Custom GPT Language Model
A custom GPT-style autoregressive transformer language model implemented from scratch in PyTorch.
This project contains:
- custom multi-head self-attention
- transformer blocks
- causal masking
- autoregressive text generation
- mixed precision training
- top-k / top-p sampling
- safetensors model weights
The model was trained on a subset of FineWeb-Edu using a GPT-2 tokenizer.
Architecture
Model configuration:
{
"vocab_size": 50257,
"context_length": 256,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": False
}
Approximate parameter count:
- ~124M parameters
Architecture components:
- token embeddings
- positional embeddings
- masked multi-head self-attention
- feed-forward MLP blocks
- pre-layer normalization
- residual connections
- causal language modeling head
Training
Training setup:
- PyTorch
- AdamW optimizer
- Automatic Mixed Precision (AMP)
- Gradient clipping
- Top-k / Top-p text generation
Hardware used:
- RTX 3060 Ti 8GB
Dataset:
- FineWeb-Edu subset (10M tokens)
Tokenizer:
- GPT-2 tokenizer
Installation
Install dependencies:
pip install torch transformers safetensors
Loading The Model
import json
import torch
from safetensors.torch import load_file
from transformers import AutoTokenizer
from model import GPTModel
# load config
with open("config.json") as f:
cfg = json.load(f)
# create model
model = GPTModel(cfg)
# load weights
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict)
model.eval()
# tokenizer
tokenizer = AutoTokenizer.from_pretrained(".")
Text Generation Example
from model import generate_and_print_sample
print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))
Sample Generations
Example generations from early-stage training:
"The world is big and is a whole for children. The best part of which has been made in the lives, and the state is an ideal man, but also the same one is in the world. “The only one has been created by people,” said the new study of the journal In the past, it is the best “s not people who have no longer to have not been seen in a few years.” “The only one who have one, the most famous in the country has no one at least three years. “If you’re very low, it is not a big or less than one’s risk.” The study is a study of people who have already been reported that the risk of people who are diagnosed with HIV-S"
The model currently demonstrates:
- syntactic coherence
- topic persistence
- autoregressive language modeling
- early semantic structure
Files
model.py # GPT architecture
model.safetensors # trained weights
config.json # model configuration
tokenizer files # GPT2 tokenizer assets
README.md # project documentation
Notes
This is a custom PyTorch implementation and is not directly compatible with Hugging Face AutoModelForCausalLM.
Users should load the model using the provided model.py architecture.
License
MIT License.
- Downloads last month
- 1