Custom GPT Language Model

A custom GPT-style autoregressive transformer language model implemented from scratch in PyTorch.

This project contains:

custom multi-head self-attention
transformer blocks
causal masking
autoregressive text generation
mixed precision training
top-k / top-p sampling
safetensors model weights

The model was trained on a subset of FineWeb-Edu using a GPT-2 tokenizer.

Architecture

Model configuration:

{
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

Approximate parameter count:

~124M parameters

Architecture components:

token embeddings
positional embeddings
masked multi-head self-attention
feed-forward MLP blocks
pre-layer normalization
residual connections
causal language modeling head

Training

Training setup:

PyTorch
AdamW optimizer
Automatic Mixed Precision (AMP)
Gradient clipping
Top-k / Top-p text generation

Hardware used:

RTX 3060 Ti 8GB

Dataset:

FineWeb-Edu subset (10M tokens)

Tokenizer:

GPT-2 tokenizer

Installation

Install dependencies:

pip install torch transformers safetensors

Loading The Model

import json
import torch

from safetensors.torch import load_file
from transformers import AutoTokenizer

from model import GPTModel

# load config
with open("config.json") as f:
    cfg = json.load(f)

# create model
model = GPTModel(cfg)

# load weights
state_dict = load_file("model.safetensors")

model.load_state_dict(state_dict)

model.eval()

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(".")

Text Generation Example

from model import generate_and_print_sample

print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))

Sample Generations

Example generations from early-stage training:

"The world is big and is a whole for children. The best part of which has been made in the lives, and the state is an ideal man, but also the same one is in the world. “The only one has been created by people,” said the new study of the journal In the past, it is the best “s not people who have no longer to have not been seen in a few years.” “The only one who have one, the most famous in the country has no one at least three years. “If you’re very low, it is not a big or less than one’s risk.” The study is a study of people who have already been reported that the risk of people who are diagnosed with HIV-S"

The model currently demonstrates:

syntactic coherence
topic persistence
autoregressive language modeling
early semantic structure

Files

model.py              # GPT architecture
model.safetensors     # trained weights
config.json           # model configuration
tokenizer files       # GPT2 tokenizer assets
README.md             # project documentation

Notes

This is a custom PyTorch implementation and is not directly compatible with Hugging Face AutoModelForCausalLM.

Users should load the model using the provided model.py architecture.

License

MIT License.

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32