Custom GPT Language Model

A custom GPT-style autoregressive transformer language model implemented from scratch in PyTorch.

This project contains:

  • custom multi-head self-attention
  • transformer blocks
  • causal masking
  • autoregressive text generation
  • mixed precision training
  • top-k / top-p sampling
  • safetensors model weights

The model was trained on a subset of FineWeb-Edu using a GPT-2 tokenizer.


Architecture

Model configuration:

{
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

Approximate parameter count:

  • ~124M parameters

Architecture components:

  • token embeddings
  • positional embeddings
  • masked multi-head self-attention
  • feed-forward MLP blocks
  • pre-layer normalization
  • residual connections
  • causal language modeling head

Training

Training setup:

  • PyTorch
  • AdamW optimizer
  • Automatic Mixed Precision (AMP)
  • Gradient clipping
  • Top-k / Top-p text generation

Hardware used:

  • RTX 3060 Ti 8GB

Dataset:

  • FineWeb-Edu subset (10M tokens)

Tokenizer:

  • GPT-2 tokenizer

Installation

Install dependencies:

pip install torch transformers safetensors

Loading The Model

import json
import torch

from safetensors.torch import load_file
from transformers import AutoTokenizer

from model import GPTModel

# load config
with open("config.json") as f:
    cfg = json.load(f)

# create model
model = GPTModel(cfg)

# load weights
state_dict = load_file("model.safetensors")

model.load_state_dict(state_dict)

model.eval()

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(".")

Text Generation Example

from model import generate_and_print_sample

print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))

Sample Generations

Example generations from early-stage training:

"The world is big and is a whole for children. The best part of which has been made in the lives, and the state is an ideal man, but also the same one is in the world. “The only one has been created by people,” said the new study of the journal In the past, it is the best “s not people who have no longer to have not been seen in a few years.” “The only one who have one, the most famous in the country has no one at least three years. “If you’re very low, it is not a big or less than one’s risk.” The study is a study of people who have already been reported that the risk of people who are diagnosed with HIV-S"

The model currently demonstrates:

  • syntactic coherence
  • topic persistence
  • autoregressive language modeling
  • early semantic structure

Files

model.py              # GPT architecture
model.safetensors     # trained weights
config.json           # model configuration
tokenizer files       # GPT2 tokenizer assets
README.md             # project documentation

Notes

This is a custom PyTorch implementation and is not directly compatible with Hugging Face AutoModelForCausalLM.

Users should load the model using the provided model.py architecture.


License

MIT License.

Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support