---
license: mit
language:
- en
tags:
- gpt
- transformer
- pytorch
- language-model
- text-generation
library_name: pytorch
---

# Custom GPT Language Model

A custom GPT-style autoregressive transformer language model implemented from scratch in PyTorch.

This project contains:
- custom multi-head self-attention
- transformer blocks
- causal masking
- autoregressive text generation
- mixed precision training
- top-k / top-p sampling
- safetensors model weights

The model was trained on a subset of FineWeb-Edu using a GPT-2 tokenizer.

---

# Architecture

Model configuration:

```python
{
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}
```

Approximate parameter count:
- ~124M parameters

Architecture components:
- token embeddings
- positional embeddings
- masked multi-head self-attention
- feed-forward MLP blocks
- pre-layer normalization
- residual connections
- causal language modeling head

---

# Training

Training setup:
- PyTorch
- AdamW optimizer
- Automatic Mixed Precision (AMP)
- Gradient clipping
- Top-k / Top-p text generation

Hardware used:
- RTX 3060 Ti 8GB

Dataset:
- FineWeb-Edu subset (10M tokens)

Tokenizer:
- GPT-2 tokenizer

---

# Installation

Install dependencies:

```bash
pip install torch transformers safetensors
```

---

# Loading The Model

```python
import json
import torch

from safetensors.torch import load_file
from transformers import AutoTokenizer

from model import GPTModel

# load config
with open("config.json") as f:
    cfg = json.load(f)

# create model
model = GPTModel(cfg)

# load weights
state_dict = load_file("model.safetensors")

model.load_state_dict(state_dict)

model.eval()

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(".")
```

---

# Text Generation Example

```python
from model import generate_and_print_sample

print(generate_and_print_sample(model, tokenizer, "cuda", "The world is big"))
```

---

# Sample Generations

Example generations from early-stage training:

> "The world is big and is a whole for children. The best part of which has been made in the lives, and the state is an ideal man, but also the same one is in the world. “The only one has been created by people,” said the new study of the journal In the past, it is the best “s not people who have no longer to have not been seen in a few years.” “The only one who have one, the most famous in the country has no one at least three years. “If you’re very low, it is not a big or less than one’s risk.” The study is a study of people who have already been reported that the risk of people who are diagnosed with HIV-S"

The model currently demonstrates:
- syntactic coherence
- topic persistence
- autoregressive language modeling
- early semantic structure

---


# Files

```text
model.py              # GPT architecture
model.safetensors     # trained weights
config.json           # model configuration
tokenizer files       # GPT2 tokenizer assets
README.md             # project documentation
```

---

# Notes

This is a custom PyTorch implementation and is not directly compatible with Hugging Face `AutoModelForCausalLM`.

Users should load the model using the provided `model.py` architecture.

---

# License

MIT License.