File size: 4,359 Bytes

5a836cf
b9ef8cb
 
5a836cf
b9ef8cb
 
 
 
 
 
 
 
 
5a836cf
b9ef8cb
 
 
 
5a836cf
 
 
 
b9ef8cb
 
 
5a836cf
b9ef8cb
5a836cf
b9ef8cb
5a836cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9ef8cb
 
5a836cf
 
b9ef8cb
 
 
 
 
 
 
 
 
1a215a2
b9ef8cb
1a215a2
 
 
b9ef8cb
 
 
 
1a215a2
 
 
 
 
b9ef8cb
1a215a2
b9ef8cb
 
 
1a215a2
 
 
 
 
b9ef8cb
 
5a836cf
 
 
 
 
 
b9ef8cb
 
1a215a2
5a836cf
 
1a215a2
b9ef8cb
 
5a836cf
 
 
1a215a2
 
 
 
 
b9ef8cb
 
5a836cf
 
 
1a215a2
b9ef8cb
5a836cf
 
 
1a215a2
5a836cf
1a215a2
 
5a836cf
 
 
 
 
b9ef8cb
 
 
 
 
 
5a836cf
 
b9ef8cb
5a836cf
 
 
1a215a2
b9ef8cb
 
 
5a836cf
 
1a215a2
 
 
 
2a4d19d
1a215a2
b9ef8cb
 
 
5a836cf
 
1a215a2
b9ef8cb
5a836cf
1a215a2
b9ef8cb
 
5a836cf
 
 
b9ef8cb

---
language:
- en
license: mit
tags:
- gpt2
- from-scratch
- decoder-only
- transformer
- zeroshot
- llm-training
- bfloat16
- flash-attention
datasets:
- HuggingFaceFW/fineweb-edu-score-2
- HuggingFaceTB/smoltalk
- tatsu-lab/alpaca
- OpenAssistant/oasst2
pipeline_tag: text-generation
library_name: pytorch
---

# ZeroShot-500M

**530M parameter decoder-only transformer trained entirely from scratch** — base pre-training, mid-training, and supervised fine-tuning — on a single rented RTX 5090.

Part of the **ZeroShot scaling series** by [TobiasLogic](https://github.com/TobiasLogic), a progression of increasingly large GPT-2 style LLMs trained from zero, on consumer/prosumer GPUs, at minimal cost.

---

## Model Details

| | |
|---|---|
| **Parameters** | ~530M |
| **Architecture** | GPT-2 style decoder-only transformer |
| **Layers** | 24 |
| **Attention Heads** | 20 |
| **Embedding Dim** | 1280 |
| **Head Dim** | 64 |
| **Context Window** | 2,048 tokens |
| **Vocab Size** | 50,304 (GPT-2 BPE, padded for tensor cores) |
| **Precision** | bfloat16 |
| **Attention** | Flash Attention via `F.scaled_dot_product_attention` |
| **Weight Tying** | Embedding ↔ LM head |

---

## Training

### Stage 1 — Base Pre-training ✅

| | |
|---|---|
| **Data** | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2) (streamed, zero disk usage) |
| **Tokens** | ~7.9B |
| **Steps** | 30,000 |
| **LR Schedule** | Cosine decay: 4e-4 → 4e-5 |
| **Effective Batch** | 128 sequences (4 micro × 32 grad accumulation) |
| **Final Loss** | 2.75 |

![Base loss curve](loss_curve_base.png)

### Stage 2 — Mid-training ✅

| | |
|---|---|
| **Data** | [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (80k) + [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (50k) + [OpenAssistant](https://huggingface.co/datasets/OpenAssistant/oasst2) (50k) |
| **Steps** | 4,975 |
| **LR** | 1e-4 (cosine decay) |
| **Final Loss** | 1.03 |

![Mid loss curve](loss_curve_mid.png)

### Stage 3 — SFT ✅

| | |
|---|---|
| **Steps** | 1,975 |
| **LR** | 3e-5 (cosine decay) |
| **Final Loss** | 0.93 |

![SFT loss curve](loss_curve_sft.png)

---

## Hardware

| | |
|---|---|
| **GPU** | NVIDIA RTX 5090 (32GB GDDR7) |
| **Platform** | [Vast.ai](https://vast.ai) (South Korea) |
| **Cost/hr** | $0.343/hr |
| **Throughput** | ~43,000 tokens/sec |
| **Total Cost** | ~$18 |
| **PyTorch** | Nightly (cu128, Blackwell sm_120) |
| **torch.compile** | Disabled (unsupported on sm_120) |

---

## Checkpoints

| File | Description |
|---|---|
| `ckpt_base_final.pt` | Base pre-trained — 30k steps, loss 2.75 |
| `ckpt_mid_final.pt` | Post mid-training — 4,975 steps, loss 1.03 |
| `ckpt_sft_final.pt` | Final chat model — 1,975 steps, loss 0.93 |

---

## Usage

Requires the `GPT` class from `train.py` included in this repo.

```python
import torch
import tiktoken
from train import GPT, ModelConfig

# Load the SFT checkpoint (chat model)
ckpt = torch.load("ckpt_sft_final.pt", map_location="cuda", weights_only=False)
model = GPT(ModelConfig(**ckpt["model_config"])).to("cuda")
model.load_state_dict(ckpt["model"])
model.eval()

enc = tiktoken.get_encoding("gpt2")
tokens = torch.tensor(
    [enc.encode("The meaning of life is")],
    dtype=torch.long,
    device="cuda"
)

with torch.no_grad():
    output = model.generate(tokens, max_new_tokens=200, temperature=0.8, top_k=200)

print(enc.decode(output[0].tolist()))
```

> **Blackwell GPU users (RTX 5060 Ti / 5090):** disable `torch.compile` and use PyTorch nightly cu128.

---

## ZeroShot Family

| Model | Params | Base Loss | SFT Loss | Cost | GPU |
|---|---|---|---|---|---|
| [MicroGPT](https://github.com/TobiasLogic/microgpt) | 30.5M | 3.85 | — | Free | RTX 3050 |
| [ZeroShot-124M](https://huggingface.co/TobiasLogic/ZeroShot-124M) | 124M | 3.45 | 1.60 | ~$6.77 | RTX 5060 Ti |
| [ZeroShot-350M](https://huggingface.co/TobiasLogic/ZeroShot-350M) | 337M | 3.20 | 1.3| ~$7.88 | RTX 5090 |
| **ZeroShot-500M** | **530M** | **2.75** | **0.93** | **~$18** | **RTX 5090** |

---

## Limitations

- Undertrained by Chinchilla standards (~75% of optimal tokens for 530M params)
- Will hallucinate, repeat, and struggle on complex reasoning tasks
- English only
- No RLHF or safety alignment

---

## License

[MIT](https://opensource.org/licenses/MIT)