metadata language:
- en
license: mit
tags:
- gpt2
- from-scratch
- decoder-only
- transformer
- zeroshot
- llm-training
- bfloat16
- flash-attention
datasets:
- HuggingFaceFW/fineweb-edu-score-2
- HuggingFaceTB/smoltalk
- tatsu-lab/alpaca
- OpenAssistant/oasst2
pipeline_tag: text-generation
library_name: pytorch
ZeroShot-500M
530M parameter decoder-only transformer trained entirely from scratch β base pre-training, mid-training, and supervised fine-tuning β on a single rented RTX 5090.
Part of the ZeroShot scaling series by TobiasLogic , a progression of increasingly large GPT-2 style LLMs trained from zero, on consumer/prosumer GPUs, at minimal cost.
Model Details
Parameters
~530M
Architecture
GPT-2 style decoder-only transformer
Layers
24
Attention Heads
20
Embedding Dim
1280
Head Dim
64
Context Window
2,048 tokens
Vocab Size
50,304 (GPT-2 BPE, padded for tensor cores)
Precision
bfloat16
Attention
Flash Attention via F.scaled_dot_product_attention
Weight Tying
Embedding β LM head
Training
Stage 1 β Base Pre-training β
Data
FineWeb-Edu (streamed, zero disk usage)
Tokens
~7.9B
Steps
30,000
LR Schedule
Cosine decay: 4e-4 β 4e-5
Effective Batch
128 sequences (4 micro Γ 32 grad accumulation)
Final Loss
2.75
Stage 2 β Mid-training β
Stage 3 β SFT β
Steps
1,975
LR
3e-5 (cosine decay)
Final Loss
0.93
Hardware
GPU
NVIDIA RTX 5090 (32GB GDDR7)
Platform
Vast.ai (South Korea)
Cost/hr
$0.343/hr
Throughput
~43,000 tokens/sec
Total Cost
~$18
PyTorch
Nightly (cu128, Blackwell sm_120)
torch.compile
Disabled (unsupported on sm_120)
Checkpoints
File
Description
ckpt_base_final.pt
Base pre-trained β 30k steps, loss 2.75
ckpt_mid_final.pt
Post mid-training β 4,975 steps, loss 1.03
ckpt_sft_final.pt
Final chat model β 1,975 steps, loss 0.93
Usage
Requires the GPT class from train.py included in this repo.
import torch
import tiktoken
from train import GPT, ModelConfig
ckpt = torch.load("ckpt_sft_final.pt" , map_location="cuda" , weights_only=False )
model = GPT(ModelConfig(**ckpt["model_config" ])).to("cuda" )
model.load_state_dict(ckpt["model" ])
model.eval ()
enc = tiktoken.get_encoding("gpt2" )
tokens = torch.tensor(
[enc.encode("The meaning of life is" )],
dtype=torch.long,
device="cuda"
)
with torch.no_grad():
output = model.generate(tokens, max_new_tokens=200 , temperature=0.8 , top_k=200 )
print (enc.decode(output[0 ].tolist()))
Blackwell GPU users (RTX 5060 Ti / 5090): disable torch.compile and use PyTorch nightly cu128.
ZeroShot Family
Model
Params
Base Loss
SFT Loss
Cost
GPU
MicroGPT
30.5M
3.85
β
Free
RTX 3050
ZeroShot-124M
124M
3.45
1.60
~$6.77
RTX 5060 Ti
ZeroShot-350M
337M
3.20
1.3
~$7.88
RTX 5090
ZeroShot-500M
530M
2.75
0.93
~$18
RTX 5090
Limitations
Undertrained by Chinchilla standards (~75% of optimal tokens for 530M params)
Will hallucinate, repeat, and struggle on complex reasoning tasks
English only
No RLHF or safety alignment
License
MIT