Update README.md

2a4d19d verified about 2 months ago

4.36 kB

	---
	language:
	- en
	license: mit
	tags:
	- gpt2
	- from-scratch
	- decoder-only
	- transformer
	- zeroshot
	- llm-training
	- bfloat16
	- flash-attention
	datasets:
	- HuggingFaceFW/fineweb-edu-score-2
	- HuggingFaceTB/smoltalk
	- tatsu-lab/alpaca
	- OpenAssistant/oasst2
	pipeline_tag: text-generation
	library_name: pytorch
	---

	# ZeroShot-500M

	530M parameter decoder-only transformer trained entirely from scratch — base pre-training, mid-training, and supervised fine-tuning — on a single rented RTX 5090.

	Part of the ZeroShot scaling series by [TobiasLogic](https://github.com/TobiasLogic), a progression of increasingly large GPT-2 style LLMs trained from zero, on consumer/prosumer GPUs, at minimal cost.

	---

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Parameters \| ~530M \|
	\| Architecture \| GPT-2 style decoder-only transformer \|
	\| Layers \| 24 \|
	\| Attention Heads \| 20 \|
	\| Embedding Dim \| 1280 \|
	\| Head Dim \| 64 \|
	\| Context Window \| 2,048 tokens \|
	\| Vocab Size \| 50,304 (GPT-2 BPE, padded for tensor cores) \|
	\| Precision \| bfloat16 \|
	\| Attention \| Flash Attention via `F.scaled_dot_product_attention` \|
	\| Weight Tying \| Embedding ↔ LM head \|

	---

	## Training

	### Stage 1 — Base Pre-training ✅

	\| \| \|
	\|---\|---\|
	\| Data \| [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2) (streamed, zero disk usage) \|
	\| Tokens \| ~7.9B \|
	\| Steps \| 30,000 \|
	\| LR Schedule \| Cosine decay: 4e-4 → 4e-5 \|
	\| Effective Batch \| 128 sequences (4 micro × 32 grad accumulation) \|
	\| Final Loss \| 2.75 \|

	![Base loss curve](loss_curve_base.png)

	### Stage 2 — Mid-training ✅

	\| \| \|
	\|---\|---\|
	\| Data \| [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (80k) + [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (50k) + [OpenAssistant](https://huggingface.co/datasets/OpenAssistant/oasst2) (50k) \|
	\| Steps \| 4,975 \|
	\| LR \| 1e-4 (cosine decay) \|
	\| Final Loss \| 1.03 \|

	![Mid loss curve](loss_curve_mid.png)

	### Stage 3 — SFT ✅

	\| \| \|
	\|---\|---\|
	\| Steps \| 1,975 \|
	\| LR \| 3e-5 (cosine decay) \|
	\| Final Loss \| 0.93 \|

	![SFT loss curve](loss_curve_sft.png)

	---

	## Hardware

	\| \| \|
	\|---\|---\|
	\| GPU \| NVIDIA RTX 5090 (32GB GDDR7) \|
	\| Platform \| [Vast.ai](https://vast.ai) (South Korea) \|
	\| Cost/hr \| $0.343/hr \|
	\| Throughput \| ~43,000 tokens/sec \|
	\| Total Cost \| ~$18 \|
	\| PyTorch \| Nightly (cu128, Blackwell sm_120) \|
	\| torch.compile \| Disabled (unsupported on sm_120) \|

	---

	## Checkpoints

	\| File \| Description \|
	\|---\|---\|
	\| `ckpt_base_final.pt` \| Base pre-trained — 30k steps, loss 2.75 \|
	\| `ckpt_mid_final.pt` \| Post mid-training — 4,975 steps, loss 1.03 \|
	\| `ckpt_sft_final.pt` \| Final chat model — 1,975 steps, loss 0.93 \|

	---

	## Usage

	Requires the `GPT` class from `train.py` included in this repo.

	```python
	import torch
	import tiktoken
	from train import GPT, ModelConfig

	# Load the SFT checkpoint (chat model)
	ckpt = torch.load("ckpt_sft_final.pt", map_location="cuda", weights_only=False)
	model = GPT(ModelConfig(**ckpt["model_config"])).to("cuda")
	model.load_state_dict(ckpt["model"])
	model.eval()

	enc = tiktoken.get_encoding("gpt2")
	tokens = torch.tensor(
	[enc.encode("The meaning of life is")],
	dtype=torch.long,
	device="cuda"
	)

	with torch.no_grad():
	output = model.generate(tokens, max_new_tokens=200, temperature=0.8, top_k=200)

	print(enc.decode(output[0].tolist()))
	```

	> Blackwell GPU users (RTX 5060 Ti / 5090): disable `torch.compile` and use PyTorch nightly cu128.

	---

	## ZeroShot Family

	\| Model \| Params \| Base Loss \| SFT Loss \| Cost \| GPU \|
	\|---\|---\|---\|---\|---\|---\|
	\| [MicroGPT](https://github.com/TobiasLogic/microgpt) \| 30.5M \| 3.85 \| — \| Free \| RTX 3050 \|
	\| [ZeroShot-124M](https://huggingface.co/TobiasLogic/ZeroShot-124M) \| 124M \| 3.45 \| 1.60 \| ~$6.77 \| RTX 5060 Ti \|
	\| [ZeroShot-350M](https://huggingface.co/TobiasLogic/ZeroShot-350M) \| 337M \| 3.20 \| 1.3\| ~$7.88 \| RTX 5090 \|
	\| ZeroShot-500M \| 530M \| 2.75 \| 0.93 \| ~$18 \| RTX 5090 \|

	---

	## Limitations

	- Undertrained by Chinchilla standards (~75% of optimal tokens for 530M params)
	- Will hallucinate, repeat, and struggle on complex reasoning tasks
	- English only
	- No RLHF or safety alignment

	---

	## License

	[MIT](https://opensource.org/licenses/MIT)