HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 418k • 1.16k
A 135M parameter language model trained from scratch on FineWeb-Edu.
| Property | Value |
|---|---|
| Architecture | Deep & narrow Llama-style architecture with Grouped Query Attention |
| Parameters | 135M |
| Layers | 30 |
| Hidden size | 576 |
| Attention heads | 9 (3 KV, GQA) |
| Context length | 1,024 tokens |
| Vocab size | 49,152 |
| Final loss | 5.4439 |
| Final perplexity | 231.3 |
This model was trained from scratch (random initialization) for research purposes.
| Hyperparameter | Value |
|---|---|
| Batch size | 4 sequences |
| Gradient accumulation | 8 steps |
| Effective batch | 32,768 tokens/step |
| Total steps | 6,103 |
| Learning rate | 6e-4 (cosine decay to 6e-5) |
| Warmup steps | 200 |
| Optimizer | AdamW (beta1=0.9, beta2=0.95) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | Mixed (AMP fp16) |
| Hardware | NVIDIA T4 GPU (16GB) |
rockerritesh/gpt2-small-fineweb-edu-200m — GPT-2 Small (124M, 12 layers)rockerritesh/smollm2-135m-fineweb-edu-200m — SmolLM2 (135M, 30 layers, Llama-style)