--- language: - en license: apache-2.0 tags: - causal-lm - pretraining - small-language-model - gqa - swiglu - rope metrics: - perplexity pipeline_tag: text-generation --- # SLM-10M A 9.97M parameter causal language model trained from scratch, targeting the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard) `<10M` tier. ## Model Details | Property | Value | |----------|-------| | Parameters | 9,968,640 (~10M) | | Architecture | Causal Transformer | | Vocabulary | 8,192 tokens | | Context length | 1,024 tokens | | Training tokens | 25B | | Precision | bfloat16 | ## Architecture | Component | Config | |-----------|--------| | Hidden size | 256 | | Layers | 12 | | Q heads / KV heads | 8 / 2 (GQA) | | Head dim | 32 | | FFN intermediate | 640 | | Positional encoding | RoPE (θ=100k) | | Normalization | RMSNorm (fp32 upcast) | | Activation | SwiGLU | | Attention | GQA + QK-Norm | | Weight tying | Embed ↔ LM head | Design follows SotA SLM recipes (GPT-X2, Qwen3, Gemma2): QK-Norm prevents attention logit explosion, Z-loss stabilises early training (disabled after 31B tokens), scaled residual init keeps residual stream variance bounded. ## Training **Data mix (25B tokens total):** | Source | Weight | |--------|--------| | FineWeb-Edu | 55% | | Cosmopedia-v2 | 25% | | FineWeb-HQ | 10% | | FineMath | 10% | **Optimizer:** AdamW (fused) — lr=3e-3, min_lr=3e-4, β=(0.9, 0.95), wd=0.1, grad_clip=1.0 **LR schedule:** Warmup (1k steps) → stable → cosine decay tail (last 15% of steps) **Batch:** 512K tokens/step (micro-batch 32 × grad_accum 16 × seq_len 1024) **Hardware:** NVIDIA GB10, bfloat16, `torch.compile` ## Evaluation Zero-shot evaluation on the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard) benchmarks: | Benchmark | Score | |-----------|-------| | HellaSwag (acc_norm) | 26.57% | | ARC-Easy (acc_norm) | 30.47% | | ARC-Challenge (acc_norm) | 24.83% | | PIQA (acc_norm) | 50.92% | | ArithMark-2.0 | 24.52% | | **Avg** | **32.42%** | Evaluated using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [ArithMark-2.0](https://huggingface.co/datasets/axiomiclabs/Arithmark-2.0) custom benchmark script. ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model = AutoModelForCausalLM.from_pretrained( "liodon-ai/slm-10m", trust_remote_code=True, torch_dtype=torch.bfloat16, ).to("cuda") tokenizer = AutoTokenizer.from_pretrained("liodon-ai/slm-10m", trust_remote_code=True) inputs = tokenizer("The quick brown fox", return_tensors="pt").to("cuda") output = model.generate(**inputs, max_new_tokens=50, temperature=0.8, top_k=50) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ## Reproduce ```bash git clone https://github.com/liodon-ai/slm-pretrain # or your repo pip install -r requirements.txt # Prepare data python prepare_data.py # Train (25B tokens) python train.py # Export to HF format python export.py --checkpoint checkpoints/step_0044000.pt --out hf_model # Evaluate PYTHONPATH=. lm_eval --model hf \ --model_args pretrained=hf_model,trust_remote_code=True \ --tasks hellaswag,arc_easy,arc_challenge,piqa \ --device cuda python eval_arithmark.py ``` ## License Apache 2.0