---
language:
- en
license: apache-2.0
tags:
- causal-lm
- pretraining
- small-language-model
- gqa
- swiglu
- rope
metrics:
- perplexity
pipeline_tag: text-generation
---

# SLM-10M

A 9.97M parameter causal language model trained from scratch, targeting the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard) `<10M` tier.

## Model Details

| Property | Value |
|----------|-------|
| Parameters | 9,968,640 (~10M) |
| Architecture | Causal Transformer |
| Vocabulary | 8,192 tokens |
| Context length | 1,024 tokens |
| Training tokens | 25B |
| Precision | bfloat16 |

## Architecture

| Component | Config |
|-----------|--------|
| Hidden size | 256 |
| Layers | 12 |
| Q heads / KV heads | 8 / 2 (GQA) |
| Head dim | 32 |
| FFN intermediate | 640 |
| Positional encoding | RoPE (θ=100k) |
| Normalization | RMSNorm (fp32 upcast) |
| Activation | SwiGLU |
| Attention | GQA + QK-Norm |
| Weight tying | Embed ↔ LM head |

Design follows SotA SLM recipes (GPT-X2, Qwen3, Gemma2): QK-Norm prevents attention logit explosion, Z-loss stabilises early training (disabled after 31B tokens), scaled residual init keeps residual stream variance bounded.

## Training

**Data mix (25B tokens total):**

| Source | Weight |
|--------|--------|
| FineWeb-Edu | 55% |
| Cosmopedia-v2 | 25% |
| FineWeb-HQ | 10% |
| FineMath | 10% |

**Optimizer:** AdamW (fused) — lr=3e-3, min_lr=3e-4, β=(0.9, 0.95), wd=0.1, grad_clip=1.0

**LR schedule:** Warmup (1k steps) → stable → cosine decay tail (last 15% of steps)

**Batch:** 512K tokens/step (micro-batch 32 × grad_accum 16 × seq_len 1024)

**Hardware:** NVIDIA GB10, bfloat16, `torch.compile`

## Evaluation

Zero-shot evaluation on the [Open SLM Leaderboard](https://huggingface.co/spaces/AxiomicLabs/Open_SLM_Leaderboard) benchmarks:

| Benchmark | Score |
|-----------|-------|
| HellaSwag (acc_norm) | 26.57% |
| ARC-Easy (acc_norm) | 30.47% |
| ARC-Challenge (acc_norm) | 24.83% |
| PIQA (acc_norm) | 50.92% |
| ArithMark-2.0 | 24.52% |
| **Avg** | **32.42%** |

Evaluated using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [ArithMark-2.0](https://huggingface.co/datasets/axiomiclabs/Arithmark-2.0) custom benchmark script.

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "liodon-ai/slm-10m",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained("liodon-ai/slm-10m", trust_remote_code=True)

inputs = tokenizer("The quick brown fox", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=50, temperature=0.8, top_k=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## Reproduce

```bash
git clone https://github.com/liodon-ai/slm-pretrain  # or your repo
pip install -r requirements.txt

# Prepare data
python prepare_data.py

# Train (25B tokens)
python train.py

# Export to HF format
python export.py --checkpoint checkpoints/step_0044000.pt --out hf_model

# Evaluate
PYTHONPATH=. lm_eval --model hf \
  --model_args pretrained=hf_model,trust_remote_code=True \
  --tasks hellaswag,arc_easy,arc_challenge,piqa \
  --device cuda

python eval_arithmark.py
```

## License

Apache 2.0