gpt2small-en-it-nanochat-lr2e4-bs6-wsd-earlydecay3500-final5e6-webwiki-step7000

This repo stages the benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like WSD early-decay-3500 web/wiki run 20260530_fresh-gpt2small-lr2e4-bs6-wsd-earlydecay3500-final5e6-webwiki.

What this is

model family: GPT-2-small-like decoder-only LM
parameters: ~136M
languages: English + Italian
context length: 2500
selected checkpoint: step_7000.pt
selection reason: best full repo-native CPU benchmark result across the closeout candidate checkpoints from this run family
status relative to comparable public checkpoints: currently the second-best benchmarked public GPT-2-small EN/IT checkpoint in the comparable web/wiki slice tracked from this workspace

Important caveat

This run's best online validation checkpoint was not the same as its best benchmark checkpoint.

best online validation of the full run:
- step_16000
- validation_loss=3.8532891550
- validation_perplexity=47.1478851784
benchmark winner after closeout:
- step_7000

This release follows the benchmark winner, not the online-validation winner.

Benchmark summary

Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml

Winner metrics:

val_loss_mixed: 5.2358
ppl_mixed: 187.8785
val_loss_en: 5.1587
ppl_en: 173.9426
val_loss_it: 4.0506
ppl_it: 57.4308
loop_rate: 0.675
repeated_4gram_rate: 0.95
cloze_en_contains: 0.02
cloze_it_contains: 0.10
cloze_en_exact: 0.00
cloze_it_exact: 0.00

Benchmark ranking across the closeout candidates from this run:

step_7000
- mixed=5.2358
- en=5.1587
- it=4.0506
step_16000
- mixed=5.4730
- en=5.2994
- it=4.6116
step_28000
- mixed=5.6045
- en=5.3080
- it=4.7532
step_29000
- mixed=5.7228
- en=5.2835
- it=4.7360
step_23000
- mixed=5.9788
- en=5.4895
- it=4.8834

Source/domain losses for the winner

source_loss_books_en: 4.7953
source_loss_books_it: 4.8795
source_loss_code: 8.2308
source_loss_web_en: 5.8867
source_loss_web_it: 6.0990
source_loss_wiki_en: 4.0703
source_loss_wiki_it: 3.9486

Probe reading at step 7000

EN factual prompt The capital of Italy is -> Rome: rank=69, prob=0.0015792847
EN procedural prompt A small language model should -> be: rank=1, prob=0.5
IT factual prompt La capitale d'Italia è -> Roma: rank=273, prob=0.0003337860
IT procedural prompt Un piccolo modello linguistico dovrebbe -> essere: rank=1, prob=0.4785156250

Factual probes remain weak in both languages, while the procedural prompts are strong next-token continuations. These probes are directional evidence only. The main selection rule here is the repo-native benchmark result.

Cross-run comparison

Token estimate formula:

tokens_seen ~= step * batch_size * grad_accum_steps * sequence_length
for these comparable GPT-2-small web/wiki and v5 runs: 6 * 16 * 2500 = 240000 tokens per step

Current comparable leaderboard by the same primary metric val_loss_mixed:

earlydecay7000 step_7000
- mixed=5.2158
- estimated tokens seen: ~1.68B
this release: earlydecay3500 step_7000
- mixed=5.2358
- estimated tokens seen: ~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-cosine-webwiki-step7000
- mixed=5.3558
- estimated tokens seen: ~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-webwiki-step11000
- mixed=5.3576
- estimated tokens seen: ~2.64B
gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000
- mixed=5.4493
- estimated tokens seen: ~3.36B

So this release is still a meaningful checkpoint in the comparable public slice, but it does not beat the already published earlydecay7000 step_7000.

Training/data provenance

training config: training_config.yaml
tokenizer: tokenizer.json + tokenizer_meta.json
packed dataset root used by the run: /mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
tokenizer root used by the run: /mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M

Included files

step_7000.pt
step_7000.safetensors
step_7000.safetensors.json
training_config.yaml
tokenizer.json
tokenizer_meta.json
best_validation.json
eval_summary.json
comparison.json
benchmark_report.md
benchmark_metrics.json
benchmark_scores.json
benchmark_source_losses.json
probe_step7000_summary.json
full run telemetry snapshots: eval_metrics.jsonl, metrics.jsonl, probe_generations.jsonl
release note: 2026-06-03_wsd_earlydecay3500_release_step7000.md

Limitations

mixed quality is still in the weak/intermediate band
generations remain repetitive and unstable under free-form continuation
factual recall is still weak in both languages
this is the best preserved checkpoint inside the earlydecay3500 run family, not the top checkpoint of the broader comparable leaderboard
dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus

Downloads last month: -; Downloads are not tracked for this model. How to track