gpt2small-en-it-nanochat-lr2e4-bs6-wsd-earlydecay3500-final5e6-webwiki-step7000
This repo stages the benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like WSD early-decay-3500 web/wiki run 20260530_fresh-gpt2small-lr2e4-bs6-wsd-earlydecay3500-final5e6-webwiki.
What this is
- model family: GPT-2-small-like decoder-only LM
- parameters: ~136M
- languages: English + Italian
- context length: 2500
- selected checkpoint:
step_7000.pt - selection reason: best full repo-native CPU benchmark result across the closeout candidate checkpoints from this run family
- status relative to comparable public checkpoints: currently the second-best benchmarked public GPT-2-small EN/IT checkpoint in the comparable web/wiki slice tracked from this workspace
Important caveat
This run's best online validation checkpoint was not the same as its best benchmark checkpoint.
- best online validation of the full run:
step_16000validation_loss=3.8532891550validation_perplexity=47.1478851784
- benchmark winner after closeout:
step_7000
This release follows the benchmark winner, not the online-validation winner.
Benchmark summary
Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml
Winner metrics:
val_loss_mixed:5.2358ppl_mixed:187.8785val_loss_en:5.1587ppl_en:173.9426val_loss_it:4.0506ppl_it:57.4308loop_rate:0.675repeated_4gram_rate:0.95cloze_en_contains:0.02cloze_it_contains:0.10cloze_en_exact:0.00cloze_it_exact:0.00
Benchmark ranking across the closeout candidates from this run:
step_7000mixed=5.2358en=5.1587it=4.0506
step_16000mixed=5.4730en=5.2994it=4.6116
step_28000mixed=5.6045en=5.3080it=4.7532
step_29000mixed=5.7228en=5.2835it=4.7360
step_23000mixed=5.9788en=5.4895it=4.8834
Source/domain losses for the winner
source_loss_books_en:4.7953source_loss_books_it:4.8795source_loss_code:8.2308source_loss_web_en:5.8867source_loss_web_it:6.0990source_loss_wiki_en:4.0703source_loss_wiki_it:3.9486
Probe reading at step 7000
- EN factual prompt
The capital of Italy is -> Rome:rank=69,prob=0.0015792847 - EN procedural prompt
A small language model should -> be:rank=1,prob=0.5 - IT factual prompt
La capitale d'Italia è -> Roma:rank=273,prob=0.0003337860 - IT procedural prompt
Un piccolo modello linguistico dovrebbe -> essere:rank=1,prob=0.4785156250
Factual probes remain weak in both languages, while the procedural prompts are strong next-token continuations. These probes are directional evidence only. The main selection rule here is the repo-native benchmark result.
Cross-run comparison
Token estimate formula:
tokens_seen ~= step * batch_size * grad_accum_steps * sequence_length- for these comparable GPT-2-small web/wiki and v5 runs:
6 * 16 * 2500 = 240000tokens per step
Current comparable leaderboard by the same primary metric val_loss_mixed:
earlydecay7000 step_7000mixed=5.2158- estimated tokens seen:
~1.68B
- this release:
earlydecay3500 step_7000mixed=5.2358- estimated tokens seen:
~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-cosine-webwiki-step7000mixed=5.3558- estimated tokens seen:
~1.68B
gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-webwiki-step11000mixed=5.3576- estimated tokens seen:
~2.64B
gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000mixed=5.4493- estimated tokens seen:
~3.36B
So this release is still a meaningful checkpoint in the comparable public slice, but it does not beat the already published earlydecay7000 step_7000.
Training/data provenance
- training config:
training_config.yaml - tokenizer:
tokenizer.json+tokenizer_meta.json - packed dataset root used by the run:
/mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M - tokenizer root used by the run:
/mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M
Included files
step_7000.ptstep_7000.safetensorsstep_7000.safetensors.jsontraining_config.yamltokenizer.jsontokenizer_meta.jsonbest_validation.jsoneval_summary.jsoncomparison.jsonbenchmark_report.mdbenchmark_metrics.jsonbenchmark_scores.jsonbenchmark_source_losses.jsonprobe_step7000_summary.json- full run telemetry snapshots:
eval_metrics.jsonl,metrics.jsonl,probe_generations.jsonl - release note:
2026-06-03_wsd_earlydecay3500_release_step7000.md
Limitations
- mixed quality is still in the weak/intermediate band
- generations remain repetitive and unstable under free-form continuation
- factual recall is still weak in both languages
- this is the best preserved checkpoint inside the
earlydecay3500run family, not the top checkpoint of the broader comparable leaderboard - dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus