gpt2small-en-it-nanochat-lr2e4-bs6-wsd-earlydecay3500-final5e6-webwiki-step7000

This repo stages the benchmark-selected checkpoint from the local NanoChat EN/IT GPT-2-small-like WSD early-decay-3500 web/wiki run 20260530_fresh-gpt2small-lr2e4-bs6-wsd-earlydecay3500-final5e6-webwiki.

What this is

  • model family: GPT-2-small-like decoder-only LM
  • parameters: ~136M
  • languages: English + Italian
  • context length: 2500
  • selected checkpoint: step_7000.pt
  • selection reason: best full repo-native CPU benchmark result across the closeout candidate checkpoints from this run family
  • status relative to comparable public checkpoints: currently the second-best benchmarked public GPT-2-small EN/IT checkpoint in the comparable web/wiki slice tracked from this workspace

Important caveat

This run's best online validation checkpoint was not the same as its best benchmark checkpoint.

  • best online validation of the full run:
    • step_16000
    • validation_loss=3.8532891550
    • validation_perplexity=47.1478851784
  • benchmark winner after closeout:
    • step_7000

This release follows the benchmark winner, not the online-validation winner.

Benchmark summary

Repo-native benchmark suite: configs/eval/20260521_pretrain_minimal_en_it_webwiki_step11000.yaml

Winner metrics:

  • val_loss_mixed: 5.2358
  • ppl_mixed: 187.8785
  • val_loss_en: 5.1587
  • ppl_en: 173.9426
  • val_loss_it: 4.0506
  • ppl_it: 57.4308
  • loop_rate: 0.675
  • repeated_4gram_rate: 0.95
  • cloze_en_contains: 0.02
  • cloze_it_contains: 0.10
  • cloze_en_exact: 0.00
  • cloze_it_exact: 0.00

Benchmark ranking across the closeout candidates from this run:

  1. step_7000
    • mixed=5.2358
    • en=5.1587
    • it=4.0506
  2. step_16000
    • mixed=5.4730
    • en=5.2994
    • it=4.6116
  3. step_28000
    • mixed=5.6045
    • en=5.3080
    • it=4.7532
  4. step_29000
    • mixed=5.7228
    • en=5.2835
    • it=4.7360
  5. step_23000
    • mixed=5.9788
    • en=5.4895
    • it=4.8834

Source/domain losses for the winner

  • source_loss_books_en: 4.7953
  • source_loss_books_it: 4.8795
  • source_loss_code: 8.2308
  • source_loss_web_en: 5.8867
  • source_loss_web_it: 6.0990
  • source_loss_wiki_en: 4.0703
  • source_loss_wiki_it: 3.9486

Probe reading at step 7000

  • EN factual prompt The capital of Italy is -> Rome: rank=69, prob=0.0015792847
  • EN procedural prompt A small language model should -> be: rank=1, prob=0.5
  • IT factual prompt La capitale d'Italia è -> Roma: rank=273, prob=0.0003337860
  • IT procedural prompt Un piccolo modello linguistico dovrebbe -> essere: rank=1, prob=0.4785156250

Factual probes remain weak in both languages, while the procedural prompts are strong next-token continuations. These probes are directional evidence only. The main selection rule here is the repo-native benchmark result.

Cross-run comparison

Token estimate formula:

  • tokens_seen ~= step * batch_size * grad_accum_steps * sequence_length
  • for these comparable GPT-2-small web/wiki and v5 runs: 6 * 16 * 2500 = 240000 tokens per step

Current comparable leaderboard by the same primary metric val_loss_mixed:

  1. earlydecay7000 step_7000
    • mixed=5.2158
    • estimated tokens seen: ~1.68B
  2. this release: earlydecay3500 step_7000
    • mixed=5.2358
    • estimated tokens seen: ~1.68B
  3. gpt2small-en-it-nanochat-lr2e4-bs6-cosine-webwiki-step7000
    • mixed=5.3558
    • estimated tokens seen: ~1.68B
  4. gpt2small-en-it-nanochat-lr2e4-bs6-wsd-fastdecay-webwiki-step11000
    • mixed=5.3576
    • estimated tokens seen: ~2.64B
  5. gpt2small-en-it-nanochat-lr1e4-bs6-cosine-webwiki-step14000
    • mixed=5.4493
    • estimated tokens seen: ~3.36B

So this release is still a meaningful checkpoint in the comparable public slice, but it does not beat the already published earlydecay7000 step_7000.

Training/data provenance

  • training config: training_config.yaml
  • tokenizer: tokenizer.json + tokenizer_meta.json
  • packed dataset root used by the run: /mnt/apps/llm-nanochat/datasets/202605141153_fineweb50_wiki50_50en_50it_score100_2500context_5Btokens_tok_20260515_en50it50_webwiki_stratified_500M
  • tokenizer root used by the run: /mnt/apps/llm-nanochat/tokenizers/tokenizer_20260515_en50it50_webwiki_stratified_500M

Included files

  • step_7000.pt
  • step_7000.safetensors
  • step_7000.safetensors.json
  • training_config.yaml
  • tokenizer.json
  • tokenizer_meta.json
  • best_validation.json
  • eval_summary.json
  • comparison.json
  • benchmark_report.md
  • benchmark_metrics.json
  • benchmark_scores.json
  • benchmark_source_losses.json
  • probe_step7000_summary.json
  • full run telemetry snapshots: eval_metrics.jsonl, metrics.jsonl, probe_generations.jsonl
  • release note: 2026-06-03_wsd_earlydecay3500_release_step7000.md

Limitations

  • mixed quality is still in the weak/intermediate band
  • generations remain repetitive and unstable under free-form continuation
  • factual recall is still weak in both languages
  • this is the best preserved checkpoint inside the earlydecay3500 run family, not the top checkpoint of the broader comparable leaderboard
  • dataset redistribution for the full training corpus may have separate licensing constraints; this repo contains model artifacts, not the raw/prepared training corpus
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support