# SMOKE — 100M-active MoE, tiny-BPE, physics All smoke gates PASS (eva01 GPU2, docker nvcr.io/nvidia/pytorch:24.03-py3, fp16). ## Tokenizer - ByteLevel-BPE trained on 30,000 interleaved train-shard scenes (all 24 trained types). - vocab_size = 512 (specials //), sim-only (reduced header + 3-keyword frame desc). - median tokens/scene = 86,640 (p90 218k) — scenes are 200 frames x up to 34 objects. ## Model - config_100m: d_model=640, n_layers=14, n_q=10/n_kv=2/hd64, 8 routed + 1 shared expert, top_k=2, d_ff=1024, max_seq_len=1024, tied embeds, grouped MoE, fp32 router, Liger fused-CE. - ACTIVE = 92.83M, TOTAL = 246.18M params (within 90-120M target band). ## Forward + train step - Random input_ids [8,1024]: forward loss = 6.24 (~ln(512)=6.24 at init), finite. router_cv ~0.22. - One real train step (Muon+AdamW, fp16 autocast, dynamic loss-scale, NaN-guard): loss finite. - Peak VRAM at bs=8/seq=1024 full step = 11.3 GB (< 13 GB cap; bs=10 OOMs vs other users sharing GPU2). ## Throughput - Sustained ~16,900 GPU-step tok/s; ~10,500 tok/s end-to-end incl. CPU serialization/tokenization. - 150-step probe: loss 6.32 -> 2.34 (physics text is highly structured; fast convergence). ## Budget decision - token_budget = 700M (was 2.5B). At ~10.5k tok/s end-to-end -> ~18.5h wall < 24h cap. - batch_size=8, seq_len=1024 (8192 tok/step) -> ~85,450 steps. peak_lr=6e-4 Muon, cosine, warmup 500. ## Stability fix (applied mid-run at step 10000) - Symptom: at loss_scale 2^21 physics grads overflowed ~9x/2000 steps (benign NaN-skip+recover). Trainer nan_cap was CUMULATIVE -> would abort ~step 11k. - Fix: loss_scale_max 2^24->2^16; nan_cap now counts CONSECUTIVE NaNs (reset on good step). - Resumed from ckpts/last.pt (step 10000, weights only; fresh Muon momentum, cosine LR continued). - Result: 0 NaN post-patch (vs ~9/2000 before). Stable.