# RESULTS — moe100m-physics tiny-BPE ## Model (from scratch) - Qwen3-MoE: d_model=640, n_layers=14, GQA 10q/2kv hd64, partial-RoPE 32, 8 routed + 1 shared SwiGLU experts, top-2, aux-loss-free sigmoid-bias router, tied embeds, fp32 router, Liger fused-CE. - ACTIVE = 92.83M, TOTAL = 246.18M. - Tokenizer: custom 512-token ByteLevel-BPE (sim-only). median ~86.6k tokens/scene. - ckpt_best_path | ckpts/best.pt | ## Training - Data: AlexWortega/physics-scenarios-packed, 24 trained types interleaved; frame descriptions reduced to {in motion, settling, at rest}. batch 8 x seq 1024, Muon+AdamW, cosine, fp16 + fp32 router. - Eval loss: 6.24 (init) -> ~1.70 (plateaued from ~step 7000). best ckpt eval 1.704. ## Eval — Pymunk position error (% of scene diagonal), greedy autoregressive rollout | set | @15f | note | |---|---|---| | trained (all 30) | 6.89% | includes large scenes that overflow 1024 ctx | | held-out (all) | 5.88% | | | trained, fittable (<=12 obj) | 2.03% | fair vs baseline | | held-out, fittable | 2.48% | | | LFM2-350M baseline | 0.38% trained / 0.93% held-out | 350M, 8192 ctx, bf16 | Model produces well-formed physics frames (Frame N: obj_i: pos/vel); ~3-5x less precise than the larger 8192-ctx LFM2 baseline. orbit @80f ~1.0-1.2%. ## Stability note Training diverged reproducibly at ~140M tokens (a toxic explosive-batch cluster at that data offset, independent of LR). Best clean checkpoint (eval 1.704) is the deliverable. See POSTMORTEM.md.