# RESULTS — moe100m-physics tiny-BPE

## Model (from scratch)
- Qwen3-MoE: d_model=640, n_layers=14, GQA 10q/2kv hd64, partial-RoPE 32, 8 routed + 1 shared
  SwiGLU experts, top-2, aux-loss-free sigmoid-bias router, tied embeds, fp32 router, Liger fused-CE.
- ACTIVE = 92.83M, TOTAL = 246.18M.
- Tokenizer: custom 512-token ByteLevel-BPE (sim-only). median ~86.6k tokens/scene.
- ckpt_best_path | ckpts/best.pt |

## Training
- Data: AlexWortega/physics-scenarios-packed, 24 trained types interleaved; frame descriptions
  reduced to {in motion, settling, at rest}. batch 8 x seq 1024, Muon+AdamW, cosine, fp16 + fp32 router.
- Eval loss: 6.24 (init) -> ~1.70 (plateaued from ~step 7000). best ckpt eval 1.704.

## Eval — Pymunk position error (% of scene diagonal), greedy autoregressive rollout
| set | @15f | note |
|---|---|---|
| trained (all 30)        | 6.89% | includes large scenes that overflow 1024 ctx |
| held-out (all)          | 5.88% | |
| trained, fittable (<=12 obj) | 2.03% | fair vs baseline |
| held-out, fittable      | 2.48% | |
| LFM2-350M baseline      | 0.38% trained / 0.93% held-out | 350M, 8192 ctx, bf16 |

Model produces well-formed physics frames (Frame N: obj_i: pos/vel); ~3-5x less precise than the
larger 8192-ctx LFM2 baseline. orbit @80f ~1.0-1.2%.

## Stability note
Training diverged reproducibly at ~140M tokens (a toxic explosive-batch cluster at that data offset,
independent of LR). Best clean checkpoint (eval 1.704) is the deliverable. See POSTMORTEM.md.