# PLAN — 100M MoE, tiny-BPE, physics next-frame, eva01 GPU2

## Single path (tiny-BPE). GPU: eva01 GPU 2 (idle, ~15GB free). Container: nvcr.io/nvidia/pytorch:24.03-py3.

### Files (in path-bpe/, adapting the cloned repo)
- `tokenizer_build.py` — train a tiny ByteLevel-BPE (`tokenizers`) on ~20k serialized scenes; vocab≈512;
  specials <bos>/<eos>/<pad>. Save `tokenizer.json`. (sim-only vocab learned from the corpus.)
- `data_physics.py` — stream `AlexWortega/physics-scenarios-{raw,packed}`; decode `jsonl` bytes →
  serialize via fmt_header/fmt_frame (port from physics_core); reduce descriptions to a small fixed
  KEYWORD set (Type kept as token; frame desc → {in motion, settling, collision, at rest, …}); BOS+scene+EOS;
  pack into seq_len blocks. train/validation/test splits exist.
- `config_100m.py` — repo MoEModelConfig with vocab=tokenizer.size; tune to ~100M ACTIVE (tiny vocab makes
  embeds ~free). Start d_model=512, n_layers=14, n_q=8/n_kv=2/hd64, 8 routed+1 shared top2, d_ff=1024,
  max_seq_len=1024. Smoke prints param count → adjust to 90–120M active.
- `train_phys.py` — single-GPU adaptation of train/train_200m.py: Muon optim (optim/muon.py), Liger CE if avail
  else torch CE, fp16 + fp32-router hardening (V100), cosine schedule, eval_every, ckpt, wandb.
- `eval_phys.py` — load ckpt, greedy-generate frames, roll out via ported physics_core, score %diag vs Pymunk
  on the 30 bench scenes (24 trained / 6 held-out). Compare to LFM2-350M baseline (@15f 0.38/0.93; orbit 0.75@80f).

### Hyperparams (start)
batch 16 × seq 1024 = 16k tok/step; peak_lr 6e-4 (Muon), warmup 500, cosine; ~24 GPU-h ≈ 2–3B tokens.
token_budget = 2.5e9. NaN guard (halve LR; fp32 router). eval_every 1000.

### Success criterion
Primary: produce a verified checkpoint (VERIFY all-pass) that rolls out physically-plausible frames and
beats/contends with LFM2-350M on @15f %diag for trained scenes. Stretch: match orbit long-horizon.