# PLAN — 100M MoE, tiny-BPE, physics next-frame, eva01 GPU2 ## Single path (tiny-BPE). GPU: eva01 GPU 2 (idle, ~15GB free). Container: nvcr.io/nvidia/pytorch:24.03-py3. ### Files (in path-bpe/, adapting the cloned repo) - `tokenizer_build.py` — train a tiny ByteLevel-BPE (`tokenizers`) on ~20k serialized scenes; vocab≈512; specials //. Save `tokenizer.json`. (sim-only vocab learned from the corpus.) - `data_physics.py` — stream `AlexWortega/physics-scenarios-{raw,packed}`; decode `jsonl` bytes → serialize via fmt_header/fmt_frame (port from physics_core); reduce descriptions to a small fixed KEYWORD set (Type kept as token; frame desc → {in motion, settling, collision, at rest, …}); BOS+scene+EOS; pack into seq_len blocks. train/validation/test splits exist. - `config_100m.py` — repo MoEModelConfig with vocab=tokenizer.size; tune to ~100M ACTIVE (tiny vocab makes embeds ~free). Start d_model=512, n_layers=14, n_q=8/n_kv=2/hd64, 8 routed+1 shared top2, d_ff=1024, max_seq_len=1024. Smoke prints param count → adjust to 90–120M active. - `train_phys.py` — single-GPU adaptation of train/train_200m.py: Muon optim (optim/muon.py), Liger CE if avail else torch CE, fp16 + fp32-router hardening (V100), cosine schedule, eval_every, ckpt, wandb. - `eval_phys.py` — load ckpt, greedy-generate frames, roll out via ported physics_core, score %diag vs Pymunk on the 30 bench scenes (24 trained / 6 held-out). Compare to LFM2-350M baseline (@15f 0.38/0.93; orbit 0.75@80f). ### Hyperparams (start) batch 16 × seq 1024 = 16k tok/step; peak_lr 6e-4 (Muon), warmup 500, cosine; ~24 GPU-h ≈ 2–3B tokens. token_budget = 2.5e9. NaN guard (halve LR; fp32 router). eval_every 1000. ### Success criterion Primary: produce a verified checkpoint (VERIFY all-pass) that rolls out physically-plausible frames and beats/contends with LFM2-350M on @15f %diag for trained scenes. Stretch: match orbit long-horizon.