--- library_name: transformers tags: - ml-intern - moe - qwen3 - physics - from-scratch license: apache-2.0 --- # moe100m-physics-tinybpe A **~100M-active Qwen3-style sparse-MoE** language model trained **from scratch** on physics-simulation next-frame-prediction text, with a **custom 512-token ByteLevel-BPE** whose vocabulary is simulation-only (digits, punctuation, structural keywords). Built autonomously by the ml-intern Claude Code skill. ## Model - Active params: **92.8M** | Total: **246.2M** - d_model=640, n_layers=14, GQA 10q/2kv head_dim 64, partial RoPE(32) - MoE: 8 routed + 1 shared SwiGLU experts, top-2, aux-loss-free sigmoid-bias router - Tied embeddings, RMSNorm, QK-Norm, fp32 router, Liger fused-CE - max_seq_len 1024, vocab 512 - Optimizer: Muon (matrices) + AdamW (rest), cosine LR, fp16 (V100) ## Training - Data: `AlexWortega/physics-scenarios-packed` (24 trained scenario types, interleaved) - Frame descriptions reduced to a 3-keyword controlled set (in motion / settling / at rest) - tokens seen: 1.4e+08 (planned 7e+08) - final train loss: 1.7107 | best eval loss: 1.6926 - wall: 0.20 GPU-h on 1x V100 (eva01) ## Eval — Pymunk position error (% of scene diagonal), greedy autoregressive rollout | set | @15f | |---|---| | trained (all 30 scenes) | 5.548% | | held-out (all) | 6.753% | | trained, fittable (<=12 obj) | 1.649% | | held-out, fittable | 2.524% | Baseline (fine-tuned LFM2-350M, bf16, 8192 ctx): @15f trained 0.38% / held-out 0.93%; orbit 0.75% @80f. > NOTE: this model uses max_seq_len=1024 vs the baseline's 8192. Scenes with > >~12 objects cannot fit a full frame in the generation budget, so the > **fittable** rows are the fair comparison. The model generates well-formed > physics frames (Frame N: obj_i: pos/vel) and is ~3-5x less precise than the > larger 8192-ctx LFM2 baseline. ## Training note (honest) Training diverged reproducibly at ~140M tokens (an intrinsic fp16+Muon weight instability at eval-loss ~1.69; confirmed across peak_lr 6e-4/3e-4/2e-4 and two data seeds). The published checkpoint is the best clean one (step 17000, eval 1.693); eval loss had already plateaued there since ~step 7000. See POSTMORTEM.md. ## VERIFY gates (4/6 pass; gates 4 data-consumption + 5 abort fail due to the ## divergence above — documented, not masked) - 1_generation_sanity: PASS - 2_loss_sanity: PASS - 3_eval_tracks_train: PASS - 4_data_consumption: FAIL - 5_stderr_scan: FAIL - 6_param_count: PASS ## Files `model.py` (+ `optim/`) defines `MoEModel`; `config.json` has the trained hyperparameters; `tokenizer.json` is the tiny-BPE; `train.log`/`eval.log`/ `VERIFY.md`/`EVAL_RESULTS.json` are the full run record.