---
library_name: transformers
tags:
- ml-intern
- moe
- qwen3
- physics
- from-scratch
license: apache-2.0
---

# moe100m-physics-tinybpe

A **~100M-active Qwen3-style sparse-MoE** language model trained **from scratch**
on physics-simulation next-frame-prediction text, with a **custom 512-token
ByteLevel-BPE** whose vocabulary is simulation-only (digits, punctuation,
structural keywords). Built autonomously by the ml-intern Claude Code skill.

## Model
- Active params: **92.8M** | Total: **246.2M**
- d_model=640, n_layers=14, GQA 10q/2kv head_dim 64, partial RoPE(32)
- MoE: 8 routed + 1 shared SwiGLU experts, top-2, aux-loss-free sigmoid-bias router
- Tied embeddings, RMSNorm, QK-Norm, fp32 router, Liger fused-CE
- max_seq_len 1024, vocab 512
- Optimizer: Muon (matrices) + AdamW (rest), cosine LR, fp16 (V100)

## Training
- Data: `AlexWortega/physics-scenarios-packed` (24 trained scenario types, interleaved)
- Frame descriptions reduced to a 3-keyword controlled set (in motion / settling / at rest)
- tokens seen: 1.4e+08 (planned 7e+08)
- final train loss: 1.7107 | best eval loss: 1.6926
- wall: 0.20 GPU-h on 1x V100 (eva01)

## Eval — Pymunk position error (% of scene diagonal), greedy autoregressive rollout
| set | @15f |
|---|---|
| trained (all 30 scenes)       | 5.548% |
| held-out (all)                | 6.753% |
| trained, fittable (<=12 obj)  | 1.649% |
| held-out, fittable            | 2.524% |

Baseline (fine-tuned LFM2-350M, bf16, 8192 ctx): @15f trained 0.38% / held-out 0.93%; orbit 0.75% @80f.

> NOTE: this model uses max_seq_len=1024 vs the baseline's 8192. Scenes with
> >~12 objects cannot fit a full frame in the generation budget, so the
> **fittable** rows are the fair comparison. The model generates well-formed
> physics frames (Frame N: obj_i: pos/vel) and is ~3-5x less precise than the
> larger 8192-ctx LFM2 baseline.

## Training note (honest)
Training diverged reproducibly at ~140M tokens (an intrinsic fp16+Muon weight
instability at eval-loss ~1.69; confirmed across peak_lr 6e-4/3e-4/2e-4 and two
data seeds). The published checkpoint is the best clean one (step 17000, eval
1.693); eval loss had already plateaued there since ~step 7000. See POSTMORTEM.md.

## VERIFY gates (4/6 pass; gates 4 data-consumption + 5 abort fail due to the
## divergence above — documented, not masked)
- 1_generation_sanity: PASS
- 2_loss_sanity: PASS
- 3_eval_tracks_train: PASS
- 4_data_consumption: FAIL
- 5_stderr_scan: FAIL
- 6_param_count: PASS

## Files
`model.py` (+ `optim/`) defines `MoEModel`; `config.json` has the trained
hyperparameters; `tokenizer.json` is the tiny-BPE; `train.log`/`eval.log`/
`VERIFY.md`/`EVAL_RESULTS.json` are the full run record.