Upload README.md with huggingface_hub

1fcb992 verified 12 days ago

2.71 kB

library_name: transformers
tags:
  - ml-intern
  - moe
  - qwen3
  - physics
  - from-scratch
license: apache-2.0

moe100m-physics-tinybpe

A ~100M-active Qwen3-style sparse-MoE language model trained from scratch on physics-simulation next-frame-prediction text, with a custom 512-token ByteLevel-BPE whose vocabulary is simulation-only (digits, punctuation, structural keywords). Built autonomously by the ml-intern Claude Code skill.

Model

Active params: 92.8M | Total: 246.2M
d_model=640, n_layers=14, GQA 10q/2kv head_dim 64, partial RoPE(32)
MoE: 8 routed + 1 shared SwiGLU experts, top-2, aux-loss-free sigmoid-bias router
Tied embeddings, RMSNorm, QK-Norm, fp32 router, Liger fused-CE
max_seq_len 1024, vocab 512
Optimizer: Muon (matrices) + AdamW (rest), cosine LR, fp16 (V100)

Training

Data: AlexWortega/physics-scenarios-packed (24 trained scenario types, interleaved)
Frame descriptions reduced to a 3-keyword controlled set (in motion / settling / at rest)
tokens seen: 1.4e+08 (planned 7e+08)
final train loss: 1.7107 | best eval loss: 1.6926
wall: 0.20 GPU-h on 1x V100 (eva01)

Eval — Pymunk position error (% of scene diagonal), greedy autoregressive rollout

set	@15f
trained (all 30 scenes)	5.548%
held-out (all)	6.753%
trained, fittable (<=12 obj)	1.649%
held-out, fittable	2.524%

Baseline (fine-tuned LFM2-350M, bf16, 8192 ctx): @15f trained 0.38% / held-out 0.93%; orbit 0.75% @80f.

NOTE: this model uses max_seq_len=1024 vs the baseline's 8192. Scenes with

~12 objects cannot fit a full frame in the generation budget, so the fittable rows are the fair comparison. The model generates well-formed physics frames (Frame N: obj_i: pos/vel) and is ~3-5x less precise than the larger 8192-ctx LFM2 baseline.

Training note (honest)

Training diverged reproducibly at ~140M tokens (an intrinsic fp16+Muon weight instability at eval-loss ~1.69; confirmed across peak_lr 6e-4/3e-4/2e-4 and two data seeds). The published checkpoint is the best clean one (step 17000, eval 1.693); eval loss had already plateaued there since ~step 7000. See POSTMORTEM.md.

VERIFY gates (4/6 pass; gates 4 data-consumption + 5 abort fail due to the

divergence above — documented, not masked)

1_generation_sanity: PASS
2_loss_sanity: PASS
3_eval_tracks_train: PASS
4_data_consumption: FAIL
5_stderr_scan: FAIL
6_param_count: PASS

Files

model.py (+ optim/) defines MoEModel; config.json has the trained hyperparameters; tokenizer.json is the tiny-BPE; train.log/eval.log/ VERIFY.md/EVAL_RESULTS.json are the full run record.