AlexWortega's picture
Upload README.md with huggingface_hub
1fcb992 verified
|
raw
history blame
2.71 kB
metadata
library_name: transformers
tags:
  - ml-intern
  - moe
  - qwen3
  - physics
  - from-scratch
license: apache-2.0

moe100m-physics-tinybpe

A ~100M-active Qwen3-style sparse-MoE language model trained from scratch on physics-simulation next-frame-prediction text, with a custom 512-token ByteLevel-BPE whose vocabulary is simulation-only (digits, punctuation, structural keywords). Built autonomously by the ml-intern Claude Code skill.

Model

  • Active params: 92.8M | Total: 246.2M
  • d_model=640, n_layers=14, GQA 10q/2kv head_dim 64, partial RoPE(32)
  • MoE: 8 routed + 1 shared SwiGLU experts, top-2, aux-loss-free sigmoid-bias router
  • Tied embeddings, RMSNorm, QK-Norm, fp32 router, Liger fused-CE
  • max_seq_len 1024, vocab 512
  • Optimizer: Muon (matrices) + AdamW (rest), cosine LR, fp16 (V100)

Training

  • Data: AlexWortega/physics-scenarios-packed (24 trained scenario types, interleaved)
  • Frame descriptions reduced to a 3-keyword controlled set (in motion / settling / at rest)
  • tokens seen: 1.4e+08 (planned 7e+08)
  • final train loss: 1.7107 | best eval loss: 1.6926
  • wall: 0.20 GPU-h on 1x V100 (eva01)

Eval — Pymunk position error (% of scene diagonal), greedy autoregressive rollout

set @15f
trained (all 30 scenes) 5.548%
held-out (all) 6.753%
trained, fittable (<=12 obj) 1.649%
held-out, fittable 2.524%

Baseline (fine-tuned LFM2-350M, bf16, 8192 ctx): @15f trained 0.38% / held-out 0.93%; orbit 0.75% @80f.

NOTE: this model uses max_seq_len=1024 vs the baseline's 8192. Scenes with

~12 objects cannot fit a full frame in the generation budget, so the fittable rows are the fair comparison. The model generates well-formed physics frames (Frame N: obj_i: pos/vel) and is ~3-5x less precise than the larger 8192-ctx LFM2 baseline.

Training note (honest)

Training diverged reproducibly at ~140M tokens (an intrinsic fp16+Muon weight instability at eval-loss ~1.69; confirmed across peak_lr 6e-4/3e-4/2e-4 and two data seeds). The published checkpoint is the best clean one (step 17000, eval 1.693); eval loss had already plateaued there since ~step 7000. See POSTMORTEM.md.

VERIFY gates (4/6 pass; gates 4 data-consumption + 5 abort fail due to the

divergence above — documented, not masked)

  • 1_generation_sanity: PASS
  • 2_loss_sanity: PASS
  • 3_eval_tracks_train: PASS
  • 4_data_consumption: FAIL
  • 5_stderr_scan: FAIL
  • 6_param_count: PASS

Files

model.py (+ optim/) defines MoEModel; config.json has the trained hyperparameters; tokenizer.json is the tiny-BPE; train.log/eval.log/ VERIFY.md/EVAL_RESULTS.json are the full run record.