AlexWortega commited on
Commit
ed8e81d
·
verified ·
1 Parent(s): 64901c9

Upload TASK.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. TASK.md +31 -0
TASK.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TASK — 100M-active MoE, from scratch, physics-sim next-frame prediction, custom minimal vocab
2
+
3
+ Train a Qwen3-style sparse-MoE LM **from scratch** on the physics-simulation
4
+ next-frame-prediction corpus, using a **custom minimal tokenizer** whose vocab
5
+ contains only the tokens needed to emit the simulation text (digits, punctuation,
6
+ structural keywords). Target ~100M active params.
7
+
8
+ ## Scaffold
9
+ - Model/trainer: github.com/AlexWortega/moe-200m-qwen3-100b- (Qwen3-MoE: GQA + partial RoPE
10
+ + QK-Norm + RMSNorm, aux-loss-free sigmoid bias router, 1 shared + N routed top-2 SwiGLU
11
+ experts, tied embed/lm_head, Liger fused-CE, Muon optimizer). `MoEModelConfig` in model.py.
12
+ - Sibling 100M config exists: moe-100m-volta-week (good sizing reference).
13
+
14
+ ## Data (HF Hub, from the physics-llm project)
15
+ - AlexWortega/physics-scenarios-raw, AlexWortega/physics-scenarios-packed (~900K scenes,
16
+ 30 types, 24 train / 6 held-out). Format = the LFM2 serialization (Scene/Gravity/Frame/obj_...).
17
+
18
+ ## Key requirement — custom vocab
19
+ Vocab = ONLY simulation tokens (tens–low-hundreds). With tied embeddings, shrinking vocab from
20
+ 151,936 → ~100 frees ~97M embedding params, so the whole ~100M budget goes to the MoE/dynamics
21
+ (vs the 350M LFM2 whose huge vocab embeddings ate the budget). Drop free-text Scene/Frame
22
+ descriptions (not needed for physics); keep Type as a categorical token.
23
+
24
+ ## Success metric
25
+ Pymunk position error as % of scene diagonal (same as LFM2 baseline), via the existing harness
26
+ at /Users/aleksandrnikolich/Desktop/vae_llm/physics_blog/bench (physics_core.rollout). Baseline
27
+ to beat: LFM2-350M bf16 — @15f trained 0.38% / held-out 0.93%; orbit 0.75% @80f.
28
+
29
+ ## High-impact unknowns -> Clarify
30
+ - experiment budget (GPU-h / wall-clock) - GPU choice (eva02 A6000 vs eva01 4xV100)
31
+ - tokenizer/number encoding (char-level vs tiny-BPE) [genuine 2-path fork]