# Qwen3.5-2B-metamath 8192 speed estimate

- Data: metamath-output/setmm-train-qwen35-4b-mixed-12000, 4000 original + 12000 expanded.
- max_length=8192 keeps about 14492/16000 examples from tokenizer length scan.
- Training config: Qwen3.5-2B base, FLA fast path available, LoRA rank 32/alpha 64/dropout 0.05, bf16, gradient checkpointing, lr=5e-4, 1 epoch.
- Batch: per-device train batch size 2, gradient accumulation 8 on one GPU, effective batch size about 16.
- Smoke run: 59 train examples, 4 optimizer steps, runtime 120.9s. First step was about 94s due to compile/init; later steps were about 8-11s/step on the short smoke sample.
- Full 1-epoch steps: about 888 optimizer steps after 2% eval split.
- Estimated full runtime: roughly 4-8 hours depending on length mix, checkpoint/eval cost, and whether the compiled kernels stay warm.
- Log: /data/pretrained_models/Qwen3.5-2B-metamath/train-8192.log
- PID file: /data/pretrained_models/Qwen3.5-2B-metamath/train-8192.pid