# Qwen3.5-2B-metamath 8192 speed estimate - Data: metamath-output/setmm-train-qwen35-4b-mixed-12000, 4000 original + 12000 expanded. - max_length=8192 keeps about 14492/16000 examples from tokenizer length scan. - Training config: Qwen3.5-2B base, FLA fast path available, LoRA rank 32/alpha 64/dropout 0.05, bf16, gradient checkpointing, lr=5e-4, 1 epoch. - Batch: per-device train batch size 2, gradient accumulation 8 on one GPU, effective batch size about 16. - Smoke run: 59 train examples, 4 optimizer steps, runtime 120.9s. First step was about 94s due to compile/init; later steps were about 8-11s/step on the short smoke sample. - Full 1-epoch steps: about 888 optimizer steps after 2% eval split. - Estimated full runtime: roughly 4-8 hours depending on length mix, checkpoint/eval cost, and whether the compiled kernels stay warm. - Log: /data/pretrained_models/Qwen3.5-2B-metamath/train-8192.log - PID file: /data/pretrained_models/Qwen3.5-2B-metamath/train-8192.pid