qwen25-7b-ot-q3_14b-clean-ideal
s1-style supervised fine-tune of Qwen/Qwen2.5-7B-Instruct on
ideal Qwen3-14B reasoning traces β the teacher (Qwen3-14B) is fed the raw
OpenThoughts question with no adversarial trigger and no in-context demos, and
its <think>...</think> + visible answer are used as the SFT target.
This repo contains all 5 epoch checkpoints of a single 5-epoch training run, each laid out under its own subfolder:
qwen25-7b-ot-q3_14b-clean-ideal/
βββ ep1/ # step-00399
βββ ep2/ # step-00798
βββ ep3/ # step-01197
βββ ep4/ # step-01597
βββ ep5/ # step-01995 β final epoch
Load any specific epoch with the subfolder= kwarg:
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained(
"Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal", subfolder="ep5", torch_dtype="bfloat16",
)
tok = AutoTokenizer.from_pretrained("Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal", subfolder="ep5")
Training setup
| field | value |
|---|---|
| student | Qwen/Qwen2.5-7B-Instruct |
| teacher | Qwen/Qwen3-14B (YaRN-131k, plain user prompt β no attack/ICL) |
| teacher data source | Chia-Mu-Lab/ot-ideal-q3_14b-clean (10000 prompts, 6389 kept after structural / boxed-answer filter) |
| training recipe | s1 β full FT, BLOCK_SIZE=32768, grad_accum=4, micro_bs=1, LR=1e-5, FSDP full_shard, BF16 |
| epochs | 5 (1 ckpt per epoch) |
| hardware | 4 Γ H200 (Modal) |
| train_runtime | ~6 h end-to-end |
| final loss | 0.272 |
Performance
| ckpt | MATH500 | AIME24 | AIME25 | JEE-math-s | JEE-math-p | LCB-v5 |
|---|---|---|---|---|---|---|
| base Qwen2.5-7B-Ins | 70.93 | 10.00 | 2.22 | 32.20 | 35.96 | 15.77 |
| ep1 (step-00399) | 29.87 | 3.33 | 2.22 | 18.50 | 21.59 | 16.13 |
| ep2 (step-00798) | 44.13 | 4.44 | 4.44 | 28.53 | 31.81 | 16.85 |
| ep3 (step-01197) | 61.13 | 11.11 | 15.56 | 40.82 | 43.75 | 15.05 |
| ep4 (step-01597) | 66.87 | 10.00 | 15.56 | 51.13 | 53.71 | 16.85 |
| ep5 (step-01995) | 70.27 | 14.44 | 13.33 | 48.45 | 51.24 | 14.70 |
All values are %. JEEbench is the math split only (236 of 515 problems). MATH500 / AIME24 / AIME25 are mean accuracy over 3 samples per problem (T=0.5, max_new_tokens=32768). JEEbench is mean strict / partial answer-match over n=6 samples per problem (TIA protocol). LCB is pass@1 on release_v5 codegeneration with window 2024-08-01 β 2025-02-01, n=3 T=0.5.
Evaluation via exp-b10/distill/_common/multibench_runner.py β
vLLM 0.10.0 on B200 (SM 100), seed=7, deterministic decoding aside from
temperature sampling. Each cell is a single eval run.
Caveats
- Truncation dominates raw accuracy at
max_new_tokens=32768. Earlier epochs (ep1/ep2) are heavily truncation-suppressed β the model has not yet learned to terminate reasoning concisely. ep3+ recover. - LCB-v5 is roughly flat across epochs (14.7β16.9 pass@1) because the distill data is math-only; the student neither gains nor loses much codegen capability.
- JEE column is filtered to the math subject split (236 of 515) β the student is never trained on physics or chemistry, so phy/chem rows just measure background drift.
Files per subfolder
Each epN/ is a stock HF Trainer checkpoint folder:
config.json, tokenizer*, vocab.json, merges.txt,
model.safetensors.index.json, model-0000{1..7}-of-00007.safetensors,
generation_config.json. Optimizer/scheduler/RNG state is intentionally
omitted (this is a release artifact, not a resume point).