qwen25-7b-ot-q3_14b-clean-ideal

s1-style supervised fine-tune of Qwen/Qwen2.5-7B-Instruct on ideal Qwen3-14B reasoning traces β€” the teacher (Qwen3-14B) is fed the raw OpenThoughts question with no adversarial trigger and no in-context demos, and its <think>...</think> + visible answer are used as the SFT target.

This repo contains all 5 epoch checkpoints of a single 5-epoch training run, each laid out under its own subfolder:

qwen25-7b-ot-q3_14b-clean-ideal/
β”œβ”€β”€ ep1/   # step-00399
β”œβ”€β”€ ep2/   # step-00798
β”œβ”€β”€ ep3/   # step-01197
β”œβ”€β”€ ep4/   # step-01597
└── ep5/   # step-01995  ← final epoch

Load any specific epoch with the subfolder= kwarg:

from transformers import AutoModelForCausalLM, AutoTokenizer
m   = AutoModelForCausalLM.from_pretrained(
    "Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal", subfolder="ep5", torch_dtype="bfloat16",
)
tok = AutoTokenizer.from_pretrained("Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal", subfolder="ep5")

Training setup

field value
student Qwen/Qwen2.5-7B-Instruct
teacher Qwen/Qwen3-14B (YaRN-131k, plain user prompt β€” no attack/ICL)
teacher data source Chia-Mu-Lab/ot-ideal-q3_14b-clean (10000 prompts, 6389 kept after structural / boxed-answer filter)
training recipe s1 β€” full FT, BLOCK_SIZE=32768, grad_accum=4, micro_bs=1, LR=1e-5, FSDP full_shard, BF16
epochs 5 (1 ckpt per epoch)
hardware 4 Γ— H200 (Modal)
train_runtime ~6 h end-to-end
final loss 0.272

Performance

ckpt MATH500 AIME24 AIME25 JEE-math-s JEE-math-p LCB-v5
base Qwen2.5-7B-Ins 70.93 10.00 2.22 32.20 35.96 15.77
ep1 (step-00399) 29.87 3.33 2.22 18.50 21.59 16.13
ep2 (step-00798) 44.13 4.44 4.44 28.53 31.81 16.85
ep3 (step-01197) 61.13 11.11 15.56 40.82 43.75 15.05
ep4 (step-01597) 66.87 10.00 15.56 51.13 53.71 16.85
ep5 (step-01995) 70.27 14.44 13.33 48.45 51.24 14.70

All values are %. JEEbench is the math split only (236 of 515 problems). MATH500 / AIME24 / AIME25 are mean accuracy over 3 samples per problem (T=0.5, max_new_tokens=32768). JEEbench is mean strict / partial answer-match over n=6 samples per problem (TIA protocol). LCB is pass@1 on release_v5 codegeneration with window 2024-08-01 β†’ 2025-02-01, n=3 T=0.5.

Evaluation via exp-b10/distill/_common/multibench_runner.py β€” vLLM 0.10.0 on B200 (SM 100), seed=7, deterministic decoding aside from temperature sampling. Each cell is a single eval run.

Caveats

  • Truncation dominates raw accuracy at max_new_tokens=32768. Earlier epochs (ep1/ep2) are heavily truncation-suppressed β€” the model has not yet learned to terminate reasoning concisely. ep3+ recover.
  • LCB-v5 is roughly flat across epochs (14.7–16.9 pass@1) because the distill data is math-only; the student neither gains nor loses much codegen capability.
  • JEE column is filtered to the math subject split (236 of 515) β€” the student is never trained on physics or chemistry, so phy/chem rows just measure background drift.

Files per subfolder

Each epN/ is a stock HF Trainer checkpoint folder: config.json, tokenizer*, vocab.json, merges.txt, model.safetensors.index.json, model-0000{1..7}-of-00007.safetensors, generation_config.json. Optimizer/scheduler/RNG state is intentionally omitted (this is a release artifact, not a resume point).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal

Base model

Qwen/Qwen2.5-7B
Finetuned
(2619)
this model

Dataset used to train Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal