qwen25-7b-ot-q3_14b-clean-ideal

s1-style supervised fine-tune of Qwen/Qwen2.5-7B-Instruct on ideal Qwen3-14B reasoning traces — the teacher (Qwen3-14B) is fed the raw OpenThoughts question with no adversarial trigger and no in-context demos, and its <think>...</think> + visible answer are used as the SFT target.

This repo contains all 5 epoch checkpoints of a single 5-epoch training run, each laid out under its own subfolder:

qwen25-7b-ot-q3_14b-clean-ideal/
├── ep1/   # step-00399
├── ep2/   # step-00798
├── ep3/   # step-01197
├── ep4/   # step-01597
└── ep5/   # step-01995  ← final epoch

Load any specific epoch with the subfolder= kwarg:

from transformers import AutoModelForCausalLM, AutoTokenizer
m   = AutoModelForCausalLM.from_pretrained(
    "Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal", subfolder="ep5", torch_dtype="bfloat16",
)
tok = AutoTokenizer.from_pretrained("Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal", subfolder="ep5")

Training setup

field	value
student	`Qwen/Qwen2.5-7B-Instruct`
teacher	`Qwen/Qwen3-14B` (YaRN-131k, plain user prompt — no attack/ICL)
teacher data source	Chia-Mu-Lab/ot-ideal-q3_14b-clean (10000 prompts, 6389 kept after structural / boxed-answer filter)
training recipe	s1 — full FT, BLOCK_SIZE=32768, grad_accum=4, micro_bs=1, LR=1e-5, FSDP full_shard, BF16
epochs	5 (1 ckpt per epoch)
hardware	4 × H200 (Modal)
train_runtime	~6 h end-to-end
final loss	0.272

Performance

ckpt	MATH500	AIME24	AIME25	JEE-math-s	JEE-math-p	LCB-v5
base Qwen2.5-7B-Ins	70.93	10.00	2.22	32.20	35.96	15.77
ep1 (step-00399)	29.87	3.33	2.22	18.50	21.59	16.13
ep2 (step-00798)	44.13	4.44	4.44	28.53	31.81	16.85
ep3 (step-01197)	61.13	11.11	15.56	40.82	43.75	15.05
ep4 (step-01597)	66.87	10.00	15.56	51.13	53.71	16.85
ep5 (step-01995)	70.27	14.44	13.33	48.45	51.24	14.70

All values are %. JEEbench is the math split only (236 of 515 problems). MATH500 / AIME24 / AIME25 are mean accuracy over 3 samples per problem (T=0.5, max_new_tokens=32768). JEEbench is mean strict / partial answer-match over n=6 samples per problem (TIA protocol). LCB is pass@1 on release_v5 codegeneration with window 2024-08-01 → 2025-02-01, n=3 T=0.5.

Evaluation via exp-b10/distill/_common/multibench_runner.py — vLLM 0.10.0 on B200 (SM 100), seed=7, deterministic decoding aside from temperature sampling. Each cell is a single eval run.

Caveats

Truncation dominates raw accuracy at max_new_tokens=32768. Earlier epochs (ep1/ep2) are heavily truncation-suppressed — the model has not yet learned to terminate reasoning concisely. ep3+ recover.
LCB-v5 is roughly flat across epochs (14.7–16.9 pass@1) because the distill data is math-only; the student neither gains nor loses much codegen capability.
JEE column is filtered to the math subject split (236 of 515) — the student is never trained on physics or chemistry, so phy/chem rows just measure background drift.

Files per subfolder

Each epN/ is a stock HF Trainer checkpoint folder: config.json, tokenizer*, vocab.json, merges.txt, model.safetensors.index.json, model-0000{1..7}-of-00007.safetensors, generation_config.json. Optimizer/scheduler/RNG state is intentionally omitted (this is a release artifact, not a resume point).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Chia-Mu-Lab/qwen25-7b-ot-q3_14b-clean-ideal

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2619)

this model

Chia-Mu-Lab
/

qwen25-7b-ot-q3_14b-clean-ideal