splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16 · Benchmark report: Reasoning-Distilled-oQ8e vs vanilla MLX-oQ8 — accuracy + compute efficiency on 10 benchmarks

May 7

•

📎 PNG

Hi @splats , head-to-head eval of the Reasoning-Distilled-oQ8e checkpoint against the vanilla MLX-oQ8 baseline, on the oMLX harness, identical hardware and seed, ten benchmarks, both thinking modes.
Sharing the numbers because the trade-off is non-trivial and the maintainer view matters more than mine.

TL;DR

The distillation trades 3.6 pts of mean thinking-mode accuracy for 1.58x
faster wall-clock and 45% fewer output tokens. That trade is Pareto
positive on four benchmarks and Pareto negative on three. The remaining
three are roughly neutral on cost.
The most load-bearing finding is on HumanEval no-think: +14.0 pts,
93.3% vs 79.3%, 26 problems gained against 3 lost. That asymmetry says
reasoning is partially internalized, not just relocated.

Setup

Distilled : splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16
Baseline : Qwen3.6-35B-A3B-MLX-oQ8-FP16
Eval harness : oMLX 0.3.8, identical hardware/seed (MacStudio M3 Ultra), 10 benchmarks × {think, no-think}
Token counts are character-based estimates (≈ chars / 4); wall-clock is end-to-end including sampling.

Full results

Benchmark	Mode	Acc Δ (D − S)	Token ratio (D / S)	Speedup (S / D)
MMLU	no-think	+2.4	0.58×	1.10×
MMLU	think	−2.5	0.53×	1.57×
MMLU-Pro	no-think	+2.0	0.62×	1.24×
MMLU-Pro	think	+4.0	0.58×	1.71×
TruthfulQA	no-think	−1.2	~1.0×	1.06×
TruthfulQA	think	−4.6	0.54×	1.38×
ARC-Challenge	no-think	−0.3	~1.0×	1.05×
ARC-Challenge	think	0.0	0.34×	3.37×
MathQA	no-think	−3.0	n/a*	0.68×
MathQA	think	0.0	0.69×	1.49×
HumanEval	no-think	+14.0	~1.0×	1.06×
HumanEval	think	−12.2	0.44×	1.90×
MBPP	no-think	−2.0	0.45×	1.91×
MBPP	think	−15.0	0.65×	1.42×
LiveCodeBench	no-think	−7.0	~1.0×	0.89×
LiveCodeBench	think	−5.0	0.59×	1.44×
BBQ	no-think	+0.7	~1.0×	1.25×
BBQ	think	+1.0	0.29×	2.92×
SafetyBench	no-think	−0.7	~1.0×	1.19×
SafetyBench	think	−2.0	0.53×	1.65×

* MathQA no-think: both models output ~1-char answers; the aggregate ratio is dominated by a few outliers and not meaningful.

Pareto wins (thinking mode, acc parity or better AND cheaper)

MMLU-Pro: +4.0 pts at 42% fewer tokens, 1.71x faster. Cleanest win.
ARC-Challenge: 0.0 pts at 66% fewer tokens, 3.37x faster. Identical 97.3% accuracy at one-third the compute.
BBQ: +1.0 pts at 71% fewer tokens, 2.92x faster. Most efficient thinking trace in the entire eval.
MathQA: 0.0 pts at 31% fewer tokens, 1.49x faster. Modest but free.

Trade-off boundary

Long-form code generation in thinking mode is where the distilled underperforms.
HumanEval thinking: 95.1% vs 82.9%, delta -12.2 pts
MBPP thinking: 93.5% vs 78.5%, delta -15.0 pts
LiveCodeBench thinking: 52.0% vs 47.0%, delta -5.0 pts

The pattern points to compressed reasoning traces. Shorter traces help on multi-choice and short-form code, do not help on long-horizon code where extended deliberation pays off. TruthfulQA also regresses by 5 pts in both modes; mechanism less clear, possibly an alignment artifact from the teacher distribution.

Interpretation

The invariant being tested: does Claude-4.7-Opus distillation transfer reasoning into the student weights, or does it just shape inference-time output style?
The data points to genuine transfer, bounded to specific regimes:
Multi-choice and short-form reasoning: transfer succeeds. No-think gains on MMLU, MMLU-Pro, HumanEval are evidence.
Compressed thinking trace: transfer succeeds for short reasoning, yielding the four Pareto wins.
Long-horizon code generation: transfer incomplete. The teacher's long-trace behavior in thinking mode is not fully recovered.

Default suggestion for downstream users: the distilled is the better default for no-think serving paths and for any thinking workload outside long-form code. For HumanEval-style or MBPP-style production code generation with thinking budget available, vanilla oQ8 holds the edge.

Questions

Distillation setup: teacher-trace dataset size, response-based vs reasoning-trace distillation, training duration?
MBPP and HumanEval thinking-mode regression: observed during training or only at eval time?
oQ8e vs standard oQ8: calibration dataset, group size, anything notable?
Intermediate checkpoints: planned release? The no-think reasoning gain emerging across training would be a useful curve.

Raw JSONs and harness output available on request.

Thanks for the model and sorry for my AI assisted english translation.
Feel free to reuse my benchmark results.
I'll add Kimi reasoning soon.

∵ RCR Regis ∴

Regis-RCR

May 12

Hi @splats ,

Follow-up to the previous head-to-head: this time both reasoning-distilled checkpoints (Claude-4.7-Opus and Kimi-K2.6) go against the vanilla oQ8 baseline on the same 10 benchmarks, same hardware, same seeds. The token data shifts the picture compared to the 2-way run.

TL;DR (thinking mode, n = 10 benchmarks)

Claude-4.7-Opus distilled: 1.58x faster wall-clock, -45% output tokens, -3.6 pts mean accuracy. The compression is real and quantified.
Kimi-K2.6 distilled: 0.76x as fast (slower), +16% output tokens, -2.9 pts mean accuracy. The trade-off curve moved the wrong way on both axes.
Where each teacher shines: Claude takes MMLU-Pro by +4.0 pts AND -42% tokens AND 1.71x faster (clean Pareto). Kimi recovers TruthfulQA to parity (-0.1) where Claude regresses 4.6 pts, and matches baseline on HumanEval thinking (+0.6) where Claude drops 12.2.

Setup (oMLX 0.3.8, Mac Studio M3 Ultra)

Distilled-A: Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16
Distilled-B: Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled-oQ8e-fp16
Baseline: Qwen3.6-35B-A3B-MLX-oQ8-FP16
Harness: oMLX eval, identical hardware (Apple Silicon), identical seeds
Sample sizes: full-set on TruthfulQA (817), HumanEval (164), MBPP (200). Stratified samples elsewhere (300 to 1000 items, see table). Small-n caveat on LiveCodeBench (100) limits resolution.
Token estimates: chars per question divided by 4. Same proxy for all three models, so ratios stay valid.

Results

See table.md (paste alongside this post) and comparison.png.

Where each distillation shines

Claude-4.7-Opus distilled, clean Pareto wins (more accurate AND fewer tokens AND faster):

MMLU-Pro thinking: +4.0 pts, 1.71x faster, -42% tokens
BBQ thinking: +1.0 pt, 2.92x faster, -71% tokens
HumanEval no-thinking: +14.0 pts (93.3% vs 79.3%). Reasoning baked into weights.

Kimi-K2.6 distilled, niche wins:

HumanEval no-thinking: +11.6 pts (90.8% vs 79.3%). Same internalization story.
TruthfulQA thinking: parity with baseline (-0.1) where Claude lost 4.6 pts. Teacher signal carries on truthfulness.
HumanEval thinking: +0.6 pt vs baseline, where Claude drops 12.2. The longer trace pays off on this slice.

Where the baseline still leads

TruthfulQA thinking: Claude distilled regresses to 85.7% (-4.6). Kimi holds (90.2%).
HumanEval thinking: Claude 82.9% vs baseline 95.1% (-12.2). Steep loss on long-form code generation.
MBPP thinking: Claude 78.5% vs baseline 93.5% (-15.0). Same pattern, larger gap. Kimi limits the damage at -4.5.
LiveCodeBench: both distilled lose. Kimi loses harder (-12) than Claude (-5). Small-n caveat applies.
MMLU-Pro thinking, Kimi side: 75.3% vs baseline 79.7% (-4.3). The two teachers split: Claude +4.0 here, Kimi -4.3.

Reading the data

The token Pareto points to two distinct distillation regimes. Claude-trained traces produce ~half the output (529 tokens/q on MMLU vs 994 baseline, 799 vs 1804 on HumanEval), which buys the 1.58x speedup and the 45% token saving. The cost is concentrated on multi-step code generation in thinking mode (MBPP -15, HumanEval -12), where the shorter trace stops before the answer converges.

Kimi-trained traces go the other direction. They produce 16% more tokens than baseline on average (1035 vs 994 on MMLU, 2117 vs 1804 on HumanEval), without a matching accuracy gain. The failure modes shift: TruthfulQA holds, code-think holds, but MMLU-Pro and MathQA drop. Hard to find a quadrant where the Kimi checkpoint dominates baseline.

The no-thinking column is the clearer reasoning-transfer signal. HumanEval gains +14 and +11.6 pts without any CoT context, which is the strongest evidence the reasoning transferred into weights rather than into longer outputs.

Questions

Were the two teachers prompted with comparable trace-length budgets, or did Kimi-K2.6 produce systematically longer traces during distillation? The output-length asymmetry (529 vs 1035 tokens/q on MMLU) suggests yes.
Any internal eval that caught the TruthfulQA regression on the Claude run (-4.6 thinking) before publish? Could route a finetune-time signal.
Plans to combine the two teachers (mixture-of-traces, sequential, or weight merge)? The strengths look complementary: Claude on MMLU-Pro/MathQA/LiveCodeBench, Kimi on TruthfulQA/HumanEval/MBPP.
Would you welcome a third teacher in the same setup, kept publicly comparable on oMLX?

Thanks again for shipping both checkpoints publicly. Two reasoning-distillation variants on one base lets the community measure teacher effects directly, which is rare.

Regis

Benchmark	Mode	Vanilla oQ8 acc	Claude-4.7-Opus distilled acc	Δ vs base (Claude-4.7-Opus distilled)	Kimi-K2.6 distilled acc	Δ vs base (Kimi-K2.6 distilled)
MMLU	no-think	80.9%	83.3%	+2.4	83.0%	+2.1
MMLU	think	91.0%	88.5%	-2.5	90.3%	-0.7
MMLU-Pro	no-think	61.3%	63.3%	+2.0	59.7%	-1.7
MMLU-Pro	think	79.7%	83.7%	+4.0	75.3%	-4.3
TruthfulQA	no-think	86.3%	85.1%	-1.2	87.0%	+0.7
TruthfulQA	think	90.3%	85.7%	-4.6	90.2%	-0.1
ARC-Challenge	no-think	95.3%	95.0%	-0.3	96.0%	+0.7
ARC-Challenge	think	97.3%	97.3%	+0.0	96.0%	-1.3
MathQA	no-think	44.0%	41.0%	-3.0	44.0%	+0.0
MathQA	think	89.0%	89.0%	+0.0	83.7%	-5.3
HumanEval	no-think	79.3%	93.3%	+14.0	90.8%	+11.6
HumanEval	think	95.1%	82.9%	-12.2	95.7%	+0.6
MBPP	no-think	84.5%	82.5%	-2.0	83.5%	-1.0
MBPP	think	93.5%	78.5%	-15.0	89.0%	-4.5
LiveCodeBench	no-think	52.0%	45.0%	-7.0	49.0%	-3.0
LiveCodeBench	think	52.0%	47.0%	-5.0	40.0%	-12.0
BBQ	no-think	94.0%	94.7%	+0.7	94.7%	+0.7
BBQ	think	95.3%	96.3%	+1.0	95.3%	+0.0
SafetyBench	no-think	83.7%	83.0%	-0.7	83.3%	-0.3
SafetyBench	think	87.0%	85.0%	-2.0	85.3%	-1.7

splats

Owner May 12

These models were a community quantization request for lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled