Instructions to use splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16 splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Benchmark report: Reasoning-Distilled-oQ8e vs vanilla MLX-oQ8 โ accuracy + compute efficiency on 10 benchmarks
๐ PNG
Hi @splats , head-to-head eval of the Reasoning-Distilled-oQ8e checkpoint against the vanilla MLX-oQ8 baseline, on the oMLX harness, identical hardware and seed, ten benchmarks, both thinking modes.
Sharing the numbers because the trade-off is non-trivial and the maintainer view matters more than mine.
TL;DR
The distillation trades 3.6 pts of mean thinking-mode accuracy for 1.58x
faster wall-clock and 45% fewer output tokens. That trade is Pareto
positive on four benchmarks and Pareto negative on three. The remaining
three are roughly neutral on cost.
The most load-bearing finding is on HumanEval no-think: +14.0 pts,
93.3% vs 79.3%, 26 problems gained against 3 lost. That asymmetry says
reasoning is partially internalized, not just relocated.
Setup
- Distilled :
splats/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16 - Baseline :
Qwen3.6-35B-A3B-MLX-oQ8-FP16 - Eval harness : oMLX 0.3.8, identical hardware/seed (MacStudio M3 Ultra), 10 benchmarks ร {think, no-think}
- Token counts are character-based estimates (โ chars / 4); wall-clock is end-to-end including sampling.
Full results
| Benchmark | Mode | Acc ฮ (D โ S) | Token ratio (D / S) | Speedup (S / D) |
|---|---|---|---|---|
| MMLU | no-think | +2.4 | 0.58ร | 1.10ร |
| MMLU | think | โ2.5 | 0.53ร | 1.57ร |
| MMLU-Pro | no-think | +2.0 | 0.62ร | 1.24ร |
| MMLU-Pro | think | +4.0 | 0.58ร | 1.71ร |
| TruthfulQA | no-think | โ1.2 | ~1.0ร | 1.06ร |
| TruthfulQA | think | โ4.6 | 0.54ร | 1.38ร |
| ARC-Challenge | no-think | โ0.3 | ~1.0ร | 1.05ร |
| ARC-Challenge | think | 0.0 | 0.34ร | 3.37ร |
| MathQA | no-think | โ3.0 | n/a* | 0.68ร |
| MathQA | think | 0.0 | 0.69ร | 1.49ร |
| HumanEval | no-think | +14.0 | ~1.0ร | 1.06ร |
| HumanEval | think | โ12.2 | 0.44ร | 1.90ร |
| MBPP | no-think | โ2.0 | 0.45ร | 1.91ร |
| MBPP | think | โ15.0 | 0.65ร | 1.42ร |
| LiveCodeBench | no-think | โ7.0 | ~1.0ร | 0.89ร |
| LiveCodeBench | think | โ5.0 | 0.59ร | 1.44ร |
| BBQ | no-think | +0.7 | ~1.0ร | 1.25ร |
| BBQ | think | +1.0 | 0.29ร | 2.92ร |
| SafetyBench | no-think | โ0.7 | ~1.0ร | 1.19ร |
| SafetyBench | think | โ2.0 | 0.53ร | 1.65ร |
* MathQA no-think: both models output ~1-char answers; the aggregate ratio is dominated by a few outliers and not meaningful.
Pareto wins (thinking mode, acc parity or better AND cheaper)
MMLU-Pro: +4.0 pts at 42% fewer tokens, 1.71x faster. Cleanest win.
ARC-Challenge: 0.0 pts at 66% fewer tokens, 3.37x faster. Identical 97.3% accuracy at one-third the compute.
BBQ: +1.0 pts at 71% fewer tokens, 2.92x faster. Most efficient thinking trace in the entire eval.
MathQA: 0.0 pts at 31% fewer tokens, 1.49x faster. Modest but free.
Trade-off boundary
Long-form code generation in thinking mode is where the distilled underperforms.
HumanEval thinking: 95.1% vs 82.9%, delta -12.2 pts
MBPP thinking: 93.5% vs 78.5%, delta -15.0 pts
LiveCodeBench thinking: 52.0% vs 47.0%, delta -5.0 pts
The pattern points to compressed reasoning traces. Shorter traces help on multi-choice and short-form code, do not help on long-horizon code where extended deliberation pays off. TruthfulQA also regresses by 5 pts in both modes; mechanism less clear, possibly an alignment artifact from the teacher distribution.
Interpretation
The invariant being tested: does Claude-4.7-Opus distillation transfer reasoning into the student weights, or does it just shape inference-time output style?
The data points to genuine transfer, bounded to specific regimes:
Multi-choice and short-form reasoning: transfer succeeds. No-think gains on MMLU, MMLU-Pro, HumanEval are evidence.
Compressed thinking trace: transfer succeeds for short reasoning, yielding the four Pareto wins.
Long-horizon code generation: transfer incomplete. The teacher's long-trace behavior in thinking mode is not fully recovered.
Default suggestion for downstream users: the distilled is the better default for no-think serving paths and for any thinking workload outside long-form code. For HumanEval-style or MBPP-style production code generation with thinking budget available, vanilla oQ8 holds the edge.
Questions
Distillation setup: teacher-trace dataset size, response-based vs reasoning-trace distillation, training duration?
MBPP and HumanEval thinking-mode regression: observed during training or only at eval time?
oQ8e vs standard oQ8: calibration dataset, group size, anything notable?
Intermediate checkpoints: planned release? The no-think reasoning gain emerging across training would be a useful curve.
Raw JSONs and harness output available on request.
Thanks for the model and sorry for my AI assisted english translation.
Feel free to reuse my benchmark results.
I'll add Kimi reasoning soon.
โต RCR Regis โด
Hi @splats ,
Follow-up to the previous head-to-head: this time both reasoning-distilled checkpoints (Claude-4.7-Opus and Kimi-K2.6) go against the vanilla oQ8 baseline on the same 10 benchmarks, same hardware, same seeds. The token data shifts the picture compared to the 2-way run.
TL;DR (thinking mode, n = 10 benchmarks)
- Claude-4.7-Opus distilled: 1.58x faster wall-clock, -45% output tokens, -3.6 pts mean accuracy. The compression is real and quantified.
- Kimi-K2.6 distilled: 0.76x as fast (slower), +16% output tokens, -2.9 pts mean accuracy. The trade-off curve moved the wrong way on both axes.
- Where each teacher shines: Claude takes MMLU-Pro by +4.0 pts AND -42% tokens AND 1.71x faster (clean Pareto). Kimi recovers TruthfulQA to parity (-0.1) where Claude regresses 4.6 pts, and matches baseline on HumanEval thinking (+0.6) where Claude drops 12.2.
Setup (oMLX 0.3.8, Mac Studio M3 Ultra)
- Distilled-A: Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-oQ8e-fp16
- Distilled-B: Qwen3.6-35B-A3B-Kimi-K2.6-Reasoning-Distilled-oQ8e-fp16
- Baseline: Qwen3.6-35B-A3B-MLX-oQ8-FP16
- Harness: oMLX eval, identical hardware (Apple Silicon), identical seeds
- Sample sizes: full-set on TruthfulQA (817), HumanEval (164), MBPP (200). Stratified samples elsewhere (300 to 1000 items, see table). Small-n caveat on LiveCodeBench (100) limits resolution.
- Token estimates: chars per question divided by 4. Same proxy for all three models, so ratios stay valid.
Results
See table.md (paste alongside this post) and comparison.png.
Where each distillation shines
Claude-4.7-Opus distilled, clean Pareto wins (more accurate AND fewer tokens AND faster):
- MMLU-Pro thinking: +4.0 pts, 1.71x faster, -42% tokens
- BBQ thinking: +1.0 pt, 2.92x faster, -71% tokens
- HumanEval no-thinking: +14.0 pts (93.3% vs 79.3%). Reasoning baked into weights.
Kimi-K2.6 distilled, niche wins:
- HumanEval no-thinking: +11.6 pts (90.8% vs 79.3%). Same internalization story.
- TruthfulQA thinking: parity with baseline (-0.1) where Claude lost 4.6 pts. Teacher signal carries on truthfulness.
- HumanEval thinking: +0.6 pt vs baseline, where Claude drops 12.2. The longer trace pays off on this slice.
Where the baseline still leads
- TruthfulQA thinking: Claude distilled regresses to 85.7% (-4.6). Kimi holds (90.2%).
- HumanEval thinking: Claude 82.9% vs baseline 95.1% (-12.2). Steep loss on long-form code generation.
- MBPP thinking: Claude 78.5% vs baseline 93.5% (-15.0). Same pattern, larger gap. Kimi limits the damage at -4.5.
- LiveCodeBench: both distilled lose. Kimi loses harder (-12) than Claude (-5). Small-n caveat applies.
- MMLU-Pro thinking, Kimi side: 75.3% vs baseline 79.7% (-4.3). The two teachers split: Claude +4.0 here, Kimi -4.3.
Reading the data
The token Pareto points to two distinct distillation regimes. Claude-trained traces produce ~half the output (529 tokens/q on MMLU vs 994 baseline, 799 vs 1804 on HumanEval), which buys the 1.58x speedup and the 45% token saving. The cost is concentrated on multi-step code generation in thinking mode (MBPP -15, HumanEval -12), where the shorter trace stops before the answer converges.
Kimi-trained traces go the other direction. They produce 16% more tokens than baseline on average (1035 vs 994 on MMLU, 2117 vs 1804 on HumanEval), without a matching accuracy gain. The failure modes shift: TruthfulQA holds, code-think holds, but MMLU-Pro and MathQA drop. Hard to find a quadrant where the Kimi checkpoint dominates baseline.
The no-thinking column is the clearer reasoning-transfer signal. HumanEval gains +14 and +11.6 pts without any CoT context, which is the strongest evidence the reasoning transferred into weights rather than into longer outputs.
Questions
- Were the two teachers prompted with comparable trace-length budgets, or did Kimi-K2.6 produce systematically longer traces during distillation? The output-length asymmetry (529 vs 1035 tokens/q on MMLU) suggests yes.
- Any internal eval that caught the TruthfulQA regression on the Claude run (-4.6 thinking) before publish? Could route a finetune-time signal.
- Plans to combine the two teachers (mixture-of-traces, sequential, or weight merge)? The strengths look complementary: Claude on MMLU-Pro/MathQA/LiveCodeBench, Kimi on TruthfulQA/HumanEval/MBPP.
- Would you welcome a third teacher in the same setup, kept publicly comparable on oMLX?
Thanks again for shipping both checkpoints publicly. Two reasoning-distillation variants on one base lets the community measure teacher effects directly, which is rare.
Regis
| Benchmark | Mode | Vanilla oQ8 acc | Claude-4.7-Opus distilled acc | ฮ vs base (Claude-4.7-Opus distilled) | Kimi-K2.6 distilled acc | ฮ vs base (Kimi-K2.6 distilled) |
|---|---|---|---|---|---|---|
| MMLU | no-think | 80.9% | 83.3% | +2.4 | 83.0% | +2.1 |
| MMLU | think | 91.0% | 88.5% | -2.5 | 90.3% | -0.7 |
| MMLU-Pro | no-think | 61.3% | 63.3% | +2.0 | 59.7% | -1.7 |
| MMLU-Pro | think | 79.7% | 83.7% | +4.0 | 75.3% | -4.3 |
| TruthfulQA | no-think | 86.3% | 85.1% | -1.2 | 87.0% | +0.7 |
| TruthfulQA | think | 90.3% | 85.7% | -4.6 | 90.2% | -0.1 |
| ARC-Challenge | no-think | 95.3% | 95.0% | -0.3 | 96.0% | +0.7 |
| ARC-Challenge | think | 97.3% | 97.3% | +0.0 | 96.0% | -1.3 |
| MathQA | no-think | 44.0% | 41.0% | -3.0 | 44.0% | +0.0 |
| MathQA | think | 89.0% | 89.0% | +0.0 | 83.7% | -5.3 |
| HumanEval | no-think | 79.3% | 93.3% | +14.0 | 90.8% | +11.6 |
| HumanEval | think | 95.1% | 82.9% | -12.2 | 95.7% | +0.6 |
| MBPP | no-think | 84.5% | 82.5% | -2.0 | 83.5% | -1.0 |
| MBPP | think | 93.5% | 78.5% | -15.0 | 89.0% | -4.5 |
| LiveCodeBench | no-think | 52.0% | 45.0% | -7.0 | 49.0% | -3.0 |
| LiveCodeBench | think | 52.0% | 47.0% | -5.0 | 40.0% | -12.0 |
| BBQ | no-think | 94.0% | 94.7% | +0.7 | 94.7% | +0.7 |
| BBQ | think | 95.3% | 96.3% | +1.0 | 95.3% | +0.0 |
| SafetyBench | no-think | 83.7% | 83.0% | -0.7 | 83.3% | -0.3 |
| SafetyBench | think | 87.0% | 85.0% | -2.0 | 85.3% | -1.7 |
These models were a community quantization request for lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

