--- language: - en - zh base_model: - Qwen/Qwen3.6-27B tags: - gguf - llama.cpp - qwen - qwen3 - qwen3.6 - mtp - speculative-decoding - quantized - long-context - chinese pipeline_tag: text-generation license: apache-2.0 --- # Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-GGUF GGUF quantized release of the Claude Opus / Sonnet reasoning distillation on Qwen3.6-27B, with native MTP speculative decoding support in `llama.cpp`. **Key numbers:** Q4_K_M + MTP2 → **114.78 tok/s** generation, **80.33%** draft acceptance, **64%** faster than non-MTP baseline. On the same machine, this release delivers **2x the visible answer content** vs the original qwen3.6-27b while maintaining 4/4 correctness. ## Quick Download | File | Size | Best for | |:---|---:|:---| | **Q4_K_M** (recommended) | 15.66 GB | Best overall balance | | Q6_K | 20.89 GB | Quality-first | | Q2_K | 10.12 GB | Extreme compression | | Q8_0 | 27.05 GB | High-fidelity experiments | ## Compared to Original qwen3.6-27b Same-machine benchmark against the original (non-quantized) qwen3.6-27b: ![Release vs original efficiency comparison](release_vs_original_efficiency.png) *GGUF side includes llama-cli cold start — this is a conservative estimate.* | | Original | This release | |---|---|---| | Average response time | 10.93s | **10.09s** | | Correctness (4 prompts) | 3/4 | **4/4** | | Visible answer chars | 1336 | **2845** | | Hidden reasoning overhead | 9002 chars | minimal | The original spends a large fraction of its token budget on hidden reasoning chains. This release converts that budget into visible answers, making it better suited for interactive local use. ## Compatibility Requires a recent `llama.cpp` build with Qwen3.5/3.6 MTP support. Older conversion pipelines may miss the required metadata and fail with `failed to create MTP context`. Verified stack: - Windows CUDA build of `llama.cpp` - GPU: NVIDIA RTX PRO 6000 Blackwell 96 GB - `-ngl 999 --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999` - LM Studio 0.4.14+ opens MTP by default, zero configuration ## Benchmarks ### Quantization Comparison (short context) Test: three-person logic puzzle, `n=160`, GPU + MTP2. | Variant | Prompt tok/s | Generation tok/s | Draft acceptance | |---|---:|---:|---:| | Q2_K + MTP2 | 439.73 | 118.01 | 68.66% | | **Q4_K_M + MTP2** | **240.55** | **114.78** | **80.33%** | | Q6_K + MTP2 | 503.87 | 99.85 | 78.86% | | Q8_0 + MTP2 | 421.04 | 78.86 | 69.17% | MTP vs non-MTP baseline (Q4_K_M): | Variant | Prompt tok/s | Generation tok/s | |---|---:|---:| | Non-MTP | 796.22 | 69.98 | | MTP2 | 240.55 | **114.78** | | MTP3 | 390.77 | 117.16 | MTP2 offers the best acceptance/throughput tradeoff. MTP3 acceptance drops to 69.48%. ### Long Context Prompt lengths ~6.6K (ctx8k) and ~26.7K (ctx32k). Generation is intentionally short (17-23 tokens) to isolate prompt processing. | Context | Variant | Prompt tok/s | Generation tok/s | Draft acceptance | |---|---:|---:|---:|---:| | ctx8k | Q2_K | 1304.11 | 104.41 | 83.33% | | ctx8k | **Q4_K_M** | **2798.63** | **31.73** | **60.00%** | | ctx8k | Q6_K | 2415.74 | 69.48 | 60.00% | | ctx8k | Q8_0 | 2143.06 | 63.78 | 60.00% | | ctx32k | Q2_K | 2450.46 | 71.41 | 78.57% | | ctx32k | **Q4_K_M** | **2846.65** | **87.42** | **83.33%** | | ctx32k | Q6_K | 2620.59 | 81.02 | 71.43% | | ctx32k | Q8_0 | 3120.27 | 71.19 | 71.43% | Q4_K_M is the most balanced variant across both short and long contexts. Q6_K is a solid quality-first choice. Note: BF16 + MTP2 (historical reference) yielded 20.49 tok/s prompt / 0.85 tok/s generation on this GPU — quantization is required for practical throughput on this hardware. ## Usage ### LM Studio (zero config) Upgrade to **LM Studio 0.4.14 or later**. Load the GGUF file and MTP speculative decoding is enabled automatically — no settings, no flags, no configuration needed. ### llama-cli ```bash # Regular inference ./llama-cli -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf -ngl 999 -c 8192 -p "Your prompt here" # With MTP enabled ./llama-cli -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf -ngl 999 -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999 -p "Your prompt here" ``` Recommended args: - Short replies: `-c 4096 --temp 0 --top-k 1 --spec-draft-n-max 2` - Long reasoning: `-c 8192 --temp 0 --top-k 1 --spec-draft-n-max 2` ## Quality Validation All four quantized variants passed: - GGUF header integrity check - GPU `draft-mtp` loadability - Same-prompt logic consistency (all converge to the same answer: A=lying, B=truth, C=lying) | Variant | Quality verdict | Recommendation | |---|---|---| | Q2_K | Usable, most aggressive compression | Extreme compression only | | **Q4_K_M** | **Best balance** | **Default** | | Q6_K | More stable quality | Quality-first choice | | Q8_0 | Fine, but not always faster than Q6_K | High-fidelity experiments | Note: Windows PowerShell CLI may corrupt Chinese prompt arguments. Use UTF-8 prompt files, API calls, or your own inference service for Chinese workloads. ## Known Limitations - Requires a recent `llama.cpp` build (older exports may miss Qwen3.5/3.6 MTP metadata) - Q8_0 is not guaranteed to be faster than Q6_K on bandwidth-limited GPUs - Chinese prompts may need extra encoding care in Windows CLI environments ## V1 → V2 V2 optimizes distillation targets, reasoning chain compression, and MTP deployment compatibility. Coding accuracy, tool calling stability, and debugging efficiency are all meaningfully improved.