---
language:
- en
- zh
base_model:
- Qwen/Qwen3.6-27B
tags:
- gguf
- llama.cpp
- qwen
- qwen3
- qwen3.6
- mtp
- speculative-decoding
- quantized
- long-context
- chinese
pipeline_tag: text-generation
license: apache-2.0
---

# Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-GGUF

GGUF quantized release of the Claude Opus / Sonnet reasoning distillation on Qwen3.6-27B, with native MTP speculative decoding support in `llama.cpp`.

**Key numbers:** Q4_K_M + MTP2 → **114.78 tok/s** generation, **80.33%** draft acceptance, **64%** faster than non-MTP baseline. On the same machine, this release delivers **2x the visible answer content** vs the original qwen3.6-27b while maintaining 4/4 correctness.

## Quick Download

| File | Size | Best for |
|:---|---:|:---|
| **Q4_K_M** (recommended) | 15.66 GB | Best overall balance |
| Q6_K | 20.89 GB | Quality-first |
| Q2_K | 10.12 GB | Extreme compression |
| Q8_0 | 27.05 GB | High-fidelity experiments |

## Compared to Original qwen3.6-27b

Same-machine benchmark against the original (non-quantized) qwen3.6-27b:

![Release vs original efficiency comparison](release_vs_original_efficiency.png)

*GGUF side includes llama-cli cold start — this is a conservative estimate.*

| | Original | This release |
|---|---|---|
| Average response time | 10.93s | **10.09s** |
| Correctness (4 prompts) | 3/4 | **4/4** |
| Visible answer chars | 1336 | **2845** |
| Hidden reasoning overhead | 9002 chars | minimal |

The original spends a large fraction of its token budget on hidden reasoning chains. This release converts that budget into visible answers, making it better suited for interactive local use.

## Compatibility

Requires a recent `llama.cpp` build with Qwen3.5/3.6 MTP support. Older conversion pipelines may miss the required metadata and fail with `failed to create MTP context`.

Verified stack:
- Windows CUDA build of `llama.cpp`
- GPU: NVIDIA RTX PRO 6000 Blackwell 96 GB
- `-ngl 999 --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999`
- LM Studio 0.4.14+ opens MTP by default, zero configuration

## Benchmarks

### Quantization Comparison (short context)

Test: three-person logic puzzle, `n=160`, GPU + MTP2.

| Variant | Prompt tok/s | Generation tok/s | Draft acceptance |
|---|---:|---:|---:|
| Q2_K + MTP2 | 439.73 | 118.01 | 68.66% |
| **Q4_K_M + MTP2** | **240.55** | **114.78** | **80.33%** |
| Q6_K + MTP2 | 503.87 | 99.85 | 78.86% |
| Q8_0 + MTP2 | 421.04 | 78.86 | 69.17% |

MTP vs non-MTP baseline (Q4_K_M):

| Variant | Prompt tok/s | Generation tok/s |
|---|---:|---:|
| Non-MTP | 796.22 | 69.98 |
| MTP2 | 240.55 | **114.78** |
| MTP3 | 390.77 | 117.16 |

MTP2 offers the best acceptance/throughput tradeoff. MTP3 acceptance drops to 69.48%.

### Long Context

Prompt lengths ~6.6K (ctx8k) and ~26.7K (ctx32k). Generation is intentionally short (17-23 tokens) to isolate prompt processing.

| Context | Variant | Prompt tok/s | Generation tok/s | Draft acceptance |
|---|---:|---:|---:|---:|
| ctx8k | Q2_K | 1304.11 | 104.41 | 83.33% |
| ctx8k | **Q4_K_M** | **2798.63** | **31.73** | **60.00%** |
| ctx8k | Q6_K | 2415.74 | 69.48 | 60.00% |
| ctx8k | Q8_0 | 2143.06 | 63.78 | 60.00% |
| ctx32k | Q2_K | 2450.46 | 71.41 | 78.57% |
| ctx32k | **Q4_K_M** | **2846.65** | **87.42** | **83.33%** |
| ctx32k | Q6_K | 2620.59 | 81.02 | 71.43% |
| ctx32k | Q8_0 | 3120.27 | 71.19 | 71.43% |

Q4_K_M is the most balanced variant across both short and long contexts. Q6_K is a solid quality-first choice.

Note: BF16 + MTP2 (historical reference) yielded 20.49 tok/s prompt / 0.85 tok/s generation on this GPU — quantization is required for practical throughput on this hardware.

## Usage

### LM Studio (zero config)

Upgrade to **LM Studio 0.4.14 or later**. Load the GGUF file and MTP speculative decoding is enabled automatically — no settings, no flags, no configuration needed.

### llama-cli

```bash
# Regular inference
./llama-cli -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf -ngl 999 -c 8192 -p "Your prompt here"

# With MTP enabled
./llama-cli -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf -ngl 999 -c 8192 \
  --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999 -p "Your prompt here"
```

Recommended args:
- Short replies: `-c 4096 --temp 0 --top-k 1 --spec-draft-n-max 2`
- Long reasoning: `-c 8192 --temp 0 --top-k 1 --spec-draft-n-max 2`

## Quality Validation

All four quantized variants passed:
- GGUF header integrity check
- GPU `draft-mtp` loadability
- Same-prompt logic consistency (all converge to the same answer: A=lying, B=truth, C=lying)

| Variant | Quality verdict | Recommendation |
|---|---|---|
| Q2_K | Usable, most aggressive compression | Extreme compression only |
| **Q4_K_M** | **Best balance** | **Default** |
| Q6_K | More stable quality | Quality-first choice |
| Q8_0 | Fine, but not always faster than Q6_K | High-fidelity experiments |

Note: Windows PowerShell CLI may corrupt Chinese prompt arguments. Use UTF-8 prompt files, API calls, or your own inference service for Chinese workloads.

## Known Limitations

- Requires a recent `llama.cpp` build (older exports may miss Qwen3.5/3.6 MTP metadata)
- Q8_0 is not guaranteed to be faster than Q6_K on bandwidth-limited GPUs
- Chinese prompts may need extra encoding care in Windows CLI environments

## V1 → V2

V2 optimizes distillation targets, reasoning chain compression, and MTP deployment compatibility. Coding accuracy, tool calling stability, and debugging efficiency are all meaningfully improved.