---
language:
- en
license: other
base_model:
- Qwen/Qwen3.6-27B
tags:
- gguf
- llama.cpp
- qwen
- mtp
- speculative-decoding
- quantized
pipeline_tag: text-generation
---

# Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-GGUF

This is the GGUF quantized release of the local distilled model `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP`.

The value proposition of this release is straightforward: it preserves the Claude Opus / Sonnet distilled style, opens `MTP` directly in `llama.cpp` for real acceleration, shortens the overly long reasoning chain seen in the local original model, and converts more of the token budget into user-visible answers.

Key points:

- Preserves the Claude Opus / Sonnet distilled response style and organization
- Verified to open `MTP` directly in `llama.cpp` with `--spec-type draft-mtp`
- `Q4_K_M + MTP2` reaches `80.33%` draft acceptance and `114.78 tok/s` generation, versus `69.98 tok/s` for `Q4_K_M + non-MTP`, or about `64%` faster generation
- Compared with the local original model, this release follows a shorter reasoning path; in the same-machine 4-prompt comparison, the original consumed `9002` hidden reasoning chars
- Delivers higher visible-output efficiency per token budget; the same comparison produced `2845` visible answer chars for this release versus `1336` for the original
- Provides four quantization variants: `Q2_K / Q4_K_M / Q6_K / Q8_0`

## 1. Core Value Of This Release

This is not just a generic GGUF export. It is a release that has already been validated for local deployment. From an end-user perspective, the important points are:

- `MTP` can be opened directly in `llama.cpp`, rather than existing only as metadata that fails at runtime
- In the tested stack, `MTP2` reaches `80.33%` acceptance, showing that speculative acceleration is actually effective
- Same-machine comparison against the local original `qwen3.6-27b` shows that the original spends more of its budget on an overly long hidden reasoning chain
- This release turns more of the token budget into visible answers, making it better suited for efficient local deployment and interactive use

## 2. Files

| File | Size | Notes |
|---|---:|---|
| `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q2_K.gguf` | 10.12 GB | Most aggressive compression, fastest, largest quality loss |
| `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf` | 15.66 GB | Best overall balance, default recommendation |
| `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q6_K.gguf` | 20.89 GB | More quality-oriented, still reasonably fast |
| `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q8_0.gguf` | 27.05 GB | Closer to high precision, heavier bandwidth pressure |

## 3. Compatibility

Verified with:

- Windows CUDA build of `llama.cpp`
- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition 96 GB
- `llama-cli`
- `--spec-type draft-mtp`
- `--spec-draft-n-max 2`
- `-ngl 999`

Note: you need a newer `llama.cpp` build that includes Qwen3.5/3.6 MTP support. Older conversion pipelines may miss the required metadata and fail with `failed to create MTP context`.

## 4. Recommended Variant

- `Q4_K_M`: default recommendation, best speed/quality balance
- `Q6_K`: recommended if you care more about quality
- `Q2_K`: use when VRAM or disk space is very limited
- `Q8_0`: use for higher-fidelity experiments, but it is not always faster

## 5. GPU + MTP2 Benchmarks

Test environment:

- GPU: RTX PRO 6000 Blackwell 96 GB
- Backend: CUDA
- Args: `-ngl 999 --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999`
- Logic puzzle: three-person truth/lie reasoning task, `n=160`
- `5.2` is the short-context benchmark
- `5.3` is the long-context benchmark

### 5.1 Historical Reference

| Variant | Prompt | Generation | Draft acceptance |
|---|---:|---:|---:|
| BF16 + MTP2 | 20.49 tok/s | 0.85 tok/s | 76.80% |
| Q4_K_M + non-MTP | 796.22 tok/s | 69.98 tok/s | - |
| Q4_K_M + MTP2 | 240.55 tok/s | 114.78 tok/s | 80.33% |
| Q4_K_M + MTP3 | 390.77 tok/s | 117.16 tok/s | 69.48% |

### 5.2 Current Quantization Comparison

| Variant | Prompt | Generation | Draft acceptance | Notes |
|---|---:|---:|---:|---|
| Q2_K + MTP2 | 439.73 tok/s | 118.01 tok/s | 68.66% | Fastest generation, but most aggressive compression |
| Q4_K_M + MTP2 | 240.55 tok/s | 114.78 tok/s | 80.33% | Default recommendation |
| Q6_K + MTP2 | 503.87 tok/s | 99.85 tok/s | 78.86% | More quality-oriented |
| Q8_0 + MTP2 | 421.04 tok/s | 78.86 tok/s | 69.17% | Largest file, more bandwidth-limited |

### 5.3 Long-Context Addendum

The long-context tests also use `GPU + MTP2`, but the prompt is changed to a long-document retrieval task:

- `ctx8k` uses an actual prompt length of about `6616 tokens`
- `ctx32k` uses an actual prompt length of about `26738 tokens`
- To reduce output variance, generation is intentionally short; the model usually reaches `EOS` after `17-23 tokens`
- The table below is based on raw `llama.cpp` timing logs

| Tier | Variant | Prompt tokens | Prompt tok/s | Generation tokens | Generation tok/s | Draft acceptance |
|---|---|---:|---:|---:|---:|---:|
| ctx8k | Q2_K | 6616 | 1304.11 | 16 | 104.41 | 83.33% |
| ctx8k | Q4_K_M | 6616 | 2798.63 | 21 | 31.73 | 60.00% |
| ctx8k | Q6_K | 6616 | 2415.74 | 21 | 69.48 | 60.00% |
| ctx8k | Q8_0 | 6616 | 2143.06 | 21 | 63.78 | 60.00% |
| ctx32k | Q2_K | 26738 | 2450.46 | 17 | 71.41 | 78.57% |
| ctx32k | Q4_K_M | 26738 | 2846.65 | 23 | 87.42 | 83.33% |
| ctx32k | Q6_K | 26738 | 2620.59 | 17 | 81.02 | 71.43% |
| ctx32k | Q8_0 | 26738 | 3120.27 | 17 | 71.19 | 71.43% |

Long-context observations:

- `Q4_K_M` remains the most balanced option in this long-context setup
- `Q6_K` still delivers `81 tok/s` generation at `ctx32k`, making it a good quality-first choice
- `Q8_0` shows strong prompt throughput at `ctx32k`, but generation still does not clearly outperform `Q6_K`
- `Q2_K` remains usable for long context, but it is still better suited for extreme compression than for default distribution

Conclusion:

- On this Blackwell workstation GPU, `Q4_K_M` remains the best-balanced variant
- `Q2_K` has the highest generation speed, but it is also the most aggressive in compression and quality trade-off
- `Q6_K` is more stable in acceptance and is a better high-quality option
- `Q8_0` is not guaranteed to be faster, indicating clear bandwidth limits in this setup

### 5.4 Same-Machine Deployment Comparison vs Local Original `qwen3.6-27b`

This section presents a same-machine deployment comparison against the local original `qwen3.6-27b` served on port `1234`. The purpose is to illustrate response efficiency, correctness, and output-budget allocation under a local deployment workflow, rather than to claim a strict cross-hardware or cross-framework benchmark result.

- Comparison target: `qwen3.6-27b` on `http://127.0.0.1:1234/v1/chat/completions`
- Release representative: `Q4_K_M + MTP2`
- Prompt set: `4` mostly objective prompts covering logic, a `sqrt(2)` proof, literary recall, and long-context retrieval
- Measurement note: the GGUF latency in the figure below includes `llama-cli` cold start, so this is a conservative comparison for the release

Key numbers:

- Average wall time: release `10.09s`, original `10.93s`
- Correctness: release `4/4`, original `3/4`
- Original hidden reasoning overhead: `9002` `reasoning_content` characters across the 4 prompts
- Release throughput: average prompt `1035.1 tok/s`, average generation `118.1 tok/s`

![Release vs original efficiency comparison](release_vs_original_efficiency.png)

Observations:

- On this prompt set, the release is faster on average even though the GGUF side is measured with a cold start every run
- On the `sqrt(2)` proof prompt, the original spent a large amount of budget on hidden reasoning and did not reliably finish the final concise answer within the configured limit
- The release follows a direct-answer path and is better aligned with the goal of efficient local deployment
- If the release is deployed as a persistent local service instead of starting `llama-cli` per request, latency is typically lower than what is shown here

This comparison is not meant to be a formal academic benchmark. It answers a more practical question: on the same machine, can a publishable local GGUF release preserve correctness while delivering better response efficiency? In this test, the answer is yes.

## 6. Quality Validation

Two kinds of validation were performed:

1. Loadability validation
   - `Q2_K / Q4_K_M / Q6_K / Q8_0` all passed the `GGUF` header check
   - All four quantization variants can be loaded with GPU `draft-mtp`
2. Same-prompt logic validation
   - `Q2_K / Q4_K_M / Q6_K / Q8_0` all follow the same reasoning direction on the same truth/lie logic puzzle
   - The stable answer is:
     - A is lying
     - B is telling the truth
     - C is lying

Quality assessment:

| Variant | Quality conclusion | Recommendation |
|---|---|---|
| `Q2_K` | Usable, but the most aggressive compression with the largest quality loss | Only recommended for extreme compression scenarios |
| `Q4_K_M` | Best overall balance of quality and speed | Default recommendation for release |
| `Q6_K` | More stable quality, better for fidelity-oriented use | Recommended as the higher-quality option |
| `Q8_0` | Quality is fine, but speed is not necessarily better than `Q6_K` | Recommended for high-fidelity experiments |

Additional notes:

- In the current PowerShell + CLI environment, passing Chinese prompts directly via command-line arguments may occasionally introduce encoding noise
- Therefore, the main quality comparison in this repo uses an English logic puzzle as the unified benchmark
- For actual Chinese usage, validation through UTF-8 prompt files, API calls, or your own inference service is recommended

## 7. `llama.cpp` Usage

### 7.1 Regular Inference

```bash
./llama-cli \
  -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf \
  -ngl 999 \
  -c 8192 \
  -p "Introduce yourself briefly in Chinese."
```

### 7.2 Enable MTP

```bash
./llama-cli \
  -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf \
  -ngl 999 \
  -c 8192 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --spec-draft-ngl 999 \
  -p "Explain briefly how MTP works."
```

### 7.3 Recommended Args

Short replies:

```bash
-c 4096 --temp 0 --top-k 1 --spec-draft-n-max 2
```

Long reasoning:

```bash
-c 8192 --temp 0 --top-k 1 --spec-draft-n-max 2
```

`MTP3` can still improve speed in some longer-output cases, but acceptance tends to drop. `MTP2` is the recommended starting point.

## 8. Known Limitations

- Requires a newer `llama.cpp`
- BF16/Q4 exported by older converters may miss the key Qwen3.5/3.6 MTP metadata
- Some Windows CLI environments may corrupt Chinese prompt arguments
- `Q8_0` is not guaranteed to be faster than `Q6_K`, especially on bandwidth-limited GPUs

## 9. Final Recommendation

If you only want to download one file:

- Choose `Q4_K_M`

If you care more about quality:

- Choose `Q6_K`

If you care more about extreme compression:

- Choose `Q2_K`

If you are running higher-fidelity experiments:

- Choose `Q8_0`