--- language: - en license: other base_model: - Qwen/Qwen3.6-27B tags: - gguf - llama.cpp - qwen - mtp - speculative-decoding - quantized pipeline_tag: text-generation --- # Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-GGUF This is the GGUF quantized release of the local distilled model `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP`. The value proposition of this release is straightforward: it preserves the Claude Opus / Sonnet distilled style, opens `MTP` directly in `llama.cpp` for real acceleration, shortens the overly long reasoning chain seen in the local original model, and converts more of the token budget into user-visible answers. Key points: - Preserves the Claude Opus / Sonnet distilled response style and organization - Verified to open `MTP` directly in `llama.cpp` with `--spec-type draft-mtp` - `Q4_K_M + MTP2` reaches `80.33%` draft acceptance and `114.78 tok/s` generation, versus `69.98 tok/s` for `Q4_K_M + non-MTP`, or about `64%` faster generation - Compared with the local original model, this release follows a shorter reasoning path; in the same-machine 4-prompt comparison, the original consumed `9002` hidden reasoning chars - Delivers higher visible-output efficiency per token budget; the same comparison produced `2845` visible answer chars for this release versus `1336` for the original - Provides four quantization variants: `Q2_K / Q4_K_M / Q6_K / Q8_0` ## 1. Core Value Of This Release This is not just a generic GGUF export. It is a release that has already been validated for local deployment. From an end-user perspective, the important points are: - `MTP` can be opened directly in `llama.cpp`, rather than existing only as metadata that fails at runtime - In the tested stack, `MTP2` reaches `80.33%` acceptance, showing that speculative acceleration is actually effective - Same-machine comparison against the local original `qwen3.6-27b` shows that the original spends more of its budget on an overly long hidden reasoning chain - This release turns more of the token budget into visible answers, making it better suited for efficient local deployment and interactive use ## 2. Files | File | Size | Notes | |---|---:|---| | `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q2_K.gguf` | 10.12 GB | Most aggressive compression, fastest, largest quality loss | | `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf` | 15.66 GB | Best overall balance, default recommendation | | `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q6_K.gguf` | 20.89 GB | More quality-oriented, still reasonably fast | | `Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q8_0.gguf` | 27.05 GB | Closer to high precision, heavier bandwidth pressure | ## 3. Compatibility Verified with: - Windows CUDA build of `llama.cpp` - GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition 96 GB - `llama-cli` - `--spec-type draft-mtp` - `--spec-draft-n-max 2` - `-ngl 999` Note: you need a newer `llama.cpp` build that includes Qwen3.5/3.6 MTP support. Older conversion pipelines may miss the required metadata and fail with `failed to create MTP context`. ## 4. Recommended Variant - `Q4_K_M`: default recommendation, best speed/quality balance - `Q6_K`: recommended if you care more about quality - `Q2_K`: use when VRAM or disk space is very limited - `Q8_0`: use for higher-fidelity experiments, but it is not always faster ## 5. GPU + MTP2 Benchmarks Test environment: - GPU: RTX PRO 6000 Blackwell 96 GB - Backend: CUDA - Args: `-ngl 999 --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-ngl 999` - Logic puzzle: three-person truth/lie reasoning task, `n=160` - `5.2` is the short-context benchmark - `5.3` is the long-context benchmark ### 5.1 Historical Reference | Variant | Prompt | Generation | Draft acceptance | |---|---:|---:|---:| | BF16 + MTP2 | 20.49 tok/s | 0.85 tok/s | 76.80% | | Q4_K_M + non-MTP | 796.22 tok/s | 69.98 tok/s | - | | Q4_K_M + MTP2 | 240.55 tok/s | 114.78 tok/s | 80.33% | | Q4_K_M + MTP3 | 390.77 tok/s | 117.16 tok/s | 69.48% | ### 5.2 Current Quantization Comparison | Variant | Prompt | Generation | Draft acceptance | Notes | |---|---:|---:|---:|---| | Q2_K + MTP2 | 439.73 tok/s | 118.01 tok/s | 68.66% | Fastest generation, but most aggressive compression | | Q4_K_M + MTP2 | 240.55 tok/s | 114.78 tok/s | 80.33% | Default recommendation | | Q6_K + MTP2 | 503.87 tok/s | 99.85 tok/s | 78.86% | More quality-oriented | | Q8_0 + MTP2 | 421.04 tok/s | 78.86 tok/s | 69.17% | Largest file, more bandwidth-limited | ### 5.3 Long-Context Addendum The long-context tests also use `GPU + MTP2`, but the prompt is changed to a long-document retrieval task: - `ctx8k` uses an actual prompt length of about `6616 tokens` - `ctx32k` uses an actual prompt length of about `26738 tokens` - To reduce output variance, generation is intentionally short; the model usually reaches `EOS` after `17-23 tokens` - The table below is based on raw `llama.cpp` timing logs | Tier | Variant | Prompt tokens | Prompt tok/s | Generation tokens | Generation tok/s | Draft acceptance | |---|---|---:|---:|---:|---:|---:| | ctx8k | Q2_K | 6616 | 1304.11 | 16 | 104.41 | 83.33% | | ctx8k | Q4_K_M | 6616 | 2798.63 | 21 | 31.73 | 60.00% | | ctx8k | Q6_K | 6616 | 2415.74 | 21 | 69.48 | 60.00% | | ctx8k | Q8_0 | 6616 | 2143.06 | 21 | 63.78 | 60.00% | | ctx32k | Q2_K | 26738 | 2450.46 | 17 | 71.41 | 78.57% | | ctx32k | Q4_K_M | 26738 | 2846.65 | 23 | 87.42 | 83.33% | | ctx32k | Q6_K | 26738 | 2620.59 | 17 | 81.02 | 71.43% | | ctx32k | Q8_0 | 26738 | 3120.27 | 17 | 71.19 | 71.43% | Long-context observations: - `Q4_K_M` remains the most balanced option in this long-context setup - `Q6_K` still delivers `81 tok/s` generation at `ctx32k`, making it a good quality-first choice - `Q8_0` shows strong prompt throughput at `ctx32k`, but generation still does not clearly outperform `Q6_K` - `Q2_K` remains usable for long context, but it is still better suited for extreme compression than for default distribution Conclusion: - On this Blackwell workstation GPU, `Q4_K_M` remains the best-balanced variant - `Q2_K` has the highest generation speed, but it is also the most aggressive in compression and quality trade-off - `Q6_K` is more stable in acceptance and is a better high-quality option - `Q8_0` is not guaranteed to be faster, indicating clear bandwidth limits in this setup ### 5.4 Same-Machine Deployment Comparison vs Local Original `qwen3.6-27b` This section presents a same-machine deployment comparison against the local original `qwen3.6-27b` served on port `1234`. The purpose is to illustrate response efficiency, correctness, and output-budget allocation under a local deployment workflow, rather than to claim a strict cross-hardware or cross-framework benchmark result. - Comparison target: `qwen3.6-27b` on `http://127.0.0.1:1234/v1/chat/completions` - Release representative: `Q4_K_M + MTP2` - Prompt set: `4` mostly objective prompts covering logic, a `sqrt(2)` proof, literary recall, and long-context retrieval - Measurement note: the GGUF latency in the figure below includes `llama-cli` cold start, so this is a conservative comparison for the release Key numbers: - Average wall time: release `10.09s`, original `10.93s` - Correctness: release `4/4`, original `3/4` - Original hidden reasoning overhead: `9002` `reasoning_content` characters across the 4 prompts - Release throughput: average prompt `1035.1 tok/s`, average generation `118.1 tok/s` ![Release vs original efficiency comparison](release_vs_original_efficiency.png) Observations: - On this prompt set, the release is faster on average even though the GGUF side is measured with a cold start every run - On the `sqrt(2)` proof prompt, the original spent a large amount of budget on hidden reasoning and did not reliably finish the final concise answer within the configured limit - The release follows a direct-answer path and is better aligned with the goal of efficient local deployment - If the release is deployed as a persistent local service instead of starting `llama-cli` per request, latency is typically lower than what is shown here This comparison is not meant to be a formal academic benchmark. It answers a more practical question: on the same machine, can a publishable local GGUF release preserve correctness while delivering better response efficiency? In this test, the answer is yes. ## 6. Quality Validation Two kinds of validation were performed: 1. Loadability validation - `Q2_K / Q4_K_M / Q6_K / Q8_0` all passed the `GGUF` header check - All four quantization variants can be loaded with GPU `draft-mtp` 2. Same-prompt logic validation - `Q2_K / Q4_K_M / Q6_K / Q8_0` all follow the same reasoning direction on the same truth/lie logic puzzle - The stable answer is: - A is lying - B is telling the truth - C is lying Quality assessment: | Variant | Quality conclusion | Recommendation | |---|---|---| | `Q2_K` | Usable, but the most aggressive compression with the largest quality loss | Only recommended for extreme compression scenarios | | `Q4_K_M` | Best overall balance of quality and speed | Default recommendation for release | | `Q6_K` | More stable quality, better for fidelity-oriented use | Recommended as the higher-quality option | | `Q8_0` | Quality is fine, but speed is not necessarily better than `Q6_K` | Recommended for high-fidelity experiments | Additional notes: - In the current PowerShell + CLI environment, passing Chinese prompts directly via command-line arguments may occasionally introduce encoding noise - Therefore, the main quality comparison in this repo uses an English logic puzzle as the unified benchmark - For actual Chinese usage, validation through UTF-8 prompt files, API calls, or your own inference service is recommended ## 7. `llama.cpp` Usage ### 7.1 Regular Inference ```bash ./llama-cli \ -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf \ -ngl 999 \ -c 8192 \ -p "Introduce yourself briefly in Chinese." ``` ### 7.2 Enable MTP ```bash ./llama-cli \ -m Qwen3.6-27B-Claude-Opus-Sonnet-DistilledV2-MTP-Q4_K_M.gguf \ -ngl 999 \ -c 8192 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ --spec-draft-ngl 999 \ -p "Explain briefly how MTP works." ``` ### 7.3 Recommended Args Short replies: ```bash -c 4096 --temp 0 --top-k 1 --spec-draft-n-max 2 ``` Long reasoning: ```bash -c 8192 --temp 0 --top-k 1 --spec-draft-n-max 2 ``` `MTP3` can still improve speed in some longer-output cases, but acceptance tends to drop. `MTP2` is the recommended starting point. ## 8. Known Limitations - Requires a newer `llama.cpp` - BF16/Q4 exported by older converters may miss the key Qwen3.5/3.6 MTP metadata - Some Windows CLI environments may corrupt Chinese prompt arguments - `Q8_0` is not guaranteed to be faster than `Q6_K`, especially on bandwidth-limited GPUs ## 9. Final Recommendation If you only want to download one file: - Choose `Q4_K_M` If you care more about quality: - Choose `Q6_K` If you care more about extreme compression: - Choose `Q2_K` If you are running higher-fidelity experiments: - Choose `Q8_0`