--- license: apache-2.0 base_model: - llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF tags: - Qwen3.6-27B - abliterated - Uncensored - MTP - Multi-Token-Prediction - TurboQuant - Speculative-Decoding - gguf - IQ4_XS pipeline_tag: text-generation --- # Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF(Smaller) **♥ MTP Inference-Accelerated Model Optimized for 16GB VRAM GPUs ♥** This model is a **native MTP (Multi-Token Prediction) capable** version, extracted from the Dense backbone of [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF](https://hf-mirror.com/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF) and quantized. It supports longer contexts, features **uncensored (abliterated)** characteristics, and significantly boosts per-token inference speed. For use cases requiring longer contexts (e.g., 128K+) at approximately 20 tokens/s inference speed, consider this model: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller --- ## Key Highlights - **MTP Speculative Decoding**: Native Multi-Token Prediction draft generation boosts inference from **20 → 35 tokens/s** (75% improvement) - **High Speed at Long Contexts**: **20 tokens/s** at 50K context — **2× faster** than non-MTP models (only 10 tokens/s) - **70% Draft Acceptance Rate**: spec-draft-n-max=2 is optimal; higher values do not improve acceptance - **16GB VRAM, up to 60K context**: Fully fits on a single GPU with TurboQuant KV Cache (turbo4) - **FFN Layer IQ3_S Mixed Precision**: Further reduces model size, freeing VRAM for KV Cache - **Uncensored Model**: Abliterated to remove content restrictions, suitable for deep research --- ## Innovation This model inherits the mixed-precision quantization strategy from [Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller](https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller): `attn_qkv` / `attn_k` / `attn_v` / `attn_output` / `output` layers remain at **IQ4_XS**, while `ffn_down` / `ffn_up` / `ffn_gate` layers are downgraded to **IQ3_S**. On top of this, **the core breakthrough is MTP support** — the base model preserves the native MTP Head, enabling parallel generation of multiple draft tokens during inference, which are accepted in one batch after verification by the target model, significantly reducing the number of serial decoding steps. --- ## MTP Inference Performance Tested on: **NVIDIA RTX 4060 Ti 16GB**, llama.cpp (turboquant + mtp branch) | Scenario | Speed | | :--- | :--- | | Short context (non-MTP model) | 19 tokens/s | | Short context (MTP model) | **35 tokens/s** | | Long context 50K (non-MTP model) | 10 tokens/s | | Long context 50K (MTP model) | **20 tokens/s** | | Draft acceptance rate | **70%** | --- ## Memory Usage (TurboQuant KV Cache) | Version | Context Length | KV Cache | VRAM Usage | | :--- | :--- | :--- | :--- | | `IQ4_XS-FFN-IQ3_S` (this model) | 60K | kv=turbo4 | **~15.4 GB** | | `IQ4_XS-FFN-IQ3_S` (this model) | 48K | kv=turbo4 | **~15.2 GB** | | `IQ4_XS-FFN-IQ3_S` (this model) | 32K | k=q8_0,v=turbo4 | **~15.3 GB** | - **Note**: After testing, setting the context to 48K will be more stable and less likely to cause out-of-memory errors. - **Note**: llama.cpp automatically upgrades `cache-type-k` to `q8_0`, which limits context to ~32K on the same VRAM budget. See the [Run Command](#run-command) section for the solution. --- ## KV Cache Precision Comparison (Turbo4 vs q8_0) By setting `TURBO_AUTO_ASYMMETRIC=0`, the KV Cache uses the `turbo4` format instead of the auto-upgraded `q8_0`, providing significant VRAM savings with minimal perplexity impact: **English novel test:** | KV Cache Config | Perplexity | Difference | | :--- | :--- | :--- | | k=q8_0 + v=turbo4 | 1.3436 +/- 0.00539 | Baseline | | **kv=turbo4** | **1.3536 +/- 0.00551** | **+0.74% only** | **Code test:** | KV Cache Config | Perplexity | Difference | | :--- | :--- | :--- | | k=q8_0 + v=turbo4 | 1.2312 +/- 0.00157 | Baseline | | **kv=turbo4** | **1.2322 +/- 0.00157** | **+0.08% only** | Conclusion: **kv=turbo4 delivers significant VRAM savings with minimal perplexity loss (0.1%–0.7%), making 60K context feasible.** --- ## Methodology 1. **Base model**: [llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF](https://hf-mirror.com/llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF) — an uncensored GGUF with native MTP Head preserved 2. **Extraction and quantization**: Dense backbone extracted (27B), quantized using TurboQuant technology stack with mixed precision 3. **Quantization types**: - `attn_qkv`, `attn_k`, `attn_v`, `attn_output`, `output`: `IQ4_XS` - `ffn_down`, `ffn_up`, `ffn_gate`: `IQ3_S` - Other layers: default `IQ4_XS` --- ## Run Command ### 16GB VRAM | 60K Context | MTP Acceleration ```bash set TURBO_AUTO_ASYMMETRIC=0 llama-server.exe ^ -m Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-FFN-IQ3.gguf ^ --parallel 1 ^ --spec-type mtp ^ --spec-draft-n-max 2 ^ -c 61440 ^ -ngl 999 ^ --flash-attn on ^ -ctk turbo4 ^ -ctv turbo4 ^ --host 0.0.0.0 ^ --port 1234 ``` ### Key Parameter Descriptions | Parameter | Description | | :--- | :--- | | `--spec-type mtp` | Enable MTP speculative decoding mode | | `--spec-draft-n-max 2` | Max draft tokens; **2** is optimal (higher values do not improve acceptance rate in testing) | | `-ctk turbo4 / -ctv turbo4` | Use turbo4 format for Key/Value Cache; requires `TURBO_AUTO_ASYMMETRIC=0` to take effect | | `set TURBO_AUTO_ASYMMETRIC=0` | Prevents automatic K Cache upgrade to q8_0, ensuring turbo4 is used and saving VRAM | | `--flash-attn on` | Enable Flash Attention for speedup | | `-c 61440` | 60K context window | ### About spec-draft-n-max Extensive testing shows that `--spec-draft-n-max 2` is the optimal configuration. The draft acceptance rate saturates at **~70%**; increasing the draft count to 3 or higher does not improve actual output speed and only adds computational overhead. --- ## Runtime Requirements You need a llama.cpp fork that **supports both TurboQuant and MTP**: - **Recommended source branch**: [QuinsZouls/llama-cpp-turboquant/llama-next](https://github.com/QuinsZouls/llama-cpp-turboquant/tree/llama-next) - **Precompiled binary download**: [lemonyins/llama-cpp-turboquant-mtp](https://github.com/lemonyins/llama-cpp-turboquant-mtp) > This build fixes the `TURBO_AUTO_ASYMMETRIC` logic and works out of the box — no need to manually set the environment variable. --- ## Caveats - **MTP is essential for speedup**: You must use an MTP-capable llama.cpp fork and specify `--spec-type mtp`, otherwise the MTP Head will not be activated - **TurboQuant is mandatory**: Without TurboQuant KV Cache, 16GB VRAM cannot support 60K context - **Environment variable required**: If using a non-lemonyins build, you must `set TURBO_AUTO_ASYMMETRIC=0` first; otherwise K Cache will be auto-upgraded to q8_0 and VRAM will be insufficient for 60K - **Vision module removed**: There is insufficient VRAM to load the vision module, so this model is for text-only inference acceleration. For vision support, use: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller --- ## Acknowledgments - **[llmfan46](https://hf-mirror.com/llmfan46)** — Providing the native MTP-preserved uncensored base GGUF - **[QuinsZouls](https://github.com/QuinsZouls)** — Providing the llama.cpp branch supporting both TurboQuant and MTP ([llama-cpp-turboquant/llama-next](https://github.com/QuinsZouls/llama-cpp-turboquant/tree/llama-next)) - **[lemonyins](https://github.com/lemonyins)** — Providing precompiled binaries and fixing the K Cache auto-upgrade issue - **[llama.cpp](https://github.com/ggml-org/llama.cpp)** — The GGML / llama.cpp team and community