Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF(Smaller)

♥ MTP Inference-Accelerated Model Optimized for 16GB VRAM GPUs ♥

This model is a native MTP (Multi-Token Prediction) capable version, extracted from the Dense backbone of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF and quantized. It supports longer contexts, features uncensored (abliterated) characteristics, and significantly boosts per-token inference speed.

For use cases requiring longer contexts (e.g., 128K+) at approximately 20 tokens/s inference speed, consider this model: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller


Key Highlights

  • MTP Speculative Decoding: Native Multi-Token Prediction draft generation boosts inference from 20 → 35 tokens/s (75% improvement)
  • High Speed at Long Contexts: 20 tokens/s at 50K context — 2× faster than non-MTP models (only 10 tokens/s)
  • 70% Draft Acceptance Rate: spec-draft-n-max=2 is optimal; higher values do not improve acceptance
  • 16GB VRAM, up to 60K context: Fully fits on a single GPU with TurboQuant KV Cache (turbo4)
  • FFN Layer IQ3_S Mixed Precision: Further reduces model size, freeing VRAM for KV Cache
  • Uncensored Model: Abliterated to remove content restrictions, suitable for deep research

Innovation

This model inherits the mixed-precision quantization strategy from Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller: attn_qkv / attn_k / attn_v / attn_output / output layers remain at IQ4_XS, while ffn_down / ffn_up / ffn_gate layers are downgraded to IQ3_S. On top of this, the core breakthrough is MTP support — the base model preserves the native MTP Head, enabling parallel generation of multiple draft tokens during inference, which are accepted in one batch after verification by the target model, significantly reducing the number of serial decoding steps.


MTP Inference Performance

Tested on: NVIDIA RTX 4060 Ti 16GB, llama.cpp (turboquant + mtp branch)

Scenario Speed
Short context (non-MTP model) 19 tokens/s
Short context (MTP model) 35 tokens/s
Long context 50K (non-MTP model) 10 tokens/s
Long context 50K (MTP model) 20 tokens/s
Draft acceptance rate 70%

Memory Usage (TurboQuant KV Cache)

Version Context Length KV Cache VRAM Usage
IQ4_XS-FFN-IQ3_S (this model) 60K kv=turbo4 ~15.4 GB
IQ4_XS-FFN-IQ3_S (this model) 48K kv=turbo4 ~15.2 GB
IQ4_XS-FFN-IQ3_S (this model) 32K k=q8_0,v=turbo4 ~15.3 GB
  • Note: After testing, setting the context to 48K will be more stable and less likely to cause out-of-memory errors.
  • Note: llama.cpp automatically upgrades cache-type-k to q8_0, which limits context to ~32K on the same VRAM budget. See the Run Command section for the solution.

KV Cache Precision Comparison (Turbo4 vs q8_0)

By setting TURBO_AUTO_ASYMMETRIC=0, the KV Cache uses the turbo4 format instead of the auto-upgraded q8_0, providing significant VRAM savings with minimal perplexity impact:

English novel test:

KV Cache Config Perplexity Difference
k=q8_0 + v=turbo4 1.3436 +/- 0.00539 Baseline
kv=turbo4 1.3536 +/- 0.00551 +0.74% only

Code test:

KV Cache Config Perplexity Difference
k=q8_0 + v=turbo4 1.2312 +/- 0.00157 Baseline
kv=turbo4 1.2322 +/- 0.00157 +0.08% only

Conclusion: kv=turbo4 delivers significant VRAM savings with minimal perplexity loss (0.1%–0.7%), making 60K context feasible.


Methodology

  1. Base model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF — an uncensored GGUF with native MTP Head preserved
  2. Extraction and quantization: Dense backbone extracted (27B), quantized using TurboQuant technology stack with mixed precision
  3. Quantization types:
    • attn_qkv, attn_k, attn_v, attn_output, output: IQ4_XS
    • ffn_down, ffn_up, ffn_gate: IQ3_S
    • Other layers: default IQ4_XS

Run Command

16GB VRAM | 60K Context | MTP Acceleration

set TURBO_AUTO_ASYMMETRIC=0

llama-server.exe ^
  -m Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-FFN-IQ3.gguf ^
  --parallel 1 ^
  --spec-type mtp ^
  --spec-draft-n-max 2 ^
  -c 61440 ^
  -ngl 999 ^
  --flash-attn on ^
  -ctk turbo4 ^
  -ctv turbo4 ^
  --host 0.0.0.0 ^
  --port 1234

Key Parameter Descriptions

Parameter Description
--spec-type mtp Enable MTP speculative decoding mode
--spec-draft-n-max 2 Max draft tokens; 2 is optimal (higher values do not improve acceptance rate in testing)
-ctk turbo4 / -ctv turbo4 Use turbo4 format for Key/Value Cache; requires TURBO_AUTO_ASYMMETRIC=0 to take effect
set TURBO_AUTO_ASYMMETRIC=0 Prevents automatic K Cache upgrade to q8_0, ensuring turbo4 is used and saving VRAM
--flash-attn on Enable Flash Attention for speedup
-c 61440 60K context window

About spec-draft-n-max

Extensive testing shows that --spec-draft-n-max 2 is the optimal configuration. The draft acceptance rate saturates at ~70%; increasing the draft count to 3 or higher does not improve actual output speed and only adds computational overhead.


Runtime Requirements

You need a llama.cpp fork that supports both TurboQuant and MTP:

This build fixes the TURBO_AUTO_ASYMMETRIC logic and works out of the box — no need to manually set the environment variable.


Caveats

  • MTP is essential for speedup: You must use an MTP-capable llama.cpp fork and specify --spec-type mtp, otherwise the MTP Head will not be activated
  • TurboQuant is mandatory: Without TurboQuant KV Cache, 16GB VRAM cannot support 60K context
  • Environment variable required: If using a non-lemonyins build, you must set TURBO_AUTO_ASYMMETRIC=0 first; otherwise K Cache will be auto-upgraded to q8_0 and VRAM will be insufficient for 60K
  • Vision module removed: There is insufficient VRAM to load the vision module, so this model is for text-only inference acceleration. For vision support, use: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller

Acknowledgments

Downloads last month
3,545
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller