Instructions to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller", filename="Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-FFN-IQ3.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS # Run inference directly in the terminal: llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS # Run inference directly in the terminal: llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS # Run inference directly in the terminal: ./llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS # Run inference directly in the terminal: ./build/bin/llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Use Docker
docker model run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
- LM Studio
- Jan
- vLLM
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
- Ollama
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Ollama:
ollama run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
- Unsloth Studio
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller to start chatting
- Pi
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Docker Model Runner:
docker model run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
- Lemonade
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
Run and chat with the model
lemonade run user.Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller-IQ4_XS
List all available models
lemonade list
Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF(Smaller)
♥ MTP Inference-Accelerated Model Optimized for 16GB VRAM GPUs ♥
This model is a native MTP (Multi-Token Prediction) capable version, extracted from the Dense backbone of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF and quantized. It supports longer contexts, features uncensored (abliterated) characteristics, and significantly boosts per-token inference speed.
For use cases requiring longer contexts (e.g., 128K+) at approximately 20 tokens/s inference speed, consider this model: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller
Key Highlights
- MTP Speculative Decoding: Native Multi-Token Prediction draft generation boosts inference from 20 → 35 tokens/s (75% improvement)
- High Speed at Long Contexts: 20 tokens/s at 50K context — 2× faster than non-MTP models (only 10 tokens/s)
- 70% Draft Acceptance Rate: spec-draft-n-max=2 is optimal; higher values do not improve acceptance
- 16GB VRAM, up to 60K context: Fully fits on a single GPU with TurboQuant KV Cache (turbo4)
- FFN Layer IQ3_S Mixed Precision: Further reduces model size, freeing VRAM for KV Cache
- Uncensored Model: Abliterated to remove content restrictions, suitable for deep research
Innovation
This model inherits the mixed-precision quantization strategy from Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller: attn_qkv / attn_k / attn_v / attn_output / output layers remain at IQ4_XS, while ffn_down / ffn_up / ffn_gate layers are downgraded to IQ3_S. On top of this, the core breakthrough is MTP support — the base model preserves the native MTP Head, enabling parallel generation of multiple draft tokens during inference, which are accepted in one batch after verification by the target model, significantly reducing the number of serial decoding steps.
MTP Inference Performance
Tested on: NVIDIA RTX 4060 Ti 16GB, llama.cpp (turboquant + mtp branch)
| Scenario | Speed |
|---|---|
| Short context (non-MTP model) | 19 tokens/s |
| Short context (MTP model) | 35 tokens/s |
| Long context 50K (non-MTP model) | 10 tokens/s |
| Long context 50K (MTP model) | 20 tokens/s |
| Draft acceptance rate | 70% |
Memory Usage (TurboQuant KV Cache)
| Version | Context Length | KV Cache | VRAM Usage |
|---|---|---|---|
IQ4_XS-FFN-IQ3_S (this model) |
60K | kv=turbo4 | ~15.4 GB |
IQ4_XS-FFN-IQ3_S (this model) |
48K | kv=turbo4 | ~15.2 GB |
IQ4_XS-FFN-IQ3_S (this model) |
32K | k=q8_0,v=turbo4 | ~15.3 GB |
- Note: After testing, setting the context to 48K will be more stable and less likely to cause out-of-memory errors.
- Note: llama.cpp automatically upgrades
cache-type-ktoq8_0, which limits context to ~32K on the same VRAM budget. See the Run Command section for the solution.
KV Cache Precision Comparison (Turbo4 vs q8_0)
By setting TURBO_AUTO_ASYMMETRIC=0, the KV Cache uses the turbo4 format instead of the auto-upgraded q8_0, providing significant VRAM savings with minimal perplexity impact:
English novel test:
| KV Cache Config | Perplexity | Difference |
|---|---|---|
| k=q8_0 + v=turbo4 | 1.3436 +/- 0.00539 | Baseline |
| kv=turbo4 | 1.3536 +/- 0.00551 | +0.74% only |
Code test:
| KV Cache Config | Perplexity | Difference |
|---|---|---|
| k=q8_0 + v=turbo4 | 1.2312 +/- 0.00157 | Baseline |
| kv=turbo4 | 1.2322 +/- 0.00157 | +0.08% only |
Conclusion: kv=turbo4 delivers significant VRAM savings with minimal perplexity loss (0.1%–0.7%), making 60K context feasible.
Methodology
- Base model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF — an uncensored GGUF with native MTP Head preserved
- Extraction and quantization: Dense backbone extracted (27B), quantized using TurboQuant technology stack with mixed precision
- Quantization types:
attn_qkv,attn_k,attn_v,attn_output,output:IQ4_XSffn_down,ffn_up,ffn_gate:IQ3_S- Other layers: default
IQ4_XS
Run Command
16GB VRAM | 60K Context | MTP Acceleration
set TURBO_AUTO_ASYMMETRIC=0
llama-server.exe ^
-m Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-FFN-IQ3.gguf ^
--parallel 1 ^
--spec-type mtp ^
--spec-draft-n-max 2 ^
-c 61440 ^
-ngl 999 ^
--flash-attn on ^
-ctk turbo4 ^
-ctv turbo4 ^
--host 0.0.0.0 ^
--port 1234
Key Parameter Descriptions
| Parameter | Description |
|---|---|
--spec-type mtp |
Enable MTP speculative decoding mode |
--spec-draft-n-max 2 |
Max draft tokens; 2 is optimal (higher values do not improve acceptance rate in testing) |
-ctk turbo4 / -ctv turbo4 |
Use turbo4 format for Key/Value Cache; requires TURBO_AUTO_ASYMMETRIC=0 to take effect |
set TURBO_AUTO_ASYMMETRIC=0 |
Prevents automatic K Cache upgrade to q8_0, ensuring turbo4 is used and saving VRAM |
--flash-attn on |
Enable Flash Attention for speedup |
-c 61440 |
60K context window |
About spec-draft-n-max
Extensive testing shows that --spec-draft-n-max 2 is the optimal configuration. The draft acceptance rate saturates at ~70%; increasing the draft count to 3 or higher does not improve actual output speed and only adds computational overhead.
Runtime Requirements
You need a llama.cpp fork that supports both TurboQuant and MTP:
- Recommended source branch: QuinsZouls/llama-cpp-turboquant/llama-next
- Precompiled binary download: lemonyins/llama-cpp-turboquant-mtp
This build fixes the
TURBO_AUTO_ASYMMETRIClogic and works out of the box — no need to manually set the environment variable.
Caveats
- MTP is essential for speedup: You must use an MTP-capable llama.cpp fork and specify
--spec-type mtp, otherwise the MTP Head will not be activated - TurboQuant is mandatory: Without TurboQuant KV Cache, 16GB VRAM cannot support 60K context
- Environment variable required: If using a non-lemonyins build, you must
set TURBO_AUTO_ASYMMETRIC=0first; otherwise K Cache will be auto-upgraded to q8_0 and VRAM will be insufficient for 60K - Vision module removed: There is insufficient VRAM to load the vision module, so this model is for text-only inference acceleration. For vision support, use: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller
Acknowledgments
- llmfan46 — Providing the native MTP-preserved uncensored base GGUF
- QuinsZouls — Providing the llama.cpp branch supporting both TurboQuant and MTP (llama-cpp-turboquant/llama-next)
- lemonyins — Providing precompiled binaries and fixing the K Cache auto-upgrade issue
- llama.cpp — The GGML / llama.cpp team and community
- Downloads last month
- 3,545
4-bit
Model tree for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller
Base model
Qwen/Qwen3.6-35B-A3B