Instructions to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller",
	filename="Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-FFN-IQ3.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
# Run inference directly in the terminal:
./llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Use Docker

docker model run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

LM Studio
Jan

vLLM

How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Ollama
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Ollama:
```
ollama run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
```

Unsloth Studio

How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller to start chatting

How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Docker Model Runner:
```
docker model run hf.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS
```

Lemonade

How to use lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller:IQ4_XS

Run and chat with the model

lemonade run user.Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller-IQ4_XS

List all available models

lemonade list

Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF(Smaller)

♥ MTP Inference-Accelerated Model Optimized for 16GB VRAM GPUs ♥

This model is a native MTP (Multi-Token Prediction) capable version, extracted from the Dense backbone of llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF and quantized. It supports longer contexts, features uncensored (abliterated) characteristics, and significantly boosts per-token inference speed.

For use cases requiring longer contexts (e.g., 128K+) at approximately 20 tokens/s inference speed, consider this model: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller

Key Highlights

MTP Speculative Decoding: Native Multi-Token Prediction draft generation boosts inference from 20 → 35 tokens/s (75% improvement)
High Speed at Long Contexts: 20 tokens/s at 50K context — 2× faster than non-MTP models (only 10 tokens/s)
70% Draft Acceptance Rate: spec-draft-n-max=2 is optimal; higher values do not improve acceptance
16GB VRAM, up to 60K context: Fully fits on a single GPU with TurboQuant KV Cache (turbo4)
FFN Layer IQ3_S Mixed Precision: Further reduces model size, freeing VRAM for KV Cache
Uncensored Model: Abliterated to remove content restrictions, suitable for deep research

Innovation

This model inherits the mixed-precision quantization strategy from Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller: attn_qkv / attn_k / attn_v / attn_output / output layers remain at IQ4_XS, while ffn_down / ffn_up / ffn_gate layers are downgraded to IQ3_S. On top of this, the core breakthrough is MTP support — the base model preserves the native MTP Head, enabling parallel generation of multiple draft tokens during inference, which are accepted in one batch after verification by the target model, significantly reducing the number of serial decoding steps.

MTP Inference Performance

Tested on: NVIDIA RTX 4060 Ti 16GB, llama.cpp (turboquant + mtp branch)

Scenario	Speed
Short context (non-MTP model)	19 tokens/s
Short context (MTP model)	35 tokens/s
Long context 50K (non-MTP model)	10 tokens/s
Long context 50K (MTP model)	20 tokens/s
Draft acceptance rate	70%

Memory Usage (TurboQuant KV Cache)

Version	Context Length	KV Cache	VRAM Usage
`IQ4_XS-FFN-IQ3_S` (this model)	60K	kv=turbo4	~15.4 GB
`IQ4_XS-FFN-IQ3_S` (this model)	48K	kv=turbo4	~15.2 GB
`IQ4_XS-FFN-IQ3_S` (this model)	32K	k=q8_0,v=turbo4	~15.3 GB

Note: After testing, setting the context to 48K will be more stable and less likely to cause out-of-memory errors.
Note: llama.cpp automatically upgrades cache-type-k to q8_0, which limits context to ~32K on the same VRAM budget. See the Run Command section for the solution.

KV Cache Precision Comparison (Turbo4 vs q8_0)

By setting TURBO_AUTO_ASYMMETRIC=0, the KV Cache uses the turbo4 format instead of the auto-upgraded q8_0, providing significant VRAM savings with minimal perplexity impact:

English novel test:

KV Cache Config	Perplexity	Difference
k=q8_0 + v=turbo4	1.3436 +/- 0.00539	Baseline
kv=turbo4	1.3536 +/- 0.00551	+0.74% only

Code test:

KV Cache Config	Perplexity	Difference
k=q8_0 + v=turbo4	1.2312 +/- 0.00157	Baseline
kv=turbo4	1.2322 +/- 0.00157	+0.08% only

Conclusion: kv=turbo4 delivers significant VRAM savings with minimal perplexity loss (0.1%–0.7%), making 60K context feasible.

Methodology

Base model: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF — an uncensored GGUF with native MTP Head preserved
Extraction and quantization: Dense backbone extracted (27B), quantized using TurboQuant technology stack with mixed precision
Quantization types:
- attn_qkv, attn_k, attn_v, attn_output, output: IQ4_XS
- ffn_down, ffn_up, ffn_gate: IQ3_S
- Other layers: default IQ4_XS

Run Command

16GB VRAM | 60K Context | MTP Acceleration

set TURBO_AUTO_ASYMMETRIC=0

llama-server.exe ^
  -m Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-FFN-IQ3.gguf ^
  --parallel 1 ^
  --spec-type mtp ^
  --spec-draft-n-max 2 ^
  -c 61440 ^
  -ngl 999 ^
  --flash-attn on ^
  -ctk turbo4 ^
  -ctv turbo4 ^
  --host 0.0.0.0 ^
  --port 1234

Key Parameter Descriptions

Parameter	Description
`--spec-type mtp`	Enable MTP speculative decoding mode
`--spec-draft-n-max 2`	Max draft tokens; 2 is optimal (higher values do not improve acceptance rate in testing)
`-ctk turbo4 / -ctv turbo4`	Use turbo4 format for Key/Value Cache; requires `TURBO_AUTO_ASYMMETRIC=0` to take effect
`set TURBO_AUTO_ASYMMETRIC=0`	Prevents automatic K Cache upgrade to q8_0, ensuring turbo4 is used and saving VRAM
`--flash-attn on`	Enable Flash Attention for speedup
`-c 61440`	60K context window

About spec-draft-n-max

Extensive testing shows that --spec-draft-n-max 2 is the optimal configuration. The draft acceptance rate saturates at ~70%; increasing the draft count to 3 or higher does not improve actual output speed and only adds computational overhead.

Runtime Requirements

You need a llama.cpp fork that supports both TurboQuant and MTP:

Recommended source branch: QuinsZouls/llama-cpp-turboquant/llama-next
Precompiled binary download: lemonyins/llama-cpp-turboquant-mtp

This build fixes the TURBO_AUTO_ASYMMETRIC logic and works out of the box — no need to manually set the environment variable.

Caveats

MTP is essential for speedup: You must use an MTP-capable llama.cpp fork and specify --spec-type mtp, otherwise the MTP Head will not be activated
TurboQuant is mandatory: Without TurboQuant KV Cache, 16GB VRAM cannot support 60K context
Environment variable required: If using a non-lemonyins build, you must set TURBO_AUTO_ASYMMETRIC=0 first; otherwise K Cache will be auto-upgraded to q8_0 and VRAM will be insufficient for 60K
Vision module removed: There is insufficient VRAM to load the vision module, so this model is for text-only inference acceleration. For vision support, use: https://huggingface.co/lemonyins/Qwen3.6-27B-uncensored-abliterated-i1-IQ4_XS-GGUF-Smaller

Acknowledgments

llmfan46 — Providing the native MTP-preserved uncensored base GGUF
QuinsZouls — Providing the llama.cpp branch supporting both TurboQuant and MTP (llama-cpp-turboquant/llama-next)
lemonyins — Providing precompiled binaries and fixing the K Cache auto-upgrade issue
llama.cpp — The GGML / llama.cpp team and community

Downloads last month: 3,545

GGUF

Model size

27B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for lemonyins/Qwen3.6-27B-uncensored-abliterated-MTP-i1-IQ4_XS-GGUF-Smaller

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved

Quantized

llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved-GGUF

Quantized

(3)

this model