Instructions to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="unsloth/Qwen3.6-35B-A3B-MTP-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/Qwen3.6-35B-A3B-MTP-GGUF", dtype="auto")

llama-cpp-python

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
	filename="BF16/Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

LM Studio
Jan

vLLM

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Qwen3.6-35B-A3B-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

SGLang

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/Qwen3.6-35B-A3B-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/Qwen3.6-35B-A3B-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Ollama:
```
ollama run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
```

Unsloth Studio new

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting

Pi new

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
```

Lemonade

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-MTP-GGUF-UD-Q4_K_M

List all available models

lemonade list

Fast!

#10

by mazuj2 - opened 12 days ago

Discussion

mazuj2

12 days ago

•

edited 12 days ago

pushed me over the hump! i was getting 110tps on my 5070ti/5060ti bifurcated 32gb. I am now getting 145 to 160 tps!
thanks unsloth and qwen!

Wladastic

10 days ago

Which Quant variant were you using?
My 5060Ti only manages to output 20-30tps but with 65k active context being loaded already on UD_IQ4NL_XL without MTP, downloading the UD_Q4_XL MTP right now, let's see...

ercangorgulu

10 days ago

You should check if your VRAM is enough and not any copying happening due to not have enough memory. Speed between quants does not differ much. So having 20-30tps is not normal.
My 4090 was doing around 150tps on Qwen3.6-35B-A3B-GGUF and now it is around 170tps for Qwen3.6-35B-A3B-MTP-GGUF
Also just in any case this is MOE model, maybe your speed is for 27B model?

koifish12

10 days ago

Which Quant variant were you using?
My 5060Ti only manages to output 20-30tps but with 65k active context being loaded already on UD_IQ4NL_XL without MTP, downloading the UD_Q4_XL MTP right now, let's see...

you dont have enough vram

Wladastic

10 days ago

I do, ran okay with offloading to cpu, this machine is my minimal server to run, but mtp bloated it, went without mtp.
On my main machine I got 96gb and prefer the dense 27b there, but honestly nmax above 2 produced robotic results only somehow, putting everything into bulletpoints etc.

takeseem

4 days ago

---- English is trans, sorry ----
Qwen3.6-35B-A3B-MTP-GGUF: Actual measurement shows that the generation speed is lower—only 53.5% of draft tokens are accepted, at 18.42 tokens per second. The original Qwen3.6-35B-A3B model can achieve 23.64 tokens per second.

Personal understanding:

MTP involves transferring model parameters once to generate multiple tokens, then verifying the final output one by one. The costs of prediction and verification are almost zero, but every successful verification saves on the cost of parameter transfer.
MoE is not suitable for MTP: MoE relies on experts to select routes when calculating tokens. It’s likely that predictions and verifications will not match the same experts, which creates a fundamental conflict with MTP’s optimized prediction and packaging.

Qwen3.6-35B-A3B-MTP-GGUF 实测:生成速度变低,draft tokens accepted 53.5% 18.42 token/s,原版的 Qwen3.6-35B-A3B 能到 23.64 token/s
个人理解：MTP：搬运一次模型参数，生成多个 token，然后依次验证输出最终 token；预测和验证成本几乎为零，但每多验证成功一个就省一次搬运成本。
MoE 不适合 MTP：MoE 算 token 依赖专家路由选择，预测和验证大概率不会命中相同的专家，这样就与 MTP 的打包预测优化产生了根本的冲突。

takeseem

4 days ago

In qwen3.5 9b, the actual speed was 10 T/s; after enabling MTP, it increased to 15 T/s, with a acceptance rate of 54.9%. This is indeed good news for dense models.

qwen3.5 9b 实测原来是 10 t/s，开启 mtp 后 15t/s，54.9% 接受率。对于稠密模型确实是好消息。

takeseem

4 days ago

MTP Conclusion: MTP does not save memory, but it is effective for systems with excess computing power and memory bandwidth usage exceeding 50%.
Mini PC UM 790 pro 96G (2x48G 5600M, memory bandwidth 59G/s)
- 2B and below: The card’s computing power is limited; MTP increases overload. MTP should be turned off.
- 4B to 7B: Bandwidth bottleneck; the computing power remains sensitive. MTP should be turned on, with a maximum draft of 1.
- 9B to 32B: Pure bandwidth bottleneck; excess computing power exists. MTP should be turned on, with a maximum draft of 2 or 3.
- MoE: Expert routing conflicts with MTP’s functionality. MTP must be turned off.

MTP 结论：MTP 不会节省内存，但对算力过剩，内存带宽使用率超50%的都会有很好的效果。
UM 790 pro 96G（2x48G 5600M，内存带宽 59G/s）
- 2B 及以下：卡算力，MTP 加重过载，MTP Off
- 4B ~ 7B：带宽瓶颈，算力仍敏感，MTP On，Max Draft = 1
- 9B ~ 32B：纯带宽瓶颈，算力过剩，MTP On，Max Draft = 2 或 3
- MoE：专家路由与 MTP 验证底层冲突，坚决 MTP Off

Wladastic

4 days ago

I agree, the quality tradeoff on 48GB or lower Vram is worse than just using a bigger quant to me.
I rather have 4gb used by a Q6 without MTP instead of Q4 with MTP.
I know the purpose of MTP is to speed things up, but I also saw a degration in quality and would rather have better quality than bulletpoint responses.

takeseem

3 days ago

@Wladastic MTP does not reduce intelligence, and there is no loss of accuracy.

MTP 不会降智，不会有任何精度损失。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment