Instructions to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="unsloth/Qwen3.6-35B-A3B-MTP-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/Qwen3.6-35B-A3B-MTP-GGUF", dtype="auto")

llama-cpp-python

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
	filename="BF16/Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

LM Studio
Jan

vLLM

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Qwen3.6-35B-A3B-MTP-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

SGLang

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/Qwen3.6-35B-A3B-MTP-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/Qwen3.6-35B-A3B-MTP-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Ollama:
```
ollama run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
```

Unsloth Studio new

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting

Pi new

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
```

Lemonade

How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-MTP-GGUF-UD-Q4_K_M

List all available models

lemonade list

# Qwen3.6-35B-A3B + MTP + TurboQuant on ROCm (RX 7800 XT / gfx1101): What Actually Works

#17

by JamesBean187 - opened 8 days ago

Discussion

JamesBean187

8 days ago

Hardware: AMD RX 7800 XT (gfx1101) · 16 GB VRAM
OS: Xubuntu 24.04
ROCm: 6.4.0 kernel 6.17.
Fork: NJannasch/llama.cpp mtp-turboquant branch
Model: unsloth/Qwen3.6-35B-A3B-MTP-GGUF — Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf
Date tested: 2026-05-18

Tested on a consumer desktop, not a server rack.

Written by Claude Code · Edited by User

Most writeups on MTP + TurboQuant are NVIDIA. This is a ROCm-specific report from an RDNA3 card. There are a few things that will bite you that I didn't see documented anywhere. Posting this so you don't lose hours to the same issues.

Why This Combination

UD-IQ3_XXS quantization is 13 GB on disk — fits in 16 GB VRAM with room for KV cache.

The catch: at Q3-range quantization the KV cache overhead is proportionally larger. Without TurboQuant, a 32K context window adds another ~640 MB of f16 KV cache on top of the model. With turbo4 KV (-ctk turbo4 -ctv turbo4), that same 32K context costs ~170 MB. That's the margin that makes longer conversations viable on a 16 GB card.

MTP (Multi-Token Prediction) is baked into Qwen3.6-35B-A3B at training time — nextn_predict_layers = 1 in the model metadata. The NJannasch fork activates it with --spec-type draft-mtp.

Build

Single GPU setup (most people): straightforward. Build for your card's gfx target only.

git clone --branch mtp-turboquant --depth 1 https://github.com/NJannasch/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1101" \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-server -j$(nproc)

Build time: ~20 minutes on a mid-range CPU.

Multi-GPU setup — read this before you build:

If you have a secondary AMD GPU in the system, the temptation is to list both targets in AMDGPU_TARGETS. Don't.

This fork does not have a rocBLAS TensileLibrary path for every gfx target. Specifically, gfx1032 (RX 6600 XT / 6700 XT family) is not covered. If you include it, the build succeeds but inference crashes at runtime:

rocBLAS error: Cannot read TensileLibrary.dat: Illegal seek for GPU arch: gfx1032

This is not a VRAM issue. The crash happens because the compiled binary has no kernel path for that architecture. Not due to memory size. A 6600 XT with unlimited VRAM would crash the same way. Mainline llama.cpp handles gfx1032 fine via a gfx1030 fallback — this fork does not.

Fix: build for your primary card's target only, and restrict GPU visibility at launch with HIP_VISIBLE_DEVICES (see Launch Flags below). The secondary card stays available for other processes; this binary just won't touch it.

Model Download

The filename in the MTP GGUF repo does not include "MTP" in the filename itself:

wget -c -O Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  'https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf'

hf download worked but left an .incomplete file at exit code 0 in my session. Use wget -c with the direct resolve URL to be safe.

Launch Flags That Work

HIP_VISIBLE_DEVICES=1 ./build/bin/llama-server \
  --model Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  --port 8080 \
  --host 127.0.0.1 \
  -ngl 999 \
  --n-cpu-moe 0 \
  --ctx-size 32768 \
  --jinja \
  -np 1 \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ctk turbo4 \
  -ctv turbo4 \
  --reasoning-budget 512

HIP_VISIBLE_DEVICES=1 assumes the 7800 XT is your second GPU. Adjust to 0 if it's your only GPU.

Results

Benchmark (API level)

Config	VRAM	Gen t/s	MTP acceptance
MTP + f16 KV, ctx 8192	14.93 GB	~118	~100%
MTP + turbo4 KV, ctx 8192	14.80 GB	~111	~88%
MTP + turbo4 KV, ctx 32768	14.58 GB	~110	~86%

At 32K context turbo4 saves ~470 MB vs f16. Speed barely changes between ctx 8K and 32K with turbo4 — that's the point.

MTP acceptance at 86-100% is strong. Qwen3.6 was trained with MTP so draft quality is high.

UX Testing (web UI, thinking mode on, `--reasoning-budget 512`)

Tested against the built-in llama-server web UI after full service setup. Scoring is subjective: Pass / Degraded / Fail against the criterion listed.

Category	Example prompt type	First token	Full response t/s	Result
Simple direct	Short factual, single-answer	< 1s	~110 t/s	✅ Pass — fast, on point, no over-explanation
Concise instruction	"Reply in exactly N words"	< 1s	~110 t/s	✅ Pass — followed constraint, no padding
Multi-step reasoning	Technical problem with 3+ constraints	1–2s	~95–105 t/s	✅ Pass — thinking budget used well, answer structured correctly
Live data / current events	"What are good X in [city]?"	1–2s	~9 t/s ⚠️	❌ Degraded — see note
Conversational follow-up	Short reply in ongoing thread	< 1s	~90–100 t/s	✅ Pass — context retained, no repetition

Live data note: The model has no web access. Without a --reasoning-budget cap, open-ended factual queries trigger an uncapped thinking loop where the model enumerates everything it knows from training data. This accumulated hundreds of tokens of internal reasoning, driving generation speed down from ~110 t/s to ~9 t/s as attention overhead compounded. With --reasoning-budget 512 of the thinking cuts that off Then the model states plainly that it can't provide live data — which is the correct answer. The degraded score reflects the behaviour without the budget; with it, this category becomes a Pass with a graceful "I don't have live data" response.

Thinking mode: The reasoning phase is not visible in the web UI by default — responses appear after the think completes. For tasks where the thinking budget is hit, the model produces a brief response from wherever reasoning ended. This is working as designed. For direct conversational use, disable thinking per-request via the API ("chat_template_kwargs": {"enable_thinking": false}) or set --reasoning-budget 0 at launch.

Issues to Watch For

1. The Silent CPU Fallback — Most Dangerous

Symptom: Model appears to load normally. VRAM reads correctly (14+ GB used). But generation is 5-10 t/s instead of 100+ t/s. CPU is pegged at 200-300% usage. GPU utilization reads 0%.

Cause: The ROCm runtime silently falls back to CPU compute when GPU initialization fails. The VRAM reading is misleading — the weights are loaded to GPU memory via mmap but computed on CPU.

How to catch it: Check the server log for compute buffer:

# GPU — correct
sched_reserve: ROCm0 compute buffer size = 493.00 MiB

# CPU — silent fallback
sched_reserve: CPU compute buffer size = 497.00 MiB

Also check startup for:

ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected

If you see that line, nothing is running on GPU regardless of what rocm-smi shows.

VRAM is not a reliable GPU-usage indicator. Always check compute buffer in the log.

2. ROCR_VISIBLE_DEVICES + HIP_VISIBLE_DEVICES Integer Conflict

If you're running multiple GPUs and use systemd (or any launcher that sets environment variables explicitly), watch out for this:

ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES use different index spaces.

Variable	Index 0	Index 1	Index 2
`HIP_VISIBLE_DEVICES`	GPU 0	GPU 1	GPU 2
`ROCR_VISIBLE_DEVICES`	CPU (HSA agent 0)	GPU 0	GPU 1

Setting both to 1 targets different physical devices. The ROCm runtime sees a conflict and reports no capable device — then silently falls back to CPU. This is the root cause of the silent CPU fallback above in systemd contexts.

Fix: Use HIP_VISIBLE_DEVICES alone. Don't combine both with integer indices.

# Correct — one variable, GPU-only index
HIP_VISIBLE_DEVICES=1 ./llama-server ...

# Wrong — conflicting index spaces
ROCR_VISIBLE_DEVICES=1 HIP_VISIBLE_DEVICES=1 ./llama-server ...

If you want UUID-based targeting (more stable across reboots), use ROCR_VISIBLE_DEVICES=<UUID> + HIP_VISIBLE_DEVICES=0 (since only one device is visible, it becomes index 0).

3. Hybrid Model Requires MTP to Load

Qwen3.6-35B-A3B is a hybrid architecture — it alternates between full attention layers and SSM (Gated Delta Net / Mamba-style) recurrent layers every 4 blocks. This is not a standard transformer.

In the NJannasch fork, attempting to load this model without --spec-type draft-mtp causes an assert crash during slot initialization:

GGML_ASSERT(rollback >= 1 && rollback <= (llama_pos) n_rs_seq) failed

Stack trace: llama_memory_recurrent::seq_rm → llama_memory_hybrid::seq_rm → common_context_can_seq_rm → server_context_impl::load_model

--no-warmup does not fix this. -np 1 does not fix this. Only adding --spec-type draft-mtp resolves it.

This appears to be a fork-specific bug where the recurrent sequence memory (n_rs_seq) is only initialized correctly when MTP is active. The model metadata contains nextn_predict_layers = 1 — this model expects MTP.

--spec-type draft-mtp -np 1 are both required, not optional.

4. Thinking Model + No Reasoning Budget = Recall Spiral

Qwen3.6-35B-A3B is a thinking model. By default, it reasons before answering. On open-ended queries (especially anything involving enumeration from training data — "what restaurants are in X", "list all Y"), it can run an unbounded think block that accumulates hundreds to thousands of tokens.

This causes a secondary performance problem: as the thinking context grows, the O(n) attention overhead across the model's 10 full-attention layers compounds. A fresh 16-token prompt runs at ~118 t/s. The same session after 1000 tokens of accumulated thinking: ~9 t/s.

Fix: Set a token budget for the thinking phase:

--reasoning-budget 512

512 tokens is enough for genuine multi-step reasoning. Not enough for exhaustive training-data recall. Adjust upward if you need deeper analysis on complex tasks.

To disable thinking entirely per-request via API:

"chat_template_kwargs": {"enable_thinking": false}

Note: the /no_think prompt token does not work with this fork/model. Use the API parameter.

5. gfx1032 (RX 6600 XT) Not Supported by This Fork — and It's Not a VRAM Issue

Clarifying this because it's easy to misread: the problem is not that the 6600 XT has 8 GB of VRAM and the model is 13 GB.

In standard llama.cpp builds, insufficient VRAM is handled gracefully — the runtime calculates how many layers fit on GPU and offloads the rest to CPU. Slow, but not a crash. That behavior works fine.

The crash here is different. If you have a multi-GPU system with a 6600 XT alongside your main card and expose both GPUs (e.g. by removing HIP_VISIBLE_DEVICES), you get:

rocBLAS error: Cannot read TensileLibrary.dat: Illegal seek for GPU arch: gfx1032

This happens because the mtp-turboquant fork's compiled rocBLAS kernels do not include a path for gfx1032. VRAM size is irrelevant — a hypothetical 6600 XT with 64 GB would crash the same way. The architecture simply isn't in this binary's kernel set.

The mainline llama.cpp binary handles gfx1032 via a gfx1030 fallback path. This fork does not. The fix is to restrict GPU visibility to your supported card only:

HIP_VISIBLE_DEVICES=<index of gfx1101 card> ./llama-server ...

If you only have a gfx1032 card and want to run this fork: it will crash. Use mainline llama.cpp instead — you'll get CPU offload for layers that don't fit, but no rocBLAS crash.

Thinking Mode

The model ships with thinking on. For direct conversational use, disable per-request:

{
  "chat_template_kwargs": {"enable_thinking": false}
}

For a persistent no-think default, set --reasoning 0 at launch (disables thinking for all requests).

What I'd Still Like to Test

Higher ctx (65536+) with turbo4 — the VRAM math says it's viable but I haven't validated it
Quantized KV types other than turbo4 (q8_0 baseline comparison on this fork)
Whether the n_rs_seq assert is fixed in newer commits on the branch

TL;DR for AMD Users

✅ MTP works on ROCm gfx1101 at 86-100% acceptance — the NVIDIA-only assumption is wrong
✅ TurboQuant turbo4 KV works on ROCm — 160 MB → 47 MB at ctx 8K, 640 MB → 170 MB at ctx 32K
⚠️ Check compute buffer in logs — VRAM usage will look normal even if you're on CPU
⚠️ Don't combine ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES with integers
⚠️ --spec-type draft-mtp -np 1 required to load — not optional for this hybrid model
⚠️ Set --reasoning-budget 512 or thinking loops will tank your t/s on open queries
⚠️ Build with gfx1101 only — this fork has no rocBLAS kernel path for gfx1032. Not a VRAM issue — a 6600 XT with unlimited VRAM would crash the same way. Mainline llama.cpp handles gfx1032 fine (CPU layer offload); this fork does not.

Relevant upstream discussion: TurboQuant KV Cache Compression — Full HIP/ROCm Port

natanpodbielski

8 days ago

Did you tested it with pricise tasks? I.e. tool calls? It fails very often for me.

takeseem

4 days ago

---- English is trans, sorry ----
Qwen3.6-35B-A3B-MTP-GGUF: Actual measurement shows that the generation speed is lower—only 53.5% of draft tokens are accepted, at 18.42 tokens per second. The original Qwen3.6-35B-A3B model can achieve 23.64 tokens per second.

Personal understanding:

MTP involves transferring model parameters once to generate multiple tokens, then verifying the final output one by one. The costs of prediction and verification are almost zero, but every successful verification saves on the cost of parameter transfer.
MoE is not suitable for MTP: MoE relies on experts to select routes when calculating tokens. It’s likely that predictions and verifications will not match the same experts, which creates a fundamental conflict with MTP’s optimized prediction and packaging.

Qwen3.6-35B-A3B-MTP-GGUF 实测:生成速度变低,draft tokens accepted 53.5% 18.42 token/s,原版的 Qwen3.6-35B-A3B 能到 23.64 token/s
个人理解：MTP：搬运一次模型参数，生成多个 token，然后依次验证输出最终 token；预测和验证成本几乎为零，但每多验证成功一个就省一次搬运成本。
MoE 不适合 MTP：MoE 算 token 依赖专家路由选择，预测和验证大概率不会命中相同的专家，这样就与 MTP 的打包预测优化产生了根本的冲突。

takeseem

4 days ago

In qwen3.5 9b, the actual speed was 10 T/s; after enabling MTP, it increased to 15 T/s, with a acceptance rate of 54.9%. This is indeed good news for dense models.

qwen3.5 9b 实测原来是 10 t/s，开启 mtp 后 15t/s，54.9% 接受率。对于稠密模型确实是好消息。

natanpodbielski

4 days ago

This is very interesting and would indeed explain why draft acceptance rate is so low.

On the other hand, I tested it yesterday and it was generating about 60t/s in long context on Strix Halo. That is great achievement for local model.

takeseem

4 days ago

MTP Conclusion: MTP does not save memory, but it is effective for systems with excess computing power and memory bandwidth usage exceeding 50%.
Mini PC UM 790 pro 96G (2x48G 5600M, memory bandwidth 59G/s)
- 2B and below: The card’s computing power is limited; MTP increases overload. MTP should be turned off.
- 4B to 7B: Bandwidth bottleneck; the computing power remains sensitive. MTP should be turned on, with a maximum draft of 1.
- 9B to 32B: Pure bandwidth bottleneck; excess computing power exists. MTP should be turned on, with a maximum draft of 2 or 3.
- MoE: Expert routing conflicts with MTP’s functionality. MTP must be turned off.

MTP 结论：MTP 不会节省内存，但对算力过剩，内存带宽使用率超50%的都会有很好的效果。
UM 790 pro 96G（2x48G 5600M，内存带宽 59G/s）
- 2B 及以下：卡算力，MTP 加重过载，MTP Off
- 4B ~ 7B：带宽瓶颈，算力仍敏感，MTP On，Max Draft = 1
- 9B ~ 32B：纯带宽瓶颈，算力过剩，MTP On，Max Draft = 2 或 3
- MoE：专家路由与 MTP 验证底层冲突，坚决 MTP Off

takeseem

4 days ago

@natanpodbielski Can you test Qwen3.6-35B-A3B-MTP-GGUF MTP off vs MTP On ? I guess MTP off win.

natanpodbielski

4 days ago

I am doing it now.

natanpodbielski

4 days ago

MTP off: 62.98t/s
MTP 2 drats: 71.33t/s
MTP 3 drafts: 68.27t/s
MTP 6 drafts: 84.72t/s

I am almost sure I did not made mistake here. Seems to be pretty hectic but generally it is faster with MTP. I will run testing again to make sure.

natanpodbielski

4 days ago

no MTP: 63.28t/s
MTP 2 drafts: 71.64
MTP 3 drafts: 69.22
MTP 6 drafts: 92.21

Exactly the same. Maybe because original model is optimised for 2 draft tokens and GGUF version for 6?

takeseem

3 days ago

@natanpodbielski Thank you.

MTP is suitable for cases where there’s an excess of computing power and the actual memory bandwidth usage exceeds 50%. It’s very effective in such situations.

For MoE models, the initial acceptance level is relatively low; setting the max draft to 1 or 2 would be appropriate. For dense models, setting it to 2 or 3 might be more suitable.
This method works well for cases where memory bandwidth is insufficient, like mine, where I use a GPU with regular memory.
The unified memory architecture in Mac also benefits from MTP. In reality, Mac’s actual memory bandwidth is probably only about 1/4 to 1/2 of the theoretical value (120GB/s to 230GB/s).

It’s normal for the same model to have different performance levels due to differences in computing power and memory bandwidth. This is why different people may reach different conclusions.

My UM 790 Pro uses a GPU, and the measured memory (memory bandwidth) is only 59G/s. Its performance is very low. Therefore, MTP works well for models that suffer from insufficient memory bandwidth. However, for smaller models, since there’s no issue with memory bandwidth, it focuses more on improving computing power. So, MTP represents a negative optimization in this case.

MTP 适合：算力过剩，实际显存带宽使用率 > 50% ，非常有效。

MoE 模型天生 Accepted 偏低，max draft 设置为 1、2 比较合适，稠密模型可以考虑设置为 2、3。
内存带宽不足的应该不错，比如我这种核显+普通内存的，Mac 统一内存架构也是 MTP 的受益者，mac 的实际内存带宽应该也只有理论的 1/4 ~ 1/2 （120GB/s ~ 230GB/s）.

因算力和显存带宽不同，相同模型会有不同的表现完全正常，这也是为什么不同的人会有不同的结论。
我的 UM 790 pro 是核显、实测内存（显存）带宽 59G/s，性能非常低，所以以前卡显存带宽的模型 MTP 效果很好，但对小模型来说因为不卡显存带宽，卡算力了，所以 MTP 是负优化。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

# Qwen3.6-35B-A3B + MTP + TurboQuant on ROCm (RX 7800 XT / gfx1101): What Actually Works

Why This Combination

Build

Model Download

Launch Flags That Work

Results

Benchmark (API level)

UX Testing (web UI, thinking mode on, --reasoning-budget 512)

Issues to Watch For

1. The Silent CPU Fallback — Most Dangerous

2. ROCR_VISIBLE_DEVICES + HIP_VISIBLE_DEVICES Integer Conflict

3. Hybrid Model Requires MTP to Load

4. Thinking Model + No Reasoning Budget = Recall Spiral

5. gfx1032 (RX 6600 XT) Not Supported by This Fork — and It's Not a VRAM Issue

Thinking Mode

What I'd Still Like to Test

TL;DR for AMD Users

In qwen3.5 9b, the actual speed was 10 T/s; after enabling MTP, it increased to 15 T/s, with a acceptance rate of 54.9%. This is indeed good news for dense models.

UX Testing (web UI, thinking mode on, `--reasoning-budget 512`)