# Qwen3.6-35B-A3B + MTP + TurboQuant on ROCm (RX 7800 XT / gfx1101): What Actually Works

#17
by JamesBean187 - opened

Hardware: AMD RX 7800 XT (gfx1101) · 16 GB VRAM
OS: Xubuntu 24.04
ROCm: 6.4.0 kernel 6.17.
Fork: NJannasch/llama.cpp mtp-turboquant branch
Model: unsloth/Qwen3.6-35B-A3B-MTP-GGUFQwen3.6-35B-A3B-UD-IQ3_XXS.gguf
Date tested: 2026-05-18

Tested on a consumer desktop, not a server rack.

Written by Claude Code · Edited by User


Most writeups on MTP + TurboQuant are NVIDIA. This is a ROCm-specific report from an RDNA3 card. There are a few things that will bite you that I didn't see documented anywhere. Posting this so you don't lose hours to the same issues.


Why This Combination

UD-IQ3_XXS quantization is 13 GB on disk — fits in 16 GB VRAM with room for KV cache.

The catch: at Q3-range quantization the KV cache overhead is proportionally larger. Without TurboQuant, a 32K context window adds another ~640 MB of f16 KV cache on top of the model. With turbo4 KV (-ctk turbo4 -ctv turbo4), that same 32K context costs ~170 MB. That's the margin that makes longer conversations viable on a 16 GB card.

MTP (Multi-Token Prediction) is baked into Qwen3.6-35B-A3B at training time — nextn_predict_layers = 1 in the model metadata. The NJannasch fork activates it with --spec-type draft-mtp.


Build

Single GPU setup (most people): straightforward. Build for your card's gfx target only.

git clone --branch mtp-turboquant --depth 1 https://github.com/NJannasch/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1101" \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --target llama-server -j$(nproc)

Build time: ~20 minutes on a mid-range CPU.

Multi-GPU setup — read this before you build:

If you have a secondary AMD GPU in the system, the temptation is to list both targets in AMDGPU_TARGETS. Don't.

This fork does not have a rocBLAS TensileLibrary path for every gfx target. Specifically, gfx1032 (RX 6600 XT / 6700 XT family) is not covered. If you include it, the build succeeds but inference crashes at runtime:

rocBLAS error: Cannot read TensileLibrary.dat: Illegal seek for GPU arch: gfx1032

This is not a VRAM issue. The crash happens because the compiled binary has no kernel path for that architecture. Not due to memory size. A 6600 XT with unlimited VRAM would crash the same way. Mainline llama.cpp handles gfx1032 fine via a gfx1030 fallback — this fork does not.

Fix: build for your primary card's target only, and restrict GPU visibility at launch with HIP_VISIBLE_DEVICES (see Launch Flags below). The secondary card stays available for other processes; this binary just won't touch it.


Model Download

The filename in the MTP GGUF repo does not include "MTP" in the filename itself:

wget -c -O Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  'https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf'

hf download worked but left an .incomplete file at exit code 0 in my session. Use wget -c with the direct resolve URL to be safe.


Launch Flags That Work

HIP_VISIBLE_DEVICES=1 ./build/bin/llama-server \
  --model Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  --port 8080 \
  --host 127.0.0.1 \
  -ngl 999 \
  --n-cpu-moe 0 \
  --ctx-size 32768 \
  --jinja \
  -np 1 \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  -ctk turbo4 \
  -ctv turbo4 \
  --reasoning-budget 512

HIP_VISIBLE_DEVICES=1 assumes the 7800 XT is your second GPU. Adjust to 0 if it's your only GPU.


Results

Benchmark (API level)

Config VRAM Gen t/s MTP acceptance
MTP + f16 KV, ctx 8192 14.93 GB ~118 ~100%
MTP + turbo4 KV, ctx 8192 14.80 GB ~111 ~88%
MTP + turbo4 KV, ctx 32768 14.58 GB ~110 ~86%

At 32K context turbo4 saves ~470 MB vs f16. Speed barely changes between ctx 8K and 32K with turbo4 — that's the point.

MTP acceptance at 86-100% is strong. Qwen3.6 was trained with MTP so draft quality is high.


UX Testing (web UI, thinking mode on, --reasoning-budget 512)

Tested against the built-in llama-server web UI after full service setup. Scoring is subjective: Pass / Degraded / Fail against the criterion listed.

Category Example prompt type First token Full response t/s Result
Simple direct Short factual, single-answer < 1s ~110 t/s ✅ Pass — fast, on point, no over-explanation
Concise instruction "Reply in exactly N words" < 1s ~110 t/s ✅ Pass — followed constraint, no padding
Multi-step reasoning Technical problem with 3+ constraints 1–2s ~95–105 t/s ✅ Pass — thinking budget used well, answer structured correctly
Live data / current events "What are good X in [city]?" 1–2s ~9 t/s ⚠️ ❌ Degraded — see note
Conversational follow-up Short reply in ongoing thread < 1s ~90–100 t/s ✅ Pass — context retained, no repetition

Live data note: The model has no web access. Without a --reasoning-budget cap, open-ended factual queries trigger an uncapped thinking loop where the model enumerates everything it knows from training data. This accumulated hundreds of tokens of internal reasoning, driving generation speed down from ~110 t/s to ~9 t/s as attention overhead compounded. With --reasoning-budget 512 of the thinking cuts that off Then the model states plainly that it can't provide live data — which is the correct answer. The degraded score reflects the behaviour without the budget; with it, this category becomes a Pass with a graceful "I don't have live data" response.

Thinking mode: The reasoning phase is not visible in the web UI by default — responses appear after the think completes. For tasks where the thinking budget is hit, the model produces a brief response from wherever reasoning ended. This is working as designed. For direct conversational use, disable thinking per-request via the API ("chat_template_kwargs": {"enable_thinking": false}) or set --reasoning-budget 0 at launch.


Issues to Watch For

1. The Silent CPU Fallback — Most Dangerous

Symptom: Model appears to load normally. VRAM reads correctly (14+ GB used). But generation is 5-10 t/s instead of 100+ t/s. CPU is pegged at 200-300% usage. GPU utilization reads 0%.

Cause: The ROCm runtime silently falls back to CPU compute when GPU initialization fails. The VRAM reading is misleading — the weights are loaded to GPU memory via mmap but computed on CPU.

How to catch it: Check the server log for compute buffer:

# GPU — correct
sched_reserve: ROCm0 compute buffer size = 493.00 MiB

# CPU — silent fallback
sched_reserve: CPU compute buffer size = 497.00 MiB

Also check startup for:

ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected

If you see that line, nothing is running on GPU regardless of what rocm-smi shows.

VRAM is not a reliable GPU-usage indicator. Always check compute buffer in the log.


2. ROCR_VISIBLE_DEVICES + HIP_VISIBLE_DEVICES Integer Conflict

If you're running multiple GPUs and use systemd (or any launcher that sets environment variables explicitly), watch out for this:

ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES use different index spaces.

Variable Index 0 Index 1 Index 2
HIP_VISIBLE_DEVICES GPU 0 GPU 1 GPU 2
ROCR_VISIBLE_DEVICES CPU (HSA agent 0) GPU 0 GPU 1

Setting both to 1 targets different physical devices. The ROCm runtime sees a conflict and reports no capable device — then silently falls back to CPU. This is the root cause of the silent CPU fallback above in systemd contexts.

Fix: Use HIP_VISIBLE_DEVICES alone. Don't combine both with integer indices.

# Correct — one variable, GPU-only index
HIP_VISIBLE_DEVICES=1 ./llama-server ...

# Wrong — conflicting index spaces
ROCR_VISIBLE_DEVICES=1 HIP_VISIBLE_DEVICES=1 ./llama-server ...

If you want UUID-based targeting (more stable across reboots), use ROCR_VISIBLE_DEVICES=<UUID> + HIP_VISIBLE_DEVICES=0 (since only one device is visible, it becomes index 0).


3. Hybrid Model Requires MTP to Load

Qwen3.6-35B-A3B is a hybrid architecture — it alternates between full attention layers and SSM (Gated Delta Net / Mamba-style) recurrent layers every 4 blocks. This is not a standard transformer.

In the NJannasch fork, attempting to load this model without --spec-type draft-mtp causes an assert crash during slot initialization:

GGML_ASSERT(rollback >= 1 && rollback <= (llama_pos) n_rs_seq) failed

Stack trace: llama_memory_recurrent::seq_rmllama_memory_hybrid::seq_rmcommon_context_can_seq_rmserver_context_impl::load_model

--no-warmup does not fix this. -np 1 does not fix this. Only adding --spec-type draft-mtp resolves it.

This appears to be a fork-specific bug where the recurrent sequence memory (n_rs_seq) is only initialized correctly when MTP is active. The model metadata contains nextn_predict_layers = 1 — this model expects MTP.

--spec-type draft-mtp -np 1 are both required, not optional.


4. Thinking Model + No Reasoning Budget = Recall Spiral

Qwen3.6-35B-A3B is a thinking model. By default, it reasons before answering. On open-ended queries (especially anything involving enumeration from training data — "what restaurants are in X", "list all Y"), it can run an unbounded think block that accumulates hundreds to thousands of tokens.

This causes a secondary performance problem: as the thinking context grows, the O(n) attention overhead across the model's 10 full-attention layers compounds. A fresh 16-token prompt runs at ~118 t/s. The same session after 1000 tokens of accumulated thinking: ~9 t/s.

Fix: Set a token budget for the thinking phase:

--reasoning-budget 512

512 tokens is enough for genuine multi-step reasoning. Not enough for exhaustive training-data recall. Adjust upward if you need deeper analysis on complex tasks.

To disable thinking entirely per-request via API:

"chat_template_kwargs": {"enable_thinking": false}

Note: the /no_think prompt token does not work with this fork/model. Use the API parameter.


5. gfx1032 (RX 6600 XT) Not Supported by This Fork — and It's Not a VRAM Issue

Clarifying this because it's easy to misread: the problem is not that the 6600 XT has 8 GB of VRAM and the model is 13 GB.

In standard llama.cpp builds, insufficient VRAM is handled gracefully — the runtime calculates how many layers fit on GPU and offloads the rest to CPU. Slow, but not a crash. That behavior works fine.

The crash here is different. If you have a multi-GPU system with a 6600 XT alongside your main card and expose both GPUs (e.g. by removing HIP_VISIBLE_DEVICES), you get:

rocBLAS error: Cannot read TensileLibrary.dat: Illegal seek for GPU arch: gfx1032

This happens because the mtp-turboquant fork's compiled rocBLAS kernels do not include a path for gfx1032. VRAM size is irrelevant — a hypothetical 6600 XT with 64 GB would crash the same way. The architecture simply isn't in this binary's kernel set.

The mainline llama.cpp binary handles gfx1032 via a gfx1030 fallback path. This fork does not. The fix is to restrict GPU visibility to your supported card only:

HIP_VISIBLE_DEVICES=<index of gfx1101 card> ./llama-server ...

If you only have a gfx1032 card and want to run this fork: it will crash. Use mainline llama.cpp instead — you'll get CPU offload for layers that don't fit, but no rocBLAS crash.


Thinking Mode

The model ships with thinking on. For direct conversational use, disable per-request:

{
  "chat_template_kwargs": {"enable_thinking": false}
}

For a persistent no-think default, set --reasoning 0 at launch (disables thinking for all requests).


What I'd Still Like to Test

  • Higher ctx (65536+) with turbo4 — the VRAM math says it's viable but I haven't validated it
  • Quantized KV types other than turbo4 (q8_0 baseline comparison on this fork)
  • Whether the n_rs_seq assert is fixed in newer commits on the branch

TL;DR for AMD Users

  • ✅ MTP works on ROCm gfx1101 at 86-100% acceptance — the NVIDIA-only assumption is wrong
  • ✅ TurboQuant turbo4 KV works on ROCm — 160 MB → 47 MB at ctx 8K, 640 MB → 170 MB at ctx 32K
  • ⚠️ Check compute buffer in logs — VRAM usage will look normal even if you're on CPU
  • ⚠️ Don't combine ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES with integers
  • ⚠️ --spec-type draft-mtp -np 1 required to load — not optional for this hybrid model
  • ⚠️ Set --reasoning-budget 512 or thinking loops will tank your t/s on open queries
  • ⚠️ Build with gfx1101 only — this fork has no rocBLAS kernel path for gfx1032. Not a VRAM issue — a 6600 XT with unlimited VRAM would crash the same way. Mainline llama.cpp handles gfx1032 fine (CPU layer offload); this fork does not.

Relevant upstream discussion: TurboQuant KV Cache Compression — Full HIP/ROCm Port

Did you tested it with pricise tasks? I.e. tool calls? It fails very often for me.

---- English is trans, sorry ----
Qwen3.6-35B-A3B-MTP-GGUF: Actual measurement shows that the generation speed is lower—only 53.5% of draft tokens are accepted, at 18.42 tokens per second. The original Qwen3.6-35B-A3B model can achieve 23.64 tokens per second.

Personal understanding:

  • MTP involves transferring model parameters once to generate multiple tokens, then verifying the final output one by one. The costs of prediction and verification are almost zero, but every successful verification saves on the cost of parameter transfer.
  • MoE is not suitable for MTP: MoE relies on experts to select routes when calculating tokens. It’s likely that predictions and verifications will not match the same experts, which creates a fundamental conflict with MTP’s optimized prediction and packaging.

Qwen3.6-35B-A3B-MTP-GGUF 实测:生成速度变低,draft tokens accepted 53.5% 18.42 token/s,原版的 Qwen3.6-35B-A3B 能到 23.64 token/s
个人理解:MTP:搬运一次模型参数,生成多个 token,然后依次验证输出最终 token;预测和验证成本几乎为零,但每多验证成功一个就省一次搬运成本。
MoE 不适合 MTP:MoE 算 token 依赖专家路由选择,预测和验证大概率不会命中相同的专家,这样就与 MTP 的打包预测优化产生了根本的冲突。

In qwen3.5 9b, the actual speed was 10 T/s; after enabling MTP, it increased to 15 T/s, with a acceptance rate of 54.9%. This is indeed good news for dense models.

qwen3.5 9b 实测 原来是 10 t/s,开启 mtp 后 15t/s,54.9% 接受率。对于稠密模型确实是好消息。

This is very interesting and would indeed explain why draft acceptance rate is so low.

On the other hand, I tested it yesterday and it was generating about 60t/s in long context on Strix Halo. That is great achievement for local model.

  • MTP Conclusion: MTP does not save memory, but it is effective for systems with excess computing power and memory bandwidth usage exceeding 50%.

  • Mini PC UM 790 pro 96G (2x48G 5600M, memory bandwidth 59G/s)

    • 2B and below: The card’s computing power is limited; MTP increases overload. MTP should be turned off.
    • 4B to 7B: Bandwidth bottleneck; the computing power remains sensitive. MTP should be turned on, with a maximum draft of 1.
    • 9B to 32B: Pure bandwidth bottleneck; excess computing power exists. MTP should be turned on, with a maximum draft of 2 or 3.
    • MoE: Expert routing conflicts with MTP’s functionality. MTP must be turned off.

  • MTP 结论:MTP 不会节省内存,但对算力过剩,内存带宽使用率超50%的都会有很好的效果。
  • UM 790 pro 96G(2x48G 5600M,内存带宽 59G/s)
    • 2B 及以下:卡算力,MTP 加重过载,MTP Off
    • 4B ~ 7B:带宽瓶颈,算力仍敏感,MTP On,Max Draft = 1
    • 9B ~ 32B:纯带宽瓶颈,算力过剩,MTP On,Max Draft = 2 或 3
    • MoE:专家路由 与 MTP 验证底层冲突,坚决 MTP Off

@natanpodbielski Can you test Qwen3.6-35B-A3B-MTP-GGUF MTP off vs MTP On ? I guess MTP off win.

I am doing it now.

MTP off: 62.98t/s
MTP 2 drats: 71.33t/s
MTP 3 drafts: 68.27t/s
MTP 6 drafts: 84.72t/s

I am almost sure I did not made mistake here. Seems to be pretty hectic but generally it is faster with MTP. I will run testing again to make sure.

no MTP: 63.28t/s
MTP 2 drafts: 71.64
MTP 3 drafts: 69.22
MTP 6 drafts: 92.21

Exactly the same. Maybe because original model is optimised for 2 draft tokens and GGUF version for 6?

@natanpodbielski Thank you.

MTP is suitable for cases where there’s an excess of computing power and the actual memory bandwidth usage exceeds 50%. It’s very effective in such situations.

  • For MoE models, the initial acceptance level is relatively low; setting the max draft to 1 or 2 would be appropriate. For dense models, setting it to 2 or 3 might be more suitable.
  • This method works well for cases where memory bandwidth is insufficient, like mine, where I use a GPU with regular memory.
  • The unified memory architecture in Mac also benefits from MTP. In reality, Mac’s actual memory bandwidth is probably only about 1/4 to 1/2 of the theoretical value (120GB/s to 230GB/s).

It’s normal for the same model to have different performance levels due to differences in computing power and memory bandwidth. This is why different people may reach different conclusions.

My UM 790 Pro uses a GPU, and the measured memory (memory bandwidth) is only 59G/s. Its performance is very low. Therefore, MTP works well for models that suffer from insufficient memory bandwidth. However, for smaller models, since there’s no issue with memory bandwidth, it focuses more on improving computing power. So, MTP represents a negative optimization in this case.


MTP 适合:算力过剩,实际显存带宽使用率 > 50% ,非常有效。

  • MoE 模型天生 Accepted 偏低,max draft 设置为 1、2 比较合适,稠密模型可以考虑设置为 2、3。
  • 内存带宽不足的应该不错,比如我这种核显+普通内存的,Mac 统一内存架构也是 MTP 的受益者,mac 的实际内存带宽应该也只有理论的 1/4 ~ 1/2 (120GB/s ~ 230GB/s).

因算力和显存带宽不同,相同模型会有不同的表现完全正常,这也是为什么不同的人会有不同的结论。
我的 UM 790 pro 是核显、实测内存(显存)带宽 59G/s,性能非常低,所以以前卡显存带宽的模型 MTP 效果很好,但对小模型来说因为不卡显存带宽,卡算力了,所以 MTP 是负优化。

Sign up or log in to comment