Fast!

#10
by mazuj2 - opened

pushed me over the hump! i was getting 110tps on my 5070ti/5060ti bifurcated 32gb. I am now getting 145 to 160 tps!
thanks unsloth and qwen!

Which Quant variant were you using?
My 5060Ti only manages to output 20-30tps but with 65k active context being loaded already on UD_IQ4NL_XL without MTP, downloading the UD_Q4_XL MTP right now, let's see...

You should check if your VRAM is enough and not any copying happening due to not have enough memory. Speed between quants does not differ much. So having 20-30tps is not normal.
My 4090 was doing around 150tps on Qwen3.6-35B-A3B-GGUF and now it is around 170tps for Qwen3.6-35B-A3B-MTP-GGUF
Also just in any case this is MOE model, maybe your speed is for 27B model?

Which Quant variant were you using?
My 5060Ti only manages to output 20-30tps but with 65k active context being loaded already on UD_IQ4NL_XL without MTP, downloading the UD_Q4_XL MTP right now, let's see...

you dont have enough vram

I do, ran okay with offloading to cpu, this machine is my minimal server to run, but mtp bloated it, went without mtp.
On my main machine I got 96gb and prefer the dense 27b there, but honestly nmax above 2 produced robotic results only somehow, putting everything into bulletpoints etc.

---- English is trans, sorry ----
Qwen3.6-35B-A3B-MTP-GGUF: Actual measurement shows that the generation speed is lower—only 53.5% of draft tokens are accepted, at 18.42 tokens per second. The original Qwen3.6-35B-A3B model can achieve 23.64 tokens per second.

Personal understanding:

  • MTP involves transferring model parameters once to generate multiple tokens, then verifying the final output one by one. The costs of prediction and verification are almost zero, but every successful verification saves on the cost of parameter transfer.
  • MoE is not suitable for MTP: MoE relies on experts to select routes when calculating tokens. It’s likely that predictions and verifications will not match the same experts, which creates a fundamental conflict with MTP’s optimized prediction and packaging.

Qwen3.6-35B-A3B-MTP-GGUF 实测:生成速度变低,draft tokens accepted 53.5% 18.42 token/s,原版的 Qwen3.6-35B-A3B 能到 23.64 token/s
个人理解:MTP:搬运一次模型参数,生成多个 token,然后依次验证输出最终 token;预测和验证成本几乎为零,但每多验证成功一个就省一次搬运成本。
MoE 不适合 MTP:MoE 算 token 依赖专家路由选择,预测和验证大概率不会命中相同的专家,这样就与 MTP 的打包预测优化产生了根本的冲突。

In qwen3.5 9b, the actual speed was 10 T/s; after enabling MTP, it increased to 15 T/s, with a acceptance rate of 54.9%. This is indeed good news for dense models.

qwen3.5 9b 实测 原来是 10 t/s,开启 mtp 后 15t/s,54.9% 接受率。对于稠密模型确实是好消息。

  • MTP Conclusion: MTP does not save memory, but it is effective for systems with excess computing power and memory bandwidth usage exceeding 50%.

  • Mini PC UM 790 pro 96G (2x48G 5600M, memory bandwidth 59G/s)

    • 2B and below: The card’s computing power is limited; MTP increases overload. MTP should be turned off.
    • 4B to 7B: Bandwidth bottleneck; the computing power remains sensitive. MTP should be turned on, with a maximum draft of 1.
    • 9B to 32B: Pure bandwidth bottleneck; excess computing power exists. MTP should be turned on, with a maximum draft of 2 or 3.
    • MoE: Expert routing conflicts with MTP’s functionality. MTP must be turned off.

  • MTP 结论:MTP 不会节省内存,但对算力过剩,内存带宽使用率超50%的都会有很好的效果。
  • UM 790 pro 96G(2x48G 5600M,内存带宽 59G/s)
    • 2B 及以下:卡算力,MTP 加重过载,MTP Off
    • 4B ~ 7B:带宽瓶颈,算力仍敏感,MTP On,Max Draft = 1
    • 9B ~ 32B:纯带宽瓶颈,算力过剩,MTP On,Max Draft = 2 或 3
    • MoE:专家路由 与 MTP 验证底层冲突,坚决 MTP Off

I agree, the quality tradeoff on 48GB or lower Vram is worse than just using a bigger quant to me.
I rather have 4gb used by a Q6 without MTP instead of Q4 with MTP.
I know the purpose of MTP is to speed things up, but I also saw a degration in quality and would rather have better quality than bulletpoint responses.

@Wladastic MTP does not reduce intelligence, and there is no loss of accuracy.


MTP 不会降智,不会有任何精度损失。

Sign up or log in to comment