Weirdly no perf gain

#1
by Orosius - opened

MTP works (97% acceptance rate), which translate in low GPU-util instead of more token/s

With this Quant :

(APIServer pid=352258) INFO 04-25 11:02:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.93, Accepted throughput: 26.00 tokens/s, Drafted throughput: 28.00 tokens/s, Accepted: 260 tokens, Drafted: 280 tokens, Per-position acceptance rate: 0.929, Avg Draft acceptance rate: 92.9%
(APIServer pid=352258) INFO 04-25 11:02:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 51.7%, Prefix cache hit rate: 47.8%
(APIServer pid=352258) INFO 04-25 11:02:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.96, Accepted throughput: 25.20 tokens/s, Drafted throughput: 26.20 tokens/s, Accepted: 252 tokens, Drafted: 262 tokens, Per-position acceptance rate: 0.962, Avg Draft acceptance rate: 96.2%
(APIServer pid=352258) INFO 04-25 11:02:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 52.5%, Prefix cache hit rate: 47.8%

And GPU util around 60%

While with another NVFP4 without MTP, i'm around 50/55 tps, but GPU util aroun 95%

Hardware : RTX5090
WSL 2
uv run vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP --max-model-len 131072 --reasoning-parser qwen3 --kv-cache-dtype "fp8_e4m3" --language-model-only --skip-mm-profiling --enable-prefix-caching --enable-auto-tool-choice --host "0.0.0.0" --tool-call-parser qwen3_coder --port "8080" --max-num-batched-tokens 16384 --gpu-memory-utilization 0.89 --quantization modelopt --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

Hi @Orosius β€” thanks for the very clean diagnostic. We re-ran your exact launch flags on our side (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1, native Linux, no WSL) and got numbers that point to an environment-specific issue rather than a config one:

setup Phase A (short, T=0) Phase B (long-form 2000 tok, T=0.7) acceptance mean acceptance length
Orosius (5090, WSL2, prefix-cache ON) 51–55 tok/s 51–55 tok/s 92.9–96.2% 1.93–1.96
Lna-Lab (PRO 6000, native Linux, prefix-cache ON) 57.5 tok/s 83.3 tok/s 86.8–88.3% 1.87–1.88
Lna-Lab (PRO 6000, native Linux, prefix-cache OFF, all else identical) 59.5 tok/s 88.4 tok/s 86.8–89.7% 1.87–1.90

Two takeaways:

  1. --enable-prefix-caching is not the culprit. Toggling it on/off with everything else identical only moves long-form decode from 88.4 β†’ 83.3 tok/s on our box (~5 tok/s difference). Same --max-num-batched-tokens 16384, same KV FP8, same modelopt, same MTP spec config.

  2. Same flags, ~+73% on long-form on PRO 6000 vs your 5090+WSL2. Your acceptance rate is actually slightly higher than ours (you're getting more drafted tokens accepted per step), so the draft head is doing its job β€” the gain just isn't materializing into wall-clock throughput.

Most likely suspects on your side (in rough order):

  • WSL2 CUDA passthrough overhead. WSL2's GPU virtualization adds latency on small-batch kernel launches; the MTP draft pass is exactly that workload (one extra small forward per step). On native Linux the same draft pass costs much less. If you can boot a native Linux partition (or a Linux container with --gpus all outside WSL), even a quick test would isolate this.
  • vLLM build / nightly drift. Could you share the exact vllm --version? There were Blackwell-specific MTP fixes between 0.19.0 and 0.19.1rc1; if you're on an older nightly, FlashInferCutlassNvFp4LinearKernel selection for the draft pass may be off.
  • GPU clock/thermal on 5090. Slightly less likely given you're not at 95% util, but worth checking nvidia-smi --query-gpu=clocks.current.sm,clocks.max.sm during the run β€” WSL2 can also mask thermal throttling.

If you want, I can mirror your launch command verbatim including --reasoning-parser qwen3 / --tool-call-parser qwen3_coder and post the kernel-selection lines from our startup log so you can diff them against yours β€” happy to dig further.

β€” sakamakismile

uv run vllm --version 0.19.2rc1.dev206+g95995bbef

nvidia-smi --query-gpu=clocks.current.sm,clocks.max.sm --format=csv
clocks.current.sm [MHz], clocks.max.sm [MHz]
2835 MHz, 3090 MHz

Somewhat stable aroun 2850.

All seem to point toward a WSL2 problem

@Orosius Confirmed β€” your numbers (5090 + WSL2, 51–55 tok/s at 92–96% acceptance) line up with WSL2 CUDA passthrough overhead on small-batch kernel launches, which is exactly the workload MTP draft passes generate. Nothing to fix on the checkpoint side. If you can boot a native Linux partition for one quick sanity run, I'd expect ~85 tok/s long-form at the same flags.

One bonus: num_speculative_tokens=3 (instead of 1) gets us 132 tok/s short-form / 105 long-form on PRO 6000 β€” vLLM applies the MTP layer recursively. Worth trying once you're off WSL2.

β€” Tonoken3 / Lna-Lab

I observed the same and suspect the same, WSL2 πŸ‘Ž

Unfortunately I run Windows.

This model is still badass, however.

Sign up or log in to comment