RTX RTX PRO 4500 Blackwell results

#2
by Pulsate1680 - opened

Thank you for creating this! Sharing some stats from my run:

Setup:
RTX PRO 4500 Blackwell, 32GB GDDR7, 200W TGP
WSL2 (Ubuntu 24.04) on Windows 11
vLLM 0.19.2rc1 (cu130-nightly Docker image)
Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (modelopt NVFP4, MTP head grafted back in BF16)
BF16 KV cache, 131K context
Numbers (single-stream, thinking disabled, vllm bench serve):
Steady-state TG: 60–73 tok/s (engine logs, varies by content)
Mean: ~65 tok/s, peaks 73
TPOT: 17 ms
TTFT: 240 ms median
Acceptance length: 3.19 mean (3.35–3.97 on easier text)
Per-position acceptance: 87/72/61% mean, 99/94/91% on best windows
Model footprint: 18.55 GB
KV cache: 9.77 GB available, ~37K token pool
vLLM launch (compose command block):
yaml

  • sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
  • --quantization
  • modelopt
  • --speculative-config
  • '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
  • --max-model-len
  • "131072"
  • --max-num-batched-tokens
  • "4096"
  • --max-num-seqs
  • "10"
  • --gpu-memory-utilization
  • "0.93"
  • --enable-prefix-caching
  • --no-scheduler-reserve-full-isl
  • --trust-remote-code
  • --reasoning-parser
  • qwen3
  • --enable-auto-tool-choice
  • --tool-call-parser
  • qwen3_coder
  • --default-chat-template-kwargs
  • '{"preserve_thinking":true}'
  • --language-model-only

Hi @Pulsate1680 — coming back to thank you. Your num_speculative_tokens=3 line in this thread is what unlocked the next jump for our family of MTP repos.

I had been documenting num_speculative_tokens=1 based on the "MTP head has 1 layer" reasoning, which is structurally true but missed that vLLM applies the single MTP layer recursively. Your mean acceptance length of 3.19 (peaks 3.35–3.97) on the RTX PRO 4500 was the load-bearing evidence that recursive draft was actually paying off. Took your numbers, rebenched on RTX PRO 6000 Blackwell + vLLM 0.19.1rc1 @ T = 0, and saw the same shape on all four of our Qwen3.6-family NVFP4 + MTP repos:

Repo n=1 (prior) n=3 (this finding)
Qwen3.6-27B-Text-NVFP4-MTP 71–85 132 / 105 / 106
Carnice-V2-27b-NVFP4-TEXT-MTP 93 134 / 102 / 103
Huihui-Qwen3.6-…-NVFP4-TEXT-MTP ~71 135 / 112 / 109 ← family fastest
Huihui-Qwen3.6-…-NVFP4-MTP (VLM) 137 / 112 / 104 text · 129 with image

(short / medium / long-form prompts.)

All four READMEs were updated today to make num_speculative_tokens: 3 the recommended setting and explicitly cite this thread for the credit. The Huihui abliterated body comes out fastest of the group, which is consistent with refusal-shaped tokens being smoothed out — fewer awkward low-acceptance spots for the recursive draft.

Your --no-scheduler-reserve-full-isl + preserve_thinking chat-template kwarg recipe is also gold — added both to my standard launch profile.

Real thanks for posting clean numbers with the launch flags inline. Worth more than the whole "we should optimise NVFP4" conversation ever was.

— Tonoken3 / Lna-Lab

Sign up or log in to comment