Instructions to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
- SGLang
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Docker Model Runner:
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Weirdly no perf gain
MTP works (97% acceptance rate), which translate in low GPU-util instead of more token/s
With this Quant :
(APIServer pid=352258) INFO 04-25 11:02:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.93, Accepted throughput: 26.00 tokens/s, Drafted throughput: 28.00 tokens/s, Accepted: 260 tokens, Drafted: 280 tokens, Per-position acceptance rate: 0.929, Avg Draft acceptance rate: 92.9%
(APIServer pid=352258) INFO 04-25 11:02:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 51.7%, Prefix cache hit rate: 47.8%
(APIServer pid=352258) INFO 04-25 11:02:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.96, Accepted throughput: 25.20 tokens/s, Drafted throughput: 26.20 tokens/s, Accepted: 252 tokens, Drafted: 262 tokens, Per-position acceptance rate: 0.962, Avg Draft acceptance rate: 96.2%
(APIServer pid=352258) INFO 04-25 11:02:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 52.5%, Prefix cache hit rate: 47.8%
And GPU util around 60%
While with another NVFP4 without MTP, i'm around 50/55 tps, but GPU util aroun 95%
Hardware : RTX5090
WSL 2
uv run vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP --max-model-len 131072 --reasoning-parser qwen3 --kv-cache-dtype "fp8_e4m3" --language-model-only --skip-mm-profiling --enable-prefix-caching --enable-auto-tool-choice --host "0.0.0.0" --tool-call-parser qwen3_coder --port "8080" --max-num-batched-tokens 16384 --gpu-memory-utilization 0.89 --quantization modelopt --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'
Hi @Orosius β thanks for the very clean diagnostic. We re-ran your exact launch flags on our side (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1, native Linux, no WSL) and got numbers that point to an environment-specific issue rather than a config one:
| setup | Phase A (short, T=0) | Phase B (long-form 2000 tok, T=0.7) | acceptance | mean acceptance length |
|---|---|---|---|---|
| Orosius (5090, WSL2, prefix-cache ON) | 51β55 tok/s | 51β55 tok/s | 92.9β96.2% | 1.93β1.96 |
| Lna-Lab (PRO 6000, native Linux, prefix-cache ON) | 57.5 tok/s | 83.3 tok/s | 86.8β88.3% | 1.87β1.88 |
| Lna-Lab (PRO 6000, native Linux, prefix-cache OFF, all else identical) | 59.5 tok/s | 88.4 tok/s | 86.8β89.7% | 1.87β1.90 |
Two takeaways:
--enable-prefix-cachingis not the culprit. Toggling it on/off with everything else identical only moves long-form decode from 88.4 β 83.3 tok/s on our box (~5 tok/s difference). Same--max-num-batched-tokens 16384, same KV FP8, same modelopt, same MTP spec config.Same flags, ~+73% on long-form on PRO 6000 vs your 5090+WSL2. Your acceptance rate is actually slightly higher than ours (you're getting more drafted tokens accepted per step), so the draft head is doing its job β the gain just isn't materializing into wall-clock throughput.
Most likely suspects on your side (in rough order):
- WSL2 CUDA passthrough overhead. WSL2's GPU virtualization adds latency on small-batch kernel launches; the MTP draft pass is exactly that workload (one extra small forward per step). On native Linux the same draft pass costs much less. If you can boot a native Linux partition (or a Linux container with
--gpus alloutside WSL), even a quick test would isolate this. - vLLM build / nightly drift. Could you share the exact
vllm --version? There were Blackwell-specific MTP fixes between 0.19.0 and 0.19.1rc1; if you're on an older nightly, FlashInferCutlassNvFp4LinearKernel selection for the draft pass may be off. - GPU clock/thermal on 5090. Slightly less likely given you're not at 95% util, but worth checking
nvidia-smi --query-gpu=clocks.current.sm,clocks.max.smduring the run β WSL2 can also mask thermal throttling.
If you want, I can mirror your launch command verbatim including --reasoning-parser qwen3 / --tool-call-parser qwen3_coder and post the kernel-selection lines from our startup log so you can diff them against yours β happy to dig further.
β sakamakismile
uv run vllm --version 0.19.2rc1.dev206+g95995bbef
nvidia-smi --query-gpu=clocks.current.sm,clocks.max.sm --format=csv
clocks.current.sm [MHz], clocks.max.sm [MHz]
2835 MHz, 3090 MHz
Somewhat stable aroun 2850.
All seem to point toward a WSL2 problem
@Orosius Confirmed β your numbers (5090 + WSL2, 51β55 tok/s at 92β96% acceptance) line up with WSL2 CUDA passthrough overhead on small-batch kernel launches, which is exactly the workload MTP draft passes generate. Nothing to fix on the checkpoint side. If you can boot a native Linux partition for one quick sanity run, I'd expect ~85 tok/s long-form at the same flags.
One bonus: num_speculative_tokens=3 (instead of 1) gets us 132 tok/s short-form / 105 long-form on PRO 6000 β vLLM applies the MTP layer recursively. Worth trying once you're off WSL2.
β Tonoken3 / Lna-Lab
I observed the same and suspect the same, WSL2 π
Unfortunately I run Windows.
This model is still badass, however.