Instructions to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

SGLang

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
```

Weirdly no perf gain

by Orosius - opened Apr 25

Discussion

Orosius

Apr 25

MTP works (97% acceptance rate), which translate in low GPU-util instead of more token/s

With this Quant :

(APIServer pid=352258) INFO 04-25 11:02:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.93, Accepted throughput: 26.00 tokens/s, Drafted throughput: 28.00 tokens/s, Accepted: 260 tokens, Drafted: 280 tokens, Per-position acceptance rate: 0.929, Avg Draft acceptance rate: 92.9%
(APIServer pid=352258) INFO 04-25 11:02:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 51.7%, Prefix cache hit rate: 47.8%
(APIServer pid=352258) INFO 04-25 11:02:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.96, Accepted throughput: 25.20 tokens/s, Drafted throughput: 26.20 tokens/s, Accepted: 252 tokens, Drafted: 262 tokens, Per-position acceptance rate: 0.962, Avg Draft acceptance rate: 96.2%
(APIServer pid=352258) INFO 04-25 11:02:21 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 52.5%, Prefix cache hit rate: 47.8%

And GPU util around 60%

While with another NVFP4 without MTP, i'm around 50/55 tps, but GPU util aroun 95%

Hardware : RTX5090
WSL 2
uv run vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP --max-model-len 131072 --reasoning-parser qwen3 --kv-cache-dtype "fp8_e4m3" --language-model-only --skip-mm-profiling --enable-prefix-caching --enable-auto-tool-choice --host "0.0.0.0" --tool-call-parser qwen3_coder --port "8080" --max-num-batched-tokens 16384 --gpu-memory-utilization 0.89 --quantization modelopt --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":1}'

sakamakismile

Owner Apr 25

Hi @Orosius — thanks for the very clean diagnostic. We re-ran your exact launch flags on our side (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1, native Linux, no WSL) and got numbers that point to an environment-specific issue rather than a config one:

setup	Phase A (short, T=0)	Phase B (long-form 2000 tok, T=0.7)	acceptance	mean acceptance length
Orosius (5090, WSL2, prefix-cache ON)	51–55 tok/s	51–55 tok/s	92.9–96.2%	1.93–1.96
Lna-Lab (PRO 6000, native Linux, prefix-cache ON)	57.5 tok/s	83.3 tok/s	86.8–88.3%	1.87–1.88
Lna-Lab (PRO 6000, native Linux, prefix-cache OFF, all else identical)	59.5 tok/s	88.4 tok/s	86.8–89.7%	1.87–1.90

Two takeaways:

--enable-prefix-caching is not the culprit. Toggling it on/off with everything else identical only moves long-form decode from 88.4 → 83.3 tok/s on our box (~5 tok/s difference). Same --max-num-batched-tokens 16384, same KV FP8, same modelopt, same MTP spec config.
Same flags, ~+73% on long-form on PRO 6000 vs your 5090+WSL2. Your acceptance rate is actually slightly higher than ours (you're getting more drafted tokens accepted per step), so the draft head is doing its job — the gain just isn't materializing into wall-clock throughput.

Most likely suspects on your side (in rough order):

WSL2 CUDA passthrough overhead. WSL2's GPU virtualization adds latency on small-batch kernel launches; the MTP draft pass is exactly that workload (one extra small forward per step). On native Linux the same draft pass costs much less. If you can boot a native Linux partition (or a Linux container with --gpus all outside WSL), even a quick test would isolate this.
vLLM build / nightly drift. Could you share the exact vllm --version? There were Blackwell-specific MTP fixes between 0.19.0 and 0.19.1rc1; if you're on an older nightly, FlashInferCutlassNvFp4LinearKernel selection for the draft pass may be off.
GPU clock/thermal on 5090. Slightly less likely given you're not at 95% util, but worth checking nvidia-smi --query-gpu=clocks.current.sm,clocks.max.sm during the run — WSL2 can also mask thermal throttling.

If you want, I can mirror your launch command verbatim including --reasoning-parser qwen3 / --tool-call-parser qwen3_coder and post the kernel-selection lines from our startup log so you can diff them against yours — happy to dig further.

— sakamakismile

Orosius

Apr 25

uv run vllm --version 0.19.2rc1.dev206+g95995bbef

nvidia-smi --query-gpu=clocks.current.sm,clocks.max.sm --format=csv
clocks.current.sm [MHz], clocks.max.sm [MHz]
2835 MHz, 3090 MHz

Somewhat stable aroun 2850.

All seem to point toward a WSL2 problem

sakamakismile

Owner Apr 28

@Orosius Confirmed — your numbers (5090 + WSL2, 51–55 tok/s at 92–96% acceptance) line up with WSL2 CUDA passthrough overhead on small-batch kernel launches, which is exactly the workload MTP draft passes generate. Nothing to fix on the checkpoint side. If you can boot a native Linux partition for one quick sanity run, I'd expect ~85 tok/s long-form at the same flags.

One bonus: num_speculative_tokens=3 (instead of 1) gets us 132 tok/s short-form / 105 long-form on PRO 6000 — vLLM applies the MTP layer recursively. Worth trying once you're off WSL2.

— Tonoken3 / Lna-Lab

Bellesteck

May 3

I observed the same and suspect the same, WSL2 👎

Unfortunately I run Windows.

Bellesteck

May 3

This model is still badass, however.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment