Instructions to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

SGLang

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
```

RTX RTX PRO 4500 Blackwell results

by Pulsate1680 - opened Apr 25

Discussion

Pulsate1680

Apr 25

Thank you for creating this! Sharing some stats from my run:

Setup:
RTX PRO 4500 Blackwell, 32GB GDDR7, 200W TGP
WSL2 (Ubuntu 24.04) on Windows 11
vLLM 0.19.2rc1 (cu130-nightly Docker image)
Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (modelopt NVFP4, MTP head grafted back in BF16)
BF16 KV cache, 131K context
Numbers (single-stream, thinking disabled, vllm bench serve):
Steady-state TG: 60–73 tok/s (engine logs, varies by content)
Mean: ~65 tok/s, peaks 73
TPOT: 17 ms
TTFT: 240 ms median
Acceptance length: 3.19 mean (3.35–3.97 on easier text)
Per-position acceptance: 87/72/61% mean, 99/94/91% on best windows
Model footprint: 18.55 GB
KV cache: 9.77 GB available, ~37K token pool
vLLM launch (compose command block):
yaml

sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
--quantization
modelopt
--speculative-config
'{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
--max-model-len
"131072"
--max-num-batched-tokens
"4096"
--max-num-seqs
"10"
--gpu-memory-utilization
"0.93"
--enable-prefix-caching
--no-scheduler-reserve-full-isl
--trust-remote-code
--reasoning-parser
qwen3
--enable-auto-tool-choice
--tool-call-parser
qwen3_coder
--default-chat-template-kwargs
'{"preserve_thinking":true}'
--language-model-only

sakamakismile

Owner Apr 26

Hi @Pulsate1680 — coming back to thank you. Your num_speculative_tokens=3 line in this thread is what unlocked the next jump for our family of MTP repos.

I had been documenting num_speculative_tokens=1 based on the "MTP head has 1 layer" reasoning, which is structurally true but missed that vLLM applies the single MTP layer recursively. Your mean acceptance length of 3.19 (peaks 3.35–3.97) on the RTX PRO 4500 was the load-bearing evidence that recursive draft was actually paying off. Took your numbers, rebenched on RTX PRO 6000 Blackwell + vLLM 0.19.1rc1 @ T = 0, and saw the same shape on all four of our Qwen3.6-family NVFP4 + MTP repos:

Repo	n=1 (prior)	n=3 (this finding)
`Qwen3.6-27B-Text-NVFP4-MTP`	71–85	132 / 105 / 106
`Carnice-V2-27b-NVFP4-TEXT-MTP`	93	134 / 102 / 103
`Huihui-Qwen3.6-…-NVFP4-TEXT-MTP`	~71	135 / 112 / 109 ← family fastest
`Huihui-Qwen3.6-…-NVFP4-MTP` (VLM)	—	137 / 112 / 104 text · 129 with image

(short / medium / long-form prompts.)

All four READMEs were updated today to make num_speculative_tokens: 3 the recommended setting and explicitly cite this thread for the credit. The Huihui abliterated body comes out fastest of the group, which is consistent with refusal-shaped tokens being smoothed out — fewer awkward low-acceptance spots for the recursive draft.

Your --no-scheduler-reserve-full-isl + preserve_thinking chat-template kwarg recipe is also gold — added both to my standard launch profile.

Real thanks for posting clean numbers with the launch flags inline. Worth more than the whole "we should optimise NVFP4" conversation ever was.

— Tonoken3 / Lna-Lab

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment