Instructions to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
- SGLang
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Docker Model Runner:
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
RTX RTX PRO 4500 Blackwell results
Thank you for creating this! Sharing some stats from my run:
Setup:
RTX PRO 4500 Blackwell, 32GB GDDR7, 200W TGP
WSL2 (Ubuntu 24.04) on Windows 11
vLLM 0.19.2rc1 (cu130-nightly Docker image)
Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (modelopt NVFP4, MTP head grafted back in BF16)
BF16 KV cache, 131K context
Numbers (single-stream, thinking disabled, vllm bench serve):
Steady-state TG: 60–73 tok/s (engine logs, varies by content)
Mean: ~65 tok/s, peaks 73
TPOT: 17 ms
TTFT: 240 ms median
Acceptance length: 3.19 mean (3.35–3.97 on easier text)
Per-position acceptance: 87/72/61% mean, 99/94/91% on best windows
Model footprint: 18.55 GB
KV cache: 9.77 GB available, ~37K token pool
vLLM launch (compose command block):
yaml
- sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
- --quantization
- modelopt
- --speculative-config
- '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
- --max-model-len
- "131072"
- --max-num-batched-tokens
- "4096"
- --max-num-seqs
- "10"
- --gpu-memory-utilization
- "0.93"
- --enable-prefix-caching
- --no-scheduler-reserve-full-isl
- --trust-remote-code
- --reasoning-parser
- qwen3
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder
- --default-chat-template-kwargs
- '{"preserve_thinking":true}'
- --language-model-only
Hi @Pulsate1680 — coming back to thank you. Your num_speculative_tokens=3 line in this thread is what unlocked the next jump for our family of MTP repos.
I had been documenting num_speculative_tokens=1 based on the "MTP head has 1 layer" reasoning, which is structurally true but missed that vLLM applies the single MTP layer recursively. Your mean acceptance length of 3.19 (peaks 3.35–3.97) on the RTX PRO 4500 was the load-bearing evidence that recursive draft was actually paying off. Took your numbers, rebenched on RTX PRO 6000 Blackwell + vLLM 0.19.1rc1 @ T = 0, and saw the same shape on all four of our Qwen3.6-family NVFP4 + MTP repos:
| Repo | n=1 (prior) | n=3 (this finding) |
|---|---|---|
Qwen3.6-27B-Text-NVFP4-MTP |
71–85 | 132 / 105 / 106 |
Carnice-V2-27b-NVFP4-TEXT-MTP |
93 | 134 / 102 / 103 |
Huihui-Qwen3.6-…-NVFP4-TEXT-MTP |
~71 | 135 / 112 / 109 ← family fastest |
Huihui-Qwen3.6-…-NVFP4-MTP (VLM) |
— | 137 / 112 / 104 text · 129 with image |
(short / medium / long-form prompts.)
All four READMEs were updated today to make num_speculative_tokens: 3 the recommended setting and explicitly cite this thread for the credit. The Huihui abliterated body comes out fastest of the group, which is consistent with refusal-shaped tokens being smoothed out — fewer awkward low-acceptance spots for the recursive draft.
Your --no-scheduler-reserve-full-isl + preserve_thinking chat-template kwarg recipe is also gold — added both to my standard launch profile.
Real thanks for posting clean numbers with the launch flags inline. Worth more than the whole "we should optimise NVFP4" conversation ever was.
— Tonoken3 / Lna-Lab