Instructions to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="unsloth/Qwen3.6-35B-A3B-MTP-GGUF") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("unsloth/Qwen3.6-35B-A3B-MTP-GGUF", dtype="auto") - llama-cpp-python
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/Qwen3.6-35B-A3B-MTP-GGUF", filename="BF16/Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Use Docker
docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
- LM Studio
- Jan
- vLLM
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/Qwen3.6-35B-A3B-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
- SGLang
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/Qwen3.6-35B-A3B-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/Qwen3.6-35B-A3B-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Ollama:
ollama run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
- Unsloth Studio new
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/Qwen3.6-35B-A3B-MTP-GGUF to start chatting
- Pi new
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
- Lemonade
How to use unsloth/Qwen3.6-35B-A3B-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-MTP-GGUF-UD-Q4_K_M
List all available models
lemonade list
Fast!
pushed me over the hump! i was getting 110tps on my 5070ti/5060ti bifurcated 32gb. I am now getting 145 to 160 tps!
thanks unsloth and qwen!
Which Quant variant were you using?
My 5060Ti only manages to output 20-30tps but with 65k active context being loaded already on UD_IQ4NL_XL without MTP, downloading the UD_Q4_XL MTP right now, let's see...
You should check if your VRAM is enough and not any copying happening due to not have enough memory. Speed between quants does not differ much. So having 20-30tps is not normal.
My 4090 was doing around 150tps on Qwen3.6-35B-A3B-GGUF and now it is around 170tps for Qwen3.6-35B-A3B-MTP-GGUF
Also just in any case this is MOE model, maybe your speed is for 27B model?
Which Quant variant were you using?
My 5060Ti only manages to output 20-30tps but with 65k active context being loaded already on UD_IQ4NL_XL without MTP, downloading the UD_Q4_XL MTP right now, let's see...
you dont have enough vram
I do, ran okay with offloading to cpu, this machine is my minimal server to run, but mtp bloated it, went without mtp.
On my main machine I got 96gb and prefer the dense 27b there, but honestly nmax above 2 produced robotic results only somehow, putting everything into bulletpoints etc.
---- English is trans, sorry ----
Qwen3.6-35B-A3B-MTP-GGUF: Actual measurement shows that the generation speed is lower—only 53.5% of draft tokens are accepted, at 18.42 tokens per second. The original Qwen3.6-35B-A3B model can achieve 23.64 tokens per second.
Personal understanding:
- MTP involves transferring model parameters once to generate multiple tokens, then verifying the final output one by one. The costs of prediction and verification are almost zero, but every successful verification saves on the cost of parameter transfer.
- MoE is not suitable for MTP: MoE relies on experts to select routes when calculating tokens. It’s likely that predictions and verifications will not match the same experts, which creates a fundamental conflict with MTP’s optimized prediction and packaging.
Qwen3.6-35B-A3B-MTP-GGUF 实测:生成速度变低,draft tokens accepted 53.5% 18.42 token/s,原版的 Qwen3.6-35B-A3B 能到 23.64 token/s
个人理解:MTP:搬运一次模型参数,生成多个 token,然后依次验证输出最终 token;预测和验证成本几乎为零,但每多验证成功一个就省一次搬运成本。
MoE 不适合 MTP:MoE 算 token 依赖专家路由选择,预测和验证大概率不会命中相同的专家,这样就与 MTP 的打包预测优化产生了根本的冲突。
MTP Conclusion: MTP does not save memory, but it is effective for systems with excess computing power and memory bandwidth usage exceeding 50%.
Mini PC UM 790 pro 96G (2x48G 5600M, memory bandwidth 59G/s)
- 2B and below: The card’s computing power is limited; MTP increases overload. MTP should be turned off.
- 4B to 7B: Bandwidth bottleneck; the computing power remains sensitive. MTP should be turned on, with a maximum draft of 1.
- 9B to 32B: Pure bandwidth bottleneck; excess computing power exists. MTP should be turned on, with a maximum draft of 2 or 3.
- MoE: Expert routing conflicts with MTP’s functionality. MTP must be turned off.
- MTP 结论:MTP 不会节省内存,但对算力过剩,内存带宽使用率超50%的都会有很好的效果。
- UM 790 pro 96G(2x48G 5600M,内存带宽 59G/s)
- 2B 及以下:卡算力,MTP 加重过载,MTP Off
- 4B ~ 7B:带宽瓶颈,算力仍敏感,MTP On,Max Draft = 1
- 9B ~ 32B:纯带宽瓶颈,算力过剩,MTP On,Max Draft = 2 或 3
- MoE:专家路由 与 MTP 验证底层冲突,坚决 MTP Off
I agree, the quality tradeoff on 48GB or lower Vram is worse than just using a bigger quant to me.
I rather have 4gb used by a Q6 without MTP instead of Q4 with MTP.
I know the purpose of MTP is to speed things up, but I also saw a degration in quality and would rather have better quality than bulletpoint responses.