Instructions to use unsloth/Qwen3.6-35B-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/Qwen3.6-35B-A3B-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="unsloth/Qwen3.6-35B-A3B-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("unsloth/Qwen3.6-35B-A3B-GGUF", dtype="auto")

llama-cpp-python

How to use unsloth/Qwen3.6-35B-A3B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/Qwen3.6-35B-A3B-GGUF",
	filename="BF16/Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use unsloth/Qwen3.6-35B-A3B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Use Docker

docker model run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

LM Studio
Jan

vLLM

How to use unsloth/Qwen3.6-35B-A3B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/Qwen3.6-35B-A3B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

SGLang

How to use unsloth/Qwen3.6-35B-A3B-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/Qwen3.6-35B-A3B-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/Qwen3.6-35B-A3B-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/Qwen3.6-35B-A3B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use unsloth/Qwen3.6-35B-A3B-GGUF with Ollama:
```
ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
```

Unsloth Studio

How to use unsloth/Qwen3.6-35B-A3B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-35B-A3B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/Qwen3.6-35B-A3B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/Qwen3.6-35B-A3B-GGUF to start chatting

How to use unsloth/Qwen3.6-35B-A3B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use unsloth/Qwen3.6-35B-A3B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use unsloth/Qwen3.6-35B-A3B-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
```

Lemonade

How to use unsloth/Qwen3.6-35B-A3B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.6-35B-A3B-GGUF-UD-Q4_K_M

List all available models

lemonade list

Hallucinations, unstable results, tool call errors with UD-Q6_K_XL ?

#15

by tooltd - opened Apr 21

Discussion

tooltd

Apr 21

I tried both versions UD-Q5_K_XL and UD-Q6_K_XL quants :
I used them for programming, and I found that UD-Q6_K_XL had a higher error rate than the UD-Q5_K_XL. It had very silly problems and unstable output. Has anyone else experienced this? @@

Logarifm

Apr 21

I have the same issue with UD-Q6_K_XL and the kilo code tool. The tool call failed even on a simple task.

tooltd

Apr 21

I use the recommended parameter for coding task. With same prompt, when I run it repeatedly, there's one instance where the result is completely different from others, and it doesn't even comply with the prompt's requirements. 😂
It's hard to understand, I'll probably go back to Q5

shimmyshimmer

Unsloth AI org Apr 21

I tried both versions UD-Q5_K_XL and UD-Q6_K_XL quants :
I used them for programming, and I found that UD-Q6_K_XL had a higher error rate than the UD-Q5_K_XL. It had very silly problems and unstable output. Has anyone else experienced this? @@

I have the same issue with UD-Q6_K_XL and the kilo code tool. The tool call failed even on a simple task.

Could be because the compute is not enough. We've seen many people saying the models don't load or it just breaks because of max memory use. If you get better results with Q5 then definitely stick with the smaller one

thaatz

Apr 21

I have also been having tool calling issues on Q8_K_XL and Q8 from Bartowski too. I don't think it is an unsloth specific problem though.
https://huggingface.co/Qwen/Qwen3.6-35B-A3B/discussions/40

danielhanchen

Unsloth AI org Apr 22

I have also been having tool calling issues on Q8_K_XL and Q8 from Bartowski too. I don't think it is an unsloth specific problem though.
https://huggingface.co/Qwen/Qwen3.6-35B-A3B/discussions/40

Uusually it's to do with the tooling you're using, not necesarilly the model or quant. What tooling are you using?

aminya

Apr 22

The base model has issues. Use a fine tune like https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

kyunle

Apr 22

They have similar issues, unfortunatelly. Doom loop, disobeying skill instructions. 😕

Shuasimodo

Apr 25

the issue seems to be with the chat template. switching to the qwen3.5 chat template fixed all errors for me.

kyunle

Apr 25

@shuasimodo would you mind to share your template reference? ☺️

Shuasimodo

Apr 25

•

edited Apr 25

sorry about that. just this file: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/raw/main/chat_template.jinja
just download it and link to it as your chat template file in llama.cpp or whatever you're using for inference.

MortiDahlaine

May 10

Yeah Q6_K_XL seems to take ages to load and then goes like 6.7t/s.
Works with v11 template from here: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Q6_K seems at least 2x faster

Shuasimodo

about 1 month ago

•

edited 28 days ago

Yeah Q6_K_XL seems to take ages to load and then goes like 6.7t/s.
Works with v11 template from here: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Q6_K seems at least 2x faster

@MortiDahlaine

This is because of mainly improper use of f32 upcasting during quantization in an aim to achieve higher quality.

using f32 (blah blah is same is bf16 with extra 0's and is universally compatible) uses significantly more vram and creates more calculations within the gpu, and on the cpu - it's heaver, clunkier, alters the original precision, and is what is causing this "use a smaller model" nonsense. When I say alter, people can argue whatever they want - but I am saying it in the sense of bf16 is the original release, weights are not changed. f32 is upcasting, even though it's the same with just extra 0's, it's still not the same as the original bf16 because it now has extra 0's.

before anybody cries and pretends they're a ego-genius - yes, norm's, inp's, and maybe a couple others are in f32 and stay in f32. they're tiny, and critical, and also that's how llama.cpp quantizes... that's not my point. My point is that putting all ssm_* weights in f32 is uncalculated recklessness.

BF16 is the original weights, and runs on older hardware when properly quantized, and when properly quantized, creates a more stable numerical flow of calculations during inference with results in a better output.

there is this whole thing people believe without empirically testing, that is bf16 doesn't work well on older hardware - and so now we're seeing ssm weights in f32 without knowing what the pros and cons are.

Pros - better on paper (bf16 quality)
cons - significantly higher vram
cons - smaller context window
cons - altering original precision

here's the highest quality model I've been able to quantize (with the trade-off of attn_gate) from experimenting with every weight for over 2 months.

ffn weights should be in the same precision for a more stable flow of data during inference. putting shexp in a higher precision than exps does not make it better. the model might seem to have more dynamic range, but it's an inaccurate numerical flow of data trying to take 16 bit decision and turn it into an 8 or 4 bit expert routing. studies show one thing, but hundreds of hours of empirical testing validate otherwise.
using bf16 in the right places is not actually as slower as you'd think, but is critical to maintain the coherency, accuracy, and dynamic capability of the model.
using q8_0 kv cache type is not free lunch. is it storing k and v in a quantized form right at the begining of the inference stage and influences the rest of the process and compounds over the length of the chat. This is where you'll see little errors like spelling errors, names, variables in code, etc... "negligible", sure. but if you are seeking quality and accuracy that is likely why. if running in a true production environment, give bf16 a try. then you'll be storing k and v cache values in their original form.

I have tested these weights with an rtx 2060, 6gb gpu using llama cpp.

context 102400
kv cache type f16
batch 1024
ubatch 512

and I get a stable ~100 token/sec PP with ~16 token/sec TG.

./llama.cpp/build/bin/llama-quantize \
  --output-tensor-type q8_0 \
  --token-embedding-type bf16 \
  --tensor-type output=bf16 \
  --tensor-type attn_gate=q8_0 \
  --tensor-type attn_qkv=bf16 \
  --tensor-type attn_q=bf16 \
  --tensor-type attn_k=bf16 \
  --tensor-type attn_v=bf16 \
  --tensor-type attn_output=bf16 \
  --tensor-type ssm_beta=bf16 \
  --tensor-type ssm_alpha=bf16 \
  --tensor-type ssm_out=bf16 \
  --tensor-type ffn_up_shexp=q8_0 \
  --tensor-type ffn_gate_shexp=q8_0 \
  --tensor-type ffn_down_shexp=q8_0 \
  --tensor-type ffn_up_exps=q8_0 \
  --tensor-type ffn_gate_exps=q8_0 \
  --tensor-type ffn_down_exps=q8_0 \
  /gguf_files/qwen3.6/bf16/Qwen3.6-35B-A3B-BF16-00001-of-00002.gguf \
  /gguf_files/qwen3.6/Qwen3.6-35B-A3B-HQ.gguf \
  Q8_0

tldr of script; gate is important, but it also is your... you need to decide whether or not you want context. With gate at bf16, on 6gb vram you get 40k context. it's smart, good, but it's not really production capable with the limited context window. With setting it to q8_0, you get 102k context. but then again, unsloths q8_k_xl uses a q8_0 gate too. so I'm going to safely assume that's been well vetted and is okay to put in q8_0.

important: This is with no mmproj file. ~~need to lower context or shift your weights around.~~
actually you can offload your mmproj kv cache to ram and set image-max-tokens and this will all work with vision

example:

no-mmproj-offload = true
image-max-tokens = 256

what I'm trying to say, and validate is that bf16 is the original precision, and it does work on older hardware better than f32 for computational overhead and quality. - without altering most of the models original weights at all or upcasting to f32.

Test & results

llama_model_quantize_impl: model size = 66152.24 MiB (16.01 BPW)
llama_model_quantize_impl: quant size = 36560.05 MiB (8.85 BPW)
this seems to work well. weights are almost all in bf16, ffn in 8 bit, attn_gate is in 8 bit too.
attn_gate influences vram usage. with this model I can get 102400 context window, f16 kv cache type, and it fits.
speeds are same as earlier hq model (not added to test sorry me) ~107/sec PP with tool calls, ~17/sec TG
VRAM: 5654MiB / 6144MiB

model config:

[Qwen3.6-35b]
model = /gguf_files/qwen3.6/Qwen3.6-35B-A3B-HQ.gguf
#mmproj = /gguf_files/qwen3.6/mmproj-q8_0.gguf
alias = qwen3.6-35b
n-gpu-layers = 41
n-cpu-moe = 40
ctx-size = 102400
threads = 6
cache-type-k = f16
cache-type-v = f16
batch-size = 1024
ubatch-size = 512
flash-attn = true
sleep-idle-seconds = -1
no-kv-offload = false 
context-shift = false
chat-template-file = /home/llama-user/.config/llama/templates/qwen3.6.jinja
jinja = true
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
repeat-penalty = 1.0
presence-penalty = 0.0
load-on-startup = false
#chat-template-kwargs = {"enable_thinking":false}

GPU

$ nvidia-smi
Thu May 14 22:07:41 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.58.03              Driver Version: 595.58.03      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2060        Off |   00000000:23:00.0 Off |                  N/A |
| 44%   49C    P2             70W /  136W |    5656MiB /   6144MiB |     71%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         1560439      C   ...ma.cpp/build/bin/llama-server       5652MiB |
+-----------------------------------------------------------------------------------------+

Memory usage

$ top

top - 23:34:56 up 4 days, 10:12,  1 user,  load average: 2.19, 1.65, 1.33
Tasks: 369 total,   2 running, 367 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.6 us,  1.3 sy,  0.0 ni, 90.5 id,  0.6 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31748.4 total,    237.8 free,   1293.5 used,  30859.9 buff/cache
MiB Swap:  31367.0 total,  28732.7 free,   2634.2 used.  30454.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                   
1560905 llama-u+  20   0   80.5g  29.9g  29.8g R  93.4  96.3  12:37.94 llama-server

important I use this gpu because it's the smallest I have access to and is the best measure of what can be run and how. Most people have 12gb, or 24gb, or more - so do the assumption math yourself and you'll have an idea of what you can run. And also this gpu is f16 only, does not support bf16. and it is evidence of the claim for slower speeds on non bf16 gpus

You can put ffn weights into 4 bit (i like mxfp4 because it keeps everything floating point and isn't some fp -> int game, and on this gpu get ~220/sec PP, and ~25/sec TG. Or q6k, or whatever. but too much use of f32 is the tldr of this rant.

If anyone is wondering why in my script I use qkv and q,k,v, it's because:
in blocks 0, 1, and 2 it uses qkv
[ 6/ 843] blk.0.attn_qkv.weight - [ 2048, 8192, 1, 1], type = bf16, size = 32.000 MiB

and starting in block 3 it moves to q,k,v. and i found at least for me that if i didn't explicitly state what those weights should be in llama.cpp it put them in q8_0 probably because of the quant type at the bottom.

[  58/ 843] blk.3.attn_k.weight                  - [  2048,    512,      1,      1], type =   bf16, size =    2.000 MiB
[  59/ 843] blk.3.attn_k_norm.weight             - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[  60/ 843] blk.3.attn_norm.weight               - [  2048,      1,      1,      1], type =    f32, size =    0.008 MiB
[  61/ 843] blk.3.attn_output.weight             - [  4096,   2048,      1,      1], type =   bf16, size =   16.000 MiB
[  62/ 843] blk.3.attn_q.weight                  - [  2048,   8192,      1,      1], type =   bf16, size =   32.000 MiB
[  63/ 843] blk.3.attn_q_norm.weight             - [   256,      1,      1,      1], type =    f32, size =    0.001 MiB
[  64/ 843] blk.3.attn_v.weight                  - [  2048,    512,      1,      1], type =   bf16, size =    2.000 MiB

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment