Instructions to use Qwen/Qwen3.6-35B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3.6-35B-A3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.6-35B-A3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Qwen/Qwen3.6-35B-A3B") model = AutoModelForMultimodalLM.from_pretrained("Qwen/Qwen3.6-35B-A3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- AMD Developer Cloud
- Local Apps Settings
- vLLM
How to use Qwen/Qwen3.6-35B-A3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3.6-35B-A3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3.6-35B-A3B
- SGLang
How to use Qwen/Qwen3.6-35B-A3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.6-35B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.6-35B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Qwen/Qwen3.6-35B-A3B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3.6-35B-A3B
endless reasoning loops
title . . .I can't be the only one can I ?
Why not use gemma4 it is better
I ran both qwen 3.6 35B and gemma 4 31B on the same task with hermes
gemma4 used the skill for it
qwen3.6 forgot about the skill and tried 100 angles with web search and fetch
May it less than lfm2-24b-a2b
Qwen models have always had problems with this. Try Gemma 4, GLM-4.7-Flash, or Step-3.5-Flash if you want a model that' s better tuned to avoid it. BUT, with llama.cpp, you can set --reasoning-budget 4096 or something like that, and it will cut off the reasoning if it goes on too long. You can also set --reasoning-budget-message, which will let you end the thinking with something like "OK, I've thought long enough. Let's answer.", which is a fairly natural way for the model to stop thinking, as if it really choose too.
just for fun:
Let's enable get_weather.
Then I will call get_weather.
Actually, I need to check if get_weather requires parameters. Since the definition in the prompt is just {"description": "Use to get the current weather for a location.", "name": "get_weather", "parameters": {"properties": {}, "type": "object"}} (Wait, the prompt doesn't show the parameters, it just lists the name and description in the "Tools you may use" list, but usually the tool_enabler or the tool definition itself provides the schema. The prompt says "Use to enable tools you'll need...".
Let's look at the tool_enabler description: "Use to enable tools you'll need, at start of each turn."
I will enable get_weather.
Then I will call get_weather.
Wait, if I look at the tool list provided in the prompt text:- get_weather
It doesn't show the schema. This is a bit ambiguous. I will assume I can just call it and see what happens, or that it takes no arguments and defaults to the user's location.
Let's proceed.
Step 1: Enable get_weather.
Step 2: Call get_weather.
Wait, the prompt says "at start of each turn".
I will do that.
Refining the plan:
- Enable
get_weather. - Call
get_weather.
Let's do it.
One detail: The user said "get me weather pleae".
I will invoke the tool.
If the tool requires a location and I don't provide it, I might need to ask. But I'll try calling it first as many weather tools are context-aware.
Actually, I don't see the parameters in the provided text for get_weather.
However, I see get_weather in the list.
I will enable it.
Wait, I should check if I need to ask for the location. If I call it and it fails, I'll ask.
But let's assume it works.
I'll enable get_weather and call it.
Note: The user prompt shows the tool list.- get_weather
I will use it.
Let's start.
- Enable
get_weather. - Call
get_weather.
Wait, does get_weather take parameters?
The prompt doesn't show the schema, just the name.
I will assume it takes no parameters or uses a default.
If it needs a location, I'll ask.
Let's execute.
One more thing: The user said "at 21:36".
Current time is 21:35.
Okay.
I'll enable the tool.
Then call it.
Wait, I can just call the tool if it's already enabled or if I enable it in the same turn.
The prompt says "Use to enable tools you'll need, at start of each turn."
So I must call tool_enabler first.
Let's do it.tool_enabler(tools_to_enable=["get_weather"])get_weather()
Wait, looking at the tool list, get_weather is there.
I will enable it.
Is there any other tool I might need? No.
Just weather.
Let's go.tool_enabler(tools_to_enable=["get_weather"])get_weather()
Wait, does get_weather require parameters?
If the schema isn't provided, I might have to guess or it might be parameterless.
I'll try calling it.
If I look at other tools like get_weather, usually they might take location.
But since I don't see the schema, I will call it without parameters first. If it errors, I'll ask.
Let's try to be safe.
I will enable get_weather and call it.
Wait, I should check if I can just call get_weather directly without enabling?
The instructions say "Use to enable tools you'll need, at start of each turn."
So yes, I must enable it.
Plan:
tool_enablerwithget_weather.get_weather.
Let's do this.
The user is waiting.
Wait, the prompt says "Current Time: 21:35".
The user message is "at 21:36 said: get me weather pleae".
This implies the user sent it at 21:36, but the system clock is 21:35?
Maybe the user's local time is different or the log is slightly ahead.
Doesn't matter for the weather.
Let's go.
One final check: Is get_weather available? Yes, it's in the list.
Is tool_enabler available? Yes.
I will perform the actions.
tool_enablerget_weather
Wait, I should confirm if get_weather needs a location.
If I don't know the location, I can't provide it if it's a required parameter.
However, if I can't ask the user (because I'm in the middle of execution), I'll call it and see.
If it returns an error asking for location, I'll reply to the user asking for it.
But usually, these bots have a default location or use IP.
I'll proceed.
One detail: The prompt says "at start of each turn".
I will do tool_enabler and then the tool call in the same turn.
Let's go.
Wait, I need to check the tool definition again."name": "get_weather", "description": "Use to get the current weather for a location."
It doesn't show parameters.
I'll assume it's parameterless or uses context.
Let's execute.tool_enabler(tools_to_enable=["get_weather"])get_weather()
Wait, I can't execute two tool calls in one go if they are dependent?
No, I can call tool_enabler then get_weather.
Actually, I can call multiple tools in one turn usually.
But tool_enabler usually needs to happen first to make the tool available?
The prompt says "Use to enable tools you'll need, at start of each turn."
So I should call tool_enabler first.
Then call get_weather.
Let's do that.
I'll output the tool_enabler call first.
Then the get_weather call.
Wait, I see "Current Time: 21:35".
The user message is "at 21:36".
This is just metadata.
Let's go.tool_enablerget_weather
Wait, I should check if I need to enable other tools?
No, just weather.
Okay, I'm ready.
I'll enable get_weather and then call it.
Wait, I notice the user said "pleae".
I'll ignore the typo.
Let's go.tool_enabler(tools_to_enable=["get_weather"])get_weather()
Wait, I need to check if get_weather is actually a tool I can call.
The list says: get_weather.
Yes.
Okay.
I'll generate the response.
I will call tool_enabler with get_weather.
Then I will call get_weather.
Wait, does get_weather return a string or JSON?
I'll see.
Okay, let's start.tool_enabler(tools_to_enable=["get_weather"])get_weather()
Wait, I should check if there are any other tools I need.
Maybe speak to speak the weather?
The user didn't ask for speech.
I'll just get the weather.
Okay.tool_enabler(tools_to_enable=["get_weather"])get_weather()
Wait, I can't see the output of get_weather until I make the call.
I will make the call now.
One thought: The user might be in a specific location.
If get_weather
Qwen models have always had problems with this. Try Gemma 4, GLM-4.7-Flash, or Step-3.5-Flash if you want a model that' s better tuned to avoid it. BUT, with llama.cpp, you can set --reasoning-budget 4096 or something like that, and it will cut off the reasoning if it goes on too long. You can also set --reasoning-budget-message, which will let you end the thinking with something like "OK, I've thought long enough. Let's answer.", which is a fairly natural way for the model to stop thinking, as if it really choose too.
Dammit, I just tried that --reasoning-budget with a specific length myself, and found it doesn't work after all; llama-server only supports -1 (do as much reasoning as you like) or 0 (don't do any reasoning) as values :/
Qwen models have always had problems with this. Try Gemma 4, GLM-4.7-Flash, or Step-3.5-Flash if you want a model that' s better tuned to avoid it. BUT, with llama.cpp, you can set --reasoning-budget 4096 or something like that, and it will cut off the reasoning if it goes on too long. You can also set --reasoning-budget-message, which will let you end the thinking with something like "OK, I've thought long enough. Let's answer.", which is a fairly natural way for the model to stop thinking, as if it really choose too.
Dammit, I just tried that --reasoning-budget with a specific length myself, and found it doesn't work after all; llama-server only supports -1 (do as much reasoning as you like) or 0 (don't do any reasoning) as values :/
I'll second this, sadly reasoning budget does jack all to Qwen 3.5 in llama-server.
I solved the problem using reasoning_budget in vLLM. It is an endemic problem of Qwen