Instructions to use Qwen/Qwen3.5-122B-A10B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.5-122B-A10B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-122B-A10B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-122B-A10B")
model = AutoModelForMultimodalLM.from_pretrained("Qwen/Qwen3.5-122B-A10B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen3.5-122B-A10B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.5-122B-A10B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-122B-A10B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.5-122B-A10B

SGLang

How to use Qwen/Qwen3.5-122B-A10B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.5-122B-A10B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-122B-A10B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.5-122B-A10B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-122B-A10B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.5-122B-A10B with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.5-122B-A10B
```

Thank you team Qwen for a 120B LLM

by rtzurtz - opened Feb 24

Discussion

rtzurtz

Feb 24

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

ztsvvstz

Feb 24

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

May I ask what you use to infer such big models and how many tokens per second you get?^^

rtzurtz

Mar 13

•

edited Mar 13

I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.

May I ask what you use to infer such big models and how many tokens per second you get?^^

I have been using this command: llama-server -m 'gpt-oss-120b-mxfp4-00001-of-00003.gguf' --n_gpu_layers 99 --n-cpu-moe 32 --threads 4 --temp 1.0 --top-k 0 --top-p 1.0 -c 8192 --chat-template-kwargs '{"reasoning_effort": "medium"}' --jinja --no-warmup. (Used Dedicated Memory: 90%). And am getting 17.5 tokens per second for the first 1000 tokens.

Some time ago llama.cpp added the fit command(s) and set it to on by default. Now, --n_gpu_layers 99 --n-cpu-moe 32 is not needed anymore and without these commands I'm getting 17.0 tokens per second for the first 1000 tokens and a Used Dedicated Memory:of 85-86%.

After stopping the inferencing after the first 1000 tokens, when running the same prompt again, I'm getting over 19 t/s on both commands (my prompt question was such, that output is different each time, but maybe was something still cached).

Using n_gpu_layers and n-cpu-moe to manually offload a little bit more to the VRAM, naturally gives a slightly higher t/s, but I think remembering that there was an issue with not enough VRAM at some point, so I'd recommend removing n_gpu_layers and n-cpu-moe. I guess use them if you want some VRAM left for something else and that not all VRAM is used.

PS: Since I can't fit the good 4-bit quants of 122B-A10B, I went testing Qwen3.5-27B for now (as a dense LLM it performs batter than what its parameter count would suggest vs a MoE LLM) (I may still try the 122B later).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment