Instructions to use Qwen/Qwen3.5-397B-A17B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.5-397B-A17B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-397B-A17B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-397B-A17B")
model = AutoModelForMultimodalLM.from_pretrained("Qwen/Qwen3.5-397B-A17B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen3.5-397B-A17B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.5-397B-A17B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-397B-A17B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.5-397B-A17B

SGLang

How to use Qwen/Qwen3.5-397B-A17B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.5-397B-A17B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-397B-A17B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.5-397B-A17B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-397B-A17B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.5-397B-A17B with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.5-397B-A17B
```

will there be a smaller version?

#16

by iojvsuynv - opened Feb 16

Discussion

iojvsuynv

Feb 16

A huge, powerful model is great, but what about smaller models for local use?

RecViking

Feb 16

•

edited Feb 16

I'm sure folks will create REAP/REAM versions of this model if they haven't already uploaded them here somewhere. With that said, I doubt you'll get REAP/REAM+Quants that will fit in something like 32gigs of VMEM without some terrible loss in competency. TBH, I expect GPUs on the consumer market to eventually catch up to the "need" of something like this. High end workstation cards are already close to running a REAP/REAM+Quant of this model.

I know that doesn't directly answer your question. I also expect qwen might release smaller 3.5 versions of this architecture, but I doubt they'll have the same competency of this model. As with many previous models, they release the large model, then distill it to create smaller models. This isn't cheap. I'm just thankful Qwen does it. If they DON'T create smaller versions/distillations of this model, I'm sure someone else will.

hihenry69

Feb 17

@RecViking I just bought an rtx pro 6000. at least following the current memory pricing and market, not holding my breath that even with a new gen of gpu from team green that that will mean more vram... one can hope however.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment