Instructions to use Qwen/Qwen3.5-122B-A10B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3.5-122B-A10B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-122B-A10B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-122B-A10B") model = AutoModelForMultimodalLM.from_pretrained("Qwen/Qwen3.5-122B-A10B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen3.5-122B-A10B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3.5-122B-A10B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.5-122B-A10B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3.5-122B-A10B
- SGLang
How to use Qwen/Qwen3.5-122B-A10B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.5-122B-A10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.5-122B-A10B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.5-122B-A10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.5-122B-A10B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Qwen/Qwen3.5-122B-A10B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3.5-122B-A10B
Thank you team Qwen for a 120B LLM
I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.
I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.
May I ask what you use to infer such big models and how many tokens per second you get?^^
I have 64GB RAM + 12 GB VRAM (bought in 2023, before LLMs were a thing in the sense it is now) and Gpt-Oss-120B replaced Qwen3-30B for me some time ago. Qwen3-235B and other similar sized LLMs don't fit. This is the first LLM since Gpt-Oss-120B that I'm going to try out in a 4-5-bit GGUF quant. Wonder how a 4-bit QAT version would perform, like Gpt-Oss-120B one is. Thanks again.
May I ask what you use to infer such big models and how many tokens per second you get?^^
I have been using this command: llama-server -m 'gpt-oss-120b-mxfp4-00001-of-00003.gguf' --n_gpu_layers 99 --n-cpu-moe 32 --threads 4 --temp 1.0 --top-k 0 --top-p 1.0 -c 8192 --chat-template-kwargs '{"reasoning_effort": "medium"}' --jinja --no-warmup. (Used Dedicated Memory: 90%). And am getting 17.5 tokens per second for the first 1000 tokens.
Some time ago llama.cpp added the fit command(s) and set it to on by default. Now, --n_gpu_layers 99 --n-cpu-moe 32 is not needed anymore and without these commands I'm getting 17.0 tokens per second for the first 1000 tokens and a Used Dedicated Memory:of 85-86%.
After stopping the inferencing after the first 1000 tokens, when running the same prompt again, I'm getting over 19 t/s on both commands (my prompt question was such, that output is different each time, but maybe was something still cached).
Using n_gpu_layers and n-cpu-moe to manually offload a little bit more to the VRAM, naturally gives a slightly higher t/s, but I think remembering that there was an issue with not enough VRAM at some point, so I'd recommend removing n_gpu_layers and n-cpu-moe. I guess use them if you want some VRAM left for something else and that not all VRAM is used.
PS: Since I can't fit the good 4-bit quants of 122B-A10B, I went testing Qwen3.5-27B for now (as a dense LLM it performs batter than what its parameter count would suggest vs a MoE LLM) (I may still try the 122B later).