Qwen3-VL-32B-Instruct
This repository contains Qwen/Qwen3-VL-32B-Instruct together with a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM. The same model also runs on other frameworks (such as vLLM, SGLang, and Transformers); for usage with those, see the upstream Qwen/Qwen3-VL-32B-Instruct model card.
Overview
Qwen3-VL-32B-Instruct is a 32-billion-parameter dense vision-language model from the Qwen3-VL series. It pairs a vision encoder with a dense transformer decoder, using Interleaved-MRoPE positional embeddings and DeepStack multi-level feature fusion to handle images and videos alongside text. The model covers visual understanding tasks such as OCR, document and chart analysis, spatial reasoning, and video comprehension, and it natively supports tool (function) calling. This is the Instruct (non-thinking) edition. Its intended use is the same as the upstream Qwen/Qwen3-VL-32B-Instruct, released under the Apache 2.0 License.
- Architecture: Qwen3-VL (dense)
- Input / Output: Image + Text / Text
- Supported Inference Engine: Furiosa LLM
- Supported Hardware: FuriosaAI RNGD
Quantization
No quantization is applied — the model runs in the same precision as the upstream weights.
Features
- Vision-language. The model accepts OpenAI-style multimodal chat messages with
image_urlcontent parts alongside text. - Tool calling. The model supports tool (function) calling through the
hermestool-call parser.
Parallelism Strategy
On RNGD, Qwen3-VL-32B-Instruct runs with a tensor-parallel size of 32 PEs, which maps to four RNGD cards (8 PEs per card).
Usage
To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.
Launch the server
The simplest way to serve the model is:
# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct
When the server is ready, you will see:
INFO: Started server process [27507]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Launch the server with tool calling
To enable tool (function) calling, start the server with the hermes tool-call
parser:
furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser hermes
Query the server
The server exposes an OpenAI-compatible API. You can send a text-only request
with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "furiosa-ai/Qwen3-VL-32B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}' \
| python -m json.tool
To ask about an image, pass an image_url content part in the message:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "furiosa-ai/Qwen3-VL-32B-Instruct",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}},
{"type": "text", "text": "Describe this image."}
]
}]
}' \
| python -m json.tool
The image_url.url field accepts a remote http:///https:// URL, an inline
base64 data: URL, or a local file:// path (the latter requires the
--allowed-local-media-path flag below).
Multimodal serving options
furiosa-llm serve provides flags to control multimodal behavior; requests
that violate them are rejected with HTTP 400:
--image-limit-per-prompt N/--video-limit-per-prompt N— maximum number of images/videos allowed per request (default: unlimited).--allowed-local-media-path PATH— allowfile://URLs whose resolved path is underPATH. Local file access is disabled unless this is set.--allowed-media-domains D [D ...]— whitelist of remote domains for SSRF protection. When set, only images from the listed domains are fetched.--interleave-mm-strings— keep image placeholders at their original positions when the model uses a string-format chat template (no-op for OpenAI-format templates, the common case).--mm-processor-cache-gb GB— size of the UUID-keyed multimodal processor cache (default: 4.0). Clients can tag animage_urlpart with auuidfield and re-reference it in follow-up requests without re-uploading the image bytes; set to 0 to disable.
For example, to serve local images under /srv/media and restrict remote
fetches to a single domain:
furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct \
--allowed-local-media-path /srv/media \
--allowed-media-domains cdn.example.com \
--image-limit-per-prompt 4
See the Vision-Language Models guide for image input formats, the UUID cache, and Python client examples.
Tool calling
With the server launched using --enable-auto-tool-choice --tool-call-parser hermes,
you can pass tools and let the model decide when to call them. See the
Tool Calling guide
for a complete client example and details on tool-choice options.
Learn more
- Vision-Language Models — image input formats, multimodal server options, and the UUID cache
- Tool Calling — parsers, tool-choice options, and more examples
- Furiosa-LLM Server (
furiosa-llm serve) — full OpenAI-compatible API reference and serving options - Qwen/Qwen3-VL-32B-Instruct — upstream model card
- Downloads last month
- 102
Model tree for furiosa-ai/Qwen3-VL-32B-Instruct
Base model
Qwen/Qwen3-VL-32B-Instruct