Qwen3-VL-32B-Instruct

This repository contains Qwen/Qwen3-VL-32B-Instruct together with a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM. The same model also runs on other frameworks (such as vLLM, SGLang, and Transformers); for usage with those, see the upstream Qwen/Qwen3-VL-32B-Instruct model card.

Overview

Qwen3-VL-32B-Instruct is a 32-billion-parameter dense vision-language model from the Qwen3-VL series. It pairs a vision encoder with a dense transformer decoder, using Interleaved-MRoPE positional embeddings and DeepStack multi-level feature fusion to handle images and videos alongside text. The model covers visual understanding tasks such as OCR, document and chart analysis, spatial reasoning, and video comprehension, and it natively supports tool (function) calling. This is the Instruct (non-thinking) edition. Its intended use is the same as the upstream Qwen/Qwen3-VL-32B-Instruct, released under the Apache 2.0 License.

Architecture: Qwen3-VL (dense)
Input / Output: Image + Text / Text
Supported Inference Engine: Furiosa LLM
Supported Hardware: FuriosaAI RNGD

Quantization

No quantization is applied — the model runs in the same precision as the upstream weights.

Features

Vision-language. The model accepts OpenAI-style multimodal chat messages with image_url content parts alongside text.
Tool calling. The model supports tool (function) calling through the hermes tool-call parser.

Parallelism Strategy

On RNGD, Qwen3-VL-32B-Instruct runs with a tensor-parallel size of 32 PEs, which maps to four RNGD cards (8 PEs per card).

Usage

To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.

Launch the server

The simplest way to serve the model is:

# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct

When the server is ready, you will see:

INFO:     Started server process [27507]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Launch the server with tool calling

To enable tool (function) calling, start the server with the hermes tool-call parser:

furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Query the server

The server exposes an OpenAI-compatible API. You can send a text-only request with curl:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "furiosa-ai/Qwen3-VL-32B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
    }' \
    | python -m json.tool

To ask about an image, pass an image_url content part in the message:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "furiosa-ai/Qwen3-VL-32B-Instruct",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"}},
            {"type": "text", "text": "Describe this image."}
        ]
    }]
    }' \
    | python -m json.tool

The image_url.url field accepts a remote http:///https:// URL, an inline base64 data: URL, or a local file:// path (the latter requires the --allowed-local-media-path flag below).

Multimodal serving options

furiosa-llm serve provides flags to control multimodal behavior; requests that violate them are rejected with HTTP 400:

--image-limit-per-prompt N / --video-limit-per-prompt N — maximum number of images/videos allowed per request (default: unlimited).
--allowed-local-media-path PATH — allow file:// URLs whose resolved path is under PATH. Local file access is disabled unless this is set.
--allowed-media-domains D [D ...] — whitelist of remote domains for SSRF protection. When set, only images from the listed domains are fetched.
--interleave-mm-strings — keep image placeholders at their original positions when the model uses a string-format chat template (no-op for OpenAI-format templates, the common case).
--mm-processor-cache-gb GB — size of the UUID-keyed multimodal processor cache (default: 4.0). Clients can tag an image_url part with a uuid field and re-reference it in follow-up requests without re-uploading the image bytes; set to 0 to disable.

For example, to serve local images under /srv/media and restrict remote fetches to a single domain:

furiosa-llm serve furiosa-ai/Qwen3-VL-32B-Instruct \
  --allowed-local-media-path /srv/media \
  --allowed-media-domains cdn.example.com \
  --image-limit-per-prompt 4

See the Vision-Language Models guide for image input formats, the UUID cache, and Python client examples.

Tool calling

With the server launched using --enable-auto-tool-choice --tool-call-parser hermes, you can pass tools and let the model decide when to call them. See the Tool Calling guide for a complete client example and details on tool-choice options.

Learn more

Vision-Language Models — image input formats, multimodal server options, and the UUID cache
Tool Calling — parsers, tool-choice options, and more examples
Furiosa-LLM Server (furiosa-llm serve) — full OpenAI-compatible API reference and serving options
Qwen/Qwen3-VL-32B-Instruct — upstream model card

Downloads last month: 102

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for furiosa-ai/Qwen3-VL-32B-Instruct

Base model

Qwen/Qwen3-VL-32B-Instruct

Finetuned

(33)

this model

Collection including furiosa-ai/Qwen3-VL-32B-Instruct

Qwen3 & Qwen3 VL

Collection

9 items • Updated 4 days ago