Qwen3-30B-A3B-FP8

This repository contains Qwen/Qwen3-30B-A3B-FP8 together with a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM. The same model also runs on other frameworks (such as vLLM, SGLang, and Transformers); for usage with those, see the upstream Qwen/Qwen3-30B-A3B-FP8 model card.

Overview

Qwen3-30B-A3B is an auto-regressive Mixture-of-Experts (MoE) transformer with 30.5B total parameters of which about 3.3B are activated per token. It is a hybrid reasoning model that supports both thinking and non-thinking modes within a single model — switchable per request — and offers strong reasoning over text, instruction following, multilingual coverage, and tool usage. Its intended use is the same as the upstream Qwen/Qwen3-30B-A3B-FP8, and it is released under the Apache 2.0 License.

  • Architecture: Qwen3-MoE (Mixture-of-Experts)
  • Input / Output: Text / Text
  • Supported Inference Engine: Furiosa LLM
  • Supported Hardware: FuriosaAI RNGD

Quantization

Weights are quantized to FP8 (static), following the upstream FP8 release, and activations use dynamic FP8 quantization at runtime (per-token / per-block). The KV cache stays in 16-bit precision.

Features

  • Reasoning. Qwen3-30B-A3B is a hybrid reasoning model: thinking mode can be enabled or disabled per request (for example, via enable_thinking or the /think and /no_think prompt directives). When thinking mode is active, launch the server with the qwen3 reasoning parser to have Furiosa-LLM parse the chain of thought into a separate field (see Reasoning below).
  • Tool calling. The model supports tool (function) calling through the hermes tool-call parser, the parser used by the Qwen3 series.

Parallelism Strategy

On RNGD, Qwen3-30B-A3B-FP8 runs with a tensor-parallel size of 32 PEs, which maps to four RNGD cards (8 PEs per card).

Usage

To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.

Launch the server

The simplest way to serve the model is:

# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-30B-A3B-FP8 \
  --reasoning-parser qwen3

When thinking mode is active, the reasoning content is returned in a separate field (see Reasoning below).

When the server is ready, you will see:

INFO:     Started server process [27507]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Launch the server with tool calling

To enable tool (function) calling, start the server with the hermes tool-call parser:

furiosa-llm serve furiosa-ai/Qwen3-30B-A3B-FP8 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

Query the server

The server exposes an OpenAI-compatible API. You can send a request with curl:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "furiosa-ai/Qwen3-30B-A3B-FP8",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
    }' \
    | python -m json.tool

Reasoning

In thinking mode, Qwen3-30B-A3B-FP8 returns its reasoning separately from the final answer:

  • response.choices[].message.reasoning (non-streaming)
  • response.choices[].delta.reasoning (streaming)
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="furiosa-ai/Qwen3-30B-A3B-FP8",
    messages=[{"role": "user", "content": "How many r's are in 'strawberry'?"}],
)

print("Reasoning:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Note: The reasoning field is not part of the OpenAI API specification, but it is the convention OpenAI recommends for returning the chain-of-thought (CoT) in Chat Completions-compatible APIs. The OpenAI Agents SDK uses reasoning as its primary property for the CoT, and many LLM serving frameworks (such as vLLM) follow the same convention. It appears only in responses that contain reasoning content; accessing it on a response without reasoning content raises an AttributeError.

Tool calling

With the server launched using --enable-auto-tool-choice --tool-call-parser hermes, you can pass tools and let the model decide when to call them. See the Tool Calling guide for a complete client example and details on tool-choice options.

Learn more

Downloads last month
42
Safetensors
Model size
31B params
Tensor type
F32
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for furiosa-ai/Qwen3-30B-A3B-FP8

Quantized
(2)
this model