Qwen3-8B-FP8
This repository contains Qwen/Qwen3-8B-FP8 together with a Furiosa Executable Bundle (FXB) for running it on FuriosaAI RNGD with Furiosa-LLM. The same model also runs on other frameworks (such as vLLM, SGLang, and Transformers); for usage with those, see the upstream Qwen/Qwen3-8B-FP8 model card.
Overview
Qwen3-8B is the 8.2B-parameter dense model of the Qwen3 series, a causal transformer with grouped-query attention. Its hallmark is seamless switching between a thinking mode — emitting a chain of thought before the final answer for complex reasoning, math, and coding — and a non-thinking mode for efficient general dialogue, within a single model. It also offers strong tool-calling and agent capabilities and multilingual support. Its intended use is the same as the upstream Qwen/Qwen3-8B-FP8, and it is released under the Apache 2.0 License.
- Architecture: Qwen3 (dense)
- Input / Output: Text / Text
- Supported Inference Engine: Furiosa LLM
- Supported Hardware: FuriosaAI RNGD
Quantization
The model weights are quantized to FP8 (static), using the same fine-grained FP8 quantization (block size 128) the upstream model ships in, and activations use dynamic FP8 quantization at runtime (per-token / per-block).
Features
- Reasoning. Qwen3-8B is a hybrid reasoning model: in thinking mode it produces a chain of thought before the final answer, and you can toggle modes per request with
/thinkand/no_think. Launch the server with--reasoning-parser qwen3to have the reasoning content returned in a separate field. - Tool calling. The model supports tool (function) calling through the
hermestool-call parser.
Parallelism Strategy
On RNGD, Qwen3-8B-FP8 runs with a tensor-parallel size of 8 PEs, which maps to a single RNGD card (8 PEs per card).
Usage
To run this model with Furiosa-LLM, follow the example commands below after installing Furiosa-LLM and its prerequisites.
Launch the server
The simplest way to serve the model is:
# Launch the server, listening on port 8000 by default
furiosa-llm serve furiosa-ai/Qwen3-8B-FP8 --reasoning-parser qwen3
The --reasoning-parser qwen3 flag parses the model's thinking content into a
separate field (see Reasoning below).
When the server is ready, you will see:
INFO: Started server process [27507]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Launch the server with tool calling
To enable tool (function) calling, start the server with the hermes tool-call
parser:
furiosa-llm serve furiosa-ai/Qwen3-8B-FP8 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Query the server
The server exposes an OpenAI-compatible API. You can send a request with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "furiosa-ai/Qwen3-8B-FP8",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}' \
| python -m json.tool
Reasoning
With --reasoning-parser qwen3, the thinking content is returned separately
from the final answer:
response.choices[].message.reasoning(non-streaming)response.choices[].delta.reasoning(streaming)
You can switch off thinking for a request by adding /no_think to the prompt
(and back on with /think).
Note: The
reasoningfield is not part of the OpenAI API specification, but it is the convention OpenAI recommends for returning the chain-of-thought (CoT) in Chat Completions-compatible APIs. The OpenAI Agents SDK usesreasoningas its primary property for the CoT, and many LLM serving frameworks (such as vLLM) follow the same convention. It appears only in responses that contain reasoning content; accessing it on a response without reasoning content raises anAttributeError.
Tool calling
With the server launched using --enable-auto-tool-choice --tool-call-parser hermes,
you can pass tools and let the model decide when to call them. See the
Tool Calling guide
for a complete client example and details on tool-choice options.
Learn more
- Tool Calling — parsers, tool-choice options, and more examples
- Furiosa-LLM Server (
furiosa-llm serve) — full OpenAI-compatible API reference and serving options - Qwen/Qwen3-8B-FP8 — upstream model card
- Downloads last month
- 133