Instructions to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen3-Next-80B-A3B-Instruct-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Next-80B-A3B-Instruct-FP8")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Next-80B-A3B-Instruct-FP8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8

SGLang

How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
```

I find qwen3 next exceptional, but too big.

by ZeroWw - opened Nov 5, 2025

Discussion

ZeroWw

Nov 5, 2025

Please create a 32b or even 14B model! It would be great!

RecViking

Nov 6, 2025

Qwen3 Next is exceptional partially because of its size. While the number of parameters isn't exactly a 1:1 in terms of parameter size to capabilities, there's certainly a strong well studied link. You could remove half of the experts from the model and attempt to resettle the weights, but you'd end up with something that's roughly half the capability depending on how and what you decided to remove and what you decided to measure as capability. You could even get more precise and test for activation and try to discover which experts were most useful in your use cases and then remove the ones you don't "need". But you are giving up generalization in that case too.

A model "stores" information and behaviors/capabilities in one single space. Remove that space and you are removing whatever knowledge and/or capability was there. Other areas within the network may be able to compensate, but you are losing specificity.

Information only compresses so far. There are limits. For an LLM, size matters - at least with current technology and architectures. We really need a major architecture shift and/or boost in hardware capabilities and power efficiency.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment