Instructions to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen3-Next-80B-A3B-Instruct-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Next-80B-A3B-Instruct-FP8") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Next-80B-A3B-Instruct-FP8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
- SGLang
How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
I find qwen3 next exceptional, but too big.
Please create a 32b or even 14B model! It would be great!
Qwen3 Next is exceptional partially because of its size. While the number of parameters isn't exactly a 1:1 in terms of parameter size to capabilities, there's certainly a strong well studied link. You could remove half of the experts from the model and attempt to resettle the weights, but you'd end up with something that's roughly half the capability depending on how and what you decided to remove and what you decided to measure as capability. You could even get more precise and test for activation and try to discover which experts were most useful in your use cases and then remove the ones you don't "need". But you are giving up generalization in that case too.
A model "stores" information and behaviors/capabilities in one single space. Remove that space and you are removing whatever knowledge and/or capability was there. Other areas within the network may be able to compensate, but you are losing specificity.
Information only compresses so far. There are limits. For an LLM, size matters - at least with current technology and architectures. We really need a major architecture shift and/or boost in hardware capabilities and power efficiency.