Instructions to use AICP-Labs/qwen3-32b-dflash-en-zh with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AICP-Labs/qwen3-32b-dflash-en-zh with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AICP-Labs/qwen3-32b-dflash-en-zh")

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("AICP-Labs/qwen3-32b-dflash-en-zh")
model = AutoModel.from_pretrained("AICP-Labs/qwen3-32b-dflash-en-zh")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AICP-Labs/qwen3-32b-dflash-en-zh with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AICP-Labs/qwen3-32b-dflash-en-zh"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AICP-Labs/qwen3-32b-dflash-en-zh",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AICP-Labs/qwen3-32b-dflash-en-zh

SGLang

How to use AICP-Labs/qwen3-32b-dflash-en-zh with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AICP-Labs/qwen3-32b-dflash-en-zh" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AICP-Labs/qwen3-32b-dflash-en-zh",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AICP-Labs/qwen3-32b-dflash-en-zh" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AICP-Labs/qwen3-32b-dflash-en-zh",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use AICP-Labs/qwen3-32b-dflash-en-zh with Docker Model Runner:
```
docker model run hf.co/AICP-Labs/qwen3-32b-dflash-en-zh
```

throughput does not seem to be as good as Eagle3？

by Jing17 - opened Mar 6

Discussion

Jing17

Mar 6

Hello, I tried to train the Qwen3-32B-Eagle3 model using Eagle Chat training data and tested gsm8k with H20 + sglang. The acceptance rate is higher than Eagle3 of the 3-1-4 strategy, but the throughput does not seem to be as good as Eagle3?

gqzs

Mar 10

It might be because the verification stage of DFlash consumes too much unnecessary compute. You could try using a better GPU or reducing the number of tokens in the verification stage.
By the way, what concurrency level did you use for the evaluation?

Jing17

Mar 11

I used H20 * 4, with concurrency=8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment