Instructions to use AICP-Labs/qwen3-32b-dflash-en-zh with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AICP-Labs/qwen3-32b-dflash-en-zh with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AICP-Labs/qwen3-32b-dflash-en-zh")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("AICP-Labs/qwen3-32b-dflash-en-zh") model = AutoModel.from_pretrained("AICP-Labs/qwen3-32b-dflash-en-zh") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AICP-Labs/qwen3-32b-dflash-en-zh with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AICP-Labs/qwen3-32b-dflash-en-zh" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AICP-Labs/qwen3-32b-dflash-en-zh", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AICP-Labs/qwen3-32b-dflash-en-zh
- SGLang
How to use AICP-Labs/qwen3-32b-dflash-en-zh with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AICP-Labs/qwen3-32b-dflash-en-zh" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AICP-Labs/qwen3-32b-dflash-en-zh", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AICP-Labs/qwen3-32b-dflash-en-zh" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AICP-Labs/qwen3-32b-dflash-en-zh", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AICP-Labs/qwen3-32b-dflash-en-zh with Docker Model Runner:
docker model run hf.co/AICP-Labs/qwen3-32b-dflash-en-zh
Qwen3-32B DFlash Draft Model (EagleChat 400K Mix)
A DFlash draft model trained from Qwen3-32B using a EagleChat subset (English 200K + Chinese 200K) to accelerate speculative decoding.
Model Summary
This repository provides a DFlash draft model for Qwen3-32B. The draft model is intended to be used together with the target model in SpecForge, improving throughput (output tokens/sec) under standard speculative verification.
- Base / Target model:
Qwen/Qwen3-32B - Draft model type: DFlash (speculative decoding draft)
- Training data: EagleChat subset (English 200K + Chinese 200K; total ~400K)
- Training hardware: H100
- Primary use case: accelerate inference with DFlash / SpecForge
Training Details
Data
- Dataset: EagleChat subset
- Composition:
- English: ~200,000 samples
- Chinese: 200,000 samples
- Total: ~400,000 samples
Procedure
- Epochs: 6
- Sequence length: 4096
- Precision: bf16
- Codebase: SpecForge (DFlash training)
Evaluation
Benchmark settings
- Target model:
/models/Qwen3-32B - Draft model:
sx-aicp/qwen3-32b-dflash-en-zh(or local path) - Max new tokens: 2048
- Attention backend:
fa3 - Tensor parallel (tp_size): 4
- device_sm: 90 (H100)
- drop_first_batch: true
- Concurrencies: 1 / 4 / 32 (varies by suite)
Speed Bench Results
Environment: H100 (SM90), tp=4, attention=fa3, max_new_tokens=2048, drop_first_batch=true.
Unified Summary
| Benchmark | Conc=1 | Conc=4 | Conc=32 |
|---|---|---|---|
| Math500 | 109.20 → 392.63 3.595× / L=5.564 |
409.44 → 1351.51 3.301× / L=5.582 |
2554.68 → 4554.81 1.783× / L=5.588 |
| HumanEval | 108.93 → 331.66 3.045× / L=4.769 |
407.34 → 1129.16 2.772× / L=4.756 |
2482.40 → 3632.36 1.463× / L=4.757 |
| MT-Bench | 109.19 → 233.75 2.141× / L=3.791 |
409.97 → 804.64 1.963× / L=3.852 |
2470.75 → 2767.16 1.120× / L=3.917 |
Format: baseline tok/s → DFlash tok/s; Speedup× / L(acceptance length).
How to Evaluate (z-lab / dflash)
python benchmark_sglang.py \
--tp-size 4 \
--target-model /models/Qwen3-32B \
--draft-model /path/to/draft_model \
--concurrencies 1,4,32 \
--dataset-name math500 \
--attention-backends fa3 \
--output-md sglang_results.md
- Downloads last month
- 17
Model tree for AICP-Labs/qwen3-32b-dflash-en-zh
Base model
Qwen/Qwen3-32B