How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Dracones/QwQ-32B_exl2_4.5bpw"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Dracones/QwQ-32B_exl2_4.5bpw",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/Dracones/QwQ-32B_exl2_4.5bpw
Quick Links

Configuration Parsing Warning:In config.json: "quantization_config.bits" must be an integer

QwQ-32B - EXL2 4.5bpw

This is a 4.5bpw EXL2 quant of Qwen/QwQ-32B

Details about the model can be found at the above model page.

Perplexity Scoring

Below are the perplexity scores for the EXL2 models. A lower score is better.

Quant Level Perplexity Score
8.0 6.4393
7.0 6.4452
6.0 6.4693
5.0 6.4732
4.5 6.5417
4.0 6.6190
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dracones/QwQ-32B_exl2_4.5bpw

Base model

Qwen/Qwen2.5-32B
Finetuned
Qwen/QwQ-32B
Quantized
(163)
this model