PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Paper β’ 2603.29078 β’ Published
How to use caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
pipe(text=messages) # Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5")
model = AutoModelForImageTextToText.from_pretrained("caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5 with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker model run hf.co/caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5
How to use caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5 with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'How to use caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5 with Docker Model Runner:
docker model run hf.co/caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5
Hadamard-Lloyd Weight Quantization of Qwen/Qwen3.6-35B-A3B
π¬ First HLWQ quantization of a 256-expert hybrid GDN+Attention MoE model
| Metric | Value |
|---|---|
| π― Weight bits | 5 (HLWQ Q5 β Lloyd-Max + Hadamard) |
| π¦ polar_state | 21.55 GB (6 shards, 62,190 keys) |
| π’ Coverage | 95.8% of 35.11B params (33.62B quantized) |
| β±οΈ Quantization time | 60s (PQ5) + 65s (CT INT4) |
| ποΈ Architecture | 40L hybrid (30 GDN + 10 Full Attention) |
| π§© Experts | 256/layer, 8 routed + 1 shared |
RTX PRO 6000 Blackwell (96 GB). FP16 KV uses optimized model.generate(); Q3/Q2 KV use manual generation loop with PolarQuantKVCache.
| Component | Spec |
|---|---|
| Hidden dim | 2048 |
| Head dim | 256 (full attn) / 128 (GDN) |
| Expert intermediate | 512 |
| Vocab | 248,320 |
| Context | 262,144 tokens |
| Vision | 27-layer ViT (kept BF16) |
| Component | Count | Status |
|---|---|---|
| MoE expert slices | 20,480 | β HLWQ Q5 |
| Attention projections | 130 | β HLWQ Q5 |
| Shared expert MLPs | 120 | β HLWQ Q5 |
| Norms, layernorms | β | β¬ BF16 |
| MoE routers | 40 | β¬ BF16 |
| GDN gates (in_proj_a/b) | 60 | β¬ BF16 (critical) |
| Vision encoder | 27 layers | β¬ BF16 |
| MTP layer | 1 | β¬ BF16 |
For inference, use the CT INT4 version:
π caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Qwen3.6-35B-A3B-HLWQ-CT-INT4 \
--language-model-only --enforce-eager --moe-expert-cache-size 8
| GPU | Expert Cache | VRAM |
|---|---|---|
| RTX PRO 6000 (96 GB) | all-in | ~20 GB |
| RTX 4090 (24 GB) | cache=4 | ~4 GB |
| RTX 3060 (12 GB) | cache=2 | ~3 GB |
HLWQ (Hadamard-Lloyd Weight Quantization):
Weight Matrix W (out Γ in)
β
βΌ
[1] Block reshape β (out, n_blocks, 128)
β
βΌ
[2] Per-block L2 normalize β norms saved
β
βΌ
[3] Walsh-Hadamard rotation: blocks @ H128 Γ β128
β (uniform information distribution)
β
βΌ
[4] Lloyd-Max 5-bit quantization (32 centroids, N(0,1))
β (optimal MSE for Gaussian values)
β
βΌ
[5] 5-bit pack: 8 codes β 5 bytes
β
βΌ
polar_state: __packed, __norms, __meta
@misc{hlwq2026,
title={HLWQ: Hadamard-Lloyd Weight Quantization for Large Language Models},
author={Caio Vicentino},
year={2026},
url={https://arxiv.org/abs/2603.29078}
}
| Resource | Link |
|---|---|
| π Paper | arXiv:2603.29078 |
| π§ Code | GitHub |
| π¦ PyPI | pip install polarquant |
| π CT INT4 | Qwen3.6-35B-A3B-HLWQ-CT-INT4 |
| π Base model | Qwen/Qwen3.6-35B-A3B |
Base model
Qwen/Qwen3.6-35B-A3B
docker model run hf.co/caiovicentino1/Qwen3.6-35B-A3B-HLWQ-Q5