Instructions to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4
- SGLang
How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with Docker Model Runner:
docker model run hf.co/caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4
Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.
The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.
Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.
Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).
⚡ HLWQ-Engine v4 -- Qwen3.5-9B
Custom Triton kernel inference for HLWQ -- 34 tok/s with direct quantized computation (no dequantization).
Note: For production use, we recommend HLWQ Q5 + torchao instead (43 tok/s vs 34 tok/s). HLWQ-Engine v4 is a research artifact demonstrating direct quantized inference without dequantization.
🎯 Key Results
| Metric | Value |
|---|---|
| Method | HLWQ-Engine v4 (Triton GEMV) |
| Perplexity (WikiText-2) | 6.89 |
| Throughput | 34.0 tok/s |
| VRAM | 12.2 GB |
| Platform | RTX PRO 6000 (Blackwell) |
| vs v3 | 2.9x speedup (11.8 -> 34.0 tok/s) |
| vs FP16 | 74% speed, 68% VRAM |
📊 Performance Evolution
| Version | tok/s | Speedup | Key Optimization |
|---|---|---|---|
| v1 | ~3 | 1x | Naive Python dequant |
| v2 | ~7 | 2.3x | GPU dequant |
| v3 | 11.8 | 3.9x | Triton GEMV kernel |
| v4 | 34.0 | 11.3x | Matmul FWHT + cache |
vs Recommended Approach
| Method | tok/s | VRAM | PPL | Approach |
|---|---|---|---|---|
| HLWQ Q5 + torchao | 43.1 | 6.5 GB | 6.56 | Dequant + cuBLAS |
| HLWQ-Engine v4 | 34.0 | 12.2 GB | 6.89 | Direct Triton GEMV |
| FP16 baseline | 45.7 | 17.9 GB | 6.37 | cuBLAS |
🔬 Architecture
HLWQ-Engine v4 performs inference directly on quantized weights without dequantization, using custom Triton kernels:
Input Activations
|
v
[FWHT Cache] --> Hadamard Transform (cached per forward pass)
|
v
[Triton GEMV] --> Centroid lookup + accumulate (fused kernel)
|
v
[Scale by Norms] --> Output Activations
Key Optimizations (v3 -> v4)
| Optimization | Before | After | Speedup |
|---|---|---|---|
| Matmul FWHT | 0.208 ms/call | 0.008 ms/call | 25x |
| FWHT cache | 3x redundant calls | 1x (Q/K/V reuse) | 3x |
| Pre-scaled centroids | Runtime multiply | Baked into table | ~1.1x |
Matmul FWHT: Replaced butterfly-algorithm FWHT with torch.matmul(x, H128) -- cuBLAS is faster than custom code for 128-dim transforms.
FWHT Cache: Q, K, V projections in attention share the same input activation. Cache by data_ptr to avoid redundant transforms. Auto-cleared between model.forward() calls via pre-hook.
🚀 Usage
# HLWQ-Engine v4 requires the polarengine-vllm package
pip install polarengine-vllm
from polarengine_vllm import HLWQizer
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
"caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4",
dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4")
output = model.generate(
**tokenizer("What is machine learning?", return_tensors="pt").to("cuda"),
max_new_tokens=200
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
🔧 Technical Details
| Component | Details |
|---|---|
| Kernel | Triton GEMV (polar_gemv_kernel) |
| FWHT | Matmul-based, cached per forward pass |
| Centroids | Pre-scaled by 1/sqrt(block_size) |
| Quantization | HLWQ Q5 (5-bit, block_size=128) |
| Storage | int8 codes + fp16 norms + fp32 centroids |
| Remaining gap | 2048x8192 layers 4.5x slower than cuBLAS (SplitK would fix) |
Known Limitations
- Slower than torchao: cuBLAS INT4 matmul is highly optimized; custom Triton GEMV cannot yet match it
- Higher VRAM: Stores quantized weights + lookup tables (12.2 GB vs 6.5 GB)
- No SplitK: Large matrices (2048x8192) are bottlenecked without split-K parallel reduction
- Research code: Not recommended for production deployment
🔗 Links
- \U0001f4c4 Paper (arXiv) -- HLWQ: Optimal Gaussian Weight Quantization
- 💻 Code (GitHub) -- Full research codebase
- \U0001f50c vLLM Plugin -- Production inference integration
- \U0001f4e6 Recommended: HLWQ Q5 -- Faster, smaller, better quality
📖 Citation
@article{vicentino2026polarquant,
title={HLWQ: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.7424577},
year={2026}
}
🙏 Acknowledgements
Built with PyTorch, Triton, and the Qwen team's open-weight models.
- Downloads last month
- 9
