Instructions to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4

SGLang

How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4 with Docker Model Runner:
```
docker model run hf.co/caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4
```

Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

⚡ HLWQ-Engine v4 -- Qwen3.5-9B

Custom Triton kernel inference for HLWQ -- 34 tok/s with direct quantized computation (no dequantization).

Note: For production use, we recommend HLWQ Q5 + torchao instead (43 tok/s vs 34 tok/s). HLWQ-Engine v4 is a research artifact demonstrating direct quantized inference without dequantization.

🎯 Key Results

Metric	Value
Method	HLWQ-Engine v4 (Triton GEMV)
Perplexity (WikiText-2)	6.89
Throughput	34.0 tok/s
VRAM	12.2 GB
Platform	RTX PRO 6000 (Blackwell)
vs v3	2.9x speedup (11.8 -> 34.0 tok/s)
vs FP16	74% speed, 68% VRAM

📊 Performance Evolution

Version	tok/s	Speedup	Key Optimization
v1	~3	1x	Naive Python dequant
v2	~7	2.3x	GPU dequant
v3	11.8	3.9x	Triton GEMV kernel
v4	34.0	11.3x	Matmul FWHT + cache

vs Recommended Approach

Method	tok/s	VRAM	PPL	Approach
HLWQ Q5 + torchao	43.1	6.5 GB	6.56	Dequant + cuBLAS
HLWQ-Engine v4	34.0	12.2 GB	6.89	Direct Triton GEMV
FP16 baseline	45.7	17.9 GB	6.37	cuBLAS

🔬 Architecture

HLWQ-Engine v4 performs inference directly on quantized weights without dequantization, using custom Triton kernels:

Input Activations
       |
       v
  [FWHT Cache] --> Hadamard Transform (cached per forward pass)
       |
       v
  [Triton GEMV] --> Centroid lookup + accumulate (fused kernel)
       |
       v
  [Scale by Norms] --> Output Activations

Key Optimizations (v3 -> v4)

Optimization	Before	After	Speedup
Matmul FWHT	0.208 ms/call	0.008 ms/call	25x
FWHT cache	3x redundant calls	1x (Q/K/V reuse)	3x
Pre-scaled centroids	Runtime multiply	Baked into table	~1.1x

Matmul FWHT: Replaced butterfly-algorithm FWHT with torch.matmul(x, H128) -- cuBLAS is faster than custom code for 128-dim transforms.

FWHT Cache: Q, K, V projections in attention share the same input activation. Cache by data_ptr to avoid redundant transforms. Auto-cleared between model.forward() calls via pre-hook.

🚀 Usage

# HLWQ-Engine v4 requires the polarengine-vllm package
pip install polarengine-vllm

from polarengine_vllm import HLWQizer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4",
    dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4")

output = model.generate(
    **tokenizer("What is machine learning?", return_tensors="pt").to("cuda"),
    max_new_tokens=200
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

🔧 Technical Details

Component	Details
Kernel	Triton GEMV (polar_gemv_kernel)
FWHT	Matmul-based, cached per forward pass
Centroids	Pre-scaled by 1/sqrt(block_size)
Quantization	HLWQ Q5 (5-bit, block_size=128)
Storage	int8 codes + fp16 norms + fp32 centroids
Remaining gap	2048x8192 layers 4.5x slower than cuBLAS (SplitK would fix)

Known Limitations

Slower than torchao: cuBLAS INT4 matmul is highly optimized; custom Triton GEMV cannot yet match it
Higher VRAM: Stores quantized weights + lookup tables (12.2 GB vs 6.5 GB)
No SplitK: Large matrices (2048x8192) are bottlenecked without split-K parallel reduction
Research code: Not recommended for production deployment

🔗 Links

\U0001f4c4 Paper (arXiv) -- HLWQ: Optimal Gaussian Weight Quantization
💻 Code (GitHub) -- Full research codebase
\U0001f50c vLLM Plugin -- Production inference integration
\U0001f4e6 Recommended: HLWQ Q5 -- Faster, smaller, better quality

📖 Citation

@article{vicentino2026polarquant,
  title={HLWQ: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.7424577},
  year={2026}
}