Naming notice (2026-04-10). The "HLWQ" technique used in this model is being rebranded to HLWQ (Hadamard-Lloyd Weight Quantization). The change is only the name; the algorithm and the weights in this repository are unchanged.

The rebrand resolves a name collision with an unrelated, earlier KV cache quantization method also named HLWQ (Han et al., arXiv:2502.02617, 2025). HLWQ addresses weight quantization with a deterministic Walsh-Hadamard rotation and Lloyd-Max scalar codebook; Han et al.'s HLWQ addresses KV cache quantization with a random polar rotation. The two methods are technically distinct.

Existing loaders that load this repository by ID continue to work without changes. Future model uploads will use the HLWQ name.

Reference paper for this technique: arXiv:2603.29078 (v2 in preparation; v1 still uses the old name).

⚡ HLWQ-Engine v4 -- Qwen3.5-9B

Custom Triton kernel inference for HLWQ -- 34 tok/s with direct quantized computation (no dequantization).

Note: For production use, we recommend HLWQ Q5 + torchao instead (43 tok/s vs 34 tok/s). HLWQ-Engine v4 is a research artifact demonstrating direct quantized inference without dequantization.


🎯 Key Results

Metric Value
Method HLWQ-Engine v4 (Triton GEMV)
Perplexity (WikiText-2) 6.89
Throughput 34.0 tok/s
VRAM 12.2 GB
Platform RTX PRO 6000 (Blackwell)
vs v3 2.9x speedup (11.8 -> 34.0 tok/s)
vs FP16 74% speed, 68% VRAM

📊 Performance Evolution

Speed vs VRAM

Version tok/s Speedup Key Optimization
v1 ~3 1x Naive Python dequant
v2 ~7 2.3x GPU dequant
v3 11.8 3.9x Triton GEMV kernel
v4 34.0 11.3x Matmul FWHT + cache

vs Recommended Approach

Method tok/s VRAM PPL Approach
HLWQ Q5 + torchao 43.1 6.5 GB 6.56 Dequant + cuBLAS
HLWQ-Engine v4 34.0 12.2 GB 6.89 Direct Triton GEMV
FP16 baseline 45.7 17.9 GB 6.37 cuBLAS

🔬 Architecture

HLWQ-Engine v4 performs inference directly on quantized weights without dequantization, using custom Triton kernels:

Input Activations
       |
       v
  [FWHT Cache] --> Hadamard Transform (cached per forward pass)
       |
       v
  [Triton GEMV] --> Centroid lookup + accumulate (fused kernel)
       |
       v
  [Scale by Norms] --> Output Activations

Key Optimizations (v3 -> v4)

Optimization Before After Speedup
Matmul FWHT 0.208 ms/call 0.008 ms/call 25x
FWHT cache 3x redundant calls 1x (Q/K/V reuse) 3x
Pre-scaled centroids Runtime multiply Baked into table ~1.1x

Matmul FWHT: Replaced butterfly-algorithm FWHT with torch.matmul(x, H128) -- cuBLAS is faster than custom code for 128-dim transforms.

FWHT Cache: Q, K, V projections in attention share the same input activation. Cache by data_ptr to avoid redundant transforms. Auto-cleared between model.forward() calls via pre-hook.


🚀 Usage

# HLWQ-Engine v4 requires the polarengine-vllm package
pip install polarengine-vllm

from polarengine_vllm import HLWQizer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4",
    dtype="bfloat16", device_map="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4")

output = model.generate(
    **tokenizer("What is machine learning?", return_tensors="pt").to("cuda"),
    max_new_tokens=200
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

🔧 Technical Details

Component Details
Kernel Triton GEMV (polar_gemv_kernel)
FWHT Matmul-based, cached per forward pass
Centroids Pre-scaled by 1/sqrt(block_size)
Quantization HLWQ Q5 (5-bit, block_size=128)
Storage int8 codes + fp16 norms + fp32 centroids
Remaining gap 2048x8192 layers 4.5x slower than cuBLAS (SplitK would fix)

Known Limitations

  • Slower than torchao: cuBLAS INT4 matmul is highly optimized; custom Triton GEMV cannot yet match it
  • Higher VRAM: Stores quantized weights + lookup tables (12.2 GB vs 6.5 GB)
  • No SplitK: Large matrices (2048x8192) are bottlenecked without split-K parallel reduction
  • Research code: Not recommended for production deployment

🔗 Links


📖 Citation

@article{vicentino2026polarquant,
  title={HLWQ: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.7424577},
  year={2026}
}

🙏 Acknowledgements

Built with PyTorch, Triton, and the Qwen team's open-weight models.

Downloads last month
9
Safetensors
Model size
7B params
Tensor type
F32
·
F16
·
I8
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(323)
this model

Collections including caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4

Papers for caiovicentino1/Qwen3.5-9B-HLWQ-Engine-v4