🧊 Gemopus-4-26B-A4B-it · HLWQ Q5 + Expert Offload
HLWQ (Hadamard-Lloyd Weight Quantization) full-stack compression of Jackrong/Gemopus-4-26B-A4B-it — Gemma 4 26B-A4B MoE, 128 experts × top-8, 30 layers, head_dim=256, multimodal (vision encoder preserved).
🎯 Headline
| Metric | HLWQ Q5 | FP16 Baseline |
|---|---|---|
| Linear layers | 207 × INT4 | — |
| MoE experts | 7,680 × HLWQ Q5 codes | BF16 |
| Peak VRAM during quant (A100) | 76.9 GB | ~54 GB |
| Download size | 16.6 GB | 51.6 GB |
| KV Cache Q3 memory (200 tok) | 34.4 MB | 49.8 MB |
| Quality check (greedy 250 tok) | ✅ coherent | ✅ coherent |
Generation quality validated on TCP vs UDP, Python Fibonacci, aurora borealis — all three produce well-structured, technically accurate answers with clean markdown.
📊 Compression
3.1× smaller than the BF16 baseline. The MoE expert weights — which dominate the model — are stored as 5-bit HLWQ codes with per-block norms. Non-aligned shapes (dense mlp.down_proj in=2112, expert down_proj in=704) keep BF16 precision instead of silently corrupting INT4.
🔬 What got quantized
| Component | Count | Treatment |
|---|---|---|
| Attention + dense MLP Linear | 177 | INT4 (torchao, group_size=128, Marlin-native) |
| Dense MLP down_proj (in=2112) | 57 | BF16 fallback (non-aligned for group_size=128) |
| MoE expert slices | 7,680 | HLWQ Q5 codes (per-expert 2D, saved as bit-packed codes + norms) |
| Vision encoder | 190 | BF16 (full quality preserved) |
| Router (gate projections) | 30 | BF16 (critical for correct expert selection) |
🚀 Consumer-GPU deployment
This checkpoint is HLWQ-codes format. For direct vLLM Marlin deployment on consumer GPUs, use the CompressedTensors sibling:
pip install git+https://github.com/caiovicentino/vllm-expert-offload.git
vllm serve caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5-CT \
--moe-expert-cache-size 4 \
--enforce-eager \
--language-model-only \
--max-model-len 4096
moe_expert_cache_size tuning per GPU:
| GPU | VRAM | Recommended cache | Expected VRAM usage |
|---|---|---|---|
| RTX 3060 | 12 GB | 2 |
~8 GB |
| RTX 4070 | 16 GB | 3 |
~8.2 GB |
| RTX 4090 | 24 GB | 4 (default) |
~8.5 GB |
| RTX 6000 Ada | 48 GB | 16 |
~11.3 GB |
| A100 | 80 GB | 128 (full GPU) |
~38 GB |
Reference throughput: Qwopus-MoE-35B on RTX 4090 with this fork → 37.4 tok/s (1.58× faster than all-experts-on-GPU due to better memory locality).
🔁 Quick Start (transformers, A100)
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Jackrong/Gemopus-4-26B-A4B-it")
messages = [{"role": "user", "content": "What is 2+2?"}]
inputs = processor.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True,
return_dict=True, tokenize=True,
).to("cuda")
out = model.generate(**inputs, max_new_tokens=60, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
# → "Two plus two equals four."
🧪 KV cache (HLWQ Q3, benchmark-only)
This checkpoint ships weights only — the HLWQ Q3 KV cache is a runtime feature implemented in PolarQuantKVCache (see the /polarquant skill source). Asymptotic compression is 5.22× (3 bits/value + norm overhead vs FP16's 16 bits/value), approached as sequence length grows beyond the 128-token BF16 residual window.
Note: vLLM manages its own KV cache — the HLWQ Q3 custom cache is useful in transformers-native inference but does not apply when serving via vLLM. For vLLM deploy, the memory win comes from expert offload + INT4 weights, not the KV cache.
🔧 Files
| File | Size | Purpose |
|---|---|---|
polar_state.safetensors |
16.61 GB | HLWQ Q5 bit-packed codes + norms + meta for every quantized weight |
hlwq_config.json |
<1 KB | HLWQ metadata (method, bits, block_size, head_dim, stats) |
config.json |
~10 KB | Base Gemma 4 config + quantization_config (wire: quant_method=polarengine) |
tokenizer*, processor_config.json, chat_template.jinja |
~10 MB | Tokenizer/processor from base model |
chart_*.png |
~400 KB | Model card assets |
🏷️ Note on naming
HLWQ replaces the author's earlier "PolarQuant" branding to disambiguate from Han et al. 2025 (arXiv:2502.02617), a distinct KV-cache quantization method published under the same name. The two techniques address different components (weights vs KV cache) and are unrelated.
Internally, quant_method in config.json remains "polarengine" — that is the wire-format string recognized by transformers and vLLM loaders. Brand is HLWQ; wire format is polarengine.
📖 Citation
@misc{vicentino2026hlwq,
title = {Hadamard-Lloyd Weight Quantization (HLWQ): Near-Lossless 5-bit PTQ for LLMs via Walsh-Hadamard Rotation and Lloyd-Max Scalar Quantization},
author = {Vicentino, Caio},
year = {2026},
eprint = {2603.29078},
archivePrefix = {arXiv},
note = {Formerly titled "PolarQuant"; v2 retitle pending to avoid collision with Han et al. 2025.}
}
🔗 References
- Paper (HLWQ): arXiv:2603.29078
- Code (HLWQ): github.com/caiovicentino/eoq-quantization
- vLLM expert offload fork: github.com/caiovicentino/vllm-expert-offload
- Base model:
Jackrong/Gemopus-4-26B-A4B-it— SFT ofgoogle/gemma-4-26B-A4B-it - Related (distinct method): Han, Kacham, Karbasi, Mirrokni, Zandieh. "PolarQuant: Quantizing KV caches with polar transformation." arXiv:2502.02617, 2025.
🙏 Acknowledgements
Built on Jackrong's excellent Gemopus SFT of Google's Gemma 4 26B-A4B architecture. Thanks to the vLLM team for the CompressedTensors + Marlin kernels, and to RedHatAI for the reference quantized MoE format (Qwen3-30B-A3B-quantized.w4a16) that this pipeline follows.
- Downloads last month
- 6
Model tree for caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5
Base model
Jackrong/Gemopus-4-26B-A4B-it



