🧊 Gemopus-4-26B-A4B-it · HLWQ Q5 + Expert Offload

HLWQ (Hadamard-Lloyd Weight Quantization) full-stack compression of Jackrong/Gemopus-4-26B-A4B-it — Gemma 4 26B-A4B MoE, 128 experts × top-8, 30 layers, head_dim=256, multimodal (vision encoder preserved).

Pipeline

🎯 Headline

Metric HLWQ Q5 FP16 Baseline
Linear layers 207 × INT4
MoE experts 7,680 × HLWQ Q5 codes BF16
Peak VRAM during quant (A100) 76.9 GB ~54 GB
Download size 16.6 GB 51.6 GB
KV Cache Q3 memory (200 tok) 34.4 MB 49.8 MB
Quality check (greedy 250 tok) ✅ coherent ✅ coherent

Generation quality validated on TCP vs UDP, Python Fibonacci, aurora borealis — all three produce well-structured, technically accurate answers with clean markdown.

📊 Compression

Download size

3.1× smaller than the BF16 baseline. The MoE expert weights — which dominate the model — are stored as 5-bit HLWQ codes with per-block norms. Non-aligned shapes (dense mlp.down_proj in=2112, expert down_proj in=704) keep BF16 precision instead of silently corrupting INT4.

🔬 What got quantized

Layer breakdown

Component Count Treatment
Attention + dense MLP Linear 177 INT4 (torchao, group_size=128, Marlin-native)
Dense MLP down_proj (in=2112) 57 BF16 fallback (non-aligned for group_size=128)
MoE expert slices 7,680 HLWQ Q5 codes (per-expert 2D, saved as bit-packed codes + norms)
Vision encoder 190 BF16 (full quality preserved)
Router (gate projections) 30 BF16 (critical for correct expert selection)

🚀 Consumer-GPU deployment

This checkpoint is HLWQ-codes format. For direct vLLM Marlin deployment on consumer GPUs, use the CompressedTensors sibling:

Consumer GPU VRAM

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git

vllm serve caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5-CT \
  --moe-expert-cache-size 4 \
  --enforce-eager \
  --language-model-only \
  --max-model-len 4096

moe_expert_cache_size tuning per GPU:

GPU VRAM Recommended cache Expected VRAM usage
RTX 3060 12 GB 2 ~8 GB
RTX 4070 16 GB 3 ~8.2 GB
RTX 4090 24 GB 4 (default) ~8.5 GB
RTX 6000 Ada 48 GB 16 ~11.3 GB
A100 80 GB 128 (full GPU) ~38 GB

Reference throughput: Qwopus-MoE-35B on RTX 4090 with this fork → 37.4 tok/s (1.58× faster than all-experts-on-GPU due to better memory locality).

🔁 Quick Start (transformers, A100)

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Jackrong/Gemopus-4-26B-A4B-it")

messages = [{"role": "user", "content": "What is 2+2?"}]
inputs = processor.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
    return_dict=True, tokenize=True,
).to("cuda")
out = model.generate(**inputs, max_new_tokens=60, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
# → "Two plus two equals four."

🧪 KV cache (HLWQ Q3, benchmark-only)

This checkpoint ships weights only — the HLWQ Q3 KV cache is a runtime feature implemented in PolarQuantKVCache (see the /polarquant skill source). Asymptotic compression is 5.22× (3 bits/value + norm overhead vs FP16's 16 bits/value), approached as sequence length grows beyond the 128-token BF16 residual window.

KV cache scaling

Note: vLLM manages its own KV cache — the HLWQ Q3 custom cache is useful in transformers-native inference but does not apply when serving via vLLM. For vLLM deploy, the memory win comes from expert offload + INT4 weights, not the KV cache.

🔧 Files

File Size Purpose
polar_state.safetensors 16.61 GB HLWQ Q5 bit-packed codes + norms + meta for every quantized weight
hlwq_config.json <1 KB HLWQ metadata (method, bits, block_size, head_dim, stats)
config.json ~10 KB Base Gemma 4 config + quantization_config (wire: quant_method=polarengine)
tokenizer*, processor_config.json, chat_template.jinja ~10 MB Tokenizer/processor from base model
chart_*.png ~400 KB Model card assets

🏷️ Note on naming

HLWQ replaces the author's earlier "PolarQuant" branding to disambiguate from Han et al. 2025 (arXiv:2502.02617), a distinct KV-cache quantization method published under the same name. The two techniques address different components (weights vs KV cache) and are unrelated.

Internally, quant_method in config.json remains "polarengine" — that is the wire-format string recognized by transformers and vLLM loaders. Brand is HLWQ; wire format is polarengine.

📖 Citation

@misc{vicentino2026hlwq,
  title   = {Hadamard-Lloyd Weight Quantization (HLWQ): Near-Lossless 5-bit PTQ for LLMs via Walsh-Hadamard Rotation and Lloyd-Max Scalar Quantization},
  author  = {Vicentino, Caio},
  year    = {2026},
  eprint  = {2603.29078},
  archivePrefix = {arXiv},
  note    = {Formerly titled "PolarQuant"; v2 retitle pending to avoid collision with Han et al. 2025.}
}

🔗 References

🙏 Acknowledgements

Built on Jackrong's excellent Gemopus SFT of Google's Gemma 4 26B-A4B architecture. Thanks to the vLLM team for the CompressedTensors + Marlin kernels, and to RedHatAI for the reference quantized MoE format (Qwen3-30B-A3B-quantized.w4a16) that this pipeline follows.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5

Quantized
(8)
this model

Collections including caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5

Papers for caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5