🧊 Gemopus-4-26B-A4B-it · HLWQ Q5 + Expert Offload

HLWQ (Hadamard-Lloyd Weight Quantization) full-stack compression of Jackrong/Gemopus-4-26B-A4B-it — Gemma 4 26B-A4B MoE, 128 experts × top-8, 30 layers, head_dim=256, multimodal (vision encoder preserved).

🎯 Headline

Metric	HLWQ Q5	FP16 Baseline
Linear layers	207 × INT4	—
MoE experts	7,680 × HLWQ Q5 codes	BF16
Peak VRAM during quant (A100)	76.9 GB	~54 GB
Download size	16.6 GB	51.6 GB
KV Cache Q3 memory (200 tok)	34.4 MB	49.8 MB
Quality check (greedy 250 tok)	✅ coherent	✅ coherent

Generation quality validated on TCP vs UDP, Python Fibonacci, aurora borealis — all three produce well-structured, technically accurate answers with clean markdown.

📊 Compression

3.1× smaller than the BF16 baseline. The MoE expert weights — which dominate the model — are stored as 5-bit HLWQ codes with per-block norms. Non-aligned shapes (dense mlp.down_proj in=2112, expert down_proj in=704) keep BF16 precision instead of silently corrupting INT4.

🔬 What got quantized

Component	Count	Treatment
Attention + dense MLP Linear	177	INT4 (torchao, group_size=128, Marlin-native)
Dense MLP down_proj (in=2112)	57	BF16 fallback (non-aligned for group_size=128)
MoE expert slices	7,680	HLWQ Q5 codes (per-expert 2D, saved as bit-packed codes + norms)
Vision encoder	190	BF16 (full quality preserved)
Router (gate projections)	30	BF16 (critical for correct expert selection)

🚀 Consumer-GPU deployment

This checkpoint is HLWQ-codes format. For direct vLLM Marlin deployment on consumer GPUs, use the CompressedTensors sibling:

pip install git+https://github.com/caiovicentino/vllm-expert-offload.git

vllm serve caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5-CT \
  --moe-expert-cache-size 4 \
  --enforce-eager \
  --language-model-only \
  --max-model-len 4096

moe_expert_cache_size tuning per GPU:

GPU	VRAM	Recommended cache	Expected VRAM usage
RTX 3060	12 GB	`2`	~8 GB
RTX 4070	16 GB	`3`	~8.2 GB
RTX 4090	24 GB	`4` (default)	~8.5 GB
RTX 6000 Ada	48 GB	`16`	~11.3 GB
A100	80 GB	`128` (full GPU)	~38 GB

Reference throughput: Qwopus-MoE-35B on RTX 4090 with this fork → 37.4 tok/s (1.58× faster than all-experts-on-GPU due to better memory locality).

🔁 Quick Start (transformers, A100)

from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("Jackrong/Gemopus-4-26B-A4B-it")

messages = [{"role": "user", "content": "What is 2+2?"}]
inputs = processor.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
    return_dict=True, tokenize=True,
).to("cuda")
out = model.generate(**inputs, max_new_tokens=60, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
# → "Two plus two equals four."

🧪 KV cache (HLWQ Q3, benchmark-only)

This checkpoint ships weights only — the HLWQ Q3 KV cache is a runtime feature implemented in PolarQuantKVCache (see the /polarquant skill source). Asymptotic compression is 5.22× (3 bits/value + norm overhead vs FP16's 16 bits/value), approached as sequence length grows beyond the 128-token BF16 residual window.

Note: vLLM manages its own KV cache — the HLWQ Q3 custom cache is useful in transformers-native inference but does not apply when serving via vLLM. For vLLM deploy, the memory win comes from expert offload + INT4 weights, not the KV cache.

🔧 Files

File	Size	Purpose
`polar_state.safetensors`	16.61 GB	HLWQ Q5 bit-packed codes + norms + meta for every quantized weight
`hlwq_config.json`	<1 KB	HLWQ metadata (method, bits, block_size, head_dim, stats)
`config.json`	~10 KB	Base Gemma 4 config + `quantization_config` (wire: `quant_method=polarengine`)
`tokenizer*`, `processor_config.json`, `chat_template.jinja`	~10 MB	Tokenizer/processor from base model
`chart_*.png`	~400 KB	Model card assets

🏷️ Note on naming

HLWQ replaces the author's earlier "PolarQuant" branding to disambiguate from Han et al. 2025 (arXiv:2502.02617), a distinct KV-cache quantization method published under the same name. The two techniques address different components (weights vs KV cache) and are unrelated.

Internally, quant_method in config.json remains "polarengine" — that is the wire-format string recognized by transformers and vLLM loaders. Brand is HLWQ; wire format is polarengine.

📖 Citation

@misc{vicentino2026hlwq,
  title   = {Hadamard-Lloyd Weight Quantization (HLWQ): Near-Lossless 5-bit PTQ for LLMs via Walsh-Hadamard Rotation and Lloyd-Max Scalar Quantization},
  author  = {Vicentino, Caio},
  year    = {2026},
  eprint  = {2603.29078},
  archivePrefix = {arXiv},
  note    = {Formerly titled "PolarQuant"; v2 retitle pending to avoid collision with Han et al. 2025.}
}

🔗 References

Paper (HLWQ): arXiv:2603.29078
Code (HLWQ): github.com/caiovicentino/eoq-quantization
vLLM expert offload fork: github.com/caiovicentino/vllm-expert-offload
Base model: Jackrong/Gemopus-4-26B-A4B-it — SFT of google/gemma-4-26B-A4B-it
Related (distinct method): Han, Kacham, Karbasi, Mirrokni, Zandieh. "PolarQuant: Quantizing KV caches with polar transformation." arXiv:2502.02617, 2025.

🙏 Acknowledgements

Built on Jackrong's excellent Gemopus SFT of Google's Gemma 4 26B-A4B architecture. Thanks to the vLLM team for the CompressedTensors + Marlin kernels, and to RedHatAI for the reference quantized MoE format (Qwen3-30B-A3B-quantized.w4a16) that this pipeline follows.

Downloads last month: 6

Model tree for caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5

Base model

Jackrong/Gemopus-4-26B-A4B-it

Quantized

(8)

this model

Collections including caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5

Papers for caiovicentino1/Gemopus-4-26B-A4B-it-HLWQ-Q5

PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Paper • 2603.29078 • Published Mar 30

PolarQuant: Quantizing KV Caches with Polar Transformation

Paper • 2502.02617 • Published Feb 4, 2025 • 1