--- license: mit base_model: zai-org/GLM-5 tags: - turboquant - tq3 - weight-compression - moe - glm library_name: transformers --- # GLM-5.1 TQ3 (3-bit weight compression) Native TQ3 checkpoint of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) (769B MoE, 40B active). ## Compression | | BF16 | TQ3 | |--|--|--| | **Checkpoint size** | ~1,510 GB | **309 GB** | | **Compression ratio** | 1x | **4.9x** | Created using [turboquant-plus-vllm](https://github.com/varjoranta/turboquant-vllm) streaming checkpoint creation on a $0.11/hr CPU instance. Total cost: $0.84. ## Status **Not yet tested on GPU.** This checkpoint was created and uploaded automatically. Quality validation on a multi-GPU setup is pending. The same code path was validated on GLM-4.7-Flash (355B, same MoE architecture with 64 experts) where it loaded successfully and scored correctly on all test prompts with 13.3 GB GPU memory. ## Architecture GLM-5.1 uses the `Glm4MoeLiteNaiveMoe` architecture: - 769B total parameters, 40B active per token - 256 routed experts, 8 active per token, 1 shared expert - 78 layers, hidden_size=6144 - Multi-head Latent Attention (MLA) - First 3 layers are dense (not MoE) - 200K context window ## How it works The WHT rotation + Gaussian Lloyd-Max codebook from [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026). After a random Walsh-Hadamard rotation, weight distributions become near-Gaussian, making them efficiently quantizable with 8 centroids (3-bit) per 128-element group. Zero calibration data needed. The checkpoint stores packed 3-bit indices + per-group norms. The loader handles: - Per-expert 2D → fused 3D regrouping (gate_proj + up_proj → gate_up_proj fusion) - Router/gate weight decompression in-place - Meta-device model creation for low-memory loading ## Usage ```python pip install turboquant-plus-vllm@git+https://github.com/varjoranta/turboquant-vllm.git ``` ```python from turboquant_vllm import load_tq3_model model, tokenizer = load_tq3_model("varjosoft/GLM-5.1-Open-TQ3", device="cuda") # Requires multi-GPU setup — see requirements below ``` ## GPU requirements for inference | Setup | Total VRAM | Per-GPU | Cost/hr (Verda) | |--|--|--|--| | 8× A100 80GB | 640 GB | 45 GB | $10.32 | | 4× H200 141GB | 564 GB | 90 GB | $13.56 | | 2× B300 262GB | 524 GB | 180 GB | $13.98 | Without TQ3, the BF16 model requires 1,510 GB VRAM (minimum 8× B300 at $55.92/hr). ## Software requirements - `transformers >= 5.5.0` - `turboquant-plus-vllm` ([GitHub](https://github.com/varjoranta/turboquant-vllm)) - PyTorch with CUDA ## Comparison with other quantizations | Method | Size | Calibration | Format | Target | |--|--|--|--|--| | **This (TQ3)** | **309 GB (4.9x)** | **None** | Safetensors | GPU serving (vLLM/PyTorch) | | Unsloth Dynamic 2-bit | 236 GB (6.4x) | 300K+ tokens | GGUF | Local/CPU (llama.cpp) | | BF16 original | 1,510 GB | N/A | Safetensors | 8× B300+ | ## License MIT (same as base model). Created by [Varjosoft Oy](https://varjosoft.com).