--- license: other license_name: glm-5 license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE base_model: zai-org/GLM-5.1 tags: - glm - glm-5 - reap - pruning - moe - gguf - llama.cpp - quantized - q4_k_m - unverified - experimental - do-not-use-in-production pipeline_tag: text-generation --- # CRITICAL WARNING: EXPERIMENTAL GGUF EXPORT This repository is an experimental GGUF export of a 40% REAP-pruned `zai-org/GLM-5.1` checkpoint. It is not fully benchmarked or validated. Do not use it for production or make quality claims from it yet. ## What this repo is for This repo is intended to host GGUF artifacts derived from the 40% REAP checkpoint: - BF16 GGUF export - Protected Q4_K_M GGUF export for `llama.cpp`-style serving ## Source checkpoint - Base model: [`zai-org/GLM-5.1`](https://huggingface.co/zai-org/GLM-5.1) - Pruned checkpoint family: [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP) - Architecture: `GlmMoeDsaForCausalLM` - Routed experts per layer: `256 -> 154` - Active params per token: `~14B` ## Quantization / protection strategy The protected Q4 export is not a blanket low-bit quantization. Sensitive tensors are kept at higher precision where possible. ### Kept higher precision - Router gate / router bias: F32 - DSA indexer tensors: Q8_0 - Attention tensors: Q8_0 - Shared expert tensors: Q8_0 - Dense-layer MLP tensors: Q8_0 ### Quantized lower precision - Routed MoE expert projection tensors: Q4_K / Q6_K family ## Chat / reasoning notes - The original GLM-5.1 chat template is preserved and embedded in GGUF metadata. - This is a reasoning/chat model; serving stacks must handle GLM-style thinking correctly. - Early serving probes suggest that unrestricted thinking can consume the entire generation budget before a final answer is emitted. Size `max_tokens` accordingly, or disable thinking per request if you need direct outputs. ## Current status - GGUF conversion: complete - Protected Q4 export: complete - Full benchmark suite: still in progress - Public quality verdict: not ready ## Intended usage Research / experimentation only: - llama.cpp serving experiments - GGUF compatibility testing - Quantization behavior analysis for GLM-5.1 MoE + DSA - Comparing protected low-bit exports against BF16 baselines ## Example llama.cpp serving ```bash llama-server \ -m glm51-444b-reap-Q4_K_M-protected-00001-of-00019.gguf \ --jinja \ --reasoning on \ --reasoning-format deepseek ``` If you need direct outputs rather than reasoning-heavy traces, disable thinking at request time in the client payload. ## Related repos - BF16 pruned checkpoint: [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP) - 25% sibling: [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP) - 50% sibling: [`0xSero/GLM-5.1-367B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-367B-A14B-REAP) ## Citation If you use these artifacts, cite the upstream GLM-5.1 release and the REAP method, and clearly state that this GGUF export is experimental and unverified. --- **Last updated:** 2026-04-14 **Status:** EXPERIMENTAL / UNVERIFIED / GGUF EXPORT ONLY