--- license: other license_name: glm-5 license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE base_model: - zai-org/GLM-5.1 tags: - broken - deprecated - expert-pruning - gguf - glm - moe - pruning - reap library_name: llama-cpp pipeline_tag: text-generation base_model_relation: quantized --- > [!TIP] > **[Support this work →](https://donate.sybilsolutions.ai)** · [X](https://x.com/0xsero) · [GitHub](https://github.com/0xsero) · [REAP paper](https://arxiv.org/abs/2510.13999) · [Cerebras REAP](https://huggingface.co/collections/cerebras/cerebras-reap) # GLM-5.1-444B-GGUF GGUF quantization of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1). ## At a glance | | | |---|---| | Base model | [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) | | Format | GGUF | | Total params | **444B** | | Active / token | 14B | | Experts / layer | — | | Layers | — | | Hidden size | — | | Context | — | | On-disk size | 283 GB | ## Which variant should I pick? | Variant | Format | Link | |---|---|---| | `GLM-5.1-444B` | BF16 | [link](https://huggingface.co/0xSero/GLM-5.1-444B) | | `GLM-5.1-444B-GGUF` **(this)** | GGUF | [link](https://huggingface.co/0xSero/GLM-5.1-444B-GGUF) | | `GLM-5.1-478B-NVFP4` | NVFP4 | [link](https://huggingface.co/0xSero/GLM-5.1-478B-NVFP4) | | `GLM-5.1-555B` | BF16 | [link](https://huggingface.co/0xSero/GLM-5.1-555B) | | `GLM-5.1-555B-GGUF` | GGUF | [link](https://huggingface.co/0xSero/GLM-5.1-555B-GGUF) | | `GLM-5.1-555B-NVFP4` | NVFP4 | [link](https://huggingface.co/0xSero/GLM-5.1-555B-NVFP4) | | `GLM-5.1-555B-W4A16` | W4A16 | [link](https://huggingface.co/0xSero/GLM-5.1-555B-W4A16) | ## This model has repetition degeneration. Use the 25% pruned version instead. **Use this instead:** [0xSero/GLM-5.1-555B-GGUF](https://huggingface.co/0xSero/GLM-5.1-555B-GGUF) --- ## What is wrong with this model? This is a Q4_K_M GGUF of the **40% expert-pruned** GLM-5.1 (154/256 experts retained). It suffers from **repetition degeneration** - the model enters infinite loops when generating code, structured output, or any long-form content requiring syntactic templates. ### Measured degeneration rates: - **29% overall** (13/45 probes degenerate in fuzz testing) - **40% of code generation** tasks loop (red-black trees, chess engines, regex, B-trees) - **75% of structured output** tasks loop (comparison tables, API specs, enum lists) - **18% of Terminal-Bench** probes loop (9/50) - **30% of SWE-bench Pro** probes loop (12/40) ### Root cause: Removing 40% of experts (102 per layer) exceeds the model's tolerance for expert pruning. The remaining 154 experts cannot cover the full routing distribution needed for coherent long-form generation. The degeneration compounds over sequence length - short outputs (<512 tokens) work fine, but anything over ~600-1000 words risks entering a repetition loop. ### The fix: The 25% pruned variant (192/256 experts, 555B) completely eliminates repetition loops while maintaining competitive quality: - **0/220 benchmark probes** had repetition loops - Terminal-Bench: 88% proxy pass rate - SWE-Pro: 66% proxy pass rate **Use [0xSero/GLM-5.1-555B-GGUF](https://huggingface.co/0xSero/GLM-5.1-555B-GGUF) instead.** ## License & citation License inherited from the base model. ```bibtex @misc{lasby2025reap, title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression}, author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa}, year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv} } ``` ## Sponsors Made possible by **NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle**.