---
license: other
license_name: glm-5
license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE
base_model:
- zai-org/GLM-5.1
tags:
- broken
- deprecated
- expert-pruning
- gguf
- glm
- moe
- pruning
- reap
library_name: llama-cpp
pipeline_tag: text-generation
base_model_relation: quantized
---

> [!TIP]
> **[Support this work →](https://donate.sybilsolutions.ai)** · [X](https://x.com/0xsero) · [GitHub](https://github.com/0xsero) · [REAP paper](https://arxiv.org/abs/2510.13999) · [Cerebras REAP](https://huggingface.co/collections/cerebras/cerebras-reap)

# GLM-5.1-444B-GGUF

GGUF quantization of [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1).

## At a glance

| | |
|---|---|
| Base model | [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) |
| Format | GGUF |
| Total params | **444B** |
| Active / token | 14B |
| Experts / layer | — |
| Layers | — |
| Hidden size | — |
| Context | — |
| On-disk size | 283 GB |

## Which variant should I pick?

| Variant | Format | Link |
|---|---|---|
| `GLM-5.1-444B` | BF16 | [link](https://huggingface.co/0xSero/GLM-5.1-444B) |
| `GLM-5.1-444B-GGUF` **(this)** | GGUF | [link](https://huggingface.co/0xSero/GLM-5.1-444B-GGUF) |
| `GLM-5.1-478B-NVFP4` | NVFP4 | [link](https://huggingface.co/0xSero/GLM-5.1-478B-NVFP4) |
| `GLM-5.1-555B` | BF16 | [link](https://huggingface.co/0xSero/GLM-5.1-555B) |
| `GLM-5.1-555B-GGUF` | GGUF | [link](https://huggingface.co/0xSero/GLM-5.1-555B-GGUF) |
| `GLM-5.1-555B-NVFP4` | NVFP4 | [link](https://huggingface.co/0xSero/GLM-5.1-555B-NVFP4) |
| `GLM-5.1-555B-W4A16` | W4A16 | [link](https://huggingface.co/0xSero/GLM-5.1-555B-W4A16) |

## This model has repetition degeneration. Use the 25% pruned version instead.

**Use this instead:** [0xSero/GLM-5.1-555B-GGUF](https://huggingface.co/0xSero/GLM-5.1-555B-GGUF)

---

## What is wrong with this model?

This is a Q4_K_M GGUF of the **40% expert-pruned** GLM-5.1 (154/256 experts retained). It suffers from **repetition degeneration** - the model enters infinite loops when generating code, structured output, or any long-form content requiring syntactic templates.

### Measured degeneration rates:
- **29% overall** (13/45 probes degenerate in fuzz testing)
- **40% of code generation** tasks loop (red-black trees, chess engines, regex, B-trees)
- **75% of structured output** tasks loop (comparison tables, API specs, enum lists)
- **18% of Terminal-Bench** probes loop (9/50)
- **30% of SWE-bench Pro** probes loop (12/40)

### Root cause:
Removing 40% of experts (102 per layer) exceeds the model's tolerance for expert pruning. The remaining 154 experts cannot cover the full routing distribution needed for coherent long-form generation. The degeneration compounds over sequence length - short outputs (<512 tokens) work fine, but anything over ~600-1000 words risks entering a repetition loop.

### The fix:
The 25% pruned variant (192/256 experts, 555B) completely eliminates repetition loops while maintaining competitive quality:
- **0/220 benchmark probes** had repetition loops
- Terminal-Bench: 88% proxy pass rate
- SWE-Pro: 66% proxy pass rate

**Use [0xSero/GLM-5.1-555B-GGUF](https://huggingface.co/0xSero/GLM-5.1-555B-GGUF) instead.**

## License & citation
License inherited from the base model.

```bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
```

## Sponsors
Made possible by **NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle**.