---
license: other
license_name: glm-5
license_link: https://huggingface.co/zai-org/GLM-5.1/blob/main/LICENSE
base_model: zai-org/GLM-5.1
tags:
  - glm
  - glm-5
  - reap
  - pruning
  - moe
  - gguf
  - llama.cpp
  - quantized
  - q4_k_m
  - unverified
  - experimental
  - do-not-use-in-production
pipeline_tag: text-generation
---

# CRITICAL WARNING: EXPERIMENTAL GGUF EXPORT

This repository is an experimental GGUF export of a 40% REAP-pruned `zai-org/GLM-5.1` checkpoint.
It is not fully benchmarked or validated. Do not use it for production or make quality claims from it yet.

## What this repo is for

This repo is intended to host GGUF artifacts derived from the 40% REAP checkpoint:

- BF16 GGUF export
- Protected Q4_K_M GGUF export for `llama.cpp`-style serving

## Source checkpoint

- Base model: [`zai-org/GLM-5.1`](https://huggingface.co/zai-org/GLM-5.1)
- Pruned checkpoint family: [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP)
- Architecture: `GlmMoeDsaForCausalLM`
- Routed experts per layer: `256 -> 154`
- Active params per token: `~14B`

## Quantization / protection strategy

The protected Q4 export is not a blanket low-bit quantization. Sensitive tensors are kept at higher precision where possible.

### Kept higher precision

- Router gate / router bias: F32
- DSA indexer tensors: Q8_0
- Attention tensors: Q8_0
- Shared expert tensors: Q8_0
- Dense-layer MLP tensors: Q8_0

### Quantized lower precision

- Routed MoE expert projection tensors: Q4_K / Q6_K family

## Chat / reasoning notes

- The original GLM-5.1 chat template is preserved and embedded in GGUF metadata.
- This is a reasoning/chat model; serving stacks must handle GLM-style thinking correctly.
- Early serving probes suggest that unrestricted thinking can consume the entire generation budget before a final answer is emitted. Size `max_tokens` accordingly, or disable thinking per request if you need direct outputs.

## Current status

- GGUF conversion: complete
- Protected Q4 export: complete
- Full benchmark suite: still in progress
- Public quality verdict: not ready

## Intended usage

Research / experimentation only:

- llama.cpp serving experiments
- GGUF compatibility testing
- Quantization behavior analysis for GLM-5.1 MoE + DSA
- Comparing protected low-bit exports against BF16 baselines

## Example llama.cpp serving

```bash
llama-server \
  -m glm51-444b-reap-Q4_K_M-protected-00001-of-00019.gguf \
  --jinja \
  --reasoning on \
  --reasoning-format deepseek
```

If you need direct outputs rather than reasoning-heavy traces, disable thinking at request time in the client payload.

## Related repos

- BF16 pruned checkpoint: [`0xSero/GLM-5.1-444B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-444B-A14B-REAP)
- 25% sibling: [`0xSero/GLM-5.1-555B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-555B-A14B-REAP)
- 50% sibling: [`0xSero/GLM-5.1-367B-A14B-REAP`](https://huggingface.co/0xSero/GLM-5.1-367B-A14B-REAP)

## Citation

If you use these artifacts, cite the upstream GLM-5.1 release and the REAP method, and clearly state that this GGUF export is experimental and unverified.

---

**Last updated:** 2026-04-14
**Status:** EXPERIMENTAL / UNVERIFIED / GGUF EXPORT ONLY