Support this work → · X · GitHub · REAP paper · Cerebras REAP

GLM-5.1-444B-GGUF

GGUF quantization of zai-org/GLM-5.1.

At a glance

Base model zai-org/GLM-5.1
Format GGUF
Total params 444B
Active / token 14B
Experts / layer —
Layers —
Hidden size —
Context —
On-disk size 283 GB

Which variant should I pick?

Variant Format Link
GLM-5.1-444B BF16 link
GLM-5.1-444B-GGUF (this) GGUF link
GLM-5.1-478B-NVFP4 NVFP4 link
GLM-5.1-555B BF16 link
GLM-5.1-555B-GGUF GGUF link
GLM-5.1-555B-NVFP4 NVFP4 link
GLM-5.1-555B-W4A16 W4A16 link

This model has repetition degeneration. Use the 25% pruned version instead.

Use this instead: 0xSero/GLM-5.1-555B-GGUF


What is wrong with this model?

This is a Q4_K_M GGUF of the 40% expert-pruned GLM-5.1 (154/256 experts retained). It suffers from repetition degeneration - the model enters infinite loops when generating code, structured output, or any long-form content requiring syntactic templates.

Measured degeneration rates:

  • 29% overall (13/45 probes degenerate in fuzz testing)
  • 40% of code generation tasks loop (red-black trees, chess engines, regex, B-trees)
  • 75% of structured output tasks loop (comparison tables, API specs, enum lists)
  • 18% of Terminal-Bench probes loop (9/50)
  • 30% of SWE-bench Pro probes loop (12/40)

Root cause:

Removing 40% of experts (102 per layer) exceeds the model's tolerance for expert pruning. The remaining 154 experts cannot cover the full routing distribution needed for coherent long-form generation. The degeneration compounds over sequence length - short outputs (<512 tokens) work fine, but anything over ~600-1000 words risks entering a repetition loop.

The fix:

The 25% pruned variant (192/256 experts, 555B) completely eliminates repetition loops while maintaining competitive quality:

  • 0/220 benchmark probes had repetition loops
  • Terminal-Bench: 88% proxy pass rate
  • SWE-Pro: 66% proxy pass rate

Use 0xSero/GLM-5.1-555B-GGUF instead.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
37
GGUF
Model size
455B params
Architecture
glm-dsa
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-5.1-444B-GGUF

Base model

zai-org/GLM-5.1
Quantized
(41)
this model

Space using 0xSero/GLM-5.1-444B-GGUF 1

Collection including 0xSero/GLM-5.1-444B-GGUF

Paper for 0xSero/GLM-5.1-444B-GGUF