---
license: mit
base_model: cerebras/GLM-4.7-Flash-REAP-23B-A3B
tags:
- gguf
- rocmfpx
- deepseek2
- glm
- moe
- rocm
- rdna4
- strix-lean
- quantization
- llama-cpp
base_model_relation: quantized
quantized_by: maczzzzzz (via charlie12345/ROCmFPX)
---

# GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF

**ROCmFPX `Q4_0_ROCMFP4_STRIX_LEAN` quant of [`cerebras/GLM-4.7-Flash-REAP-23B-A3B`](https://huggingface.co/cerebras/GLM-4.7-Flash-REAP-23B-A3B) (GLM-4.7-derived 23 B-A3B MoE, obtained by uniformly pruning 25 % of experts in GLM-4.7-Flash using the REAP method).**

Built with [charlie12345/ROCmFPX](https://github.com/charlie12345/ROCmFPX) on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-27 with build commit `11d76c2`.

| File | Size | Quant | BPW |
|---|---|---|---|
| `GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf` | 12 GB | `Q4_0_ROCMFP4_STRIX_LEAN` (4-bit ROCmFP4 + Strix K/V + Q5_K embed) | 4.38 |

This is **not** a stock llama.cpp quant; you need a ROCmFPX build of `llama-server` / `llama-cli` / `llama-quantize` to load it. Stock llama.cpp will reject the file with `unknown quantization`.

## Scope of these benchmarks — read this first

**These numbers are a light baseline, not a thorough ROCmFPX evaluation.** The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

- **Harness scope is bounded.** The numbers below come from the mesh's `mesh_eval` (6 tests, 4 deterministic + throughput) + `hermes_loop_eval` (5 agent scenarios) + a `ctx_scaling` test at 4 K → 32 K (the 64 K ctx request returned HTTP 400 from this server config — see "What's NOT in this repo").
- **Sample sizes are small.** Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
- **No perplexity / wikitext / MMLU / GSM8K.** The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory.
- **Single GPU class.** All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is **not** implied.
- **No human eval.** "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.
- **Heaviest model in the mesh.** GLM REAP 23B at 12 GB is the biggest single-model quant the mesh can serve. On smaller GPUs (<12 GB VRAM), this file will not fit. The 16 GB card runs it with ~3 GB headroom.

**What this IS good for:** a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. **What this is NOT good for:** claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, the parent repo [`cerebras/GLM-4.7-Flash-REAP-23B-A3B`](https://huggingface.co/cerebras/GLM-4.7-Flash-REAP-23B-A3B), the upstream [`zai-org/GLM-4.7-Flash`](https://huggingface.co/zai-org/GLM-4.7-Flash), and the model's stock GGUF variants (e.g. on `unsloth/`) are the place to look.

## What we measured

**Hardware:** Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
**Software:** [charlie12345/ROCmFPX](https://github.com/charlie12345/ROCmFPX) `main` @ `11d76c2`
**Source GGUF:** `GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf` (BF16, 43 GB) — the Unsloth-distributed GGUF of the Cerebras-pruned safetensors
**Same-stack comparison:** `Q3_0_ROCMFPX` (3-bit ROCmFPX experimental, 12 GB file) on the same source

### Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)

| Scenario | STRIX_LEAN (t/s) | Q3_0_ROCMFPX (t/s) | Δ |
|---|---|---|---|
| `single` (one tool call) | 38.5 | 23.1 | **+67 %** |
| `chained` (calc → use result) | 35.8 | 24.4 | +47 % |
| `multi_step` (compare 2 cities) | 50.8 | 37.7 | +35 % |
| `search` (web search + extract) | 46.8 | 32.5 | +44 % |
| `error_recovery` (file not found) | 48.9 | 34.5 | +42 % |
| **Mean** | **44.2** | **30.4** | **+45 %** |

Both quants pass all 5 scenarios. The 4-bit STRIX_LEAN is **~45 % faster** than the 3-bit Q3_0 on this MoE arch, at the same file size (12 GB). This is the headline finding for this model.

### mesh_eval (raw JSON: `raw-mesh-eval-glm-reap-23b-strix-lean.json`)

| Test | Result |
|---|---|
| `gibberish` | OK |
| `thinking_leak` | CLEAN |
| `tool_calling` (single call) | PASS — `get_weather(location=Tokyo)` |
| `coding` (merge_sorted_lists) | PASS — runs, tests pass |
| `uncensored` | PASS — no refusal |
| `throughput` (3×256-token gen) | **62.8 t/s** mean, ±0.6 stdev |
| `overall_status` | **PASS, 4/4** |

### hermes_loop (raw JSON: `raw-hermes-loop-glm-reap-23b-strix-lean.json`)

| Scenario | Result |
|---|---|
| `single` | PASS — final answer correct |
| `chained` (calc → use) | PASS — `15 × 37 = 555` |
| `multi_step` (compare 2 cities) | PASS — Tokyo/London table + conclusion |
| `search` (web search + extract) | PASS — Eiffel Tower height |
| `error_recovery` (file not found) | **PASS** (clean) |
| `overall_status` | **PASS, 5/5** |

### Context scaling (raw JSON: `ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json`)

| Ctx target | pp t/s | tg t/s | Result |
|---|---|---|---|
| 4 K | 668.9 | 50.0 | OK, coherent (`4`) |
| 32 K | 166.2 | 50.0 | OK, coherent |
| 64 K | — | — | HTTP 400 (server-side ctx cap) |

**Findings:**
- Decode throughput holds at **50 t/s** across 4 K → 32 K ctx.
- **Prompt processing degrades sharply: 4 K → 32 K drops from 669 → 166 pp t/s (4× slower).** This is a known property of the GLM-4.7 architecture's `head_dim=576` — the larger attention head blows up KV cache bandwidth pressure at long context.
- The 64 K failure is the server's `--ctx-size` cap, not a model limit. The parent GLM-4.7-Flash has 200 K native ctx; this REAP-pruned variant should fit 64 K on a 24+ GB card.

### KV cache type — `head_dim=576` constraint (no turbo support)

This model has **`head_dim=576`** (GLM-4.7 architecture). The turbo3/turbo4 KV cache types in the ROCmFPX build require `head_dim ∈ {128, 256}` and **hard-fail** on this model with: `TurboQuant requires head_dim=128 or 256, got 576`.

Production KV type: **`q8_0`** (default, with optional `q4_0_rocmfp4` for marginal speedup at same VRAM). See `references/rocmfpx-build-quant-bench.md` Pattern 13 in the meshina corpus for the full sweep.

The 131 K ctx deployment uses `--cache-ram 32768` (KV offload to system RAM) — the 12 GB weights dominate VRAM, and the KV cache lives in DDR4 regardless of quant. This is what makes long-context GLM REAP viable on 16 GB hardware.

## Quick start

```bash
# Build llama.cpp with ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
  -DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize

# Serve (131 072 ctx, q8_0 KV [head_dim=576, turbo incompatible], KV offload, fa=on)
./build/bin/llama-server \
  -m GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  -np 1 -c 131072 \
  -ctk q8_0 -ctv q8_0 \
  -kvo -cram 32768 -fa on
```

## Reproduce the quant

```bash
SRC=/path/to/GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf

~/ROCmFPX/build-rdna4/bin/llama-quantize \
  "$SRC" \
  GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  Q4_0_ROCMFP4_STRIX_LEAN
```

Quantize time: ~3-5 min warm-cache, CPU-only. Source BF16 is 43 GB so the first cold quant is slower.

## Files in this repo

| File | What it is |
|---|---|
| `GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf` | The quant. **Load only with a ROCmFPX `llama-server`.** |
| `README.md` | This file |
| `raw-mesh-eval-glm-reap-23b-strix-lean.json` | `mesh_eval.py` output (2026-06-27 17:38 UTC) |
| `raw-hermes-loop-glm-reap-23b-strix-lean.json` | `hermes_loop_eval.py` output (2026-06-27 18:12 UTC) |
| `raw-hermes-loop-glm-reap-23b-q3_0_rocmfpx.json` | Same harness on the Q3_0 baseline (for the throughput comparison) |
| `ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json` | 4 K → 32 K ctx scaling (64 K HTTP 400 — see caveat) |
| `quant-command.sh` | The exact `llama-quantize` invocation used |

## What's NOT in this repo (caveats)

- **Stock llama.cpp will not load this file.** The ROCmFP4 weight format is unique to charlie12345/ROCmFPX.
- **No CUDA / non-AMD GPU bench.** All measurements are RDNA4 (gfx1200).
- **64 K ctx is HTTP 400 on this server.** The parent GLM-4.7-Flash has 200 K native ctx. Tested up to 32 K successfully; the 64 K failure is the server's `--ctx-size` cap.
- **No turbo3/4 KV cache** on this model (head_dim=576). Hard architectural constraint, not a bug.
- **The source GGUF is Unsloth-distributed** (per `general.quantized_by = "Unsloth"` in the metadata). The actual safetensors parent is `cerebras/GLM-4.7-Flash-REAP-23B-A3B`, derived from `zai-org/GLM-4.7-Flash` (the unpruned 200 K-ctx model). The chain is: safetensors → Unsloth GGUF → our STRIX_LEAN.
- **12 GB minimum VRAM.** Doesn't fit on <12 GB cards. The mesh's 16 GB card runs it with ~3 GB headroom.
- **No MTP / speculative-decode bench on this file.** GLM-4.7 architecture is not MTP-capable in this release.
- **No vision/multimodal test.** This variant is text-only.
- **No quality benchmark** (perplexity, MMLU, GSM8K). The 4-5 quant still works on the mesh's regression tests; whether it's "the best 4-bit quant" needs upstream validation.

## Provenance

- **Source model:** [`cerebras/GLM-4.7-Flash-REAP-23B-A3B`](https://huggingface.co/cerebras/GLM-4.7-Flash-REAP-23B-A3B) — 23 B-A3B MoE, 25 % of experts pruned from `zai-org/GLM-4.7-Flash` using the REAP method
- **Source model license:** mit
- **Source GGUF uploader:** Unsloth (per `general.quantized_by` in the BF16 source metadata)
- **Quantizer:** [charlie12345/ROCmFPX](https://github.com/charlie12345/ROCmFPX) `main` @ `11d76c2` (2026-06-27)
- **Quantizer license:** MIT
- **Build hardware:** Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
- **Build tooling:** NixOS 25.11, ROCm store paths dynamic-discovered. See the `meshina` repo's `references/nixos-rocm-external-build-recipe.md` for the build env setup.
- **Bench harnesses:** `scripts/mesh-bench/mesh_eval.py` + `scripts/mesh-bench/hermes_loop_eval.py` + `scripts/mesh-bench/ctx_scaling_bench.py` from the [meshina](https://github.com/maczzzzzz/meshina) repo (private)
- **Original bench report:** `raw/benchmarks/2026-06-27-rocmfpx-validation/briefs/2026-06-27-rocmfpx-rdna4-16gb.md` in the meshina repo

## License

- **The GLM-4.7-Flash-REAP parent is MIT** (per its HF model card).
- **The `charlie12345/ROCmFPX` quantizer is MIT.**
- The GGUF in this repo is a derivative of the MIT-licensed parent, produced with the MIT-licensed quantizer. The MIT license is preserved.