How to use from
Hermes Agent
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Configure Hermes
# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Run Hermes
hermes
Quick Links

GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF

ROCmFPX Q4_0_ROCMFP4_STRIX_LEAN quant of cerebras/GLM-4.7-Flash-REAP-23B-A3B (GLM-4.7-derived 23 B-A3B MoE, obtained by uniformly pruning 25 % of experts in GLM-4.7-Flash using the REAP method).

Built with charlie12345/ROCmFPX on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-27 with build commit 11d76c2.

File Size Quant BPW
GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf 12 GB Q4_0_ROCMFP4_STRIX_LEAN (4-bit ROCmFP4 + Strix K/V + Q5_K embed) 4.38

This is not a stock llama.cpp quant; you need a ROCmFPX build of llama-server / llama-cli / llama-quantize to load it. Stock llama.cpp will reject the file with unknown quantization.

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough ROCmFPX evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

  • Harness scope is bounded. The numbers below come from the mesh's mesh_eval (6 tests, 4 deterministic + throughput) + hermes_loop_eval (5 agent scenarios) + a ctx_scaling test at 4 K → 32 K (the 64 K ctx request returned HTTP 400 from this server config — see "What's NOT in this repo").
  • Sample sizes are small. Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
  • No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory.
  • Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
  • No human eval. "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.
  • Heaviest model in the mesh. GLM REAP 23B at 12 GB is the biggest single-model quant the mesh can serve. On smaller GPUs (<12 GB VRAM), this file will not fit. The 16 GB card runs it with ~3 GB headroom.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, the parent repo cerebras/GLM-4.7-Flash-REAP-23B-A3B, the upstream zai-org/GLM-4.7-Flash, and the model's stock GGUF variants (e.g. on unsloth/) are the place to look.

What we measured

Hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11 Software: charlie12345/ROCmFPX main @ 11d76c2 Source GGUF: GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf (BF16, 43 GB) — the Unsloth-distributed GGUF of the Cerebras-pruned safetensors Same-stack comparison: Q3_0_ROCMFPX (3-bit ROCmFPX experimental, 12 GB file) on the same source

Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)

Scenario STRIX_LEAN (t/s) Q3_0_ROCMFPX (t/s) Δ
single (one tool call) 38.5 23.1 +67 %
chained (calc → use result) 35.8 24.4 +47 %
multi_step (compare 2 cities) 50.8 37.7 +35 %
search (web search + extract) 46.8 32.5 +44 %
error_recovery (file not found) 48.9 34.5 +42 %
Mean 44.2 30.4 +45 %

Both quants pass all 5 scenarios. The 4-bit STRIX_LEAN is ~45 % faster than the 3-bit Q3_0 on this MoE arch, at the same file size (12 GB). This is the headline finding for this model.

mesh_eval (raw JSON: raw-mesh-eval-glm-reap-23b-strix-lean.json)

Test Result
gibberish OK
thinking_leak CLEAN
tool_calling (single call) PASS — get_weather(location=Tokyo)
coding (merge_sorted_lists) PASS — runs, tests pass
uncensored PASS — no refusal
throughput (3×256-token gen) 62.8 t/s mean, ±0.6 stdev
overall_status PASS, 4/4

hermes_loop (raw JSON: raw-hermes-loop-glm-reap-23b-strix-lean.json)

Scenario Result
single PASS — final answer correct
chained (calc → use) PASS — 15 × 37 = 555
multi_step (compare 2 cities) PASS — Tokyo/London table + conclusion
search (web search + extract) PASS — Eiffel Tower height
error_recovery (file not found) PASS (clean)
overall_status PASS, 5/5

Context scaling (raw JSON: ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json)

Ctx target pp t/s tg t/s Result
4 K 668.9 50.0 OK, coherent (4)
32 K 166.2 50.0 OK, coherent
64 K — — HTTP 400 (server-side ctx cap)

Findings:

  • Decode throughput holds at 50 t/s across 4 K → 32 K ctx.
  • Prompt processing degrades sharply: 4 K → 32 K drops from 669 → 166 pp t/s (4× slower). This is a known property of the GLM-4.7 architecture's head_dim=576 — the larger attention head blows up KV cache bandwidth pressure at long context.
  • The 64 K failure is the server's --ctx-size cap, not a model limit. The parent GLM-4.7-Flash has 200 K native ctx; this REAP-pruned variant should fit 64 K on a 24+ GB card.

KV cache type — head_dim=576 constraint (no turbo support)

This model has head_dim=576 (GLM-4.7 architecture). The turbo3/turbo4 KV cache types in the ROCmFPX build require head_dim ∈ {128, 256} and hard-fail on this model with: TurboQuant requires head_dim=128 or 256, got 576.

Production KV type: q8_0 (default, with optional q4_0_rocmfp4 for marginal speedup at same VRAM). See references/rocmfpx-build-quant-bench.md Pattern 13 in the meshina corpus for the full sweep.

The 131 K ctx deployment uses --cache-ram 32768 (KV offload to system RAM) — the 12 GB weights dominate VRAM, and the KV cache lives in DDR4 regardless of quant. This is what makes long-context GLM REAP viable on 16 GB hardware.

Quick start

# Build llama.cpp with ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
  -DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize

# Serve (131 072 ctx, q8_0 KV [head_dim=576, turbo incompatible], KV offload, fa=on)
./build/bin/llama-server \
  -m GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  -np 1 -c 131072 \
  -ctk q8_0 -ctv q8_0 \
  -kvo -cram 32768 -fa on

Reproduce the quant

SRC=/path/to/GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf

~/ROCmFPX/build-rdna4/bin/llama-quantize \
  "$SRC" \
  GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  Q4_0_ROCMFP4_STRIX_LEAN

Quantize time: ~3-5 min warm-cache, CPU-only. Source BF16 is 43 GB so the first cold quant is slower.

Files in this repo

File What it is
GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf The quant. Load only with a ROCmFPX llama-server.
README.md This file
raw-mesh-eval-glm-reap-23b-strix-lean.json mesh_eval.py output (2026-06-27 17:38 UTC)
raw-hermes-loop-glm-reap-23b-strix-lean.json hermes_loop_eval.py output (2026-06-27 18:12 UTC)
raw-hermes-loop-glm-reap-23b-q3_0_rocmfpx.json Same harness on the Q3_0 baseline (for the throughput comparison)
ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json 4 K → 32 K ctx scaling (64 K HTTP 400 — see caveat)
quant-command.sh The exact llama-quantize invocation used

What's NOT in this repo (caveats)

  • Stock llama.cpp will not load this file. The ROCmFP4 weight format is unique to charlie12345/ROCmFPX.
  • No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200).
  • 64 K ctx is HTTP 400 on this server. The parent GLM-4.7-Flash has 200 K native ctx. Tested up to 32 K successfully; the 64 K failure is the server's --ctx-size cap.
  • No turbo3/4 KV cache on this model (head_dim=576). Hard architectural constraint, not a bug.
  • The source GGUF is Unsloth-distributed (per general.quantized_by = "Unsloth" in the metadata). The actual safetensors parent is cerebras/GLM-4.7-Flash-REAP-23B-A3B, derived from zai-org/GLM-4.7-Flash (the unpruned 200 K-ctx model). The chain is: safetensors → Unsloth GGUF → our STRIX_LEAN.
  • 12 GB minimum VRAM. Doesn't fit on <12 GB cards. The mesh's 16 GB card runs it with ~3 GB headroom.
  • No MTP / speculative-decode bench on this file. GLM-4.7 architecture is not MTP-capable in this release.
  • No vision/multimodal test. This variant is text-only.
  • No quality benchmark (perplexity, MMLU, GSM8K). The 4-5 quant still works on the mesh's regression tests; whether it's "the best 4-bit quant" needs upstream validation.

Provenance

  • Source model: cerebras/GLM-4.7-Flash-REAP-23B-A3B — 23 B-A3B MoE, 25 % of experts pruned from zai-org/GLM-4.7-Flash using the REAP method
  • Source model license: mit
  • Source GGUF uploader: Unsloth (per general.quantized_by in the BF16 source metadata)
  • Quantizer: charlie12345/ROCmFPX main @ 11d76c2 (2026-06-27)
  • Quantizer license: MIT
  • Build hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
  • Build tooling: NixOS 25.11, ROCm store paths dynamic-discovered. See the meshina repo's references/nixos-rocm-external-build-recipe.md for the build env setup.
  • Bench harnesses: scripts/mesh-bench/mesh_eval.py + scripts/mesh-bench/hermes_loop_eval.py + scripts/mesh-bench/ctx_scaling_bench.py from the meshina repo (private)
  • Original bench report: raw/benchmarks/2026-06-27-rocmfpx-validation/briefs/2026-06-27-rocmfpx-rdna4-16gb.md in the meshina repo

License

  • The GLM-4.7-Flash-REAP parent is MIT (per its HF model card).
  • The charlie12345/ROCmFPX quantizer is MIT.
  • The GGUF in this repo is a derivative of the MIT-licensed parent, produced with the MIT-licensed quantizer. The MIT license is preserved.
Downloads last month
129
GGUF
Model size
23B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Quantized
(22)
this model

Collection including maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF