Instructions to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF",
	filename="GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
./llama-cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Use Docker

docker model run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

LM Studio
Jan
Ollama
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Ollama:
```
ollama run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
```

Unsloth Studio

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Docker Model Runner:
```
docker model run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
```

Lemonade

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Run and chat with the model

lemonade run user.GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF / README.md

maczzzzzz

Upload GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf + bench data (ROCmFPX STRIX_LEAN)

9d4fac2 verified 5 days ago

preview code

Raw

History Blame Contribute Delete

11.1 kB

metadata

license: mit
base_model: cerebras/GLM-4.7-Flash-REAP-23B-A3B
tags:
  - gguf
  - rocmfpx
  - deepseek2
  - glm
  - moe
  - rocm
  - rdna4
  - strix-lean
  - quantization
  - llama-cpp
base_model_relation: quantized
quantized_by: maczzzzzz (via charlie12345/ROCmFPX)

GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF

ROCmFPX Q4_0_ROCMFP4_STRIX_LEAN quant of cerebras/GLM-4.7-Flash-REAP-23B-A3B (GLM-4.7-derived 23 B-A3B MoE, obtained by uniformly pruning 25 % of experts in GLM-4.7-Flash using the REAP method).

Built with charlie12345/ROCmFPX on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-27 with build commit 11d76c2.

File	Size	Quant	BPW
`GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf`	12 GB	`Q4_0_ROCMFP4_STRIX_LEAN` (4-bit ROCmFP4 + Strix K/V + Q5_K embed)	4.38

This is not a stock llama.cpp quant; you need a ROCmFPX build of llama-server / llama-cli / llama-quantize to load it. Stock llama.cpp will reject the file with unknown quantization.

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough ROCmFPX evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

Harness scope is bounded. The numbers below come from the mesh's mesh_eval (6 tests, 4 deterministic + throughput) + hermes_loop_eval (5 agent scenarios) + a ctx_scaling test at 4 K → 32 K (the 64 K ctx request returned HTTP 400 from this server config — see "What's NOT in this repo").
Sample sizes are small. Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory.
Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
No human eval. "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.
Heaviest model in the mesh. GLM REAP 23B at 12 GB is the biggest single-model quant the mesh can serve. On smaller GPUs (<12 GB VRAM), this file will not fit. The 16 GB card runs it with ~3 GB headroom.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, the parent repo cerebras/GLM-4.7-Flash-REAP-23B-A3B, the upstream zai-org/GLM-4.7-Flash, and the model's stock GGUF variants (e.g. on unsloth/) are the place to look.

What we measured

Hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11 Software: charlie12345/ROCmFPX main @ 11d76c2 Source GGUF: GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf (BF16, 43 GB) — the Unsloth-distributed GGUF of the Cerebras-pruned safetensors Same-stack comparison: Q3_0_ROCMFPX (3-bit ROCmFPX experimental, 12 GB file) on the same source

Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)

Scenario	STRIX_LEAN (t/s)	Q3_0_ROCMFPX (t/s)	Δ
`single` (one tool call)	38.5	23.1	+67 %
`chained` (calc → use result)	35.8	24.4	+47 %
`multi_step` (compare 2 cities)	50.8	37.7	+35 %
`search` (web search + extract)	46.8	32.5	+44 %
`error_recovery` (file not found)	48.9	34.5	+42 %
Mean	44.2	30.4	+45 %

Both quants pass all 5 scenarios. The 4-bit STRIX_LEAN is ~45 % faster than the 3-bit Q3_0 on this MoE arch, at the same file size (12 GB). This is the headline finding for this model.

mesh_eval (raw JSON: `raw-mesh-eval-glm-reap-23b-strix-lean.json`)

Test	Result
`gibberish`	OK
`thinking_leak`	CLEAN
`tool_calling` (single call)	PASS — `get_weather(location=Tokyo)`
`coding` (merge_sorted_lists)	PASS — runs, tests pass
`uncensored`	PASS — no refusal
`throughput` (3×256-token gen)	62.8 t/s mean, ±0.6 stdev
`overall_status`	PASS, 4/4

hermes_loop (raw JSON: `raw-hermes-loop-glm-reap-23b-strix-lean.json`)

Scenario	Result
`single`	PASS — final answer correct
`chained` (calc → use)	PASS — `15 × 37 = 555`
`multi_step` (compare 2 cities)	PASS — Tokyo/London table + conclusion
`search` (web search + extract)	PASS — Eiffel Tower height
`error_recovery` (file not found)	PASS (clean)
`overall_status`	PASS, 5/5

Context scaling (raw JSON: `ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json`)

Ctx target	pp t/s	tg t/s	Result
4 K	668.9	50.0	OK, coherent (`4`)
32 K	166.2	50.0	OK, coherent
64 K	—	—	HTTP 400 (server-side ctx cap)

Findings:

Decode throughput holds at 50 t/s across 4 K → 32 K ctx.
Prompt processing degrades sharply: 4 K → 32 K drops from 669 → 166 pp t/s (4× slower). This is a known property of the GLM-4.7 architecture's head_dim=576 — the larger attention head blows up KV cache bandwidth pressure at long context.
The 64 K failure is the server's --ctx-size cap, not a model limit. The parent GLM-4.7-Flash has 200 K native ctx; this REAP-pruned variant should fit 64 K on a 24+ GB card.

KV cache type — `head_dim=576` constraint (no turbo support)

This model has head_dim=576 (GLM-4.7 architecture). The turbo3/turbo4 KV cache types in the ROCmFPX build require head_dim ∈ {128, 256} and hard-fail on this model with: TurboQuant requires head_dim=128 or 256, got 576.

Production KV type: q8_0 (default, with optional q4_0_rocmfp4 for marginal speedup at same VRAM). See references/rocmfpx-build-quant-bench.md Pattern 13 in the meshina corpus for the full sweep.

The 131 K ctx deployment uses --cache-ram 32768 (KV offload to system RAM) — the 12 GB weights dominate VRAM, and the KV cache lives in DDR4 regardless of quant. This is what makes long-context GLM REAP viable on 16 GB hardware.

Quick start

# Build llama.cpp with ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
  -DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize

# Serve (131 072 ctx, q8_0 KV [head_dim=576, turbo incompatible], KV offload, fa=on)
./build/bin/llama-server \
  -m GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  -np 1 -c 131072 \
  -ctk q8_0 -ctv q8_0 \
  -kvo -cram 32768 -fa on

Reproduce the quant

SRC=/path/to/GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf

~/ROCmFPX/build-rdna4/bin/llama-quantize \
  "$SRC" \
  GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  Q4_0_ROCMFP4_STRIX_LEAN

Quantize time: ~3-5 min warm-cache, CPU-only. Source BF16 is 43 GB so the first cold quant is slower.

Files in this repo

File	What it is
`GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf`	The quant. Load only with a ROCmFPX `llama-server`.
`README.md`	This file
`raw-mesh-eval-glm-reap-23b-strix-lean.json`	`mesh_eval.py` output (2026-06-27 17:38 UTC)
`raw-hermes-loop-glm-reap-23b-strix-lean.json`	`hermes_loop_eval.py` output (2026-06-27 18:12 UTC)
`raw-hermes-loop-glm-reap-23b-q3_0_rocmfpx.json`	Same harness on the Q3_0 baseline (for the throughput comparison)
`ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json`	4 K → 32 K ctx scaling (64 K HTTP 400 — see caveat)
`quant-command.sh`	The exact `llama-quantize` invocation used

What's NOT in this repo (caveats)

Stock llama.cpp will not load this file. The ROCmFP4 weight format is unique to charlie12345/ROCmFPX.
No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200).
64 K ctx is HTTP 400 on this server. The parent GLM-4.7-Flash has 200 K native ctx. Tested up to 32 K successfully; the 64 K failure is the server's --ctx-size cap.
No turbo3/4 KV cache on this model (head_dim=576). Hard architectural constraint, not a bug.
The source GGUF is Unsloth-distributed (per general.quantized_by = "Unsloth" in the metadata). The actual safetensors parent is cerebras/GLM-4.7-Flash-REAP-23B-A3B, derived from zai-org/GLM-4.7-Flash (the unpruned 200 K-ctx model). The chain is: safetensors → Unsloth GGUF → our STRIX_LEAN.
12 GB minimum VRAM. Doesn't fit on <12 GB cards. The mesh's 16 GB card runs it with ~3 GB headroom.
No MTP / speculative-decode bench on this file. GLM-4.7 architecture is not MTP-capable in this release.
No vision/multimodal test. This variant is text-only.
No quality benchmark (perplexity, MMLU, GSM8K). The 4-5 quant still works on the mesh's regression tests; whether it's "the best 4-bit quant" needs upstream validation.

Provenance

Source model: cerebras/GLM-4.7-Flash-REAP-23B-A3B — 23 B-A3B MoE, 25 % of experts pruned from zai-org/GLM-4.7-Flash using the REAP method
Source model license: mit
Source GGUF uploader: Unsloth (per general.quantized_by in the BF16 source metadata)
Quantizer: charlie12345/ROCmFPX main @ 11d76c2 (2026-06-27)
Quantizer license: MIT
Build hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
Build tooling: NixOS 25.11, ROCm store paths dynamic-discovered. See the meshina repo's references/nixos-rocm-external-build-recipe.md for the build env setup.
Bench harnesses: scripts/mesh-bench/mesh_eval.py + scripts/mesh-bench/hermes_loop_eval.py + scripts/mesh-bench/ctx_scaling_bench.py from the meshina repo (private)
Original bench report: raw/benchmarks/2026-06-27-rocmfpx-validation/briefs/2026-06-27-rocmfpx-rdna4-16gb.md in the meshina repo

License

The GLM-4.7-Flash-REAP parent is MIT (per its HF model card).
The charlie12345/ROCmFPX quantizer is MIT.
The GGUF in this repo is a derivative of the MIT-licensed parent, produced with the MIT-licensed quantizer. The MIT license is preserved.

GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF

Scope of these benchmarks — read this first

What we measured

Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)

mesh_eval (raw JSON: raw-mesh-eval-glm-reap-23b-strix-lean.json)

hermes_loop (raw JSON: raw-hermes-loop-glm-reap-23b-strix-lean.json)

Context scaling (raw JSON: ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json)

KV cache type — head_dim=576 constraint (no turbo support)

Quick start

Reproduce the quant

Files in this repo

What's NOT in this repo (caveats)

Provenance

License

mesh_eval (raw JSON: `raw-mesh-eval-glm-reap-23b-strix-lean.json`)

hermes_loop (raw JSON: `raw-hermes-loop-glm-reap-23b-strix-lean.json`)

Context scaling (raw JSON: `ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json`)

KV cache type — `head_dim=576` constraint (no turbo support)