Instructions to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF",
	filename="GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
llama cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
./llama-cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Use Docker

docker model run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

LM Studio
Jan
Ollama
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Ollama:
```
ollama run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
```

Unsloth Studio

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Docker Model Runner:
```
docker model run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
```

Lemonade

How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Run and chat with the model

lemonade run user.GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF

ROCmFPX Q4_0_ROCMFP4_STRIX_LEAN quant of cerebras/GLM-4.7-Flash-REAP-23B-A3B (GLM-4.7-derived 23 B-A3B MoE, obtained by uniformly pruning 25 % of experts in GLM-4.7-Flash using the REAP method).

Built with charlie12345/ROCmFPX on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-27 with build commit 11d76c2.

File	Size	Quant	BPW
`GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf`	12 GB	`Q4_0_ROCMFP4_STRIX_LEAN` (4-bit ROCmFP4 + Strix K/V + Q5_K embed)	4.38

This is not a stock llama.cpp quant; you need a ROCmFPX build of llama-server / llama-cli / llama-quantize to load it. Stock llama.cpp will reject the file with unknown quantization.

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough ROCmFPX evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

Harness scope is bounded. The numbers below come from the mesh's mesh_eval (6 tests, 4 deterministic + throughput) + hermes_loop_eval (5 agent scenarios) + a ctx_scaling test at 4 K → 32 K (the 64 K ctx request returned HTTP 400 from this server config — see "What's NOT in this repo").
Sample sizes are small. Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory.
Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
No human eval. "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.
Heaviest model in the mesh. GLM REAP 23B at 12 GB is the biggest single-model quant the mesh can serve. On smaller GPUs (<12 GB VRAM), this file will not fit. The 16 GB card runs it with ~3 GB headroom.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

For a rigorous view, the parent repo cerebras/GLM-4.7-Flash-REAP-23B-A3B, the upstream zai-org/GLM-4.7-Flash, and the model's stock GGUF variants (e.g. on unsloth/) are the place to look.

What we measured

Hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11 Software: charlie12345/ROCmFPX main @ 11d76c2 Source GGUF: GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf (BF16, 43 GB) — the Unsloth-distributed GGUF of the Cerebras-pruned safetensors Same-stack comparison: Q3_0_ROCMFPX (3-bit ROCmFPX experimental, 12 GB file) on the same source

Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)

Scenario	STRIX_LEAN (t/s)	Q3_0_ROCMFPX (t/s)	Δ
`single` (one tool call)	38.5	23.1	+67 %
`chained` (calc → use result)	35.8	24.4	+47 %
`multi_step` (compare 2 cities)	50.8	37.7	+35 %
`search` (web search + extract)	46.8	32.5	+44 %
`error_recovery` (file not found)	48.9	34.5	+42 %
Mean	44.2	30.4	+45 %

Both quants pass all 5 scenarios. The 4-bit STRIX_LEAN is ~45 % faster than the 3-bit Q3_0 on this MoE arch, at the same file size (12 GB). This is the headline finding for this model.

mesh_eval (raw JSON: `raw-mesh-eval-glm-reap-23b-strix-lean.json`)

Test	Result
`gibberish`	OK
`thinking_leak`	CLEAN
`tool_calling` (single call)	PASS — `get_weather(location=Tokyo)`
`coding` (merge_sorted_lists)	PASS — runs, tests pass
`uncensored`	PASS — no refusal
`throughput` (3×256-token gen)	62.8 t/s mean, ±0.6 stdev
`overall_status`	PASS, 4/4

hermes_loop (raw JSON: `raw-hermes-loop-glm-reap-23b-strix-lean.json`)

Scenario	Result
`single`	PASS — final answer correct
`chained` (calc → use)	PASS — `15 × 37 = 555`
`multi_step` (compare 2 cities)	PASS — Tokyo/London table + conclusion
`search` (web search + extract)	PASS — Eiffel Tower height
`error_recovery` (file not found)	PASS (clean)
`overall_status`	PASS, 5/5

Context scaling (raw JSON: `ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json`)

Ctx target	pp t/s	tg t/s	Result
4 K	668.9	50.0	OK, coherent (`4`)
32 K	166.2	50.0	OK, coherent
64 K	—	—	HTTP 400 (server-side ctx cap)

Findings:

Decode throughput holds at 50 t/s across 4 K → 32 K ctx.
Prompt processing degrades sharply: 4 K → 32 K drops from 669 → 166 pp t/s (4× slower). This is a known property of the GLM-4.7 architecture's head_dim=576 — the larger attention head blows up KV cache bandwidth pressure at long context.
The 64 K failure is the server's --ctx-size cap, not a model limit. The parent GLM-4.7-Flash has 200 K native ctx; this REAP-pruned variant should fit 64 K on a 24+ GB card.

KV cache type — `head_dim=576` constraint (no turbo support)

This model has head_dim=576 (GLM-4.7 architecture). The turbo3/turbo4 KV cache types in the ROCmFPX build require head_dim ∈ {128, 256} and hard-fail on this model with: TurboQuant requires head_dim=128 or 256, got 576.

Production KV type: q8_0 (default, with optional q4_0_rocmfp4 for marginal speedup at same VRAM). See references/rocmfpx-build-quant-bench.md Pattern 13 in the meshina corpus for the full sweep.

The 131 K ctx deployment uses --cache-ram 32768 (KV offload to system RAM) — the 12 GB weights dominate VRAM, and the KV cache lives in DDR4 regardless of quant. This is what makes long-context GLM REAP viable on 16 GB hardware.

Quick start

# Build llama.cpp with ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
  -DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize

# Serve (131 072 ctx, q8_0 KV [head_dim=576, turbo incompatible], KV offload, fa=on)
./build/bin/llama-server \
  -m GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  -np 1 -c 131072 \
  -ctk q8_0 -ctv q8_0 \
  -kvo -cram 32768 -fa on

Reproduce the quant

SRC=/path/to/GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf

~/ROCmFPX/build-rdna4/bin/llama-quantize \
  "$SRC" \
  GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
  Q4_0_ROCMFP4_STRIX_LEAN

Quantize time: ~3-5 min warm-cache, CPU-only. Source BF16 is 43 GB so the first cold quant is slower.

Files in this repo

File	What it is
`GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf`	The quant. Load only with a ROCmFPX `llama-server`.
`README.md`	This file
`raw-mesh-eval-glm-reap-23b-strix-lean.json`	`mesh_eval.py` output (2026-06-27 17:38 UTC)
`raw-hermes-loop-glm-reap-23b-strix-lean.json`	`hermes_loop_eval.py` output (2026-06-27 18:12 UTC)
`raw-hermes-loop-glm-reap-23b-q3_0_rocmfpx.json`	Same harness on the Q3_0 baseline (for the throughput comparison)
`ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json`	4 K → 32 K ctx scaling (64 K HTTP 400 — see caveat)
`quant-command.sh`	The exact `llama-quantize` invocation used

What's NOT in this repo (caveats)

Stock llama.cpp will not load this file. The ROCmFP4 weight format is unique to charlie12345/ROCmFPX.
No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200).
64 K ctx is HTTP 400 on this server. The parent GLM-4.7-Flash has 200 K native ctx. Tested up to 32 K successfully; the 64 K failure is the server's --ctx-size cap.
No turbo3/4 KV cache on this model (head_dim=576). Hard architectural constraint, not a bug.
The source GGUF is Unsloth-distributed (per general.quantized_by = "Unsloth" in the metadata). The actual safetensors parent is cerebras/GLM-4.7-Flash-REAP-23B-A3B, derived from zai-org/GLM-4.7-Flash (the unpruned 200 K-ctx model). The chain is: safetensors → Unsloth GGUF → our STRIX_LEAN.
12 GB minimum VRAM. Doesn't fit on <12 GB cards. The mesh's 16 GB card runs it with ~3 GB headroom.
No MTP / speculative-decode bench on this file. GLM-4.7 architecture is not MTP-capable in this release.
No vision/multimodal test. This variant is text-only.
No quality benchmark (perplexity, MMLU, GSM8K). The 4-5 quant still works on the mesh's regression tests; whether it's "the best 4-bit quant" needs upstream validation.

Provenance

Source model: cerebras/GLM-4.7-Flash-REAP-23B-A3B — 23 B-A3B MoE, 25 % of experts pruned from zai-org/GLM-4.7-Flash using the REAP method
Source model license: mit
Source GGUF uploader: Unsloth (per general.quantized_by in the BF16 source metadata)
Quantizer: charlie12345/ROCmFPX main @ 11d76c2 (2026-06-27)
Quantizer license: MIT
Build hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
Build tooling: NixOS 25.11, ROCm store paths dynamic-discovered. See the meshina repo's references/nixos-rocm-external-build-recipe.md for the build env setup.
Bench harnesses: scripts/mesh-bench/mesh_eval.py + scripts/mesh-bench/hermes_loop_eval.py + scripts/mesh-bench/ctx_scaling_bench.py from the meshina repo (private)
Original bench report: raw/benchmarks/2026-06-27-rocmfpx-validation/briefs/2026-06-27-rocmfpx-rdna4-16gb.md in the meshina repo

License

The GLM-4.7-Flash-REAP parent is MIT (per its HF model card).
The charlie12345/ROCmFPX quantizer is MIT.
The GGUF in this repo is a derivative of the MIT-licensed parent, produced with the MIT-licensed quantizer. The MIT license is preserved.

Downloads last month: 129

GGUF

Model size

23B params

Architecture

deepseek2

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Base model

zai-org/GLM-4.7-Flash

Finetuned

cerebras/GLM-4.7-Flash-REAP-23B-A3B

Quantized

(22)

this model

Collection including maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

ROCmFPX Quants

Collection

Series of different model quants using; https://github.com/charlie12345/ROCmFPX https://github.com/charlie12345/rocmfp4-llama • 5 items • Updated 4 days ago

GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF

Scope of these benchmarks — read this first

What we measured

Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)

mesh_eval (raw JSON: raw-mesh-eval-glm-reap-23b-strix-lean.json)

hermes_loop (raw JSON: raw-hermes-loop-glm-reap-23b-strix-lean.json)

Context scaling (raw JSON: ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json)

KV cache type — head_dim=576 constraint (no turbo support)

Quick start

Reproduce the quant

Files in this repo

What's NOT in this repo (caveats)

Provenance

License

Model tree for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

Collection including maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF

mesh_eval (raw JSON: `raw-mesh-eval-glm-reap-23b-strix-lean.json`)

hermes_loop (raw JSON: `raw-hermes-loop-glm-reap-23b-strix-lean.json`)

Context scaling (raw JSON: `ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json`)

KV cache type — `head_dim=576` constraint (no turbo support)