Instructions to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF", filename="GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF # Run inference directly in the terminal: llama cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF # Run inference directly in the terminal: llama cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF # Run inference directly in the terminal: ./llama-cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Use Docker
docker model run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
- LM Studio
- Jan
- Ollama
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Ollama:
ollama run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
- Unsloth Studio
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting
- Pi
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Docker Model Runner:
docker model run hf.co/maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
- Lemonade
How to use maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Run and chat with the model
lemonade run user.GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chattingUsing HuggingFace Spaces for Unsloth
# No setup required# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting- GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF
- Scope of these benchmarks — read this first
- What we measured
- Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)
- mesh_eval (raw JSON:
raw-mesh-eval-glm-reap-23b-strix-lean.json) - hermes_loop (raw JSON:
raw-hermes-loop-glm-reap-23b-strix-lean.json) - Context scaling (raw JSON:
ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json) - KV cache type —
head_dim=576constraint (no turbo support)
- Quick start
- Reproduce the quant
- Files in this repo
- What's NOT in this repo (caveats)
- Provenance
- License
- Scope of these benchmarks — read this first
GLM-4.7-Flash-REAP-23B-A3B ROCmFPX STRIX_LEAN — GGUF
ROCmFPX Q4_0_ROCMFP4_STRIX_LEAN quant of cerebras/GLM-4.7-Flash-REAP-23B-A3B (GLM-4.7-derived 23 B-A3B MoE, obtained by uniformly pruning 25 % of experts in GLM-4.7-Flash using the REAP method).
Built with charlie12345/ROCmFPX on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-27 with build commit 11d76c2.
| File | Size | Quant | BPW |
|---|---|---|---|
GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf |
12 GB | Q4_0_ROCMFP4_STRIX_LEAN (4-bit ROCmFP4 + Strix K/V + Q5_K embed) |
4.38 |
This is not a stock llama.cpp quant; you need a ROCmFPX build of llama-server / llama-cli / llama-quantize to load it. Stock llama.cpp will reject the file with unknown quantization.
Scope of these benchmarks — read this first
These numbers are a light baseline, not a thorough ROCmFPX evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:
- Harness scope is bounded. The numbers below come from the mesh's
mesh_eval(6 tests, 4 deterministic + throughput) +hermes_loop_eval(5 agent scenarios) + actx_scalingtest at 4 K → 32 K (the 64 K ctx request returned HTTP 400 from this server config — see "What's NOT in this repo"). - Sample sizes are small. Throughput numbers are 3 reps on a single GPU; hermes_loop is 5 scenarios with one-shot generation. None are powered for statistical significance on a per-token level.
- No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory.
- Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
- No human eval. "Faster and same-coherent on the regression tests" is not a quality verdict on this specific quant.
- Heaviest model in the mesh. GLM REAP 23B at 12 GB is the biggest single-model quant the mesh can serve. On smaller GPUs (<12 GB VRAM), this file will not fit. The 16 GB card runs it with ~3 GB headroom.
What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.
For a rigorous view, the parent repo cerebras/GLM-4.7-Flash-REAP-23B-A3B, the upstream zai-org/GLM-4.7-Flash, and the model's stock GGUF variants (e.g. on unsloth/) are the place to look.
What we measured
Hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
Software: charlie12345/ROCmFPX main @ 11d76c2
Source GGUF: GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf (BF16, 43 GB) — the Unsloth-distributed GGUF of the Cerebras-pruned safetensors
Same-stack comparison: Q3_0_ROCMFPX (3-bit ROCmFPX experimental, 12 GB file) on the same source
Agent-loop throughput — STRIX_LEAN vs Q3_0_ROCMFPX (hermes_loop, same harness, same source)
| Scenario | STRIX_LEAN (t/s) | Q3_0_ROCMFPX (t/s) | Δ |
|---|---|---|---|
single (one tool call) |
38.5 | 23.1 | +67 % |
chained (calc → use result) |
35.8 | 24.4 | +47 % |
multi_step (compare 2 cities) |
50.8 | 37.7 | +35 % |
search (web search + extract) |
46.8 | 32.5 | +44 % |
error_recovery (file not found) |
48.9 | 34.5 | +42 % |
| Mean | 44.2 | 30.4 | +45 % |
Both quants pass all 5 scenarios. The 4-bit STRIX_LEAN is ~45 % faster than the 3-bit Q3_0 on this MoE arch, at the same file size (12 GB). This is the headline finding for this model.
mesh_eval (raw JSON: raw-mesh-eval-glm-reap-23b-strix-lean.json)
| Test | Result |
|---|---|
gibberish |
OK |
thinking_leak |
CLEAN |
tool_calling (single call) |
PASS — get_weather(location=Tokyo) |
coding (merge_sorted_lists) |
PASS — runs, tests pass |
uncensored |
PASS — no refusal |
throughput (3×256-token gen) |
62.8 t/s mean, ±0.6 stdev |
overall_status |
PASS, 4/4 |
hermes_loop (raw JSON: raw-hermes-loop-glm-reap-23b-strix-lean.json)
| Scenario | Result |
|---|---|
single |
PASS — final answer correct |
chained (calc → use) |
PASS — 15 × 37 = 555 |
multi_step (compare 2 cities) |
PASS — Tokyo/London table + conclusion |
search (web search + extract) |
PASS — Eiffel Tower height |
error_recovery (file not found) |
PASS (clean) |
overall_status |
PASS, 5/5 |
Context scaling (raw JSON: ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json)
| Ctx target | pp t/s | tg t/s | Result |
|---|---|---|---|
| 4 K | 668.9 | 50.0 | OK, coherent (4) |
| 32 K | 166.2 | 50.0 | OK, coherent |
| 64 K | — | — | HTTP 400 (server-side ctx cap) |
Findings:
- Decode throughput holds at 50 t/s across 4 K → 32 K ctx.
- Prompt processing degrades sharply: 4 K → 32 K drops from 669 → 166 pp t/s (4× slower). This is a known property of the GLM-4.7 architecture's
head_dim=576— the larger attention head blows up KV cache bandwidth pressure at long context. - The 64 K failure is the server's
--ctx-sizecap, not a model limit. The parent GLM-4.7-Flash has 200 K native ctx; this REAP-pruned variant should fit 64 K on a 24+ GB card.
KV cache type — head_dim=576 constraint (no turbo support)
This model has head_dim=576 (GLM-4.7 architecture). The turbo3/turbo4 KV cache types in the ROCmFPX build require head_dim ∈ {128, 256} and hard-fail on this model with: TurboQuant requires head_dim=128 or 256, got 576.
Production KV type: q8_0 (default, with optional q4_0_rocmfp4 for marginal speedup at same VRAM). See references/rocmfpx-build-quant-bench.md Pattern 13 in the meshina corpus for the full sweep.
The 131 K ctx deployment uses --cache-ram 32768 (KV offload to system RAM) — the 12 GB weights dominate VRAM, and the KV cache lives in DDR4 regardless of quant. This is what makes long-context GLM REAP viable on 16 GB hardware.
Quick start
# Build llama.cpp with ROCmFPX
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
-DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize
# Serve (131 072 ctx, q8_0 KV [head_dim=576, turbo incompatible], KV offload, fa=on)
./build/bin/llama-server \
-m GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
-np 1 -c 131072 \
-ctk q8_0 -ctv q8_0 \
-kvo -cram 32768 -fa on
Reproduce the quant
SRC=/path/to/GLM-4.7-Flash-REAP-23B-A3B-BF16.gguf
~/ROCmFPX/build-rdna4/bin/llama-quantize \
"$SRC" \
GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf \
Q4_0_ROCMFP4_STRIX_LEAN
Quantize time: ~3-5 min warm-cache, CPU-only. Source BF16 is 43 GB so the first cold quant is slower.
Files in this repo
| File | What it is |
|---|---|
GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN.gguf |
The quant. Load only with a ROCmFPX llama-server. |
README.md |
This file |
raw-mesh-eval-glm-reap-23b-strix-lean.json |
mesh_eval.py output (2026-06-27 17:38 UTC) |
raw-hermes-loop-glm-reap-23b-strix-lean.json |
hermes_loop_eval.py output (2026-06-27 18:12 UTC) |
raw-hermes-loop-glm-reap-23b-q3_0_rocmfpx.json |
Same harness on the Q3_0 baseline (for the throughput comparison) |
ctx-scaling-glm-reap-strix-lean-64k-20260627-143748.json |
4 K → 32 K ctx scaling (64 K HTTP 400 — see caveat) |
quant-command.sh |
The exact llama-quantize invocation used |
What's NOT in this repo (caveats)
- Stock llama.cpp will not load this file. The ROCmFP4 weight format is unique to charlie12345/ROCmFPX.
- No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200).
- 64 K ctx is HTTP 400 on this server. The parent GLM-4.7-Flash has 200 K native ctx. Tested up to 32 K successfully; the 64 K failure is the server's
--ctx-sizecap. - No turbo3/4 KV cache on this model (head_dim=576). Hard architectural constraint, not a bug.
- The source GGUF is Unsloth-distributed (per
general.quantized_by = "Unsloth"in the metadata). The actual safetensors parent iscerebras/GLM-4.7-Flash-REAP-23B-A3B, derived fromzai-org/GLM-4.7-Flash(the unpruned 200 K-ctx model). The chain is: safetensors → Unsloth GGUF → our STRIX_LEAN. - 12 GB minimum VRAM. Doesn't fit on <12 GB cards. The mesh's 16 GB card runs it with ~3 GB headroom.
- No MTP / speculative-decode bench on this file. GLM-4.7 architecture is not MTP-capable in this release.
- No vision/multimodal test. This variant is text-only.
- No quality benchmark (perplexity, MMLU, GSM8K). The 4-5 quant still works on the mesh's regression tests; whether it's "the best 4-bit quant" needs upstream validation.
Provenance
- Source model:
cerebras/GLM-4.7-Flash-REAP-23B-A3B— 23 B-A3B MoE, 25 % of experts pruned fromzai-org/GLM-4.7-Flashusing the REAP method - Source model license: mit
- Source GGUF uploader: Unsloth (per
general.quantized_byin the BF16 source metadata) - Quantizer: charlie12345/ROCmFPX
main@11d76c2(2026-06-27) - Quantizer license: MIT
- Build hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
- Build tooling: NixOS 25.11, ROCm store paths dynamic-discovered. See the
meshinarepo'sreferences/nixos-rocm-external-build-recipe.mdfor the build env setup. - Bench harnesses:
scripts/mesh-bench/mesh_eval.py+scripts/mesh-bench/hermes_loop_eval.py+scripts/mesh-bench/ctx_scaling_bench.pyfrom the meshina repo (private) - Original bench report:
raw/benchmarks/2026-06-27-rocmfpx-validation/briefs/2026-06-27-rocmfpx-rdna4-16gb.mdin the meshina repo
License
- The GLM-4.7-Flash-REAP parent is MIT (per its HF model card).
- The
charlie12345/ROCmFPXquantizer is MIT. - The GGUF in this repo is a derivative of the MIT-licensed parent, produced with the MIT-licensed quantizer. The MIT license is preserved.
- Downloads last month
- 129
We're not able to determine the quantization variants.
Model tree for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF
Base model
zai-org/GLM-4.7-Flash
Install Unsloth Studio (macOS, Linux, WSL)
# Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for maczzzzzz/GLM-4.7-Flash-REAP-23B-A3B-ROCmFPX-STRIX_LEAN-GGUF to start chatting