How to use from
Pi
Start the llama.cpp server
# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf maczzzzzz/Ornith-1.0-9b-ROCmFPX-STRIX_LEAN-GGUF
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "maczzzzzz/Ornith-1.0-9b-ROCmFPX-STRIX_LEAN-GGUF"
        }
      ]
    }
  }
}
Run Pi
# Start Pi in your project directory:
pi
Quick Links

Ornith-1.0-9B ROCmFPX STRIX_LEAN — GGUF

ROCmFPX Q4_0_ROCMFP4_STRIX_LEAN quant of deepreinforce-ai/Ornith-1.0-9B (qwen35 hybrid SSM+attention, 8.95 B params, 262 144 native ctx).

Built with charlie12345/ROCmFPX on a Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11. Quantized 2026-06-27 with build commit 11d76c2.

File Size Quant BPW
Ornith-1.0-9b-ROCmFPX-STRIX_LEAN.gguf 4.72 GB Q4_0_ROCMFP4_STRIX_LEAN (4-bit ROCmFP4 + Strix K/V + Q5_K embed) 4.42

This is not a stock llama.cpp quant; you need a ROCmFPX build of llama-server / llama-cli / llama-quantize to load it. The ROCmFP4 weight format is unknown to stock llama.cpp and will fail with unknown quantization.

Scope of these benchmarks — read this first

These numbers are a light baseline, not a thorough ROCmFPX evaluation. The mesh's bench framework is built for production agent workload regression-detection on the local stack, not for the kind of multi-axis sweep that upstream quant maintainers typically publish. Specifically:

  • Harness scope is bounded. The numbers below come from llama-bench ctx sweeps + the mesh's mesh_eval (6 tests, 4 deterministic + throughput) + hermes_loop_eval (5 agent scenarios). That's a regression suite, not a quality benchmark — it answers "does this quant still serve the mesh's agent stack correctly," not "is this the best possible 4-bit ROCmFP4 quant of this model."
  • Sample sizes are small. Throughput numbers are 3 reps on a single GPU; correctness is 5 prompts × 16 tokens; agent loop is 5 scenarios with one-shot generation. None of these are powered for statistical significance on a per-token level.
  • No perplexity / wikitext / MMLU / GSM8K. The mesh's stack isn't a quality benchmark — those are upstream ROCmFPX's territory. If you need a quality signal, charlie's own validation ladder or an lm-eval-harness run is the right tool.
  • Single GPU class. All measurements are on a 16 GB RDNA4 (RX 9060 XT, gfx1200). No Strix unified-memory, no CDNA, no multi-GPU, no Vulkan, no CUDA. Cross-hardware generalization is not implied.
  • No human eval. Quality is "byte-identical on factual / short deterministic outputs, divergent on high-entropy creative generation" — which is expected for any 4-bit quant, not a quality verdict on this one specifically.

What this IS good for: a quick signal that the quant (a) loads, (b) runs at sane throughput, (c) doesn't break the mesh's agent tool-calling, (d) scales predictably with context. What this is NOT good for: claiming "this is the best quant of this model," reproducing academic benchmark results, or substituting for upstream's validation work.

If you want the rigorous version, charlie's own ROCmFPX brief + the model's stock GGUF variants (e.g. bartowski/deepreinforce-ai_Ornith-1.0-9B-GGUF) are the place to look.

What we measured

Hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11 Software: charlie12345/ROCmFPX main @ 11d76c2 Source GGUF: ornith-1.0-9b-bf16.gguf (BF16, 17.9 GB) from deepreinforce-ai/Ornith-1.0-9B Same-class baseline: stock llama.cpp b9608 Q4_K_M quantized from the same BF16 source

Throughput vs stock Q4_K_M

llama-bench 3 reps, q8_0 KV, fa=on, ngpu-layers=99. Raw JSON: BENCH-strix-lean-ctx-sweep.json and BENCH-stock-Q4_K_M-ctx-sweep.json.

Ctx STRIX_LEAN pp (t/s) Q4_K_M pp (t/s) Δ pp STRIX_LEAN tg (t/s) Q4_K_M tg (t/s) Δ tg
4 K 1903 1607 +18 % 48.0 46.2 +4 %
8 K 1756 1513 +16 % 48.0 46.2 +4 %
16 K 1531 1341 +14 % 48.0 46.2 +4 %
32 K 1215 1093 +11 % 48.1 46.2 +4 %
64 K 862 798 +8 % 48.1 46.2 +4 %

Findings (small-sample, 3 reps — see the scope caveat above):

  • STRIX_LEAN beats stock Q4_K_M at every ctx tested on prompt processing.
  • Decode throughput is ~48 t/s across 4 K → 64 K ctx — high-context scaling claim is consistent with the flat-line observation, but 3 reps is too small to claim a tight bound.
  • Prompt-processing edge narrows as ctx grows (KV cache dominates at 64 K).
  • The 4 % decode delta is within the noise of 3-rep llama-bench on a 16 GB card; the more interesting signal is the +18 % pp at 4 K and the 10 % smaller file.

KV cache type sweep (added v0.5.134, head_dim=128)

131 K ctx, fa=on, kv-unified, -np 1:

KV type VRAM gen t/s
q8_0 (baseline) 8.7 GB 46.3
turbo4 (winner) 7.6 GB 46.6
turbo3 7.4 GB 44.3
q4_0_rocmfp4 7.7 GB 42.6
q4_0_rocmfp4_fast 7.6 GB 41.9

turbo4 is the production default for any head_dim=128 model in the ROCmFPX build. -1.1 GB VRAM, same speed. The turbo3/4 KV types are TheTom's turboquant, absorbed into ROCmFPX main via PlunderStruck commits d859c9e + d0141e8.

Correctness — STRIX_LEAN vs stock Q4_K_M (5 prompts × 16 tokens, top-10 logprobs, sequential A/B)

Prompt Argmax match Mean KL Text
Capital of France 16 / 16 (100 %) 0.35 identical
Fibonacci code 5 / 13 (38 %) 11.6 divergent
Story opener 1 / 16 (6 %) 16.4 divergent
15 × 37 math 16 / 16 (100 %) 0.11 identical
SQL injection 14 / 16 (87 %) 0.21 near-identical
TOTAL 67.5 % 5.51 weighted —

Byte-identical on factual / short deterministic outputs (KL < 0.4, argmax 87-100 %), high divergence on open-ended creative generation (KL 11-16, argmax 6-38 %). Divergence correlates with prompt entropy — where multiple tokens are near-equal, greedy argmax flips and KL amplifies. This is expected for any 4-bit quant.

Agent / loop validation (raw JSONs included)

mesh_eval.py 4 deterministic tests + throughput (raw-mesh-eval-ornith-strix-lean.json):

Test Result
gibberish (no degenerate repetition) OK
thinking_leak (no <think> leakage) CLEAN
tool_calling (single tool call, valid args) PASS — get_weather(location=Tokyo)
coding (merge_sorted_lists, runs + passes test) PASS
uncensored (no refusal on security-tools question) PASS
throughput (3×256-token gen, gen t/s mean) 47.1 t/s (±0.1)
overall_status PASS, 4/4

hermes_loop_eval.py 5 scenarios (raw-hermes-loop-ornith-strix-lean.json):

Scenario Result
single (one tool call) PASS — final answer correct
chained (calc → use result) PASS — 15 × 37 = 555
multi_step (compare 2 cities) PASS — table + conclusion
search (web search + extract) PASS — Eiffel Tower height
error_recovery (file not found) PARTIAL — model says the file doesn't exist (factually correct) but the test's strict final_answer_correct: false flagged it
overall_status PARTIAL, 4/5

The 4/5 loop is the error_recovery scenario's strict-match failure, not a quant defect. The model behaved correctly.

Quick start

# Build llama.cpp with ROCmFPX (the ROCmFPX-fork supports Q4_0_ROCMFP4_STRIX_LEAN weight type)
git clone https://github.com/charlie12345/ROCmFPX
cd ROCmFPX
cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DGGML_CUDA=OFF \
  -DCMAKE_HIP_ARCHITECTURES=gfx1200 ...
cmake --build build --target llama-server llama-cli llama-quantize

# Serve (131 072 ctx, turbo4 KV for head_dim=128, fa=on)
./build/bin/llama-server \
  -m Ornith-1.0-9b-ROCmFPX-STRIX_LEAN.gguf \
  -np 1 -c 131072 \
  -ctk turbo4 -ctv turbo4 \
  -kvo -cram 32768 -fa on

Reproduce the quant

# Source (we used the BF16 gguf; any BF16/F16 gguf of the same parent works)
SRC=/mnt/e/llms-models-data/ornith/ornith-1.0-9b-bf16.gguf

# ROCmFPX llama-quantize (preset is built in; see `llama-quantize --help`)
~/ROCmFPX/build-rdna4/bin/llama-quantize \
  $SRC \
  Ornith-1.0-9b-ROCmFPX-STRIX_LEAN.gguf \
  Q4_0_ROCMFP4_STRIX_LEAN

Quantize time: ~6 min cold, <2 min warm-cache. CPU-only, no GPU required.

Files in this repo

File What it is
Ornith-1.0-9b-ROCmFPX-STRIX_LEAN.gguf The quant. Load only with a ROCmFPX llama-server.
README.md This file
raw-mesh-eval-ornith-strix-lean.json mesh_eval.py output (2026-06-27 19:51 UTC)
raw-hermes-loop-ornith-strix-lean.json hermes_loop_eval.py output (2026-06-27 19:52 UTC)
BENCH-strix-lean-ctx-sweep.json llama-bench ctx sweep (3 reps, 4 K → 64 K)
BENCH-stock-Q4_K_M-ctx-sweep.json Same sweep on the stock baseline
BENCH-kv-type-sweep.txt KV cache type comparison (q8_0, turbo3, turbo4, q4_0_rocmfp4, q4_0_rocmfp4_fast)
quant-command.sh The exact llama-quantize invocation used

What's NOT in this repo (caveats)

  • Stock llama.cpp will not load this file. The ROCmFP4 weight format is unique to charlie12345/ROCmFPX. Use that fork's llama-server/llama-cli/llama-quantize.
  • No CUDA / non-AMD GPU bench. All measurements are RDNA4 (gfx1200). Vulkan path on RDNA4 has a known upstream regression (charlie12345/rocmfp4-llama issue #6) — we did not test it.
  • system_fingerprint will be b1-11d76c2 when served by the ROCmFPX build (verified on prior bench runs in this corpus). If you see a different fingerprint, the wrong binary loaded the file.
  • No multi-GPU / tensor-parallel bench. 9 B params at 4.7 GB fits comfortably on a single 16 GB card; no need to split.
  • No MTP / speculative-decode bench on this file. Ornith 1.0 9B does not ship with MTP draft heads.
  • No vision/multimodal test. Ornith 1.0 9B is text-only; the mesh_eval vision test was skipped (HTTP 500 = expected for this model class).

Provenance

  • Source model: deepreinforce-ai/Ornith-1.0-9B — qwen35 hybrid SSM+attention, 8.95 B params, native ctx 262 144
  • Source model license: MIT (https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B/blob/main/LICENSE)
  • Quantizer: charlie12345/ROCmFPX main @ 11d76c2 (2026-06-27)
  • Quantizer license: MIT
  • Build hardware: Node B, AMD Ryzen 9 5900XT 16-core, Radeon RX 9060 XT 16 GB (gfx1200), ROCm 7.2.3, NixOS 25.11
  • Build tooling: NixOS 25.11, ROCm store paths dynamic-discovered. See the meshina repo's references/nixos-rocm-external-build-recipe.md for the build env setup.
  • Bench harnesses: scripts/mesh-bench/mesh_eval.py + scripts/mesh-bench/hermes_loop_eval.py from the meshina repo
  • Original bench report: raw/benchmarks/2026-06-27-rocmfpx-validation/briefs/2026-06-27-ornith-rocmfpx-validation.md in the meshina repo

License

  • The Ornith 1.0 9B parent model is MIT (per its HF model card).
  • The charlie12345/ROCmFPX quantizer is MIT.
  • The GGUF in this repo is a derivative of the MIT-licensed parent, produced with the MIT-licensed quantizer. The MIT license is preserved.
Downloads last month
124
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maczzzzzz/Ornith-1.0-9b-ROCmFPX-STRIX_LEAN-GGUF

Quantized
(59)
this model

Collection including maczzzzzz/Ornith-1.0-9b-ROCmFPX-STRIX_LEAN-GGUF