---
license: gemma
library_name: mlx
base_model: yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF
base_model_relation: quantized
pipeline_tag: text-generation
model_type: gemma4_unified
tags:
  - nvfp4
  - mlx
  - krill
  - gemma4
  - gemma4_unified
  - apple-silicon
  - agentic
  - tool-use
---

# gemma-4-12B-agentic-fable5-composer2.5-v2 — NVFP4 (MLX)

> Original fine-tune by [yuxinlu1](https://huggingface.co/yuxinlu1) (`gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2`); this repo is an NVFP4 (MLX) requant for fast local inference on Apple Silicon.

A mixed-precision **NVFP4** conversion of the Gemma‑4‑12B *agentic* fine‑tune (optimized for tool‑use / τ²‑bench). The weights are plain MLX safetensors;
built for and tested with **[Krill](https://github.com/srvsngh99/Krill)** — a pure Swift + MLX
inference engine for Apple Silicon (no Python, no GGUF at inference). Loads in **~1.7 s** at **~6.8 GB**.

## The format

- Bulk weights are **NVFP4** (4‑bit float, group_size 16); attention `o_proj` and the vision/audio
  projectors are kept at **8‑bit affine** (group_size 64). This "protected" mixed recipe recovers the
  quality uniform 4‑bit loses on those sensitive modules while keeping 4‑bit speed and size.
- **Conversion:** bf16 safetensors → key‑remap to MLX layout → NVFP4 requant. No GGUF round‑trip.

## Compatibility (please read)

This is an **MLX** checkpoint — not GGUF, not a HF/transformers checkpoint. To load it an engine needs
(1) the `gemma4_unified` architecture (text+vision+audio) and (2) the mixed‑precision NVFP4 config
(top‑level nvfp4 + per‑module 8‑bit overrides). Today that means **Krill**; it is **not** drop‑in for
vanilla `mlx_lm`/`mlx_vlm`, and **not** loadable by llama.cpp/Ollama (GGUF) or transformers/vLLM.

## Install Krill & run

```bash
# Homebrew:
brew tap srvsngh99/krill && brew install krill
# …or one-line installer (Apple Silicon):
curl -fsSL https://raw.githubusercontent.com/srvsngh99/Krill/main/install.sh | sh

krill pull srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4   # by full path (alias TBD)
krill run  srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4 "Write a Python LRU cache."
krill serve --model srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4 --port 57455     # OpenAI-compatible API
KRILL_ENABLE_THINKING=1 krill run srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4 "..."   # reasoning channel
```

## Benchmarks

**pass@1 / accuracy**, single greedy pass, run with Krill on an M4 Pro. All three models are in this
**same NVFP4 format** (true apples‑to‑apples). HumanEval+/MBPP+ are
[EvalPlus](https://github.com/evalplus/evalplus) (stricter tests); MBPP = the 378‑problem EvalPlus set;
GSM8K = 150‑problem subset, 8‑shot. **Not** EvalPlus‑leaderboard‑comparable.

| Model | mode | HumanEval | HumanEval+ | MBPP | MBPP+ | GSM8K |
|---|---|---|---|---|---|---|
| Google gemma-4-12B-it (base) | off | 57.3 | 56.7 | 42.1 | 37.6 | 95.3 |
| Google gemma-4-12B-it (base) | on | 48.8 | 48.8 | 49.5 | 43.9 | 90.7 |
| coder v1 | off | 81.7 | — | 79.4 | — | 90.7 |
| **agentic v2** ⟵ this model | off | 83.5 | 81.7 | — | — | — |
| **agentic v2** ⟵ this model | on | 86.0 | 82.9 | — | — | — |

> ⚠️ **Partial — benchmark run still in progress.** Empty cells (—) are filling in; this card updates as the full sweep completes.

**Takeaways:** the code/agentic fine‑tunes massively out‑code the Google base on HumanEval/MBPP, while
the base is stronger at math (GSM8K). Reasoning‑on helps the fine‑tunes but tends to *hurt* the base's
coding (it over‑reasons and mangles the code block). Decode ≈ **28 tok/s**.

## Credits & license

Fine‑tune © its original author ([yuxinlu1](https://huggingface.co/yuxinlu1) (`gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2`)); base model is Google Gemma 4, under the
[Gemma license](https://ai.google.dev/gemma/terms). This repo only changes quantization/packaging.