--- license: gemma library_name: mlx base_model: yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF base_model_relation: quantized pipeline_tag: text-generation model_type: gemma4_unified tags: - nvfp4 - mlx - krill - gemma4 - gemma4_unified - apple-silicon - agentic - tool-use --- # gemma-4-12B-agentic-fable5-composer2.5-v2 — NVFP4 (MLX) > Original fine-tune by [yuxinlu1](https://huggingface.co/yuxinlu1) (`gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2`); this repo is an NVFP4 (MLX) requant for fast local inference on Apple Silicon. A mixed-precision **NVFP4** conversion of the Gemma‑4‑12B *agentic* fine‑tune (optimized for tool‑use / τ²‑bench). The weights are plain MLX safetensors; built for and tested with **[Krill](https://github.com/srvsngh99/Krill)** — a pure Swift + MLX inference engine for Apple Silicon (no Python, no GGUF at inference). Loads in **~1.7 s** at **~6.8 GB**. ## The format - Bulk weights are **NVFP4** (4‑bit float, group_size 16); attention `o_proj` and the vision/audio projectors are kept at **8‑bit affine** (group_size 64). This "protected" mixed recipe recovers the quality uniform 4‑bit loses on those sensitive modules while keeping 4‑bit speed and size. - **Conversion:** bf16 safetensors → key‑remap to MLX layout → NVFP4 requant. No GGUF round‑trip. ## Compatibility (please read) This is an **MLX** checkpoint — not GGUF, not a HF/transformers checkpoint. To load it an engine needs (1) the `gemma4_unified` architecture (text+vision+audio) and (2) the mixed‑precision NVFP4 config (top‑level nvfp4 + per‑module 8‑bit overrides). Today that means **Krill**; it is **not** drop‑in for vanilla `mlx_lm`/`mlx_vlm`, and **not** loadable by llama.cpp/Ollama (GGUF) or transformers/vLLM. ## Install Krill & run ```bash # Homebrew: brew tap srvsngh99/krill && brew install krill # …or one-line installer (Apple Silicon): curl -fsSL https://raw.githubusercontent.com/srvsngh99/Krill/main/install.sh | sh krill pull srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4 # by full path (alias TBD) krill run srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4 "Write a Python LRU cache." krill serve --model srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4 --port 57455 # OpenAI-compatible API KRILL_ENABLE_THINKING=1 krill run srv-sngh/gemma-4-12B-agentic-fable5-composer2.5-v2-nvfp4 "..." # reasoning channel ``` ## Benchmarks **pass@1 / accuracy**, single greedy pass, run with Krill on an M4 Pro. All three models are in this **same NVFP4 format** (true apples‑to‑apples). HumanEval+/MBPP+ are [EvalPlus](https://github.com/evalplus/evalplus) (stricter tests); MBPP = the 378‑problem EvalPlus set; GSM8K = 150‑problem subset, 8‑shot. **Not** EvalPlus‑leaderboard‑comparable. | Model | mode | HumanEval | HumanEval+ | MBPP | MBPP+ | GSM8K | |---|---|---|---|---|---|---| | Google gemma-4-12B-it (base) | off | 57.3 | 56.7 | 42.1 | 37.6 | 95.3 | | Google gemma-4-12B-it (base) | on | 48.8 | 48.8 | 49.5 | 43.9 | 90.7 | | coder v1 | off | 81.7 | — | 79.4 | — | 90.7 | | **agentic v2** ⟵ this model | off | 83.5 | 81.7 | — | — | — | | **agentic v2** ⟵ this model | on | 86.0 | 82.9 | — | — | — | > ⚠️ **Partial — benchmark run still in progress.** Empty cells (—) are filling in; this card updates as the full sweep completes. **Takeaways:** the code/agentic fine‑tunes massively out‑code the Google base on HumanEval/MBPP, while the base is stronger at math (GSM8K). Reasoning‑on helps the fine‑tunes but tends to *hurt* the base's coding (it over‑reasons and mangles the code block). Decode ≈ **28 tok/s**. ## Credits & license Fine‑tune © its original author ([yuxinlu1](https://huggingface.co/yuxinlu1) (`gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2`)); base model is Google Gemma 4, under the [Gemma license](https://ai.google.dev/gemma/terms). This repo only changes quantization/packaging.