Instructions to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF",
	filename="MTP/gemma-4-12B-it-MTP-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Use Docker

docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Ollama
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Ollama:
```
ollama run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
```

Unsloth Studio

How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF to start chatting

How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Docker Model Runner:
```
docker model run hf.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M
```

Lemonade

How to use yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF-Q4_K_M

List all available models

lemonade list

💻🤖 Gemma4-12B v2 — Coding + Agentic Edition ✨

🐣 Tiny footprint, big brain — a local coding & tool-using agent for everyone

No matter your GPU. No matter your RAM. With ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding agent right now. 🚀 v2 is the big agentic upgrade — it reads, reasons, uses tools, and works through multi-step technical tasks before it acts. 🧠🛠️ All local, all yours, no API, no cloud.

📊 The headline — it works as an agent (tau2-bench)

v2 is built for coding + agentic work — writing code, running commands, using tools, debugging, multi-step technical tasks. The clearest signal is tau2-bench telecom, an agentic tool-use benchmark whose diagnose → fix → verify loop mirrors real terminal/debugging work:

tau2-bench telecom · 20 tasks · local, same harness, all Q8_0	score
official `gemma-4-12B-it` (base)	~15%
🟢 Gemma4-12B v2 (this model)	~55%

→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯 Want the full story — why telecom, how the two models fail differently, the honest caveats, and the trade-offs (including general knowledge)? It's all broken down further below. 👇

🚀 Announcements

📌 Hitting a problem? Please check my pinned discussion first. ~99% of issues are a client/sampler config, not the weights — and they have a quick fix there. For example: garbled or repeating 0000… output almost always means no repetition penalty (set rep_pen 1.1, temp 1.0); and leaked <|tool_call> / <|channel> tokens mean your front-end isn't parsing Gemma 4's native tool format (use llama.cpp --jinja). If your question isn't covered, don't hesitate to open a discussion — I read them and reply as fast as I can. 💬

📦 No Q2_K this release. I finished a Q2_K (imatrix) build, but it didn't hold up under real stress-testing, so I'm holding it back — I only ship a quant once I'm confident it's genuinely good. Smallest reliable option is Q3_K_M; Q4_K_M is the recommended sweet spot. 🙏

🔮 v3 is already on the way. Honestly? Even I didn't expect the post-training jump to be this large — so I'm pushing further. v3 keeps the coding + agentic focus and aims higher still. Stay tuned! 🎉

🐘 And a bigger sibling is coming — Qwen3.6-27B. I've also started fine-tuning Qwen3.6-27B with the same coding + agentic recipe, for those of you who do have the headroom and want more raw capability. But I haven't forgotten what this project is about: a 27B may be too heavy for some of your GPUs / RAM. So this is not a replacement — I'm pushing v3 (this 12B line) in parallel, at the same time, and it will only get stronger. 💪 No matter your hardware, you'll have a model that fits. 💚

💚 A personal note — thank you, and a few honest words (please read)

First, a huge thank-you for all the data and help you've shared. 🙏 The bittersweet part: none of us saw it coming that Fable 5 would be retired — and only my own dataset holds Fable 5's genuine, self-authored chain-of-thought. So for every dataset the community contributed, I rebuilt the missing reasoning from scratch with Opus 4.8 (xhigh). It may diverge from the original Fable 5 traces, but it was the only workable path — and the improvement turned out really, really huge (it nearly launched me out of my chair 😄). The benchmark numbers are right above. 👆

Second — I've tried to reply to every community comment, and I've openly owned v1's training problems. Truly, thank you: your feedback is what lets me improve. 💚

Because v1 hit #1 trending, it also attracted some bad words / trolling. I'll say this gently but firmly: real criticism is always welcome here — pure insults are not. This is a local model that lets anyone run a capable AI on tiny RAM/VRAM, at zero API cost and fully private; I even open-sourced the full safetensors master to study and build on. If something's off, open a discussion about the actual problem — I genuinely want to hear it and I'll act on it. But comments that are only insults help no one, and I'll remove them without hesitation. 🙏

Please remember: I'm one person — not a lab shipping an "open" model for marketing or to monetize later. I don't advertise. I build this for you on my own time and my own money: synthesizing data, reviewing and cleaning it by hand, splitting and re-segmenting it (this round I even built a dynamic context-window pass to keep the agent's read-before-act steps intact), reading the latest papers, then training → evaluating → training → evaluating. It burned through an entire Claude Max 20× plan (I keep a separate Pro for my own work), and v2 alone cost 40+ hours — even with Opus 4.8, the data threw constant curveballs I had to verify myself. Thank you, truly. 🐾

🔬 The benchmarks, in detail (tau2-bench)

I evaluated v2 on tau2-bench (an agentic tool-use benchmark). I did not run the whole suite — it's very time-consuming — so I focused on the single domain that best matches what v2 is for.

Why tau2-bench telecom? Telecom troubleshooting makes the agent diagnose with read/inspect tools → pinpoint the issue → apply a fix → verify it — structurally the same loop as real terminal/debugging work (check state → diagnose → fix → confirm). That's exactly what this model is meant to be good at, which makes it the right yardstick for v2 (much more so than a shopping/customer-service domain).

tau2-bench telecom · 20 tasks · local, same harness, all Q8_0	score
official `gemma-4-12B-it` (base)	~15%
🟢 Gemma4-12B v2 (this model)	~55%

→ Roughly 3.5× higher than the base model on technical-agentic tasks. 🎯

Grounded, not made-up. Independently, a coding/terminal fabrication probe (tasks that deliberately tempt the model to invent file paths / function signatures / values) found v2 grounds before it acts just like the base — it grep/read/ls first, and doesn't make things up (0% fabrication, on par with the base model).

The interesting part — how they fail. The base model gives up early: on this run it bailed to a human agent 10 times (transfer_to_human) instead of finishing the fix. v2 keeps going — it stays in the loop and works the problem the way a much bigger model would, which is exactly why it solves so many more. It's not perfect yet: it still flails a little sometimes (over-trying, retrying). And some of the remaining misses are actually a bug in the benchmark's own APN tool (it throws on inputs it should handle gracefully), not the model. To be clear: I will not patch the benchmark's tools or leak its test questions just to inflate my score — I'd rather report an honest number and improve the model itself. More training is coming in v3. 🔧

About retail (customer-service shopping): on tau2-bench retail, the base model scores a bit higher than v2. This is fully expected and by design. Retail is pure customer-service (look up a user, process an order) — not what this model is for. v2 is specialized for coding / terminal / technical-agentic work, and on those (telecom) it dramatically outperforms the base. Need a customer-service bot? This isn't it. Need a local coding/agentic model? It is. 💚

Let's keep it honest about scale. Today's frontier models — think mimo-v2.5-pro or Opus 4.8 — all land 90%+ on this telecom benchmark. They're also enormous. For a 12B model, my rough guess is that v3 might top out somewhere around 60–70% (emphasis on guess — I haven't even started v3 yet). So let's be clear-eyed: there's still a real gap to the frontier. But keep the scale in mind — this is a 12B model running on your own machine, and narrowing that gap as much as possible at this size is the whole point. 💪

And the trade-off — there's no free lunch. I also ran a general-knowledge benchmark (MMLU-Pro), and v2 lands a little below the base model there. That's completely normal and expected for a focused fine-tune: when you push hard on coding + agentic, you trade a sliver of broad-knowledge breadth for it. Need a generalist? Try my own general-purpose Claude Opus 4.6/4.8 distillation — or the original google/gemma-4-12B-it base. Need a local coding/agentic worker? That's what v2 is tuned for.

🔬 Methodology, honestly: these are local, same-harness, relative numbers (all models tested at Q8_0, greedy decoding, self-simulated user, 20 tasks). They are not directly comparable to published tau2-bench leaderboard figures (different user-simulator, full task sets, full precision) — local self-eval runs systematically lower than published scores. Read them as "v2 vs the base model under identical conditions", which is the comparison that actually matters here.

📚 What's new in v2 (training)

v2 continues from the v1 coder and adds a big agentic push — the piece v1 was missing:

🛠️ Agentic / terminal — real multi-step tool-use trajectories (read → reason → act → verify), in Gemma 4's native tool protocol. This is what drove the tau2-bench telecom jump, and it fixes v1's "stops after the first step" behavior.
💻 Coding — verified chain-of-thought over Python tasks (real CoT, gated on passing tests) plus the Fable-5-redo set for the hard cases.
📚 General — a curated slice of reasoning/instruction data to keep broad competence.

All reasoning is distilled CoT (see the personal note above on how the Fable 5 traces were rebuilt with Opus 4.8).

📦 Pick your size (GGUF quants)

Quant	Size	Vibe
🟡 Q3_K_M	5.7 GB	great for 8 GB VRAM
🔵 Q4_K_M	6.87 GB	the sweet spot 👌 (recommended)
🟣 Q6_K	9.11 GB	near-lossless
⚪ Q8_0	11.8 GB	basically full quality

ℹ️ No Q2_K this release — it didn't pass stress-testing yet (see Announcements). Smallest reliable quant = Q3_K_M.

🚀 How to run it

Option A — llama.cpp (recommended) 🦙

⚠️ Needs a recent llama.cpp (this is the gemma4_unified architecture — older builds won't load it).

@echo off
cd /d C:\llama.cpp
llama-server.exe ^
  -m C:\models\gemma4-v2-Q4_K_M.gguf ^
  --ctx-size 16384 ^
  --n-gpu-layers 99 ^
  --no-mmap -fa on ^
  --jinja ^
  --temp 1.0 --top-p 0.95 --top-k 64 ^
  --host 0.0.0.0 --port 18080
pause

🛠️ Agentic use: pass your tools via the OpenAI tools field (works with --jinja). v2 emits structured tool-calls in Gemma 4's native protocol and is happy in agent loops (read/grep/edit/run, then verify).
🖱️ One-click apps: LM Studio / Jan / Ollama — import the GGUF, pick a quant, go.

🧠 Thinking mode

v2 thinks in Gemma's native thought channel before answering (keep enable_thinking=true, the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64; for coding you can also go greedy (temp 0).

⚠️ Good to know

Specialized for coding / terminal / agentic. General-knowledge facts/numbers should still be double-checked.
Reduced refusals: task-focused training, not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
English-centric.

📚 Base & License

License: Apache 2.0. Gemma 4 is released by Google under Apache 2.0 (unlike the older Gemma 1/2/3 terms), so this fine-tune is Apache 2.0 too — free to use, modify, and redistribute. 🎉
Base model: google/gemma-4-12B-it.
Personal/hobby project — shared as-is, no warranty. Built with time, care, and a lot of coffee. Have fun! 🐾✨

⚡ Speculative decoding (MTP draft) — verified build

The MTP/ folder ships the Gemma 4 multi-token-prediction draft (unsloth's GGUF conversion of Google's official gemma-4-12B-it-assistant) for speculative decoding. Gemma 4 MTP is in llama.cpp mainline (PR #23398) — no fork needed — but the gemma4-assistant loader is build-sensitive right now, so please use the exact build below:

✅ Verified working: llama.cpp b9553 (commit 9e3b928fd). I reproduced it with gemma4-v2-Q8_0 + the MTP-Q8_0 draft: loads cleanly and accelerates generation (~88 → ~180 tok/s on a simple deterministic prompt; expect ~1.2–1.3× on real coding/thinking). Lossless either way.
⚠️ Newer builds (e.g. b9702 / b9717) currently crash while loading the draft with invalid vector subscript. This is an upstream regression in the gemma4-assistant loader path, not a problem with these GGUFs — the same files load fine on b9553. Stick with b9553 until it's fixed upstream.

Working command on b9553 (note the older flag names — --model-draft, not --spec-draft-model):

llama-server -m gemma4-v2-Q8_0.gguf ^
  --model-draft MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
  --spec-type draft-mtp --spec-draft-n-max 4 ^
  -ngl 99 -ngld 99 -fa on --jinja

ℹ️ The Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting) line is harmless. The draft is the generic Gemma 4 assistant (not retrained for v2), so acceptance is a touch lower than a model-specific draft would give — still 100% lossless. On small-VRAM cards, Q8 main + long context + the draft can be tight; drop to Q6_K/Q4_K_M or a smaller --ctx-size if you hit OOM.

Downloads last month: 257,216

GGUF

Model size

12B params

Architecture

gemma4

Hardware compatibility

3-bit

4-bit

6-bit

8-bit

16-bit

Model tree for yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Quantized

(213)

this model

Finetunes

3 models

Quantizations

2 models

yuxinlu1
/

gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF