gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi

Instructions to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi",
	filename="gemma4-coding-Q2_K.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Use Docker

docker model run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

LM Studio
Jan

vLLM

How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Ollama
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Ollama:
```
ollama run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
```

Unsloth Studio

How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi to start chatting

How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Docker Model Runner:
```
docker model run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
```

Lemonade

How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi-Q4_K_M

List all available models

lemonade list

gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi / README.md

developerjeremylive

Duplicate from yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

41063fc 10 days ago

preview code

Raw

History Blame Contribute Delete

8.07 kB

	---
	license: apache-2.0
	base_model: google/gemma-4-12B-it
	library_name: gguf
	pipeline_tag: text-generation
	tags: [gemma4, coding, code, reasoning, thinking, gguf, llama.cpp, local-llm]
	---

	# 💻 Gemma4-12B-Coder (GGUF) — Composer 2.5 × Fable 5 ✨
	### 🐣 Tiny footprint, big brain — a local coding model for everyone

	> No matter your GPU. No matter your RAM. If you've got ~4.5 GB of VRAM or unified memory free,
	> you can run your own private, offline coding assistant right now. 🚀
	> This is the v1 / code edition — distilled from real chain-of-thought so it thinks through a problem
	> before writing the solution. 🧠💻 All local, all yours, no API, no cloud.

	### 🎯 What it is
	A focused fine-tune of Gemma 4 12B on verifiable Python coding data — every training example's reasoning leads to
	code that actually passed its tests. The result reasons in the open (edge cases, complexity, approach) and then
	emits a clean, runnable solution. 💚

	---

	## 📌 Announcements

	🚀🔥 IT'S HERE — v2 is OUT NOW! v2 has shipped — the GGUF quants are live and ready to run →
	[grab v2 here](https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF). 🎉
	The full `safetensors` master (build / fine-tune on top) goes up tomorrow. v2 is agentic + coding focused —
	the piece v1 was missing.

	Here's the result that got me most excited. When I saw v2's tau2-bench `telecom` result — an agentic tool-use
	benchmark where the model has to diagnose → fix → verify, exactly like real terminal/debugging work — I literally got
	launched out of my chair (…okay, kidding 😄). The jump in actually solving the problem is wild:

	\| tau2-bench telecom · local, same harness, Q8_0 \| score \|
	\|---\|---\|
	\| official `gemma-4-12B-it` (base) \| ~15% \|
	\| 🟢 v2 (this release) \| ~55% \|

	The base model tends to give up early (hands the problem off to a human); v2 keeps going and works it the way a
	much bigger model would. Full benchmark details are in the [v2 card](https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF) now. 🔧

	✅ safetensors master (this v1 model) is UP. Full-precision weights are live →
	[yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1](https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1)
	— roll your own GGUF / MLX / AWQ quants or fine-tune straight from the master. 🎉

	---

	## 📣 Context length fixed: now 256K (was 131K) — thanks, community! 💚

	A community member spotted that this model was reporting only a 131K context window. That turned out to be
	the well-known upstream Gemma 4 metadata bug — Google's initial `config.json` shipped with
	`max_position_embeddings: 131072` instead of the real 262144 (256K), and that value got baked into a lot of
	downstream finetunes and quants (including this one) before it was fixed upstream.

	The weights were always fine — it was purely a metadata field. **All GGUF quants have been re-patched to the
	full 256K context** (`gemma4.context_length = 262144`). Just re-download if you grabbed an earlier copy. 🙏

	---

	## 📚 Training data (the interesting part 🍳)

	This is a distillation of two complementary chain-of-thought sources, both over verifiable Python coding tasks
	(algorithmic / function-level problems that come with deterministic tests):

	- *🥇 Main set — Composer 2.5 real* CoT.** Genuine, model-authored reasoning traces. The teacher solved each problem,
	its code was run against the task's tests, and only the passing solutions were kept. So the reasoning you're
	learning from leads to code that actually works.
	- 🥈 Aux set — Fable 5 (released today! 🎉). A clever twist: we took the problems where Composer 2.5 got it wrong
	and handed them to Fable 5 to redo — re-deriving a fresh, self-consistent chain-of-thought and a correct
	solution, again gated on passing the tests. This recovers the hard cases the main teacher missed. These traces
	are synthetic (rationalized CoT), and are tagged separately so the two sources stay distinguishable.

	The recipe: real CoT for the bulk of solid coverage, plus synthetic "second-attempt" CoT to patch the failures —
	both verified by execution before anything entered training. ✅

	---

	## 📦 Pick your size (GGUF quants)

	\| Quant \| Size \| Vibe \|
	\|------\|------\|------\|
	\| 🟢 Q2_K \| 4.5 GB \| tiniest — runs almost anywhere \|
	\| 🟡 Q3_K_M \| 5.7 GB \| great for 8 GB VRAM — much better than Q2 \|
	\| 🔵 Q4_K_M \| 6.87 GB \| the sweet spot 👌 (recommended) \|
	\| 🟣 Q6_K \| 9.11 GB \| near-lossless \|
	\| ⚪ Q8_0 \| 11.8 GB \| basically full quality \|

	---

	## 🧮 "Will it fit?" — context length cheat-sheet

	Rough estimates 🤓 (assumes `q8_0` KV cache + ~1.5 GB overhead; use `q4_0` KV cache for ≈2× more context!).
	Max context is 256K. "—" = won't fit, pick a smaller quant. ✂️

	\| Your VRAM / unified mem \| 🟢 Q2_K (4.5G) \| 🟡 Q3_K_M (5.7G) \| 🔵 Q4_K_M (6.87G) \| 🟣 Q6_K (9.11G) \| ⚪ Q8_0 (11.8G) \|
	\|---\|---\|---\|---\|---\|---\|
	\| 8 GB \| ~16K ctx \| ~10K \| tight (~2–4K) \| — \| — \|
	\| 12 GB \| ~48K \| ~38K \| ~30K \| ~12K \| — \|
	\| 16 GB \| ~80K \| ~72K \| ~64K \| ~44K \| ~22K \|
	\| 24 GB \| ~200K \| ~160K \| ~128K \| ~110K \| ~88K \|
	\| 32 GB \| 256K (max) 🎉 \| 256K \| 256K \| ~230K \| ~190K \|

	> 💡 Apple Silicon / integrated GPUs with unified memory count too — same numbers, just slower than a dGPU.
	> 💡 Low on room? Drop a quant or switch KV cache to `q4_0` and your context roughly doubles.

	---

	## 🚀 How to run it (super easy)

	### Option A — llama.cpp (recommended) 🦙
	1. Grab a quant above (e.g. `…-Q4_K_M.gguf`) and `llama-server` from [llama.cpp](https://github.com/ggml-org/llama.cpp).
	> ⚠️ Needs a recent llama.cpp (this is the `gemma4_unified` architecture — older builds won't load it).
	2. Run a server (Windows `.bat` shown — tweak `--port`, `--ctx-size` to taste):

	```bat
	@echo off
	cd /d C:\llama.cpp
	llama-server.exe ^
	-m C:\models\gemma4-coding-Q4_K_M.gguf ^
	--ctx-size 16384 ^
	--n-gpu-layers 99 ^
	--no-mmap ^
	-fa on ^
	--cache-type-k q8_0 --cache-type-v q8_0 ^
	--temp 1.0 --top-p 0.95 --top-k 64 ^
	--host 0.0.0.0 --port 18080
	pause
	```
	3. Open `http://localhost:18080` and chat. 🎉 (Tip: bump `--ctx-size` per the table; use `q4_0` KV for more.)

	### Option B — one-click apps 🖱️
	Works in LM Studio, Jan, Ollama, etc. — just import the GGUF, pick your quant, go. 🐾

	### 🧠 Thinking mode
	This model thinks in Gemma's native thought channel before answering — exactly how it was trained. Keep
	`enable_thinking=true` (the default chat template handles it). Recommended sampling: `temp 1.0, top_p 0.95, top_k 64`.
	For coding you can also go greedy (`temp 0`) for more deterministic solutions.

	---

	## ⚠️ Good to know
	- Reduced refusals: the training data is task-focused with no safety hedging, so this refuses less than the base
	model. It is not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
	- Specialized for Python / algorithmic coding. Reasoning quality is strongest in that domain; general-knowledge
	facts/numbers should still be double-checked.
	- English-centric.

	---

	## 📚 Base & License
	- License: Apache 2.0. Gemma 4 is released by Google under
	[Apache 2.0](https://ai.google.dev/gemma/apache_2) (unlike the older Gemma 1/2/3 terms), so this fine-tune is
	Apache 2.0 too — free to use, modify, and redistribute. 🎉
	- Base model: [`google/gemma-4-12B-it`](https://huggingface.co/google/gemma-4-12B-it).
	- Personal/hobby project — shared as-is, no warranty. Have fun, and happy hacking! 🐾✨