--- license: apache-2.0 base_model: google/gemma-4-12B-it library_name: gguf pipeline_tag: text-generation tags: [gemma4, coding, code, reasoning, thinking, gguf, llama.cpp, local-llm] --- # ๐Ÿ’ป Gemma4-12B-Coder (GGUF) โ€” Composer 2.5 ร— Fable 5 โœจ ### ๐Ÿฃ Tiny footprint, big brain โ€” a local **coding** model for *everyone* > **No matter your GPU. No matter your RAM.** If you've got **~4.5 GB** of VRAM *or* unified memory free, > you can run your own private, offline coding assistant right now. ๐Ÿš€ > This is the **v1 / code edition** โ€” distilled from **real chain-of-thought** so it *thinks through* a problem > before writing the solution. ๐Ÿง ๐Ÿ’ป All local, all yours, no API, no cloud. ### ๐ŸŽฏ What it is A focused fine-tune of Gemma 4 12B on **verifiable Python coding** data โ€” every training example's reasoning leads to code that **actually passed its tests**. The result reasons in the open (edge cases, complexity, approach) and then emits a clean, runnable solution. ๐Ÿ’š --- ## ๐Ÿ“Œ Announcements **๐Ÿš€๐Ÿ”ฅ IT'S HERE โ€” v2 is OUT NOW!** v2 has shipped โ€” the **GGUF quants are live and ready to run** โ†’ **[grab v2 here](https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF)**. ๐ŸŽ‰ The **full `safetensors` master** (build / fine-tune on top) goes up **tomorrow**. v2 is **agentic + coding** focused โ€” the piece v1 was missing. **Here's the result that got me most excited.** When I saw v2's **tau2-bench `telecom`** result โ€” an agentic tool-use benchmark where the model has to *diagnose โ†’ fix โ†’ verify*, exactly like real terminal/debugging work โ€” I literally got **launched out of my chair** (โ€ฆokay, *kidding* ๐Ÿ˜„). The jump in **actually solving the problem** is wild: | tau2-bench **telecom** ยท local, same harness, **Q8_0** | score | |---|---| | official `gemma-4-12B-it` (base) | **~15%** | | ๐ŸŸข **v2 (this release)** | **~55%** | The base model tends to **give up early** (hands the problem off to a human); **v2 keeps going** and works it the way a much bigger model would. Full benchmark details are in the **[v2 card](https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF)** now. ๐Ÿ”ง **โœ… safetensors master (this v1 model) is UP.** Full-precision weights are live โ†’ **[yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1](https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1)** โ€” roll your own GGUF / MLX / AWQ quants or fine-tune straight from the master. ๐ŸŽ‰ --- ## ๐Ÿ“ฃ Context length fixed: now **256K** (was 131K) โ€” thanks, community! ๐Ÿ’š A community member spotted that this model was reporting only a **131K** context window. That turned out to be the well-known upstream **Gemma 4 metadata bug** โ€” Google's initial `config.json` shipped with `max_position_embeddings: 131072` instead of the real **262144 (256K)**, and that value got baked into a lot of downstream finetunes and quants (including this one) before it was fixed upstream. The weights were always fine โ€” it was purely a metadata field. **All GGUF quants have been re-patched to the full 256K context** (`gemma4.context_length = 262144`). Just re-download if you grabbed an earlier copy. ๐Ÿ™ --- ## ๐Ÿ“š Training data (the interesting part ๐Ÿณ) This is a **distillation** of two complementary chain-of-thought sources, both over verifiable Python coding tasks (algorithmic / function-level problems that come with deterministic tests): - **๐Ÿฅ‡ Main set โ€” Composer 2.5 *real* CoT.** Genuine, model-authored reasoning traces. The teacher solved each problem, its code was **run against the task's tests, and only the passing solutions were kept**. So the reasoning you're learning from leads to code that *actually works*. - **๐Ÿฅˆ Aux set โ€” Fable 5 (released today! ๐ŸŽ‰).** A clever twist: we took the problems where **Composer 2.5 got it wrong** and handed them to **Fable 5** to *redo* โ€” re-deriving a fresh, self-consistent chain-of-thought and a correct solution, again **gated on passing the tests**. This recovers the hard cases the main teacher missed. These traces are **synthetic** (rationalized CoT), and are tagged separately so the two sources stay distinguishable. The recipe: real CoT for the bulk of solid coverage, plus synthetic "second-attempt" CoT to patch the failures โ€” both verified by execution before anything entered training. โœ… --- ## ๐Ÿ“ฆ Pick your size (GGUF quants) | Quant | Size | Vibe | |------|------|------| | ๐ŸŸข **Q2_K** | **4.5 GB** | tiniest โ€” runs almost anywhere | | ๐ŸŸก **Q3_K_M** | **5.7 GB** | great for 8 GB VRAM โ€” much better than Q2 | | ๐Ÿ”ต **Q4_K_M** | **6.87 GB** | the sweet spot ๐Ÿ‘Œ (recommended) | | ๐ŸŸฃ **Q6_K** | **9.11 GB** | near-lossless | | โšช **Q8_0** | **11.8 GB** | basically full quality | --- ## ๐Ÿงฎ "Will it fit?" โ€” context length cheat-sheet Rough estimates ๐Ÿค“ (assumes `q8_0` KV cache + ~1.5 GB overhead; **use `q4_0` KV cache for โ‰ˆ2ร— more context!**). Max context is **256K**. "โ€”" = won't fit, pick a smaller quant. โœ‚๏ธ | Your VRAM / unified mem | ๐ŸŸข Q2_K (4.5G) | ๐ŸŸก Q3_K_M (5.7G) | ๐Ÿ”ต Q4_K_M (6.87G) | ๐ŸŸฃ Q6_K (9.11G) | โšช Q8_0 (11.8G) | |---|---|---|---|---|---| | **8 GB** | ~16K ctx | ~10K | tight (~2โ€“4K) | โ€” | โ€” | | **12 GB** | ~48K | ~38K | ~30K | ~12K | โ€” | | **16 GB** | ~80K | ~72K | ~64K | ~44K | ~22K | | **24 GB** | ~200K | ~160K | ~128K | ~110K | ~88K | | **32 GB** | 256K (max) ๐ŸŽ‰ | 256K | 256K | ~230K | ~190K | > ๐Ÿ’ก Apple Silicon / integrated GPUs with **unified memory** count too โ€” same numbers, just slower than a dGPU. > ๐Ÿ’ก Low on room? Drop a quant or switch KV cache to `q4_0` and your context roughly doubles. --- ## ๐Ÿš€ How to run it (super easy) ### Option A โ€” llama.cpp (recommended) ๐Ÿฆ™ 1. Grab a quant above (e.g. `โ€ฆ-Q4_K_M.gguf`) and `llama-server` from [llama.cpp](https://github.com/ggml-org/llama.cpp). > โš ๏ธ Needs a **recent llama.cpp** (this is the `gemma4_unified` architecture โ€” older builds won't load it). 2. Run a server (Windows `.bat` shown โ€” tweak `--port`, `--ctx-size` to taste): ```bat @echo off cd /d C:\llama.cpp llama-server.exe ^ -m C:\models\gemma4-coding-Q4_K_M.gguf ^ --ctx-size 16384 ^ --n-gpu-layers 99 ^ --no-mmap ^ -fa on ^ --cache-type-k q8_0 --cache-type-v q8_0 ^ --temp 1.0 --top-p 0.95 --top-k 64 ^ --host 0.0.0.0 --port 18080 pause ``` 3. Open `http://localhost:18080` and chat. ๐ŸŽ‰ (Tip: bump `--ctx-size` per the table; use `q4_0` KV for more.) ### Option B โ€” one-click apps ๐Ÿ–ฑ๏ธ Works in **LM Studio**, **Jan**, **Ollama**, etc. โ€” just import the GGUF, pick your quant, go. ๐Ÿพ ### ๐Ÿง  Thinking mode This model thinks in Gemma's native thought channel before answering โ€” exactly how it was trained. Keep **`enable_thinking=true`** (the default chat template handles it). Recommended sampling: `temp 1.0, top_p 0.95, top_k 64`. For coding you can also go greedy (`temp 0`) for more deterministic solutions. --- ## โš ๏ธ Good to know - **Reduced refusals:** the training data is task-focused with no safety hedging, so this refuses less than the base model. It is **not** safety-aligned โ€” add your own guardrails for production. Use responsibly. ๐Ÿ™ - Specialized for **Python / algorithmic** coding. Reasoning quality is strongest in that domain; general-knowledge facts/numbers should still be double-checked. - English-centric. --- ## ๐Ÿ“š Base & License - **License: Apache 2.0.** Gemma 4 is released by Google under **[Apache 2.0](https://ai.google.dev/gemma/apache_2)** (unlike the older Gemma 1/2/3 terms), so this fine-tune is **Apache 2.0** too โ€” free to use, modify, and redistribute. ๐ŸŽ‰ - **Base model:** [`google/gemma-4-12B-it`](https://huggingface.co/google/gemma-4-12B-it). - Personal/hobby project โ€” shared as-is, no warranty. Have fun, and happy hacking! ๐Ÿพโœจ