Instructions to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi", filename="gemma4-coding-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M # Run inference directly in the terminal: llama cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M # Run inference directly in the terminal: llama cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Use Docker
docker model run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
- Ollama
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Ollama:
ollama run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
- Unsloth Studio
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi to start chatting
- Pi
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Docker Model Runner:
docker model run hf.co/developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
- Lemonade
How to use developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull developerjeremylive/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-full-etheroi-Q4_K_M
List all available models
lemonade list
| license: apache-2.0 | |
| base_model: google/gemma-4-12B-it | |
| library_name: gguf | |
| pipeline_tag: text-generation | |
| tags: [gemma4, coding, code, reasoning, thinking, gguf, llama.cpp, local-llm] | |
| # ๐ป Gemma4-12B-Coder (GGUF) โ Composer 2.5 ร Fable 5 โจ | |
| ### ๐ฃ Tiny footprint, big brain โ a local **coding** model for *everyone* | |
| > **No matter your GPU. No matter your RAM.** If you've got **~4.5 GB** of VRAM *or* unified memory free, | |
| > you can run your own private, offline coding assistant right now. ๐ | |
| > This is the **v1 / code edition** โ distilled from **real chain-of-thought** so it *thinks through* a problem | |
| > before writing the solution. ๐ง ๐ป All local, all yours, no API, no cloud. | |
| ### ๐ฏ What it is | |
| A focused fine-tune of Gemma 4 12B on **verifiable Python coding** data โ every training example's reasoning leads to | |
| code that **actually passed its tests**. The result reasons in the open (edge cases, complexity, approach) and then | |
| emits a clean, runnable solution. ๐ | |
| --- | |
| ## ๐ Announcements | |
| **๐๐ฅ IT'S HERE โ v2 is OUT NOW!** v2 has shipped โ the **GGUF quants are live and ready to run** โ | |
| **[grab v2 here](https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF)**. ๐ | |
| The **full `safetensors` master** (build / fine-tune on top) goes up **tomorrow**. v2 is **agentic + coding** focused โ | |
| the piece v1 was missing. | |
| **Here's the result that got me most excited.** When I saw v2's **tau2-bench `telecom`** result โ an agentic tool-use | |
| benchmark where the model has to *diagnose โ fix โ verify*, exactly like real terminal/debugging work โ I literally got | |
| **launched out of my chair** (โฆokay, *kidding* ๐). The jump in **actually solving the problem** is wild: | |
| | tau2-bench **telecom** ยท local, same harness, **Q8_0** | score | | |
| |---|---| | |
| | official `gemma-4-12B-it` (base) | **~15%** | | |
| | ๐ข **v2 (this release)** | **~55%** | | |
| The base model tends to **give up early** (hands the problem off to a human); **v2 keeps going** and works it the way a | |
| much bigger model would. Full benchmark details are in the **[v2 card](https://huggingface.co/yuxinlu1/gemma-4-12B-agentic-fable5-composer2.5-v2-3.5x-tau2-GGUF)** now. ๐ง | |
| **โ safetensors master (this v1 model) is UP.** Full-precision weights are live โ | |
| **[yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1](https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1)** | |
| โ roll your own GGUF / MLX / AWQ quants or fine-tune straight from the master. ๐ | |
| --- | |
| ## ๐ฃ Context length fixed: now **256K** (was 131K) โ thanks, community! ๐ | |
| A community member spotted that this model was reporting only a **131K** context window. That turned out to be | |
| the well-known upstream **Gemma 4 metadata bug** โ Google's initial `config.json` shipped with | |
| `max_position_embeddings: 131072` instead of the real **262144 (256K)**, and that value got baked into a lot of | |
| downstream finetunes and quants (including this one) before it was fixed upstream. | |
| The weights were always fine โ it was purely a metadata field. **All GGUF quants have been re-patched to the | |
| full 256K context** (`gemma4.context_length = 262144`). Just re-download if you grabbed an earlier copy. ๐ | |
| --- | |
| ## ๐ Training data (the interesting part ๐ณ) | |
| This is a **distillation** of two complementary chain-of-thought sources, both over verifiable Python coding tasks | |
| (algorithmic / function-level problems that come with deterministic tests): | |
| - **๐ฅ Main set โ Composer 2.5 *real* CoT.** Genuine, model-authored reasoning traces. The teacher solved each problem, | |
| its code was **run against the task's tests, and only the passing solutions were kept**. So the reasoning you're | |
| learning from leads to code that *actually works*. | |
| - **๐ฅ Aux set โ Fable 5 (released today! ๐).** A clever twist: we took the problems where **Composer 2.5 got it wrong** | |
| and handed them to **Fable 5** to *redo* โ re-deriving a fresh, self-consistent chain-of-thought and a correct | |
| solution, again **gated on passing the tests**. This recovers the hard cases the main teacher missed. These traces | |
| are **synthetic** (rationalized CoT), and are tagged separately so the two sources stay distinguishable. | |
| The recipe: real CoT for the bulk of solid coverage, plus synthetic "second-attempt" CoT to patch the failures โ | |
| both verified by execution before anything entered training. โ | |
| --- | |
| ## ๐ฆ Pick your size (GGUF quants) | |
| | Quant | Size | Vibe | | |
| |------|------|------| | |
| | ๐ข **Q2_K** | **4.5 GB** | tiniest โ runs almost anywhere | | |
| | ๐ก **Q3_K_M** | **5.7 GB** | great for 8 GB VRAM โ much better than Q2 | | |
| | ๐ต **Q4_K_M** | **6.87 GB** | the sweet spot ๐ (recommended) | | |
| | ๐ฃ **Q6_K** | **9.11 GB** | near-lossless | | |
| | โช **Q8_0** | **11.8 GB** | basically full quality | | |
| --- | |
| ## ๐งฎ "Will it fit?" โ context length cheat-sheet | |
| Rough estimates ๐ค (assumes `q8_0` KV cache + ~1.5 GB overhead; **use `q4_0` KV cache for โ2ร more context!**). | |
| Max context is **256K**. "โ" = won't fit, pick a smaller quant. โ๏ธ | |
| | Your VRAM / unified mem | ๐ข Q2_K (4.5G) | ๐ก Q3_K_M (5.7G) | ๐ต Q4_K_M (6.87G) | ๐ฃ Q6_K (9.11G) | โช Q8_0 (11.8G) | | |
| |---|---|---|---|---|---| | |
| | **8 GB** | ~16K ctx | ~10K | tight (~2โ4K) | โ | โ | | |
| | **12 GB** | ~48K | ~38K | ~30K | ~12K | โ | | |
| | **16 GB** | ~80K | ~72K | ~64K | ~44K | ~22K | | |
| | **24 GB** | ~200K | ~160K | ~128K | ~110K | ~88K | | |
| | **32 GB** | 256K (max) ๐ | 256K | 256K | ~230K | ~190K | | |
| > ๐ก Apple Silicon / integrated GPUs with **unified memory** count too โ same numbers, just slower than a dGPU. | |
| > ๐ก Low on room? Drop a quant or switch KV cache to `q4_0` and your context roughly doubles. | |
| --- | |
| ## ๐ How to run it (super easy) | |
| ### Option A โ llama.cpp (recommended) ๐ฆ | |
| 1. Grab a quant above (e.g. `โฆ-Q4_K_M.gguf`) and `llama-server` from [llama.cpp](https://github.com/ggml-org/llama.cpp). | |
| > โ ๏ธ Needs a **recent llama.cpp** (this is the `gemma4_unified` architecture โ older builds won't load it). | |
| 2. Run a server (Windows `.bat` shown โ tweak `--port`, `--ctx-size` to taste): | |
| ```bat | |
| @echo off | |
| cd /d C:\llama.cpp | |
| llama-server.exe ^ | |
| -m C:\models\gemma4-coding-Q4_K_M.gguf ^ | |
| --ctx-size 16384 ^ | |
| --n-gpu-layers 99 ^ | |
| --no-mmap ^ | |
| -fa on ^ | |
| --cache-type-k q8_0 --cache-type-v q8_0 ^ | |
| --temp 1.0 --top-p 0.95 --top-k 64 ^ | |
| --host 0.0.0.0 --port 18080 | |
| pause | |
| ``` | |
| 3. Open `http://localhost:18080` and chat. ๐ (Tip: bump `--ctx-size` per the table; use `q4_0` KV for more.) | |
| ### Option B โ one-click apps ๐ฑ๏ธ | |
| Works in **LM Studio**, **Jan**, **Ollama**, etc. โ just import the GGUF, pick your quant, go. ๐พ | |
| ### ๐ง Thinking mode | |
| This model thinks in Gemma's native thought channel before answering โ exactly how it was trained. Keep | |
| **`enable_thinking=true`** (the default chat template handles it). Recommended sampling: `temp 1.0, top_p 0.95, top_k 64`. | |
| For coding you can also go greedy (`temp 0`) for more deterministic solutions. | |
| --- | |
| ## โ ๏ธ Good to know | |
| - **Reduced refusals:** the training data is task-focused with no safety hedging, so this refuses less than the base | |
| model. It is **not** safety-aligned โ add your own guardrails for production. Use responsibly. ๐ | |
| - Specialized for **Python / algorithmic** coding. Reasoning quality is strongest in that domain; general-knowledge | |
| facts/numbers should still be double-checked. | |
| - English-centric. | |
| --- | |
| ## ๐ Base & License | |
| - **License: Apache 2.0.** Gemma 4 is released by Google under | |
| **[Apache 2.0](https://ai.google.dev/gemma/apache_2)** (unlike the older Gemma 1/2/3 terms), so this fine-tune is | |
| **Apache 2.0** too โ free to use, modify, and redistribute. ๐ | |
| - **Base model:** [`google/gemma-4-12B-it`](https://huggingface.co/google/gemma-4-12B-it). | |
| - Personal/hobby project โ shared as-is, no warranty. Have fun, and happy hacking! ๐พโจ | |