Instructions to use liodon-ai/gemma-4-12B-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use liodon-ai/gemma-4-12B-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="liodon-ai/gemma-4-12B-it-GGUF", filename="gemma4-12B-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use liodon-ai/gemma-4-12B-it-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Use Docker
docker model run hf.co/liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use liodon-ai/gemma-4-12B-it-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "liodon-ai/gemma-4-12B-it-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "liodon-ai/gemma-4-12B-it-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
- Ollama
How to use liodon-ai/gemma-4-12B-it-GGUF with Ollama:
ollama run hf.co/liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
- Unsloth Studio
How to use liodon-ai/gemma-4-12B-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for liodon-ai/gemma-4-12B-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for liodon-ai/gemma-4-12B-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for liodon-ai/gemma-4-12B-it-GGUF to start chatting
- Pi
How to use liodon-ai/gemma-4-12B-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use liodon-ai/gemma-4-12B-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use liodon-ai/gemma-4-12B-it-GGUF with Docker Model Runner:
docker model run hf.co/liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
- Lemonade
How to use liodon-ai/gemma-4-12B-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-12B-it-GGUF-Q4_K_M
List all available models
lemonade list
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "liodon-ai/gemma-4-12B-it-GGUF:"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
piGemma 4 12B IT — GGUF Quantizations
GGUF quantizations of google/gemma-4-12B-it by Google DeepMind.
Model Overview
Gemma 4 12B IT is a 12-billion parameter instruction-tuned model from Google DeepMind, built on the Gemma 4 architecture. It features:
- Reasoning — Configurable thinking modes for step-by-step problem solving
- Coding — Strong code generation, completion, and correction capabilities
- Long Context — 256K token context window
- Multilingual — Support for 140+ languages
- Function Calling — Native tool use support for agentic workflows
- License — Apache 2.0 (free to use, modify, and redistribute)
These GGUF quantizations enable running the model locally on consumer hardware using llama.cpp, Ollama, LM Studio, and other compatible tools.
Quick Start
Ollama
ollama run hf.co/liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
llama.cpp
# Install llama.cpp
brew install llama.cpp # macOS
# or download from https://github.com/ggerganov/llama.cpp/releases
# Start server with web UI
llama-server -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
# Or run directly in terminal
llama-cli -hf liodon-ai/gemma-4-12B-it-GGUF:Q4_K_M
LM Studio
- Open LM Studio
- Search for
liodon-ai/gemma-4-12B-it-GGUF - Download your preferred quantization
- Start chatting
Jan
- Open Jan
- Navigate to Hub
- Search
liodon-ai/gemma-4-12B-it-GGUF - Download and run
Available Quantizations
| Quant | File Size | Quality | Best For |
|---|---|---|---|
Q2_K |
~4.8 GB | Lowest — usable | Ultra-low VRAM (6GB), testing |
Q3_K_M |
~6.1 GB | Good — much better than Q2 | 8GB VRAM GPUs |
Q4_K_M |
~7.4 GB | Sweet spot (recommended) | 8-12GB VRAM, best balance |
Q5_K_M |
~8.6 GB | High quality | 12GB VRAM, near-lossless |
Q6_K |
~9.8 GB | Near-lossless | 16GB VRAM, high fidelity |
Q8_0 |
~12.7 GB | Basically full quality | 24GB VRAM, maximum quality |
Hardware Requirements
Estimated VRAM requirements for full model loading (no context):
| VRAM | Q2_K | Q3_K_M | Q4_K_M | Q5_K_M | Q6_K | Q8_0 |
|---|---|---|---|---|---|---|
| 6 GB | ✓ | — | — | — | — | — |
| 8 GB | ✓ | ✓ | tight | — | — | — |
| 12 GB | ✓ | ✓ | ✓ | tight | — | — |
| 16 GB | ✓ | ✓ | ✓ | ✓ | tight | — |
| 24 GB | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Tip: Use
--cache-type-k q4_0 --cache-type-v q4_0in llama.cpp to roughly double your available context length.
Context Length Cheat Sheet
Rough estimates for max context length at each quantization (assumes q8_0 KV cache + ~1.5 GB overhead):
| VRAM | Q2_K | Q3_K_M | Q4_K_M | Q5_K_M | Q6_K | Q8_0 |
|---|---|---|---|---|---|---|
| 8 GB | ~16K | ~10K | ~2-4K | — | — | — |
| 12 GB | ~48K | ~38K | ~30K | ~20K | ~12K | — |
| 16 GB | ~80K | ~72K | ~64K | ~52K | ~44K | ~22K |
| 24 GB | ~200K | ~160K | ~128K | ~110K | ~90K | ~60K |
| 32 GB | 256K (max) | 256K | 256K | 256K | ~230K | ~190K |
Recommended Sampling Parameters
For best results with Gemma 4:
| Mode | Temperature | Top P | Top K | Use Case |
|---|---|---|---|---|
| General | 1.0 | 0.95 | 64 | Chat, creative tasks |
| Coding | 0.6 | 0.95 | 20 | Code generation |
| Deterministic | 0.0 | 1.0 | 1 | Reproducible outputs |
| Reasoning | 1.0 | 0.95 | 64 | Math, logic puzzles |
Thinking Mode
Gemma 4 supports configurable thinking modes. Enable thinking for complex tasks:
- Enable thinking: Add
<|think|>token at the start of the system prompt - Disable thinking: Remove the
<|think|>token - Default: Most libraries handle this automatically via the chat template
Model Architecture
| Property | Value |
|---|---|
| Architecture | Gemma4Unified |
| Parameters | 12B |
| Layers | 48 |
| Hidden Size | 3,840 |
| Attention Heads | 16 (Q) / 8 (KV) |
| Context Length | 256K tokens |
| Vocabulary | 262,144 |
| Sliding Window | 1,024 tokens |
Base Model
- Model: google/gemma-4-12B-it
- Organization: Google DeepMind
- License: Apache 2.0
- Blog: Gemma 4 Launch
Quantization Method
Quantized using llama.cpp's llama-quantize tool with the following methods:
- Q2_K — 2-bit K-quants (super-block size 16)
- Q3_K_M — 3-bit K-quants (medium quality)
- Q4_K_M — 4-bit K-quants (medium quality, recommended)
- Q5_K_M — 5-bit K-quants (medium quality)
- Q6_K — 6-bit K-quants (all tensors quantized to 6-bit)
- Q8_0 — 8-bit block-wise quantization (near-lossless)
Citation
@misc{gemma4_12b_it_gguf,
title = {Gemma 4 12B IT GGUF Quantizations},
author = {{liodon-ai}},
year = {2026},
url = {https://huggingface.co/liodon-ai/gemma-4-12B-it-GGUF},
note = {Quantizations of google/gemma-4-12B-it by Google DeepMind}
}
- Downloads last month
- 430
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama serve -hf liodon-ai/gemma-4-12B-it-GGUF: