Gemma 4 12B IT — iMatrix GGUF (including sub-4-bit)

The only iMatrix GGUF repo for Gemma 4 12B that includes Q2 and Q3 quants.

Most iMatrix repos for this model stop at Q4. This one goes all the way down to IQ2_M (4.1 GB) — making Gemma 4 12B runnable on 6 GB VRAM with iMatrix quality.

All quants produced with llama.cpp using importance matrix calibration on a 2M token wikitext corpus.


Quick Start

Ollama

ollama run hf.co/liodon-ai/gemma-4-12B-it-imatrix-GGUF:Q4_K_M

llama.cpp

llama-cli -hf liodon-ai/gemma-4-12B-it-imatrix-GGUF:Q4_K_M

LM Studio / Jan

Search liodon-ai/gemma-4-12B-it-imatrix-GGUF and pick your quant.


Available Quants

Quant Size VRAM Notes
IQ2_M 4.1 GB 6 GB Ultra-tiny. iMatrix keeps it coherent where standard Q2 breaks down
IQ3_M 5.4 GB 7 GB Best quality under 6 GB file size
Q2_K 4.5 GB 6 GB Smallest standard quant — runs almost anywhere
Q3_K_M 5.7 GB 7 GB Good balance for tight VRAM
IQ4_XS 6.2 GB 8 GB iMatrix Q4 — rivals standard Q5 at smaller size
Q4_K_M 6.9 GB 8 GB Recommended. Sweet spot for most setups
Q5_K_M 8.0 GB 10 GB High quality
Q6_K 9.2 GB 12 GB Near-lossless
Q8_0 12 GB 16 GB Basically full quality

Why iMatrix Matters for Sub-4-bit Quants

Standard quantization at Q2/Q3 rounds weights uniformly — the model loses coherence, repeats itself, and produces broken output. Other iMatrix repos for this model have excluded sub-4-bit entirely for this reason.

iMatrix fixes this by identifying which weights actually matter during a calibration pass over real text, then protecting those weights from aggressive rounding. The result: IQ2_M and IQ3_M remain usable and coherent at sizes that standard Q2/Q3 can't match.

If you have 6-8 GB VRAM and want to run Gemma 4 12B, the iMatrix Q2/Q3 quants here are your only viable option for this model.


What is iMatrix?

Standard GGUF quantization: compress all weights equally → fast, but imprecise at low bit widths.

iMatrix quantization:

  1. Run a calibration text through the full-precision model
  2. Measure which weights activate most during inference (the "importance matrix")
  3. Quantize with higher precision on important weights, lower precision on less important ones

Same file size. Better output. Most noticeable at Q2/Q3/Q4.


Calibration

Importance matrix computed using a 2M token sample from wikitext-103 — diverse English text covering Wikipedia articles across topics. 128 calibration chunks.


Base Model

  • Model: google/gemma-4-12B-it
  • Params: 12B
  • Context: 128K tokens
  • Architecture: Gemma 4 (multimodal)
  • License: Apache 2.0
  • Authors: Google DeepMind
Downloads last month
332
GGUF
Model size
12B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liodon-ai/gemma-4-12B-it-imatrix-GGUF

Quantized
(197)
this model