How to use from
Pi
Start the MLX server
# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "OsaurusAI/gemma-4-12B-it-qat-MXFP4"
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "OsaurusAI/gemma-4-12B-it-qat-MXFP4"
        }
      ]
    }
  }
}
Run Pi
# Start Pi in your project directory:
pi
Quick Links

Osaurus AI

OsaurusAI/gemma-4-12B-it-qat-MXFP4

MXFP4 MLX bundle converted from google/gemma-4-12B-it-qat-q4_0-unquantized. Decoder linears are quantized with MLX mxfp4 at group size 32; embeddings, norms, and Gemma 4 early-fusion media embedders are preserved as fp16 passthrough.

Bundle

Field Value
Source google/gemma-4-12B-it-qat-q4_0-unquantized
Architecture gemma4_unified / Gemma4UnifiedForConditionalGeneration
Text layers 48 total (8 full attention, 40 sliding attention)
Hidden size 3840
Quantization mxfp4, bits=4, group_size=32
Quantized weights 328 tensors with matching .scales sidecars
Shards 7 safetensors shards
Indexed weight bytes 7.37 GiB
Processor processor_class=Gemma4UnifiedProcessor, image_seq_length=280, audio_seq_length=750, audio_ms_per_token=40, video_processor=present

Modalities

Path Status
Text Preserved
Vision model_type=gemma4_unified_vision, patch_size=16
Audio model_type=gemma4_unified_audio
Video no video_config

Audio encoder/config is present and preserved.

No video_config is present in the source config; the processor file includes a video processor block, but this card does not claim a verified video runtime path.

Tokenizer And Template

Field Value
BOS token/id <bos> / 2
EOS token/id <eos> / [1, 106, 50]
PAD token/id <pad> / 0
Suppress tokens [258883, 258882]
Chat template chat_template.jinja, also folded into tokenizer_config.json
Tool parser metadata gemma4
Reasoning parser metadata gemma4

The chat template keeps Gemma 4 turn/channel formatting and includes the required-tool-choice compatibility stanza used by vMLX/Osaurus runtimes. The empty no-thinking thought-channel prefill is removed so non-thinking turns start in visible assistant content.

Files To Keep Together

  • config.json
  • jang_config.json
  • model.safetensors.index.json
  • all model-*.safetensors shards
  • tokenizer.json
  • tokenizer_config.json
  • processor_config.json
  • generation_config.json
  • chat_template.jinja

Loading

Use an MLX/vMLX runtime with Gemma 4 MXFP4 support. This bundle is not GGUF and should not be loaded with GGUF runtimes.

from mlx_vlm import load, generate

model, processor = load("OsaurusAI/gemma-4-12B-it-qat-MXFP4")

Notes

This is a quantized derivative of Google's Gemma 4 QAT release. License and use restrictions follow the upstream Gemma terms. Packaged for Apple Silicon MLX/vMLX use. Contact: eric@osaurus.ai.

Bundle Metadata

This bundle metadata is source-derived: text=true, vision=true, audio=true, video=false. No video runtime path is claimed unless video_config is present.

Downloads last month
2,398
Safetensors
Model size
3B params
Tensor type
F16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OsaurusAI/gemma-4-12B-it-qat-MXFP4

Finetuned
(14)
this model