OsaurusAI/gemma-4-12B-it-qat-MXFP4

MXFP4 MLX bundle converted from google/gemma-4-12B-it-qat-q4_0-unquantized. Decoder linears are quantized with MLX mxfp4 at group size 32; embeddings, norms, and Gemma 4 early-fusion media embedders are preserved as fp16 passthrough.

Bundle

Field	Value
Source	`google/gemma-4-12B-it-qat-q4_0-unquantized`
Architecture	`gemma4_unified` / `Gemma4UnifiedForConditionalGeneration`
Text layers	48 total (8 full attention, 40 sliding attention)
Hidden size	3840
Quantization	`mxfp4`, bits=4, group_size=32
Quantized weights	328 tensors with matching `.scales` sidecars
Shards	7 safetensors shards
Indexed weight bytes	7.37 GiB
Processor	`processor_class=Gemma4UnifiedProcessor, image_seq_length=280, audio_seq_length=750, audio_ms_per_token=40, video_processor=present`

Modalities

Path	Status
Text	Preserved
Vision	model_type=gemma4_unified_vision, patch_size=16
Audio	model_type=gemma4_unified_audio
Video	no video_config

Audio encoder/config is present and preserved.

No video_config is present in the source config; the processor file includes a video processor block, but this card does not claim a verified video runtime path.

Tokenizer And Template

Field	Value
BOS token/id	`<bos>` / `2`
EOS token/id	`<eos>` / `[1, 106, 50]`
PAD token/id	`<pad>` / `0`
Suppress tokens	`[258883, 258882]`
Chat template	`chat_template.jinja`, also folded into `tokenizer_config.json`
Tool parser metadata	`gemma4`
Reasoning parser metadata	`gemma4`

The chat template keeps Gemma 4 turn/channel formatting and includes the required-tool-choice compatibility stanza used by vMLX/Osaurus runtimes. The empty no-thinking thought-channel prefill is removed so non-thinking turns start in visible assistant content.

Files To Keep Together

config.json
jang_config.json
model.safetensors.index.json
all model-*.safetensors shards
tokenizer.json
tokenizer_config.json
processor_config.json
generation_config.json
chat_template.jinja

Loading

Use an MLX/vMLX runtime with Gemma 4 MXFP4 support. This bundle is not GGUF and should not be loaded with GGUF runtimes.

from mlx_vlm import load, generate

model, processor = load("OsaurusAI/gemma-4-12B-it-qat-MXFP4")

Notes

This is a quantized derivative of Google's Gemma 4 QAT release. License and use restrictions follow the upstream Gemma terms. Packaged for Apple Silicon MLX/vMLX use. Contact: eric@osaurus.ai.

Bundle Metadata

This bundle metadata is source-derived: text=true, vision=true, audio=true, video=false. No video runtime path is claimed unless video_config is present.