osmGemma-4-12B-uncensored-mxfp4-mlx

MXFP4 (4-bit microscaling) (7.628 bpw) MLX quant of an abliterated (refusal-ablated) google/gemma-4-12B-it — Google's encoder-free unified multimodal model (text · image · audio · video). Quantized for Apple Silicon by osmAPI. All vision & audio weights are preserved.

⚠️ Abliterated model — read this

Refusal directions were surgically removed from the parent via Heretic. It will answer many prompts the parent refuses. No new capabilities were added — only refusal behavior was reduced. Use responsibly and within applicable law.

🔓 Refusal removal — before / after

The headline result. Measured with Heretic's evaluator on 100 harmful prompts (mlabonne/harmful_behaviors test[:100]), greedy decoding, refusal-marker classifier:

Model	Refusals	Refusal rate
`google/gemma-4-12B-it` (original)	99 / 100	99.0%
this model (abliterated)	12 / 100	12.0%

↓ 87 fewer refusals — an 87.9% reduction, at KL divergence 0.053 from the original (≪ 0.5, the damage threshold) → general capabilities preserved. Measured at bf16; quantization preserves the abliteration.

📊 Specs


Disk size	~11.9 GB
Effective BPW	7.628
Scheme	MXFP4 (4-bit microscaling), group size 32 (MLX)
Base	`google/gemma-4-12B-it` — 11.95B, 48 layers, 256K context, 140+ languages
Modalities	text · image · audio · video in, text out (encoder-free / unified)
Vision / audio weights	✅ fully preserved (kept at bf16)

⚡ Inference & compatibility

These are MLX-format quants (Apple Silicon). Format matters — pick the runtime accordingly:

Runtime	This MLX quant?	Notes
mlx-vlm / mlx-lm (Mac)	✅ text today · 🟡 vision/audio pending	native; needs the small shim below
LM Studio (Mac · MLX engine)	🟡 when its bundled `mlx-lm` adds `gemma4_unified`	drop-in once supported
vLLM (CUDA/GPU)	❌ MLX format not supported	use bf16 `google/gemma-4-12B-it` + an FP8/AWQ/GPTQ quant instead
Ollama / llama.cpp	❌ needs GGUF	requires a separate GGUF build (llama.cpp `gemma4_unified` support pending)
🤗 transformers (PyTorch)	runs the bf16 model, not this quant	✅ full multimodal — see Vision & audio

Why the shim / "pending"? Gemma 4 12B is the brand-new gemma4_unified encoder-free architecture. mlx-vlm 0.5.0 ships a gemma4 module that loads the text + projection weights but does not yet implement the image patch-embedder forward, so vision/audio inference in MLX is pending an upstream update. The weights are all here, so it will "just work" once support lands.

🚀 Quick start — MLX (text)

pip install -U mlx-vlm torchvision     # torchvision is needed by the Gemma-4 processor

# Shim: map gemma4_unified onto mlx-vlm's gemma4 module + tolerate the
# not-yet-modeled vision patch-embedder tensors. Remove once mlx-vlm ships support.
import mlx_vlm.utils as U
U.MODEL_REMAPPING["gemma4_unified"] = "gemma4"
import mlx.nn as nn
_lw = nn.Module.load_weights
nn.Module.load_weights = lambda self, w, strict=True: _lw(self, w, strict=False)

from mlx_vlm import load, generate
model, processor = load("osmapi/osmGemma-4-12B-uncensored-mxfp4-mlx")

# Gemma 4 REQUIRES its chat template — a raw string produces garbage (repeated tokens).
messages = [{"role": "user", "content": "Explain abliteration in two sentences."}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(generate(model, processor, prompt, max_tokens=256).text)

Verified: this produces coherent output on Apple Silicon (M-series). The chat template + the shim are both required until mlx-vlm ships native gemma4_unified support.

🍎 Running on Mac — inference apps

These are MLX models, so any Apple-Silicon MLX runtime can serve them — once its bundled mlx-lm/mlx-vlm recognizes the gemma4_unified encoder-free arch (text first; vision when the patch-embedder lands upstream). Until then, the Quick start shim above runs text today.

App	What it is
oMLX · omlx.ai	MLX inference server + macOS menu-bar app — paged SSD KV cache, continuous batching, OpenAI/Anthropic-compatible API (great for agents & long context)
vMLX	Free MLX Mac app — prefix + paged KV cache, continuous batching, MCP tools
LM Studio	GUI bundling the MLX engine + llama.cpp; best-in-class model browser (pick the MLX runtime)
Ollama 0.19+	now runs MLX under the hood on Apple Silicon; REST API on `:11434`
mlx-vlm / mlx-lm	Apple's native libraries — maximum performance (the Quick start above)
macMLX · Msty	native macOS MLX app · unified local-and-cloud workspace

They all build on mlx-lm/mlx-vlm, so gemma4_unified (and the vision path) arrives through that stack. For full multimodal today, run the bf16 repo in 🤗 transformers.

🖼️🎙️ Vision & audio

Every vision/audio weight (vision_embedder, embed_vision, embed_audio) and the vision_config/audio_config are preserved in this quant.

In MLX today: text only (see above). Image/audio inference is pending mlx-vlm encoder-free support — no re-quantization will be needed when it lands.
For image + audio right now, run the bf16 model in 🤗 transformers (which fully supports gemma4_unified):

pip install -U "transformers>=5.10" torch torchvision librosa accelerate

from transformers import AutoProcessor, AutoModelForMultimodalLM

mid = "google/gemma-4-12B-it"          # base model; swap for an abliterated-bf16 repo for refusal-free multimodal
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForMultimodalLM.from_pretrained(mid, dtype="auto", device_map="auto")

messages = [{"role": "user", "content": [
    {"type": "image", "url":   "https://.../photo.jpg"},   # image → key "url"
    {"type": "audio", "audio": "https://.../clip.wav"},    # audio → key "audio" (≤30s)
    {"type": "text",  "text":  "Describe what you see and hear."},
]}]
inputs = processor.apply_chat_template(messages, tokenize=True, return_dict=True,
    return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(model.device)
n = inputs["input_ids"].shape[-1]
out = model.generate(**inputs, max_new_tokens=512)
print(processor.parse_response(processor.decode(out[0][n:], skip_special_tokens=False)))

Audio: ≤ 30 s clips (native ASR + speech translation). Images: variable resolution. Video: ≤ 60 s at ~1 fps. For refusal-free multimodal, swap mid for the abliterated bf16 checkpoint (ask osmAPI if you need it published).

🗂️ Quant family

Repo	Scheme	Eff. BPW	Size
`osmGemma-4-12B-uncensored-bf16` — abliterated, full multimodal	bf16	16	~23.9 GB	↗
`osmGemma-4-12B-uncensored-8bit-mlx`	8-bit affine	8.805	~13.7 GB	↗
`osmGemma-4-12B-uncensored-mxfp4-mlx`	MXFP4 (4-bit microscaling)	7.628	~11.9 GB	✅ you are here
`osmGemma-4-12B-uncensored-mixed-4.2bpw-mlx`	mixed 3/4-bit	4.2	~6.6 GB	↗
`google/gemma-4-12B-it` — base (not abliterated)	bf16	16	~24 GB	↗
`google/gemma-4-12B-it-assistant` — MTP draft	can be added later	—	—	⏳ planned

🧪 Quantization details

Tool: mlx-vlm convert (MLX), group size 32, scheme MXFP4 (4-bit microscaling) → 7.628 bpw, ~11.9 GB.
Weight-complete: the 48-layer language model is quantized; the vision patch-embedder (vision_embedder, 9 tensors) is re-inserted at bf16 and vision_config/audio_config retained — every original tensor is present.

🧬 Lineage

google/gemma-4-12B                       (Google DeepMind — base pretrain)
        ↓  instruction tuning
google/gemma-4-12B-it               (multimodal, encoder-free)
        ↓  Heretic 1.3.0 — directional ablation, Optuna/TPE-optimized over 100 trials, best Pareto trial #55
abliterated bf16                         (refusals 99→12 / 100, KL 0.053)
        ↓  mlx-vlm quantization (osmAPI)
this repo — MXFP4 (4-bit microscaling), MLX

🙏 Credits

Role	Project
Quantization & release	osmAPI
Abliteration	Heretic by p-e-w
Research	osmAPI Research Team · Terv Student Research Team
Base model	Google DeepMind — Gemma 4
Quant toolkit	mlx · mlx-vlm

📜 License

Apache-2.0 (inherited from the base). Also subject to the Gemma 4 Terms of Use.

Downloads last month: 418

Safetensors

Model size

4B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/osmGemma-4-12B-uncensored-mxfp4-mlx

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Quantized

(155)

this model