---
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
language:
- en
- multilingual
tags:
- mlx
- mlx-vlm
- gemma4
- gemma-4
- abliterated
- refusal-ablated
- uncensored
- zerofuse
- multimodal
- apple-silicon
library_name: mlx
pipeline_tag: any-to-any
base_model: google/gemma-4-12B-it
base_model_relation: quantized
---

# osmGemma-4-12B-uncensored-mxfp4-mlx

> **MXFP4 (4-bit microscaling) (7.628 bpw) MLX quant** of an **abliterated** (refusal-ablated) `google/gemma-4-12B-it` — Google's *encoder-free* unified multimodal model (text · image · audio · video). Quantized for Apple Silicon by **osmAPI**. **All vision & audio weights are preserved.**

## ⚠️ Abliterated model — read this

Refusal directions were surgically removed from the parent via [ZeroFuse](https://github.com/junainfinity/ZeroFuse). It will answer many prompts the parent refuses. **No new capabilities were added — only refusal behavior was reduced.** Use responsibly and within applicable law.

## 🔓 Refusal removal — before / after

The headline result. Measured with ZeroFuse's evaluator on **100 harmful prompts** (`mlabonne/harmful_behaviors` `test[:100]`), greedy decoding, refusal-marker classifier:

| Model | Refusals | Refusal rate |
|---|---|---|
| `google/gemma-4-12B-it` (original) | **99 / 100** | 99.0% |
| **this model** (abliterated) | **12 / 100** | 12.0% |

### **↓ 87 fewer refusals — an 87.9% reduction**, at **KL divergence 0.053** from the original (≪ 0.5, the damage threshold) → general capabilities preserved. Measured at bf16; quantization preserves the abliteration.

## 📊 Specs

| | |
|---|---|
| **Disk size** | ~11.9 GB |
| **Effective BPW** | 7.628 |
| **Scheme** | MXFP4 (4-bit microscaling), group size 32 (MLX) |
| **Base** | `google/gemma-4-12B-it` — 11.95B, 48 layers, 256K context, 140+ languages |
| **Modalities** | text · image · audio · video in, text out (encoder-free / unified) |
| **Vision / audio weights** | ✅ fully preserved (kept at bf16) |

## ⚡ Inference & compatibility

These are **MLX-format** quants (Apple Silicon). Format matters — pick the runtime accordingly:

| Runtime | This MLX quant? | Notes |
|---|---|---|
| **mlx-vlm / mlx-lm** (Mac) | ✅ **text** today · 🟡 vision/audio pending | native; needs the small shim below |
| **LM Studio** (Mac · MLX engine) | 🟡 when its bundled `mlx-lm` adds `gemma4_unified` | drop-in once supported |
| **vLLM** (CUDA/GPU) | ❌ MLX format not supported | use bf16 `google/gemma-4-12B-it` + an FP8/AWQ/GPTQ quant instead |
| **Ollama / llama.cpp** | ❌ needs **GGUF** | requires a separate GGUF build (llama.cpp `gemma4_unified` support pending) |
| **🤗 transformers** (PyTorch) | runs the **bf16** model, not this quant | ✅ **full multimodal** — see *Vision & audio* |

> Why the shim / "pending"? Gemma 4 12B is the brand-new **`gemma4_unified`** *encoder-free* architecture. `mlx-vlm` 0.5.0 ships a `gemma4` module that loads the text + projection weights but does not yet implement the image **patch-embedder** forward, so **vision/audio inference in MLX is pending** an upstream update. The weights are all here, so it will "just work" once support lands.

## 🚀 Quick start — MLX (text)

```bash
pip install -U mlx-vlm torchvision     # torchvision is needed by the Gemma-4 processor
```

```python
# Shim: map gemma4_unified onto mlx-vlm's gemma4 module + tolerate the
# not-yet-modeled vision patch-embedder tensors. Remove once mlx-vlm ships support.
import mlx_vlm.utils as U
U.MODEL_REMAPPING["gemma4_unified"] = "gemma4"
import mlx.nn as nn
_lw = nn.Module.load_weights
nn.Module.load_weights = lambda self, w, strict=True: _lw(self, w, strict=False)

from mlx_vlm import load, generate
model, processor = load("osmapi/osmGemma-4-12B-uncensored-mxfp4-mlx")

# Gemma 4 REQUIRES its chat template — a raw string produces garbage (repeated tokens).
messages = [{"role": "user", "content": "Explain abliteration in two sentences."}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(generate(model, processor, prompt, max_tokens=256).text)
```

> Verified: this produces coherent output on Apple Silicon (M-series). The chat template + the shim are both required until `mlx-vlm` ships native `gemma4_unified` support.

## 🍎 Running on Mac — inference apps

These are **MLX** models, so any Apple-Silicon MLX runtime can serve them — **once its bundled `mlx-lm`/`mlx-vlm` recognizes the `gemma4_unified` encoder-free arch** (text first; vision when the patch-embedder lands upstream). Until then, the **Quick start** shim above runs text today.

| App | What it is |
|---|---|
| [**oMLX**](https://github.com/jundot/omlx) · [omlx.ai](https://omlx.ai) | MLX inference **server** + macOS menu-bar app — paged SSD KV cache, continuous batching, OpenAI/Anthropic-compatible API (great for agents & long context) |
| [**vMLX**](https://vmlx.net) | Free MLX Mac app — prefix + paged KV cache, continuous batching, MCP tools |
| [**LM Studio**](https://lmstudio.ai) | GUI bundling the **MLX** engine + llama.cpp; best-in-class model browser (pick the MLX runtime) |
| [**Ollama** 0.19+](https://ollama.com) | now runs **MLX** under the hood on Apple Silicon; REST API on `:11434` |
| [**mlx-vlm**](https://github.com/Blaizzy/mlx-vlm) / [**mlx-lm**](https://github.com/ml-explore/mlx-lm) | Apple's native libraries — maximum performance (the **Quick start** above) |
| [**macMLX**](https://macmlx.app) · [**Msty**](https://msty.app) | native macOS MLX app · unified local-and-cloud workspace |

> They all build on `mlx-lm`/`mlx-vlm`, so `gemma4_unified` (and the vision path) arrives through that stack. For **full multimodal today**, run the [bf16 repo](https://huggingface.co/osmapi/osmGemma-4-12B-uncensored-bf16) in 🤗 transformers.

## 🖼️🎙️ Vision & audio

Every vision/audio weight (`vision_embedder`, `embed_vision`, `embed_audio`) and the `vision_config`/`audio_config` are **preserved** in this quant.

- **In MLX today:** text only (see above). Image/audio **inference** is pending `mlx-vlm` encoder-free support — no re-quantization will be needed when it lands.
- **For image + audio right now**, run the **bf16** model in 🤗 transformers (which fully supports `gemma4_unified`):

```bash
pip install -U "transformers>=5.10" torch torchvision librosa accelerate
```

```python
from transformers import AutoProcessor, AutoModelForMultimodalLM

mid = "google/gemma-4-12B-it"          # base model; swap for an abliterated-bf16 repo for refusal-free multimodal
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForMultimodalLM.from_pretrained(mid, dtype="auto", device_map="auto")

messages = [{"role": "user", "content": [
    {"type": "image", "url":   "https://.../photo.jpg"},   # image → key "url"
    {"type": "audio", "audio": "https://.../clip.wav"},    # audio → key "audio" (≤30s)
    {"type": "text",  "text":  "Describe what you see and hear."},
]}]
inputs = processor.apply_chat_template(messages, tokenize=True, return_dict=True,
    return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(model.device)
n = inputs["input_ids"].shape[-1]
out = model.generate(**inputs, max_new_tokens=512)
print(processor.parse_response(processor.decode(out[0][n:], skip_special_tokens=False)))
```

> Audio: ≤ 30 s clips (native ASR + speech translation). Images: variable resolution. Video: ≤ 60 s at ~1 fps. For **refusal-free** multimodal, swap `mid` for the abliterated bf16 checkpoint (ask osmAPI if you need it published).

## 🗂️ Quant family

| Repo | Scheme | Eff. BPW | Size | |
|---|---|---|---|---|
| `osmGemma-4-12B-uncensored-bf16` — abliterated, **full multimodal** | bf16 | 16 | ~23.9 GB | [↗](https://huggingface.co/osmapi/osmGemma-4-12B-uncensored-bf16) |
| `osmGemma-4-12B-uncensored-8bit-mlx` | 8-bit affine | 8.805 | ~13.7 GB | [↗](https://huggingface.co/osmapi/osmGemma-4-12B-uncensored-8bit-mlx) |
| `osmGemma-4-12B-uncensored-mxfp4-mlx` | MXFP4 (4-bit microscaling) | 7.628 | ~11.9 GB | ✅ **you are here** |
| `osmGemma-4-12B-uncensored-mixed-4.2bpw-mlx` | mixed 3/4-bit | 4.2 | ~6.6 GB | [↗](https://huggingface.co/osmapi/osmGemma-4-12B-uncensored-mixed-4.2bpw-mlx) |
| `google/gemma-4-12B-it` — base (not abliterated) | bf16 | 16 | ~24 GB | [↗](https://huggingface.co/google/gemma-4-12B-it) |
| `google/gemma-4-12B-it-assistant` — MTP draft | *can be added later* | — | — | ⏳ planned |

## 🧪 Quantization details

- **Tool:** `mlx-vlm` convert (MLX), group size 32, scheme **MXFP4 (4-bit microscaling)** → **7.628 bpw**, ~11.9 GB.
- **Weight-complete:** the 48-layer language model is quantized; the vision patch-embedder (`vision_embedder`, 9 tensors) is re-inserted at **bf16** and `vision_config`/`audio_config` retained — every original tensor is present.

## 🧬 Lineage

```
google/gemma-4-12B                       (Google DeepMind — base pretrain)
        ↓  instruction tuning
google/gemma-4-12B-it               (multimodal, encoder-free)
        ↓  ZeroFuse 1.3.0 — directional ablation, Optuna/TPE-optimized over 100 trials, best Pareto trial #55
abliterated bf16                         (refusals 99→12 / 100, KL 0.053)
        ↓  mlx-vlm quantization (osmAPI)
this repo — MXFP4 (4-bit microscaling), MLX
```

## 🙏 Credits

| Role | Project |
|---|---|
| **Quantization & release** | [osmAPI](https://huggingface.co/osmapi) |
| **Abliteration** | [ZeroFuse](https://github.com/junainfinity/ZeroFuse) by [osmAPI](https://osmAPI.com) |
| **Research** | [osmAPI Research Team](https://osmapi.com) · [Terv Student Research Team](https://terv.pro) |
| **Base model** | [Google DeepMind](https://huggingface.co/google) — Gemma 4 |
| **Quant toolkit** | [mlx](https://github.com/ml-explore/mlx) · [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) |

## 📜 License

Apache-2.0 (inherited from the base). Also subject to the [Gemma 4 Terms of Use](https://ai.google.dev/gemma/docs/gemma_4_license).