Instructions to use osmapi/osmGemma-4-12B-uncensored-mxfp4-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use osmapi/osmGemma-4-12B-uncensored-mxfp4-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir osmGemma-4-12B-uncensored-mxfp4-mlx osmapi/osmGemma-4-12B-uncensored-mxfp4-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
osmGemma-4-12B-uncensored-mxfp4-mlx
MXFP4 (4-bit microscaling) (7.628 bpw) MLX quant of an abliterated (refusal-ablated)
google/gemma-4-12B-it— Google's encoder-free unified multimodal model (text · image · audio · video). Quantized for Apple Silicon by osmAPI. All vision & audio weights are preserved.
⚠️ Abliterated model — read this
Refusal directions were surgically removed from the parent via Heretic. It will answer many prompts the parent refuses. No new capabilities were added — only refusal behavior was reduced. Use responsibly and within applicable law.
🔓 Refusal removal — before / after
The headline result. Measured with Heretic's evaluator on 100 harmful prompts (mlabonne/harmful_behaviors test[:100]), greedy decoding, refusal-marker classifier:
| Model | Refusals | Refusal rate |
|---|---|---|
google/gemma-4-12B-it (original) |
99 / 100 | 99.0% |
| this model (abliterated) | 12 / 100 | 12.0% |
↓ 87 fewer refusals — an 87.9% reduction, at KL divergence 0.053 from the original (≪ 0.5, the damage threshold) → general capabilities preserved. Measured at bf16; quantization preserves the abliteration.
📊 Specs
| Disk size | ~11.9 GB |
| Effective BPW | 7.628 |
| Scheme | MXFP4 (4-bit microscaling), group size 32 (MLX) |
| Base | google/gemma-4-12B-it — 11.95B, 48 layers, 256K context, 140+ languages |
| Modalities | text · image · audio · video in, text out (encoder-free / unified) |
| Vision / audio weights | ✅ fully preserved (kept at bf16) |
⚡ Inference & compatibility
These are MLX-format quants (Apple Silicon). Format matters — pick the runtime accordingly:
| Runtime | This MLX quant? | Notes |
|---|---|---|
| mlx-vlm / mlx-lm (Mac) | ✅ text today · 🟡 vision/audio pending | native; needs the small shim below |
| LM Studio (Mac · MLX engine) | 🟡 when its bundled mlx-lm adds gemma4_unified |
drop-in once supported |
| vLLM (CUDA/GPU) | ❌ MLX format not supported | use bf16 google/gemma-4-12B-it + an FP8/AWQ/GPTQ quant instead |
| Ollama / llama.cpp | ❌ needs GGUF | requires a separate GGUF build (llama.cpp gemma4_unified support pending) |
| 🤗 transformers (PyTorch) | runs the bf16 model, not this quant | ✅ full multimodal — see Vision & audio |
Why the shim / "pending"? Gemma 4 12B is the brand-new
gemma4_unifiedencoder-free architecture.mlx-vlm0.5.0 ships agemma4module that loads the text + projection weights but does not yet implement the image patch-embedder forward, so vision/audio inference in MLX is pending an upstream update. The weights are all here, so it will "just work" once support lands.
🚀 Quick start — MLX (text)
pip install -U mlx-vlm torchvision # torchvision is needed by the Gemma-4 processor
# Shim: map gemma4_unified onto mlx-vlm's gemma4 module + tolerate the
# not-yet-modeled vision patch-embedder tensors. Remove once mlx-vlm ships support.
import mlx_vlm.utils as U
U.MODEL_REMAPPING["gemma4_unified"] = "gemma4"
import mlx.nn as nn
_lw = nn.Module.load_weights
nn.Module.load_weights = lambda self, w, strict=True: _lw(self, w, strict=False)
from mlx_vlm import load, generate
model, processor = load("osmapi/osmGemma-4-12B-uncensored-mxfp4-mlx")
# Gemma 4 REQUIRES its chat template — a raw string produces garbage (repeated tokens).
messages = [{"role": "user", "content": "Explain abliteration in two sentences."}]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(generate(model, processor, prompt, max_tokens=256).text)
Verified: this produces coherent output on Apple Silicon (M-series). The chat template + the shim are both required until
mlx-vlmships nativegemma4_unifiedsupport.
🍎 Running on Mac — inference apps
These are MLX models, so any Apple-Silicon MLX runtime can serve them — once its bundled mlx-lm/mlx-vlm recognizes the gemma4_unified encoder-free arch (text first; vision when the patch-embedder lands upstream). Until then, the Quick start shim above runs text today.
| App | What it is |
|---|---|
| oMLX · omlx.ai | MLX inference server + macOS menu-bar app — paged SSD KV cache, continuous batching, OpenAI/Anthropic-compatible API (great for agents & long context) |
| vMLX | Free MLX Mac app — prefix + paged KV cache, continuous batching, MCP tools |
| LM Studio | GUI bundling the MLX engine + llama.cpp; best-in-class model browser (pick the MLX runtime) |
| Ollama 0.19+ | now runs MLX under the hood on Apple Silicon; REST API on :11434 |
| mlx-vlm / mlx-lm | Apple's native libraries — maximum performance (the Quick start above) |
| macMLX · Msty | native macOS MLX app · unified local-and-cloud workspace |
They all build on
mlx-lm/mlx-vlm, sogemma4_unified(and the vision path) arrives through that stack. For full multimodal today, run the bf16 repo in 🤗 transformers.
🖼️🎙️ Vision & audio
Every vision/audio weight (vision_embedder, embed_vision, embed_audio) and the vision_config/audio_config are preserved in this quant.
- In MLX today: text only (see above). Image/audio inference is pending
mlx-vlmencoder-free support — no re-quantization will be needed when it lands. - For image + audio right now, run the bf16 model in 🤗 transformers (which fully supports
gemma4_unified):
pip install -U "transformers>=5.10" torch torchvision librosa accelerate
from transformers import AutoProcessor, AutoModelForMultimodalLM
mid = "google/gemma-4-12B-it" # base model; swap for an abliterated-bf16 repo for refusal-free multimodal
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForMultimodalLM.from_pretrained(mid, dtype="auto", device_map="auto")
messages = [{"role": "user", "content": [
{"type": "image", "url": "https://.../photo.jpg"}, # image → key "url"
{"type": "audio", "audio": "https://.../clip.wav"}, # audio → key "audio" (≤30s)
{"type": "text", "text": "Describe what you see and hear."},
]}]
inputs = processor.apply_chat_template(messages, tokenize=True, return_dict=True,
return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(model.device)
n = inputs["input_ids"].shape[-1]
out = model.generate(**inputs, max_new_tokens=512)
print(processor.parse_response(processor.decode(out[0][n:], skip_special_tokens=False)))
Audio: ≤ 30 s clips (native ASR + speech translation). Images: variable resolution. Video: ≤ 60 s at ~1 fps. For refusal-free multimodal, swap
midfor the abliterated bf16 checkpoint (ask osmAPI if you need it published).
🗂️ Quant family
| Repo | Scheme | Eff. BPW | Size | |
|---|---|---|---|---|
osmGemma-4-12B-uncensored-bf16 — abliterated, full multimodal |
bf16 | 16 | ~23.9 GB | ↗ |
osmGemma-4-12B-uncensored-8bit-mlx |
8-bit affine | 8.805 | ~13.7 GB | ↗ |
osmGemma-4-12B-uncensored-mxfp4-mlx |
MXFP4 (4-bit microscaling) | 7.628 | ~11.9 GB | ✅ you are here |
osmGemma-4-12B-uncensored-mixed-4.2bpw-mlx |
mixed 3/4-bit | 4.2 | ~6.6 GB | ↗ |
google/gemma-4-12B-it — base (not abliterated) |
bf16 | 16 | ~24 GB | ↗ |
google/gemma-4-12B-it-assistant — MTP draft |
can be added later | — | — | ⏳ planned |
🧪 Quantization details
- Tool:
mlx-vlmconvert (MLX), group size 32, scheme MXFP4 (4-bit microscaling) → 7.628 bpw, ~11.9 GB. - Weight-complete: the 48-layer language model is quantized; the vision patch-embedder (
vision_embedder, 9 tensors) is re-inserted at bf16 andvision_config/audio_configretained — every original tensor is present.
🧬 Lineage
google/gemma-4-12B (Google DeepMind — base pretrain)
↓ instruction tuning
google/gemma-4-12B-it (multimodal, encoder-free)
↓ Heretic 1.3.0 — directional ablation, Optuna/TPE-optimized over 100 trials, best Pareto trial #55
abliterated bf16 (refusals 99→12 / 100, KL 0.053)
↓ mlx-vlm quantization (osmAPI)
this repo — MXFP4 (4-bit microscaling), MLX
🙏 Credits
| Role | Project |
|---|---|
| Quantization & release | osmAPI |
| Abliteration | Heretic by p-e-w |
| Research | osmAPI Research Team · Terv Student Research Team |
| Base model | Google DeepMind — Gemma 4 |
| Quant toolkit | mlx · mlx-vlm |
📜 License
Apache-2.0 (inherited from the base). Also subject to the Gemma 4 Terms of Use.
- Downloads last month
- 418
4-bit