---
license: apache-2.0
base_model: google/diffusiongemma-26B-A4B-it
base_model_relation: quantized
library_name: mlx
tags:
- mlx
- optiq
- diffusion_gemma
- diffusion-llm
- image-text-to-text
pipeline_tag: image-text-to-text
---

# diffusiongemma-26B-A4B-it-OptiQ-4bit

> **Built with [mlx-optiq](https://mlx-optiq.com)**, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. [Try the Lab](https://mlx-optiq.com/docs/lab/) · [All OptiQ quants](https://mlx-optiq.com/models) · [Docs](https://mlx-optiq.com/docs/)

**OptiQ data-driven mixed-precision quant** of Google's [DiffusionGemma-26B-A4B-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it), a **block/masked-diffusion** LLM (image-text-to-text), the first diffusion model in the OptiQ lineup.

Instead of uniform 4-bit, OptiQ measures each layer's quantization sensitivity (KL on the denoising-canvas logits) and spends an 8-bit budget where it helps most. At the **same ~4.66 bpw** as the standard published 4-bit, OptiQ shifts the 8-bit budget from the dense-MLP (where the hand-coded recipe puts it) onto **early-layer attention + routers** (which the measurement shows are more sensitive).

> ⚠️ **Requires [`mlx-optiq`](https://pypi.org/project/mlx-optiq/) ≥ 0.2.3.** DiffusionGemma is not loadable by stock `mlx-lm`/`mlx-vlm`; OptiQ ships a vendored, dependency-free decoder for it.

## Capability Score

Full 6-metric OptiQ Capability Score (`optiq eval --task all --score`), vs the published `-4bit` (mlx-vlm's hand-coded recipe) at equal bpw:

| Benchmark | OptiQ-4bit | published-4bit | Δ |
|---|---|---|---|
| MMLU (1000, 5-shot) | **47.4** | 44.5 | **+2.9** |
| GSM8K (1000) | **91.8** | 91.7 | +0.1 |
| IFEval (strict) | **69.1** | 68.9 | +0.2 |
| BFCL v3 | 68.5 | 68.5 | +0.0 |
| HumanEval (pass@1) | **75.6** | 74.4 | **+1.2** |
| HashHop | 7.0 | 11.0 | −4.0 |
| **Capability Score** | **59.90** | 59.84 | **+0.07** |
| Disk | **14.0 GB** | 14.5 GB | **−0.5 GB** |

OptiQ matches or beats the hand-tuned recipe on 5 of 6 benchmarks, with clear wins on the non-saturated ones (**MMLU +2.9, HumanEval +1.2**), while being **0.5 GB smaller**. (HashHop is ~0 for both: the fixed 256-token canvas can't do 12k-context retrieval; the −4.0 is noise on near-zero scores.)

## Usage

```python
from optiq.vlm.diffusion_gemma import load, generate

model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit")

# text
prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a haiku about Apple Silicon."}],
    tokenize=False, add_generation_prompt=True)
print(generate(model, tokenizer, prompt))

# image + text
from PIL import Image
print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")]))
```

### Best inference config

DiffusionGemma decodes by iteratively un-masking a fixed 256-token canvas. The sampler choice dominates speed:

| sampler | code | prose |
|---|---|---|
| `entropy-bound` (model default) | 12.7 tok/s | 1.8 tok/s |
| **`confidence-threshold`** (OptiQ default) | **58 tok/s** | **9 tok/s** |

OptiQ defaults to **`confidence-threshold`** (`generate(..., sampler="confidence-threshold")`), **4.6–5× faster** than the model's default, with no quality loss. On code it's comparable to the autoregressive Gemma-4 26B-A4B (~60 tok/s); on prose it's slower (diffusion's strength is structured/parallel-friendly output).

### LoRA fine-tuning

OptiQ ships a diffusion-native LoRA trainer (the model's denoising objective, not autoregressive cross-entropy):

```python
from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora
train_diffusion_lora(model_path, "data/", "adapter/", rank=8)   # data/train.jsonl: {prompt, completion}
model, tok = load_diffusion_lora(model_path, "adapter/")
```

## Feature support

| OptiQ feature | DiffusionGemma |
|---|---|
| Mixed-precision quant | ✅ |
| Text + image generation | ✅ |
| LoRA fine-tuning | ✅ (diffusion-native denoising loss) |
| MTP / speculative / assistant draft |, N/A (diffusion is not autoregressive; parallel canvas un-masking is the native analog) |
| KV-cache quant |, N/A (fixed 256-token canvas; the cache holds only the prompt) |

## How it was made

`optiq convert` measured per-layer KL sensitivity on the masked-diffusion forward (uniform-4 reference, candidate bits {4,8}), ran the greedy-knapsack allocator at the published recipe's 8-bit budget, and quantized via the OptiQ pipeline. The 27-layer SigLIP vision tower is kept and quantized alongside the language tower.

Built with [OptiQ](https://mlx-optiq.com). Vendored DiffusionGemma decoder derived from [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) (MIT).

## Quantize your own

This quant was produced by [mlx-optiq](https://mlx-optiq.com). Point it at any Hugging Face model to get the same sensitivity-aware mixed precision:

```bash
pip install mlx-optiq
optiq convert <hf-model-id> --target-bpw 5.0 --candidate-bits 4,8
optiq lab   # full local workbench: chat, compare, quantize, fine-tune
```