--- license: apache-2.0 base_model: google/diffusiongemma-26B-A4B-it base_model_relation: quantized library_name: mlx tags: - mlx - optiq - diffusion_gemma - diffusion-llm - image-text-to-text pipeline_tag: image-text-to-text --- # diffusiongemma-26B-A4B-it-OptiQ-4bit > **Built with [mlx-optiq](https://mlx-optiq.com)**, the MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon, no PyTorch and no cloud. [Try the Lab](https://mlx-optiq.com/docs/lab/) · [All OptiQ quants](https://mlx-optiq.com/models) · [Docs](https://mlx-optiq.com/docs/) **OptiQ data-driven mixed-precision quant** of Google's [DiffusionGemma-26B-A4B-it](https://huggingface.co/google/diffusiongemma-26B-A4B-it), a **block/masked-diffusion** LLM (image-text-to-text), the first diffusion model in the OptiQ lineup. Instead of uniform 4-bit, OptiQ measures each layer's quantization sensitivity (KL on the denoising-canvas logits) and spends an 8-bit budget where it helps most. At the **same ~4.66 bpw** as the standard published 4-bit, OptiQ shifts the 8-bit budget from the dense-MLP (where the hand-coded recipe puts it) onto **early-layer attention + routers** (which the measurement shows are more sensitive). > ⚠️ **Requires [`mlx-optiq`](https://pypi.org/project/mlx-optiq/) ≥ 0.2.3.** DiffusionGemma is not loadable by stock `mlx-lm`/`mlx-vlm`; OptiQ ships a vendored, dependency-free decoder for it. ## Capability Score Full 6-metric OptiQ Capability Score (`optiq eval --task all --score`), vs the published `-4bit` (mlx-vlm's hand-coded recipe) at equal bpw: | Benchmark | OptiQ-4bit | published-4bit | Δ | |---|---|---|---| | MMLU (1000, 5-shot) | **47.4** | 44.5 | **+2.9** | | GSM8K (1000) | **91.8** | 91.7 | +0.1 | | IFEval (strict) | **69.1** | 68.9 | +0.2 | | BFCL v3 | 68.5 | 68.5 | +0.0 | | HumanEval (pass@1) | **75.6** | 74.4 | **+1.2** | | HashHop | 7.0 | 11.0 | −4.0 | | **Capability Score** | **59.90** | 59.84 | **+0.07** | | Disk | **14.0 GB** | 14.5 GB | **−0.5 GB** | OptiQ matches or beats the hand-tuned recipe on 5 of 6 benchmarks, with clear wins on the non-saturated ones (**MMLU +2.9, HumanEval +1.2**), while being **0.5 GB smaller**. (HashHop is ~0 for both: the fixed 256-token canvas can't do 12k-context retrieval; the −4.0 is noise on near-zero scores.) ## Usage ```python from optiq.vlm.diffusion_gemma import load, generate model, tokenizer = load("mlx-community/diffusiongemma-26B-A4B-it-OptiQ-4bit") # text prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "Write a haiku about Apple Silicon."}], tokenize=False, add_generation_prompt=True) print(generate(model, tokenizer, prompt)) # image + text from PIL import Image print(generate(model, tokenizer, "What is in this image?", images=[Image.open("photo.jpg")])) ``` ### Best inference config DiffusionGemma decodes by iteratively un-masking a fixed 256-token canvas. The sampler choice dominates speed: | sampler | code | prose | |---|---|---| | `entropy-bound` (model default) | 12.7 tok/s | 1.8 tok/s | | **`confidence-threshold`** (OptiQ default) | **58 tok/s** | **9 tok/s** | OptiQ defaults to **`confidence-threshold`** (`generate(..., sampler="confidence-threshold")`), **4.6–5× faster** than the model's default, with no quality loss. On code it's comparable to the autoregressive Gemma-4 26B-A4B (~60 tok/s); on prose it's slower (diffusion's strength is structured/parallel-friendly output). ### LoRA fine-tuning OptiQ ships a diffusion-native LoRA trainer (the model's denoising objective, not autoregressive cross-entropy): ```python from optiq.vlm.diffusion_gemma.lora import train_diffusion_lora, load_diffusion_lora train_diffusion_lora(model_path, "data/", "adapter/", rank=8) # data/train.jsonl: {prompt, completion} model, tok = load_diffusion_lora(model_path, "adapter/") ``` ## Feature support | OptiQ feature | DiffusionGemma | |---|---| | Mixed-precision quant | ✅ | | Text + image generation | ✅ | | LoRA fine-tuning | ✅ (diffusion-native denoising loss) | | MTP / speculative / assistant draft |, N/A (diffusion is not autoregressive; parallel canvas un-masking is the native analog) | | KV-cache quant |, N/A (fixed 256-token canvas; the cache holds only the prompt) | ## How it was made `optiq convert` measured per-layer KL sensitivity on the masked-diffusion forward (uniform-4 reference, candidate bits {4,8}), ran the greedy-knapsack allocator at the published recipe's 8-bit budget, and quantized via the OptiQ pipeline. The 27-layer SigLIP vision tower is kept and quantized alongside the language tower. Built with [OptiQ](https://mlx-optiq.com). Vendored DiffusionGemma decoder derived from [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) (MIT). ## Quantize your own This quant was produced by [mlx-optiq](https://mlx-optiq.com). Point it at any Hugging Face model to get the same sensitivity-aware mixed precision: ```bash pip install mlx-optiq optiq convert --target-bpw 5.0 --candidate-bits 4,8 optiq lab # full local workbench: chat, compare, quantize, fine-tune ```