Text Generation
Safetensors
MLX
English
mlx-node
gemma4
quantized
awq
mxfp8
micro-scaling-fp
Mixture of Experts
sliding-window-attention
vision-language
apple-silicon
unsloth-dynamic
conversational
8-bit precision
Instructions to use Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx
Run Hermes
hermes
- MLX LM
How to use Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx", "messages": [ {"role": "user", "content": "Hello"} ] }'
File size: 7,274 Bytes
4c8510b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
license: gemma
language:
- en
base_model: google/gemma-4-26b-a4b-it
tags:
- mlx
- mlx-node
- quantized
- awq
- mxfp8
- micro-scaling-fp
- gemma4
- moe
- sliding-window-attention
- vision-language
- apple-silicon
- unsloth-dynamic
library_name: mlx-node
quantized_by: mlx-node
pipeline_tag: text-generation
model_type: gemma4
---
# Gemma-4-26B-A4B-IT — UD-MXFP8_K_XL (mlx-node)
MXFP8 (OCP micro-scaling FP8) quantization of [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node).
| | Original (BF16) | UD-MXFP8_K_XL (this model) |
|---|---|---|
| **Size** | ~49 GB | **26 GB** |
| **Format** | SafeTensors | SafeTensors |
| **Precision** | BF16 uniform | MXFP8 (E8M0 scales) + BF16 |
| **FFN group size** | — | **32** |
| **Biases** | — | no |
## What is MXFP8?
MXFP8 is the [Open Compute Project (OCP) micro-scaling FP8](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) format. Each group of 32 elements shares a single 8-bit E8M0 scale (a power-of-two exponent), and elements themselves are stored as E4M3 FP8 values. Compared to 8-bit affine:
- **Half the scale storage**: uint8 E8M0 vs. fp16/fp32 affine scales
- **No biases**: zero-point implicit (FP8 covers ±range)
- **Hardware-friendly**: scale is just an exponent shift, no FP multiply on the scale path
For typical LLM weight distributions, MXFP8 retains quality on par with 8-bit affine while shrinking the metadata footprint.
## All Variants
| Repo | Bit budget | Size | Decode (tok/s) |
|---|---|---|---|
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q3_K_XL-mlx) | 3-bit base | 14 GB | 60.6 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP4_K_XL-mlx) | mxfp4 | 16 GB | 58.4 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q4_K_XL-mlx) | 4-bit base | 17 GB | 58.6 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-NVFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-NVFP4_K_XL-mlx) | nvfp4 | 17 GB | 57.9 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q5_K_XL-mlx) | 5-bit base | 20 GB | 50.3 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q6_K_XL-mlx) | 6-bit base | 23 GB | 51.9 |
| **[Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx) (this model)** | **mxfp8** | **26 GB** | **49.8** |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q8_K_XL-mlx) | 8-bit base | 27 GB | 49.8 |
Benchmarked on Apple M3 Max 128GB via [`examples/lm.ts`](https://github.com/mlx-node/mlx-node/blob/main/examples/lm.ts) (best decode tok/s across turns 2–4, steady-state, capitals chat with `reasoningEffort: 'low'`).
**Note:** No Q2 variant is published — Gemma-4-26B-A4B-IT has only ~4B active parameters per token, which is below the architectural redundancy needed for 2-bit quantization to remain coherent. Both `unsloth` and `mixed_2_6` recipes produced gibberish at Q2 on this model.
## Performance
Steady-state decode: **49.8 tok/s** on Apple M3 Max 128GB (best of turns 2–4, `examples/lm.ts` capitals chat with `reasoningEffort: 'low'`). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only top-K of 128 experts per token (~4B active out of ~26B total), and the compiled C++ forward graph fuses the per-layer dispatch.
## Per-Tensor Bit Assignments (N=8)
| Weight | Mode | Bits | Group | Rationale |
|---|---|---|---|---|
| `self_attn.q_proj` | **mxfp8** | 8 | 32 | AWQ-corrected via input_layernorm |
| `self_attn.k_proj` | **mxfp8** | 8 | 32 | AWQ-corrected via input_layernorm |
| `self_attn.v_proj` | **mxfp8** | 8 | 32 | AWQ-corrected via input_layernorm (only on full-attention layers) |
| `mlp.gate_proj` | **mxfp8** | 8 | 32 | Shared dense MLP |
| `mlp.up_proj` | **mxfp8** | 8 | 32 | Shared dense MLP |
| `mlp.down_proj` | **mxfp8** | 8 | 32 | Shared dense MLP; "slightly more sensitive" (unsloth `base+1`) |
| `experts.switch_glu.gate_proj` | **mxfp8** | 8 | 32 | MoE expert gate (per-expert across all 128); base bits |
| `experts.switch_glu.up_proj` | **mxfp8** | 8 | 32 | MoE expert up (per-expert across all 128); base bits |
| `experts.switch_glu.down_proj` | **mxfp8** | 8 | 32 | MoE expert down (per-expert across all 128 + routing); unsloth `base+1` |
| `self_attn.o_proj` | **bf16** | — | — | NOT AWQ-correctable; kept full-precision |
## Quantization Strategy
Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At `--q-bits 8` the unsloth recipe's per-layer bit offsets all snap to 8-bit. Then `--q-mxfp` orthogonally promotes every 8-bit affine decision to MXFP8 (`mode="mxfp8", bits=8, group_size=32`) — except for keys whose dequantizers are affine-only (`embed_tokens`, `lm_head`, `router.proj`, `embed_vision.embedding_projection`).
imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). At 8-bit, the recipe's primary contribution is the affine-only safety net for embedding and routing layers.
## Architecture
| Parameter | Value |
|---|---|
| Total parameters | ~26B (~4B active per token) |
| Hidden size | 2,816 |
| Layers | 30 (sliding-window attention) |
| Attention heads | 16 (8 KV heads, GQA 2:1) |
| Head dimension | 256 |
| Experts | 128 per MoE layer |
| MoE intermediate size | 704 |
| Vocab size | 262,144 |
| Max context | 262,144 tokens |
| Vision | yes (Gemma4ForConditionalGeneration) |
## Usage
```typescript
import { loadSession } from '@mlx-node/lm';
const session = await loadSession('./Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx');
for await (const event of session.sendStream('Explain the MoE architecture in Gemma-4.', {
config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
if (!event.done) process.stdout.write(event.text);
}
```
## How It Was Made
```bash
mlx convert \
-i gemma-4-26b-a4b-it \
-o Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx \
-q --q-bits 8 --q-mxfp --q-recipe unsloth \
--imatrix-path imatrix_unsloth.gguf
```
## Acknowledgments
- **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology
- **[Google DeepMind](https://deepmind.google/)** — For the Gemma-4 model family
- **[OCP Microscaling FP](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)** — For the MXFP4/MXFP8 specification
- **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework
## License
[Gemma Terms of Use](https://ai.google.dev/gemma/terms) (inherited from base model).
|