File size: 7,274 Bytes
4c8510b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: gemma
language:
  - en
base_model: google/gemma-4-26b-a4b-it
tags:
  - mlx
  - mlx-node
  - quantized
  - awq
  - mxfp8
  - micro-scaling-fp
  - gemma4
  - moe
  - sliding-window-attention
  - vision-language
  - apple-silicon
  - unsloth-dynamic
library_name: mlx-node
quantized_by: mlx-node
pipeline_tag: text-generation
model_type: gemma4
---

# Gemma-4-26B-A4B-IT — UD-MXFP8_K_XL (mlx-node)

MXFP8 (OCP micro-scaling FP8) quantization of [google/gemma-4-26b-a4b-it](https://huggingface.co/google/gemma-4-26b-a4b-it) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node).

| | Original (BF16) | UD-MXFP8_K_XL (this model) |
|---|---|---|
| **Size** | ~49 GB | **26 GB** |
| **Format** | SafeTensors | SafeTensors |
| **Precision** | BF16 uniform | MXFP8 (E8M0 scales) + BF16 |
| **FFN group size** | — | **32** |
| **Biases** | — | no |

## What is MXFP8?

MXFP8 is the [Open Compute Project (OCP) micro-scaling FP8](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) format. Each group of 32 elements shares a single 8-bit E8M0 scale (a power-of-two exponent), and elements themselves are stored as E4M3 FP8 values. Compared to 8-bit affine:

- **Half the scale storage**: uint8 E8M0 vs. fp16/fp32 affine scales
- **No biases**: zero-point implicit (FP8 covers ±range)
- **Hardware-friendly**: scale is just an exponent shift, no FP multiply on the scale path

For typical LLM weight distributions, MXFP8 retains quality on par with 8-bit affine while shrinking the metadata footprint.

## All Variants

| Repo | Bit budget | Size | Decode (tok/s) |
|---|---|---|---|
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q3_K_XL-mlx) | 3-bit base | 14 GB | 60.6 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP4_K_XL-mlx) | mxfp4 | 16 GB | 58.4 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q4_K_XL-mlx) | 4-bit base | 17 GB | 58.6 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-NVFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-NVFP4_K_XL-mlx) | nvfp4 | 17 GB | 57.9 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q5_K_XL-mlx) | 5-bit base | 20 GB | 50.3 |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q6_K_XL-mlx) | 6-bit base | 23 GB | 51.9 |
| **[Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx) (this model)** | **mxfp8** | **26 GB** | **49.8** |
| [Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Gemma-4-26B-A4B-IT-UD-Q8_K_XL-mlx) | 8-bit base | 27 GB | 49.8 |

Benchmarked on Apple M3 Max 128GB via [`examples/lm.ts`](https://github.com/mlx-node/mlx-node/blob/main/examples/lm.ts) (best decode tok/s across turns 2–4, steady-state, capitals chat with `reasoningEffort: 'low'`).

**Note:** No Q2 variant is published — Gemma-4-26B-A4B-IT has only ~4B active parameters per token, which is below the architectural redundancy needed for 2-bit quantization to remain coherent. Both `unsloth` and `mixed_2_6` recipes produced gibberish at Q2 on this model.

## Performance

Steady-state decode: **49.8 tok/s** on Apple M3 Max 128GB (best of turns 2–4, `examples/lm.ts` capitals chat with `reasoningEffort: 'low'`). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only top-K of 128 experts per token (~4B active out of ~26B total), and the compiled C++ forward graph fuses the per-layer dispatch.

## Per-Tensor Bit Assignments (N=8)

| Weight | Mode | Bits | Group | Rationale |
|---|---|---|---|---|
| `self_attn.q_proj` | **mxfp8** | 8 | 32 | AWQ-corrected via input_layernorm |
| `self_attn.k_proj` | **mxfp8** | 8 | 32 | AWQ-corrected via input_layernorm |
| `self_attn.v_proj` | **mxfp8** | 8 | 32 | AWQ-corrected via input_layernorm (only on full-attention layers) |
| `mlp.gate_proj` | **mxfp8** | 8 | 32 | Shared dense MLP |
| `mlp.up_proj` | **mxfp8** | 8 | 32 | Shared dense MLP |
| `mlp.down_proj` | **mxfp8** | 8 | 32 | Shared dense MLP; "slightly more sensitive" (unsloth `base+1`) |
| `experts.switch_glu.gate_proj` | **mxfp8** | 8 | 32 | MoE expert gate (per-expert across all 128); base bits |
| `experts.switch_glu.up_proj` | **mxfp8** | 8 | 32 | MoE expert up (per-expert across all 128); base bits |
| `experts.switch_glu.down_proj` | **mxfp8** | 8 | 32 | MoE expert down (per-expert across all 128 + routing); unsloth `base+1` |
| `self_attn.o_proj` | **bf16** | — | — | NOT AWQ-correctable; kept full-precision |

## Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At `--q-bits 8` the unsloth recipe's per-layer bit offsets all snap to 8-bit. Then `--q-mxfp` orthogonally promotes every 8-bit affine decision to MXFP8 (`mode="mxfp8", bits=8, group_size=32`) — except for keys whose dequantizers are affine-only (`embed_tokens`, `lm_head`, `router.proj`, `embed_vision.embedding_projection`).

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). At 8-bit, the recipe's primary contribution is the affine-only safety net for embedding and routing layers.

## Architecture

| Parameter | Value |
|---|---|
| Total parameters | ~26B (~4B active per token) |
| Hidden size | 2,816 |
| Layers | 30 (sliding-window attention) |
| Attention heads | 16 (8 KV heads, GQA 2:1) |
| Head dimension | 256 |
| Experts | 128 per MoE layer |
| MoE intermediate size | 704 |
| Vocab size | 262,144 |
| Max context | 262,144 tokens |
| Vision | yes (Gemma4ForConditionalGeneration) |

## Usage

```typescript
import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx');

for await (const event of session.sendStream('Explain the MoE architecture in Gemma-4.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}
```

## How It Was Made

```bash
mlx convert \
  -i gemma-4-26b-a4b-it \
  -o Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx \
  -q --q-bits 8 --q-mxfp --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf
```

## Acknowledgments

- **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology
- **[Google DeepMind](https://deepmind.google/)** — For the Gemma-4 model family
- **[OCP Microscaling FP](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)** — For the MXFP4/MXFP8 specification
- **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework

## License

[Gemma Terms of Use](https://ai.google.dev/gemma/terms) (inherited from base model).