How to use from the
Use from the
MLX library
# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Gemma-4-26B-A4B-IT — UD-MXFP8_K_XL (mlx-node)

MXFP8 (OCP micro-scaling FP8) quantization of google/gemma-4-26b-a4b-it for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

Original (BF16) UD-MXFP8_K_XL (this model)
Size ~49 GB 26 GB
Format SafeTensors SafeTensors
Precision BF16 uniform MXFP8 (E8M0 scales) + BF16
FFN group size 32
Biases no

What is MXFP8?

MXFP8 is the Open Compute Project (OCP) micro-scaling FP8 format. Each group of 32 elements shares a single 8-bit E8M0 scale (a power-of-two exponent), and elements themselves are stored as E4M3 FP8 values. Compared to 8-bit affine:

  • Half the scale storage: uint8 E8M0 vs. fp16/fp32 affine scales
  • No biases: zero-point implicit (FP8 covers ±range)
  • Hardware-friendly: scale is just an exponent shift, no FP multiply on the scale path

For typical LLM weight distributions, MXFP8 retains quality on par with 8-bit affine while shrinking the metadata footprint.

All Variants

Benchmarked on Apple M3 Max 128GB via examples/lm.ts (best decode tok/s across turns 2–4, steady-state, capitals chat with reasoningEffort: 'low').

Note: No Q2 variant is published — Gemma-4-26B-A4B-IT has only ~4B active parameters per token, which is below the architectural redundancy needed for 2-bit quantization to remain coherent. Both unsloth and mixed_2_6 recipes produced gibberish at Q2 on this model.

Performance

Steady-state decode: 49.8 tok/s on Apple M3 Max 128GB (best of turns 2–4, examples/lm.ts capitals chat with reasoningEffort: 'low'). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only top-K of 128 experts per token (~4B active out of ~26B total), and the compiled C++ forward graph fuses the per-layer dispatch.

Per-Tensor Bit Assignments (N=8)

Weight Mode Bits Group Rationale
self_attn.q_proj mxfp8 8 32 AWQ-corrected via input_layernorm
self_attn.k_proj mxfp8 8 32 AWQ-corrected via input_layernorm
self_attn.v_proj mxfp8 8 32 AWQ-corrected via input_layernorm (only on full-attention layers)
mlp.gate_proj mxfp8 8 32 Shared dense MLP
mlp.up_proj mxfp8 8 32 Shared dense MLP
mlp.down_proj mxfp8 8 32 Shared dense MLP; "slightly more sensitive" (unsloth base+1)
experts.switch_glu.gate_proj mxfp8 8 32 MoE expert gate (per-expert across all 128); base bits
experts.switch_glu.up_proj mxfp8 8 32 MoE expert up (per-expert across all 128); base bits
experts.switch_glu.down_proj mxfp8 8 32 MoE expert down (per-expert across all 128 + routing); unsloth base+1
self_attn.o_proj bf16 NOT AWQ-correctable; kept full-precision

Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At --q-bits 8 the unsloth recipe's per-layer bit offsets all snap to 8-bit. Then --q-mxfp orthogonally promotes every 8-bit affine decision to MXFP8 (mode="mxfp8", bits=8, group_size=32) — except for keys whose dequantizers are affine-only (embed_tokens, lm_head, router.proj, embed_vision.embedding_projection).

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). At 8-bit, the recipe's primary contribution is the affine-only safety net for embedding and routing layers.

Architecture

Parameter Value
Total parameters 26B (4B active per token)
Hidden size 2,816
Layers 30 (sliding-window attention)
Attention heads 16 (8 KV heads, GQA 2:1)
Head dimension 256
Experts 128 per MoE layer
MoE intermediate size 704
Vocab size 262,144
Max context 262,144 tokens
Vision yes (Gemma4ForConditionalGeneration)

Usage

import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx');

for await (const event of session.sendStream('Explain the MoE architecture in Gemma-4.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}

How It Was Made

mlx convert \
  -i gemma-4-26b-a4b-it \
  -o Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx \
  -q --q-bits 8 --q-mxfp --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

Acknowledgments

License

Gemma Terms of Use (inherited from base model).

Downloads last month
207
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP8_K_XL-mlx