Gemma-4-26B-A4B-IT — UD-MXFP4_K_XL (mlx-node)

MXFP4 (OCP micro-scaling FP4) quantization of google/gemma-4-26b-a4b-it for Apple Silicon, using the Unsloth Dynamic quantization strategy via mlx-node.

Original (BF16) UD-MXFP4_K_XL (this model)
Size ~49 GB 16 GB
Format SafeTensors SafeTensors
Precision BF16 uniform MXFP4 (E8M0 scales) + mixed affine + BF16
FFN group size 32
Biases no (MLP); yes (affine layers)

What is MXFP4?

MXFP4 is the Open Compute Project (OCP) micro-scaling FP4 format. Each group of 32 elements shares a single 8-bit E8M0 scale (a power-of-two exponent), and elements themselves are stored as E2M1 FP4 values. Compared to 4-bit affine:

  • Half the scale storage: uint8 E8M0 vs. fp16/fp32 affine scales
  • No biases: zero-point implicit (FP4 covers ±range)
  • Hardware-friendly: scale is just an exponent shift, no FP multiply on the scale path

For typical LLM weight distributions, MXFP4 retains quality on par with 4-bit affine while shrinking the metadata footprint.

All Variants

Benchmarked on Apple M3 Max 128GB via examples/lm.ts (best decode tok/s across turns 2–4, steady-state, capitals chat with reasoningEffort: 'low').

Note: No Q2 variant is published — Gemma-4-26B-A4B-IT has only ~4B active parameters per token, which is below the architectural redundancy needed for 2-bit quantization to remain coherent. Both unsloth and mixed_2_6 recipes produced gibberish at Q2 on this model.

Performance

Steady-state decode: 58.4 tok/s on Apple M3 Max 128GB (best of turns 2–4, examples/lm.ts capitals chat with reasoningEffort: 'low'). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. The MoE architecture activates only top-K of 128 experts per token (~4B active out of ~26B total), and the compiled C++ forward graph fuses the per-layer dispatch.

Per-Tensor Bit Assignments (N=4)

Weight Mode Bits Group Rationale
embed_tokens 6-bit affine 6 64 Tied with lm_head (Gemma4 shares weights); affine-only loader
self_attn.q_proj 6-bit affine 6 64 AWQ-corrected via input_layernorm
self_attn.k_proj 6-bit affine 6 64 AWQ-corrected via input_layernorm
self_attn.v_proj 6-bit affine 6 64 AWQ-corrected via input_layernorm (only on full-attention layers)
mlp.gate_proj mxfp4 4 32 Shared dense MLP
mlp.up_proj mxfp4 4 32 Shared dense MLP
mlp.down_proj 5-bit affine 5 64 Shared dense MLP; "slightly more sensitive" (unsloth base+1)
experts.switch_glu.gate_proj mxfp4 4 32 MoE expert gate (per-expert across all 128); base bits
experts.switch_glu.up_proj mxfp4 4 32 MoE expert up (per-expert across all 128); base bits
experts.switch_glu.down_proj 5-bit affine 5 64 MoE expert down (per-expert across all 128 + routing); unsloth base+1
router.proj 8-bit affine 8 64 MoE routing — low-bit noise breaks top-K expert selection
self_attn.o_proj bf16 NOT AWQ-correctable; kept full-precision

Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At --q-bits 4 the unsloth recipe assigns 4-bit to MLP gate/up projections (both shared dense and per-expert), 5-bit affine to down_proj, 6-bit affine + AWQ pre-scaling to attention q/k/v, 8-bit affine to the MoE router, and keeps self_attn.o_proj as bf16. Then --q-mxfp orthogonally promotes the 4-bit affine decisions to MXFP4 (mode="mxfp4", bits=4, group_size=32) — except for keys whose dequantizers are affine-only (embed_tokens, router.proj, embed_vision.embedding_projection) and the MoE router (kept at 8-bit affine because FP4 noise breaks top-K).

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead).

Architecture

Parameter Value
Total parameters 26B (4B active per token)
Hidden size 2,816
Layers 30 (sliding-window attention)
Attention heads 16 (8 KV heads, GQA 2:1)
Head dimension 256
Experts 128 per MoE layer
MoE intermediate size 704
Vocab size 262,144
Max context 262,144 tokens
Vision yes (Gemma4ForConditionalGeneration)

Usage

import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Gemma-4-26B-A4B-IT-UD-MXFP4_K_XL-mlx');

for await (const event of session.sendStream('Explain the MoE architecture in Gemma-4.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}

How It Was Made

mlx convert \
  -i gemma-4-26b-a4b-it \
  -o Gemma-4-26B-A4B-IT-UD-MXFP4_K_XL-mlx \
  -q --q-bits 4 --q-mxfp --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf

Acknowledgments

License

Gemma Terms of Use (inherited from base model).

Downloads last month
276
Safetensors
Model size
6B params
Tensor type
BF16
·
U32
·
U8
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Brooooooklyn/Gemma-4-26B-A4B-IT-UD-MXFP4_K_XL-mlx