--- license: gemma library_name: gguf base_model: unsloth/diffusiongemma-26B-A4B-it-GGUF tags: - gguf - diffusion - gemma - moe - 2-bit - imatrix - asymmetric-quant pipeline_tag: text-generation --- # diffusiongemma-26B-A4B — Asymmetric 2-bit-Expert GGUF (v2, from-Q8 + imatrix) An **antirez-style asymmetric** low-bit GGUF quant of **diffusiongemma-26B-A4B-it**, a **Gemma-4 MoE *diffusion* language model** (26B total parameters, ~4B active, 128 experts, 8 active per token, 30 layers). > **This is a DIFFUSION model, not a standard autoregressive LM.** It generates by > iterative parallel canvas denoising (`diffusion.canvas_length = 256`, `attention.causal = false`), > not left-to-right next-token sampling. Standard `llama-cli` AR generation and perplexity > (PPL) are **not the right validation harness** for this model class — coherence is judged > by generation, not PPL (see validation below). > **v2 supersedes the original release.** The original was built **from Q4_K_M, imatrix-free, > with Q2_K experts, and serving was not validated**. This v2 rebuild fixes all three: > **built from Q8_0** (near-lossless source), **imatrix-optimized** experts, and **serving-validated** > on a CUDA diffusion-gemma visual-server. Same filename — `diffusiongemma-26B-A4B-asym-2bitexp.gguf` — > so existing wiring resolves unchanged. **10.98 GB (was 12.02 GB).** ## Asymmetric quantization scheme (v2) The whole point of an asymmetric quant is to spend bits where they matter. The routed experts are the bulk of the weights but each is touched by only a fraction of tokens, so they are pushed to 2-bit — but now with an **importance matrix** protecting the salient channels. The down-projection (more sensitive) and the dense/attention path are kept higher. | Tensor group | Tensor name(s) | Type written | Notes | |---|---|---|---| | Routed experts gate+up (FUSED) | `blk.*.ffn_gate_up_exps.weight` | **IQ2_S + imatrix** | ~2.44 bpw; imatrix-protected (was blind Q2_K) | | Routed experts down | `blk.*.ffn_down_exps.weight` | **IQ4_NL + imatrix** | ~4.29 bpw; 704-col → best valid 4-bit for this shape | | Attention q/k/v/o + dense FFN gate/up/down | `blk.*.attn_{q,k,v,output}`, `blk.*.ffn_{gate,up,down}.weight` | **Q5_K** (175) / **Q5_1** (30) | dense `ffn_down` (2112-col) → Q5_1 fallback | | Token embeddings (tied output) | `token_embd.weight` | **Q6_K** | tied embeddings; no separate `output.weight` | | Diffusion self-conditioning | `self_cond_{down,gate,up}.weight` | **Q4_K** (2) | non-256-divisible cols | | Norms, scales, router (`ffn_gate_inp`) | `*_norm`, `*.scale`, `ffn_gate_inp.*` | **F32** | router kept precise | Stored-type census from the model loader: `f32:423, q5_K:175, q5_1:30, iq4_nl:30, iq2_s:30, q6_K:1, q4_K:2`. ## Importance matrix — produced and applied (the key v2 win) Unlike the original imatrix-free release, an importance matrix **was produced and applied**: - `llama-imatrix` was run on the **Q8_0 source** over `calibration_datav3.txt` at NGL=99 using the diffusion-gemma graph: **129 chunks completed, 295 importance entries covering all 30 blocks' expert tensors at 93–99% coverage** (93–99% is the structural ceiling for a 128-expert/8-active MoE — only the experts that fired on calibration data get stats; expected, not a failure). - The quantize log confirms the imatrix was **loaded and applied to every expert tensor** (`have importance matrix data with 295 entries`; zero "no importance matrix" warnings). - Honesty note: the AR-perplexity printed by `llama-imatrix` (~506) is **meaningless in absolute terms** for a non-AR diffusion model — but the activation statistics it collects are real and valid, which is what the quantizer consumes. This is why IQ2_S (which *requires* an imatrix) is usable for the fused experts here, replacing the original's blind Q2_K. Imatrix protects the salient channels that drive expert routing. ## Validation — coherence served (v2) This file was **load-and-serve validated** on a CUDA `diffusion-gemma-visual-server` (prism/llama.cpp build with the `DIFFUSION_GEMMA` forward graph, NGL=99, on-device CUDA sampler, entropy-bound denoising). Five prompts (factual, code, explanation, list, creative) all produced **fully coherent committed answers with no word-drops** — correct Fibonacci code, correct first-10 primes (2,3,5,7,11,13,17,19,23,29), accurate Rayleigh-scattering explanation, coherent multi-sentence prose. Measured **141–270 tok/s** on the substantive prompts (H100). Against the original artifact on the identical prompts, v2 was **faster on every prompt** and **converged in fewer denoising steps** (e.g. 77 vs 100 steps on the code prompt) — fewer steps to converge is a direct confidence/quality signal for entropy-bound diffusion — at **equal-or-better coherence** and a **smaller file**. ## Provenance / caveats (v2) - **Source:** `unsloth/diffusiongemma-26B-A4B-it-GGUF`, **Q8_0** (26.88 GB, near-lossless). `--allow-requantize` is required (source is already quantized), but Q8_0 removes the lossy Q4_K_M generation that bounded the original release. - The GGUF metadata arch is `diffusion-gemma`; tensor names and data are unchanged. To serve this file you need a runtime that understands the diffusion-gemma tensor set (`self_cond_*`, transposed `ffn_gate_inp`, fused `ffn_gate_up_exps`), e.g. a prism/llama.cpp build with `LLM_ARCH_DIFFUSION_GEMMA` / the `diffusion-gemma-visual-server`. ## Files - `diffusiongemma-26B-A4B-asym-2bitexp.gguf` — the v2 asymmetric 2-bit-expert quant (10.98 GB). - sha256: `c7c66b99fbc311cfc61fb74380e037b2667db4bc79a98a284887b2b17b1d7a14` Built with a prism (llama.cpp-derived, diffusion fork) `llama-quantize` + `llama-imatrix`, CUDA build, on an H100.