How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
./llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Use Docker
docker model run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Quick Links

diffusiongemma-26B-A4B — Asymmetric 2-bit-Expert GGUF (v2, from-Q8 + imatrix)

An antirez-style asymmetric low-bit GGUF quant of diffusiongemma-26B-A4B-it, a Gemma-4 MoE diffusion language model (26B total parameters, ~4B active, 128 experts, 8 active per token, 30 layers).

This is a DIFFUSION model, not a standard autoregressive LM. It generates by iterative parallel canvas denoising (diffusion.canvas_length = 256, attention.causal = false), not left-to-right next-token sampling. Standard llama-cli AR generation and perplexity (PPL) are not the right validation harness for this model class — coherence is judged by generation, not PPL (see validation below).

v2 supersedes the original release. The original was built from Q4_K_M, imatrix-free, with Q2_K experts, and serving was not validated. This v2 rebuild fixes all three: built from Q8_0 (near-lossless source), imatrix-optimized experts, and serving-validated on a CUDA diffusion-gemma visual-server. Same filename — diffusiongemma-26B-A4B-asym-2bitexp.gguf — so existing wiring resolves unchanged. 10.98 GB (was 12.02 GB).

Asymmetric quantization scheme (v2)

The whole point of an asymmetric quant is to spend bits where they matter. The routed experts are the bulk of the weights but each is touched by only a fraction of tokens, so they are pushed to 2-bit — but now with an importance matrix protecting the salient channels. The down-projection (more sensitive) and the dense/attention path are kept higher.

Tensor group Tensor name(s) Type written Notes
Routed experts gate+up (FUSED) blk.*.ffn_gate_up_exps.weight IQ2_S + imatrix ~2.44 bpw; imatrix-protected (was blind Q2_K)
Routed experts down blk.*.ffn_down_exps.weight IQ4_NL + imatrix ~4.29 bpw; 704-col → best valid 4-bit for this shape
Attention q/k/v/o + dense FFN gate/up/down blk.*.attn_{q,k,v,output}, blk.*.ffn_{gate,up,down}.weight Q5_K (175) / Q5_1 (30) dense ffn_down (2112-col) → Q5_1 fallback
Token embeddings (tied output) token_embd.weight Q6_K tied embeddings; no separate output.weight
Diffusion self-conditioning self_cond_{down,gate,up}.weight Q4_K (2) non-256-divisible cols
Norms, scales, router (ffn_gate_inp) *_norm, *.scale, ffn_gate_inp.* F32 router kept precise

Stored-type census from the model loader: f32:423, q5_K:175, q5_1:30, iq4_nl:30, iq2_s:30, q6_K:1, q4_K:2.

Importance matrix — produced and applied (the key v2 win)

Unlike the original imatrix-free release, an importance matrix was produced and applied:

  • llama-imatrix was run on the Q8_0 source over calibration_datav3.txt at NGL=99 using the diffusion-gemma graph: 129 chunks completed, 295 importance entries covering all 30 blocks' expert tensors at 93–99% coverage (93–99% is the structural ceiling for a 128-expert/8-active MoE — only the experts that fired on calibration data get stats; expected, not a failure).
  • The quantize log confirms the imatrix was loaded and applied to every expert tensor (have importance matrix data with 295 entries; zero "no importance matrix" warnings).
  • Honesty note: the AR-perplexity printed by llama-imatrix (~506) is meaningless in absolute terms for a non-AR diffusion model — but the activation statistics it collects are real and valid, which is what the quantizer consumes.

This is why IQ2_S (which requires an imatrix) is usable for the fused experts here, replacing the original's blind Q2_K. Imatrix protects the salient channels that drive expert routing.

Validation — coherence served (v2)

This file was load-and-serve validated on a CUDA diffusion-gemma-visual-server (prism/llama.cpp build with the DIFFUSION_GEMMA forward graph, NGL=99, on-device CUDA sampler, entropy-bound denoising). Five prompts (factual, code, explanation, list, creative) all produced fully coherent committed answers with no word-drops — correct Fibonacci code, correct first-10 primes (2,3,5,7,11,13,17,19,23,29), accurate Rayleigh-scattering explanation, coherent multi-sentence prose. Measured 141–270 tok/s on the substantive prompts (H100).

Against the original artifact on the identical prompts, v2 was faster on every prompt and converged in fewer denoising steps (e.g. 77 vs 100 steps on the code prompt) — fewer steps to converge is a direct confidence/quality signal for entropy-bound diffusion — at equal-or-better coherence and a smaller file.

Provenance / caveats (v2)

  • Source: unsloth/diffusiongemma-26B-A4B-it-GGUF, Q8_0 (26.88 GB, near-lossless). --allow-requantize is required (source is already quantized), but Q8_0 removes the lossy Q4_K_M generation that bounded the original release.
  • The GGUF metadata arch is diffusion-gemma; tensor names and data are unchanged. To serve this file you need a runtime that understands the diffusion-gemma tensor set (self_cond_*, transposed ffn_gate_inp, fused ffn_gate_up_exps), e.g. a prism/llama.cpp build with LLM_ARCH_DIFFUSION_GEMMA / the diffusion-gemma-visual-server.

Files

  • diffusiongemma-26B-A4B-asym-2bitexp.gguf — the v2 asymmetric 2-bit-expert quant (10.98 GB).
    • sha256: c7c66b99fbc311cfc61fb74380e037b2667db4bc79a98a284887b2b17b1d7a14

Built with a prism (llama.cpp-derived, diffusion fork) llama-quantize + llama-imatrix, CUDA build, on an H100.

Downloads last month
941
GGUF
Model size
25B params
Architecture
diffusion-gemma
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF