Instructions to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF",
	filename="diffusiongemma-26B-A4B-asym-2bitexp.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
./llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
# Run inference directly in the terminal:
./build/bin/llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Use Docker

docker model run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

LM Studio
Jan

vLLM

How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Ollama
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Ollama:
```
ollama run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
```

Unsloth Studio

How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF to start chatting

How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Docker Model Runner:
```
docker model run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
```

Lemonade

How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Run and chat with the model

lemonade run user.diffusiongemma-26B-A4B-asym-2bitexp-GGUF-{{QUANT_TAG}}

List all available models

lemonade list

diffusiongemma-26B-A4B — Asymmetric 2-bit-Expert GGUF (v2, from-Q8 + imatrix)

An antirez-style asymmetric low-bit GGUF quant of diffusiongemma-26B-A4B-it, a Gemma-4 MoE diffusion language model (26B total parameters, ~4B active, 128 experts, 8 active per token, 30 layers).

This is a DIFFUSION model, not a standard autoregressive LM. It generates by iterative parallel canvas denoising (diffusion.canvas_length = 256, attention.causal = false), not left-to-right next-token sampling. Standard llama-cli AR generation and perplexity (PPL) are not the right validation harness for this model class — coherence is judged by generation, not PPL (see validation below).

v2 supersedes the original release. The original was built from Q4_K_M, imatrix-free, with Q2_K experts, and serving was not validated. This v2 rebuild fixes all three: built from Q8_0 (near-lossless source), imatrix-optimized experts, and serving-validated on a CUDA diffusion-gemma visual-server. Same filename — diffusiongemma-26B-A4B-asym-2bitexp.gguf — so existing wiring resolves unchanged. 10.98 GB (was 12.02 GB).

Asymmetric quantization scheme (v2)

The whole point of an asymmetric quant is to spend bits where they matter. The routed experts are the bulk of the weights but each is touched by only a fraction of tokens, so they are pushed to 2-bit — but now with an importance matrix protecting the salient channels. The down-projection (more sensitive) and the dense/attention path are kept higher.

Tensor group	Tensor name(s)	Type written	Notes
Routed experts gate+up (FUSED)	`blk.*.ffn_gate_up_exps.weight`	IQ2_S + imatrix	~2.44 bpw; imatrix-protected (was blind Q2_K)
Routed experts down	`blk.*.ffn_down_exps.weight`	IQ4_NL + imatrix	~4.29 bpw; 704-col → best valid 4-bit for this shape
Attention q/k/v/o + dense FFN gate/up/down	`blk..attn_{q,k,v,output}`, `blk..ffn_{gate,up,down}.weight`	Q5_K (175) / Q5_1 (30)	dense `ffn_down` (2112-col) → Q5_1 fallback
Token embeddings (tied output)	`token_embd.weight`	Q6_K	tied embeddings; no separate `output.weight`
Diffusion self-conditioning	`self_cond_{down,gate,up}.weight`	Q4_K (2)	non-256-divisible cols
Norms, scales, router (`ffn_gate_inp`)	`_norm`, `.scale`, `ffn_gate_inp.*`	F32	router kept precise

Stored-type census from the model loader: f32:423, q5_K:175, q5_1:30, iq4_nl:30, iq2_s:30, q6_K:1, q4_K:2.

Importance matrix — produced and applied (the key v2 win)

Unlike the original imatrix-free release, an importance matrix was produced and applied:

llama-imatrix was run on the Q8_0 source over calibration_datav3.txt at NGL=99 using the diffusion-gemma graph: 129 chunks completed, 295 importance entries covering all 30 blocks' expert tensors at 93–99% coverage (93–99% is the structural ceiling for a 128-expert/8-active MoE — only the experts that fired on calibration data get stats; expected, not a failure).
The quantize log confirms the imatrix was loaded and applied to every expert tensor (have importance matrix data with 295 entries; zero "no importance matrix" warnings).
Honesty note: the AR-perplexity printed by llama-imatrix (~506) is meaningless in absolute terms for a non-AR diffusion model — but the activation statistics it collects are real and valid, which is what the quantizer consumes.

This is why IQ2_S (which requires an imatrix) is usable for the fused experts here, replacing the original's blind Q2_K. Imatrix protects the salient channels that drive expert routing.

Validation — coherence served (v2)

This file was load-and-serve validated on a CUDA diffusion-gemma-visual-server (prism/llama.cpp build with the DIFFUSION_GEMMA forward graph, NGL=99, on-device CUDA sampler, entropy-bound denoising). Five prompts (factual, code, explanation, list, creative) all produced fully coherent committed answers with no word-drops — correct Fibonacci code, correct first-10 primes (2,3,5,7,11,13,17,19,23,29), accurate Rayleigh-scattering explanation, coherent multi-sentence prose. Measured 141–270 tok/s on the substantive prompts (H100).

Against the original artifact on the identical prompts, v2 was faster on every prompt and converged in fewer denoising steps (e.g. 77 vs 100 steps on the code prompt) — fewer steps to converge is a direct confidence/quality signal for entropy-bound diffusion — at equal-or-better coherence and a smaller file.

Provenance / caveats (v2)

Source: unsloth/diffusiongemma-26B-A4B-it-GGUF, Q8_0 (26.88 GB, near-lossless). --allow-requantize is required (source is already quantized), but Q8_0 removes the lossy Q4_K_M generation that bounded the original release.
The GGUF metadata arch is diffusion-gemma; tensor names and data are unchanged. To serve this file you need a runtime that understands the diffusion-gemma tensor set (self_cond_*, transposed ffn_gate_inp, fused ffn_gate_up_exps), e.g. a prism/llama.cpp build with LLM_ARCH_DIFFUSION_GEMMA / the diffusion-gemma-visual-server.

Files

diffusiongemma-26B-A4B-asym-2bitexp.gguf — the v2 asymmetric 2-bit-expert quant (10.98 GB).
- sha256: c7c66b99fbc311cfc61fb74380e037b2667db4bc79a98a284887b2b17b1d7a14

Built with a prism (llama.cpp-derived, diffusion fork) llama-quantize + llama-imatrix, CUDA build, on an H100.

Downloads last month: 941

GGUF

Model size

25B params

Architecture

diffusion-gemma

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF

Base model

google/diffusiongemma-26B-A4B-it

Quantized

unsloth/diffusiongemma-26B-A4B-it-GGUF

Quantized

(2)

this model