Instructions to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF", filename="diffusiongemma-26B-A4B-asym-2bitexp.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF # Run inference directly in the terminal: llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF # Run inference directly in the terminal: llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF # Run inference directly in the terminal: ./llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Use Docker
docker model run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
- LM Studio
- Jan
- vLLM
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
- Ollama
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Ollama:
ollama run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
- Unsloth Studio
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF to start chatting
- Pi
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Docker Model Runner:
docker model run hf.co/hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
- Lemonade
How to use hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Run and chat with the model
lemonade run user.diffusiongemma-26B-A4B-asym-2bitexp-GGUF-{{QUANT_TAG}}List all available models
lemonade list
diffusiongemma-26B-A4B — Asymmetric 2-bit-Expert GGUF (v2, from-Q8 + imatrix)
An antirez-style asymmetric low-bit GGUF quant of diffusiongemma-26B-A4B-it, a Gemma-4 MoE diffusion language model (26B total parameters, ~4B active, 128 experts, 8 active per token, 30 layers).
This is a DIFFUSION model, not a standard autoregressive LM. It generates by iterative parallel canvas denoising (
diffusion.canvas_length = 256,attention.causal = false), not left-to-right next-token sampling. Standardllama-cliAR generation and perplexity (PPL) are not the right validation harness for this model class — coherence is judged by generation, not PPL (see validation below).
v2 supersedes the original release. The original was built from Q4_K_M, imatrix-free, with Q2_K experts, and serving was not validated. This v2 rebuild fixes all three: built from Q8_0 (near-lossless source), imatrix-optimized experts, and serving-validated on a CUDA diffusion-gemma visual-server. Same filename —
diffusiongemma-26B-A4B-asym-2bitexp.gguf— so existing wiring resolves unchanged. 10.98 GB (was 12.02 GB).
Asymmetric quantization scheme (v2)
The whole point of an asymmetric quant is to spend bits where they matter. The routed experts are the bulk of the weights but each is touched by only a fraction of tokens, so they are pushed to 2-bit — but now with an importance matrix protecting the salient channels. The down-projection (more sensitive) and the dense/attention path are kept higher.
| Tensor group | Tensor name(s) | Type written | Notes |
|---|---|---|---|
| Routed experts gate+up (FUSED) | blk.*.ffn_gate_up_exps.weight |
IQ2_S + imatrix | ~2.44 bpw; imatrix-protected (was blind Q2_K) |
| Routed experts down | blk.*.ffn_down_exps.weight |
IQ4_NL + imatrix | ~4.29 bpw; 704-col → best valid 4-bit for this shape |
| Attention q/k/v/o + dense FFN gate/up/down | blk.*.attn_{q,k,v,output}, blk.*.ffn_{gate,up,down}.weight |
Q5_K (175) / Q5_1 (30) | dense ffn_down (2112-col) → Q5_1 fallback |
| Token embeddings (tied output) | token_embd.weight |
Q6_K | tied embeddings; no separate output.weight |
| Diffusion self-conditioning | self_cond_{down,gate,up}.weight |
Q4_K (2) | non-256-divisible cols |
Norms, scales, router (ffn_gate_inp) |
*_norm, *.scale, ffn_gate_inp.* |
F32 | router kept precise |
Stored-type census from the model loader: f32:423, q5_K:175, q5_1:30, iq4_nl:30, iq2_s:30, q6_K:1, q4_K:2.
Importance matrix — produced and applied (the key v2 win)
Unlike the original imatrix-free release, an importance matrix was produced and applied:
llama-imatrixwas run on the Q8_0 source overcalibration_datav3.txtat NGL=99 using the diffusion-gemma graph: 129 chunks completed, 295 importance entries covering all 30 blocks' expert tensors at 93–99% coverage (93–99% is the structural ceiling for a 128-expert/8-active MoE — only the experts that fired on calibration data get stats; expected, not a failure).- The quantize log confirms the imatrix was loaded and applied to every expert tensor
(
have importance matrix data with 295 entries; zero "no importance matrix" warnings). - Honesty note: the AR-perplexity printed by
llama-imatrix(~506) is meaningless in absolute terms for a non-AR diffusion model — but the activation statistics it collects are real and valid, which is what the quantizer consumes.
This is why IQ2_S (which requires an imatrix) is usable for the fused experts here, replacing the original's blind Q2_K. Imatrix protects the salient channels that drive expert routing.
Validation — coherence served (v2)
This file was load-and-serve validated on a CUDA diffusion-gemma-visual-server
(prism/llama.cpp build with the DIFFUSION_GEMMA forward graph, NGL=99, on-device CUDA sampler,
entropy-bound denoising). Five prompts (factual, code, explanation, list, creative) all produced
fully coherent committed answers with no word-drops — correct Fibonacci code, correct first-10
primes (2,3,5,7,11,13,17,19,23,29), accurate Rayleigh-scattering explanation, coherent multi-sentence
prose. Measured 141–270 tok/s on the substantive prompts (H100).
Against the original artifact on the identical prompts, v2 was faster on every prompt and converged in fewer denoising steps (e.g. 77 vs 100 steps on the code prompt) — fewer steps to converge is a direct confidence/quality signal for entropy-bound diffusion — at equal-or-better coherence and a smaller file.
Provenance / caveats (v2)
- Source:
unsloth/diffusiongemma-26B-A4B-it-GGUF, Q8_0 (26.88 GB, near-lossless).--allow-requantizeis required (source is already quantized), but Q8_0 removes the lossy Q4_K_M generation that bounded the original release. - The GGUF metadata arch is
diffusion-gemma; tensor names and data are unchanged. To serve this file you need a runtime that understands the diffusion-gemma tensor set (self_cond_*, transposedffn_gate_inp, fusedffn_gate_up_exps), e.g. a prism/llama.cpp build withLLM_ARCH_DIFFUSION_GEMMA/ thediffusion-gemma-visual-server.
Files
diffusiongemma-26B-A4B-asym-2bitexp.gguf— the v2 asymmetric 2-bit-expert quant (10.98 GB).- sha256:
c7c66b99fbc311cfc61fb74380e037b2667db4bc79a98a284887b2b17b1d7a14
- sha256:
Built with a prism (llama.cpp-derived, diffusion fork) llama-quantize + llama-imatrix, CUDA build, on an H100.
- Downloads last month
- 941
We're not able to determine the quantization variants.
Model tree for hyperspaceai/diffusiongemma-26B-A4B-asym-2bitexp-GGUF
Base model
google/diffusiongemma-26B-A4B-it