Instructions to use AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4
- SGLang
How to use AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 with Docker Model Runner:
docker model run hf.co/AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4
- Quick Links
- Quick Start
- Model Specs
- 2026-06-11 — DFlash drafter fixes (vLLM PR #40898 + #41703)
- Performance — historical short-context sweep (pre-fix image, 2026-06-09)
- Stock Community vLLM Baseline (No DFlash)
- Why This Is Hard: Gemma 4 on DGX Spark
- Container Image Details
- All Fixes Included
- Related Models
- Disclaimer, Liability Waiver, and Assumption of Risk
- License
- Support the work
Quick Links
| Resource | Link |
|---|---|
| Model Weights + Full Documentation | AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 on HuggingFace |
| DFlash vLLM Container (DGX Spark) | ghcr.io/aeon-7/aeon-vllm-ultimate:latest |
| DFlash Drafter | z-lab/gemma-4-26B-A4B-it-DFlash |
Quick Start
# 1. Pull the AEON vLLM Ultimate image (vLLM 0.22.1 + PR#44389 NVFP4-KV + PR#40898/#41703
# DFlash drafter fixes, sm_121a). :latest = :2026-06-11-pr41703.
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest
# 2. Download the target model and DFlash drafter.
mkdir -p models
huggingface-cli download AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 \
--local-dir ./models/gemma4
huggingface-cli download z-lab/gemma-4-26B-A4B-it-DFlash \
--local-dir ./models/gemma4-dflash
# 3. Serve — recipe for :latest (2026-06-11+): DFlash n=10, drafter flash_attn, body triton_attn.
# (The image ENTRYPOINT is bash, so override it with --entrypoint vllm.)
docker run --gpus all --ipc host --network host \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e TORCH_MATMUL_PRECISION=high \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_TEST_FORCE_FP8_MARLIN=0 \
-e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
-v "$PWD/models/gemma4:/models/gemma4:ro" \
-v "$PWD/models/gemma4-dflash:/models/gemma4-dflash:ro" \
--entrypoint vllm \
ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
serve /models/gemma4 \
--served-model-name gemma4-aeon-uncensored gemma4-fast gemma4-deep \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--quantization compressed-tensors \
--attention-backend triton_attn \
--max-model-len 262144 \
--max-num-seqs 64 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.80 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--speculative-config '{"method":"dflash","model":"/models/gemma4-dflash","num_speculative_tokens":10,"attention_backend":"flash_attn"}'
Recipe notes — the drafter backend is tied to the image tag:
- On
:latest/:2026-06-11-pr41703(current): the drafter must useattention_backend: flash_attn(enabled for multimodal Gemma targets by PR #41703'suse_mm_prefix=False).flex_attentioncrashes at the first request on this image (non-contiguous KV view in the new KV-sharing path).--enable-prefix-cachingis safe and recommended — soak-validated with DFlash (see the fixes section below). - On the rollback tag
:2026-06-04-pr44389(pre-fix): the reverse — onlyflex_attentionloads (flash_attnfails withpartial multimodal token full attention not supported), the drafter silently runs without sliding-window support (~0% acceptance beyond 2k-token contexts), and--enable-prefix-cachingtriggers a slow acceptance collapse (see below). Not recommended. - Either way the body stays on
triton_attn(Gemma's heterogeneous head dims), and do not set--kv-cache-dtype fp8— DFlash's non-causal drafter requires BF16 KV.--reasoning-parser gemma4cleanly splitsenable_thinkingoutput intoreasoning_content.num_speculative_tokens=10is the value our production fleet runs; the drafter's trained block size is 16 and z-lab ships 15 — an n=10-vs-15 re-bench on the fixed drafter is pending (the earlier "10 beats 15" result was measured on the pre-fix drafter and shouldn't be treated as settled). The model'sconfig.jsonalready ships the vision ignore-list fix (re:.*embed_vision.*,re:.*vision_tower.*) so vLLM keeps the vision tower BF16.
This default profile suits agentic gateways — one large full-context chat plus many short-lived subagents under the --max-num-seqs 64 cap. For short-context throughput benchmarking, use --max-model-len 32768 --max-num-seqs 256.
Model Specs
| Property | Value |
|---|---|
| Architecture | Gemma 4 Mixture of Experts |
| Total / Active Parameters | 26B / ~4B per token (top-8 of 128 experts) |
| Layers | 30 (25 sliding-window + 5 full-attention) |
| Max Context | 262,144 tokens |
| Quantization | NVFP4 (compressed-tensors) |
| Model Size on Disk | 15.3 GB |
| VRAM Loaded | 16.25 GB |
| Vision | 27-layer ViT (BF16) |
| Tool Calling | Native Gemma 4 format |
2026-06-11 — DFlash drafter fixes (vLLM PR #40898 + #41703)
All previously published DFlash numbers for this model were measured on a defective drafter path. We root-caused three vLLM-side defects in production and merged the upstream fixes (both PRs still open upstream) into aeon-vllm-ultimate:latest:
| Defect (pre-fix images) | Symptom | Fixed by |
|---|---|---|
Rejected-token context-KV writes corrupted the drafter's paged KV cache; --enable-prefix-caching made the corruption persistent and self-accelerating |
Draft acceptance decayed 34–56% → 0.0% over minutes-to-hours of traffic → ~6× decode slowdown (144 → 24 tok/s), healed only by restart | #41703 masks rejected/invalid context slots |
| The z-lab drafter is 4-of-5-layers sliding-window-2048, but ran as full attention | Any request with >2k tokens of context got ~0% acceptance (long chats, agent histories, big system prompts) | #40898 adds DFlash SWA support |
| Missing Gemma-4 sqrt(hidden) embed normalizer + final-logit softcap in the draft path | Depressed acceptance ceiling — mean acceptance length 4.4–6.6 vs z-lab's published 6.1–8.6 | #41703 |
Validated results on the fixed image (production profile: --gpu-memory-utilization 0.68 --max-model-len 184320 --max-num-seqs 32 --enable-prefix-caching, drafter flash_attn, n=10):
| Gate | pre-fix | 2026-06-11-pr41703 |
|---|---|---|
| Long-context (~9k system prompt) draft acceptance | ~0–7% | 43.3% / MAL 5.3 |
| Prefix-caching ON + fleet-burst + 10-min production soak | collapse to 0% in ~25 min | 52.0% / MAL 6.20 — improves under load |
| Single-stream coding (c=1, greedy) | 144 tok/s fresh-boot, decaying to ~24 | 149–150 tok/s, sustained |
| Long-context (~9k) throughput | ~46 tok/s (APC unusable) | 78 tok/s (APC accelerates the cached prefix) |
| Live production probe (30-persona voice fleet) | — | 60% acceptance / MAL 7.0 |
KV at this profile: 726k tokens / 3.94× concurrency at 180k ctx. Mean acceptance length now lands in z-lab's published range for the first time. A full per-category concurrency re-sweep on the fixed image is pending; the section below is the historical pre-fix sweep.
Performance — historical short-context sweep (pre-fix image, 2026-06-09)
⚠️ Measured on the pre-fix image (
:2026-06-04-pr44389) with the flex-drafter recipe that crashes on the current:latest— and with the broken drafter (no SWA, no embed normalizer), short-context prompts only. Kept as an honest historical floor; the fixed image measures higher (see above). Long-context performance on this image was far worse than the table suggests (~0% acceptance beyond 2k-token contexts).
Per-category single-stream + concurrency sweep (c=1…128, fresh server per category, 0 request errors), DFlash n=10, BF16 KV:
| Category | c=1 tok/s | Peak aggregate tok/s | vs prior DFlash v2 (n=15) |
|---|---|---|---|
| Coding | 144.2 | 1,724 @ c=128 | +53% c1 / +51% peak |
| Math | 99.6 | 1,335 @ c=128 | +35% / +34% |
| Reasoning | 73.6 | 1,087 @ c=128 | +21% / +24% |
| Extraction / JSON | 158.4 | 1,315 @ c=32 | +2% c1 |
| Natural language | 53.8 | 684 @ c=128 | +5% peak |
| Prose | 36.5 | 486 @ c=128 | — |
This pre-fix sweep showed n=10 beating z-lab's default 15 by 20–50% on the reasoning-heavy categories — but both arms ran the broken drafter, so treat the n=10-vs-15 conclusion as unsettled pending the re-bench on the fixed image (drafter block size is 16).
For maximum many-request aggregate throughput (very high concurrency, c≥128), serving without the drafter measured best on the pre-fix image (~3,000–3,700 tok/s at c=256 via the stock path below). Note the "drafter destabilizes at c=256" behavior observed then is plausibly the now-fixed KV-corruption defect — the high-concurrency crossover needs re-measuring on the fixed image before treating it as current guidance.
Stock Community vLLM Baseline (No DFlash)
Benchmarked with the official community image vllm/vllm-openai:latest pulled on 2026-05-06 (vLLM 0.20.1, PyTorch 2.11.0+cu130, transformers 5.7.0, image digest sha256:9eff9734a30b6713a8566217d36f8277630fd2d31cec7f0a0292835901a23aa4). This run used the same model weights, 32K context, --max-num-batched-tokens 32768, and --max-num-seqs 256, but no DFlash drafter and no AEON container env overrides. Upstream vLLM now boots this model on GB10 with FlashInfer CUTLASS NVFP4 linear kernels and VLLM CUTLASS MoE.
Full sweep: 6 natural prompt categories x 8 concurrency levels (1, 4, 8, 16, 32, 64, 128, 256) = 48 benchmark points, 0 request errors.
| Category | c=1 tok/s | c=1 TTFT p50 | Peak aggregate tok/s | c=256 aggregate tok/s | c=256 TTFT p50 |
|---|---|---|---|---|---|
| Coding | 49.12 | 130.7 ms | 3,356.61 @ c=256 | 3,356.61 | 542 ms |
| Math | 48.79 | 134.0 ms | 3,006.60 @ c=256 | 3,006.60 | 1,078 ms |
| Reasoning | 48.90 | 113.8 ms | 3,241.42 @ c=256 | 3,241.42 | 274 ms |
| Prose | 48.86 | 115.9 ms | 3,222.85 @ c=256 | 3,222.85 | 662 ms |
| Natural language | 49.38 | 72.4 ms | 3,418.94 @ c=256 | 3,418.94 | 650 ms |
| Extraction / JSON | 47.34 | 120.6 ms | 3,674.70 @ c=256 | 3,674.70 | 385 ms |
Use the stock community path when raw many-request aggregate throughput matters more than speculative single-stream speed. Use the DFlash image when you want the lower interactive TPOT and the integrated Gemma 4 DFlash serving recipe.
Why This Is Hard: Gemma 4 on DGX Spark
Running Gemma 4 NVFP4 on a DGX Spark used to require a source-built stack. As of the 2026-05-06 community vllm/vllm-openai:latest image, upstream vLLM can boot this model on GB10, and AEON's aeon-vllm-ultimate image packages the optimized (and now corrected — see the fixes section above) DFlash path as a single pull-and-run container. Every layer of the stack, from the silicon to the serving framework to the model weights themselves, has had compatibility gaps worth understanding.
The DGX Spark Problem
The NVIDIA DGX Spark ships with a GB10 Grace Blackwell chip: SM 12.1 on ARM64 (aarch64). This is bleeding-edge silicon that much of the ML ecosystem is still catching up to:
- Python wheels remain risky on SM 12.1. Official PyPI releases have historically targeted SM 8.0/8.9/9.0 (Ampere/Ada/Hopper). Installing
pip install vllmcan give you CUDA kernels compiled for the wrong GPU; use a tested Docker image or build from source. - No pre-built FlashInfer wheels for SM 12.1. FlashInfer provides the fused MoE dispatch kernels that make expert routing fast. Without it compiled for your architecture, MoE models can't use the optimized CUTLASS/Triton backends.
- ARM64 architecture means many x86-only prebuilt binaries don't run at all. Even when packages claim CUDA support, the host-side code is often x86-compiled.
- 273 GB/s memory bandwidth: fast for a desktop-class device, but a fraction of what data center GPUs offer (H100: 3.35 TB/s, A100: 2 TB/s). This makes model architecture choice critical: dense models that need to read all parameters every token are bandwidth-starved here.
The practical result: current stock vLLM can serve this model, but high-confidence production recipes still need to pin image versions, model format, attention backend, KV dtype, and concurrency settings instead of assuming any vLLM tag will behave the same way.
The Gemma 4 Problem
Gemma 4 is not just a new model. It is architecturally unusual in ways that break assumptions in existing tooling:
1. Requires transformers v5+ (nothing else does yet)
Gemma 4 was the first major model to require the transformers v5 major version bump. Older stock vLLM images shipped with v4.x and failed to parse the Gemma 4 config. Current community images may include transformers v5, but pin the version because v4/v5 API differences can still break model loading.
2. Heterogeneous attention head dimensions
Most models have uniform head dimensions across all layers. Gemma 4 has head_dim=256 for sliding-window layers and global_head_dim=512 for full-attention layers. This breaks attention backends that assume a single head dimension. vLLM forces the TRITON_ATTN backend specifically for Gemma 4 to handle this — other backends (FlashAttention, FlashInfer attention) produce numerical divergence or crash.
3. Hybrid sliding-window + full-attention layers
Of the 30 layers, 25 use a sliding window of 1024 tokens and 5 use full global attention; all 30 carry MoE blocks (128 experts, top-8 — checkpoint-verified: router + expert tensors in every layer). The two attention types have different compute patterns and KV cache requirements — interleaved through the stack.
4. Massive MoE expert count
128 experts per layer with top-8 routing across all 30 layers. That's 128 x 30 = 3,840 expert weight blocks, each with 4 NVFP4 tensors (weight_packed, weight_scale, weight_global_scale, input_global_scale). The total tensor count in this model is 47,648. Loading and routing these correctly requires FusedMoE kernels that can handle the stacked expert format, and the compressed-tensors naming convention doesn't match what vLLM expects (see below).
The NVFP4 Quantization Problem
NVFP4 (NVIDIA 4-bit floating point — E2M1 elements with block scales) is how we get a 26B-parameter model into 15.3 GB. But there are two completely different NVFP4 formats in the ecosystem, and they are not compatible:
ModelOpt NVFP4 (NVIDIA's TensorRT-LLM toolchain): Stores weights as weight, weight_scale_inverse, input_scale. This is what NVIDIA's own tools produce and what most vLLM NVFP4 code paths expect.
Compressed-tensors NVFP4 (llmcompressor/vLLM community): Stores weights as weight_packed, weight_scale, weight_global_scale, input_global_scale. Different tensor names, different scale conventions, different packing format.
This model uses compressed-tensors format (quantized with llmcompressor on an H200). vLLM's Gemma 4 weight loader has hard-coded assumptions about tensor naming that don't match. Specifically:
- Expert path mismatch: Compressed-tensors names MoE experts as
layers.X.experts.{id}.{proj}.weight_packed. vLLM's FusedMoE expectslayers.X.moe.experts.{id}.{proj}.weight_packed— note the.moe.segment. Without patching, every single expert tensor fails to load with a KeyError. - Suffix format mismatch: The weight loader constructs names like
w2_weight.weight_packedwhen it should bew2_weight_packed. The_weight.needs to be collapsed to_. - Dimension assertion failure: The original code asserts
dim == 2for weight tensors, but NVFP4 packed tensors have different dimensionality due to the 4-bit packing.
The included gemma4_patched.py fixes all three issues with targeted patches to the weight loading pipeline.
The Accidental Quantization Problem
When quantizing with llmcompressor, you specify ignore patterns for layers that should stay in BF16 (full precision). The original quantization used patterns like re:.*visual.* and re:.*gate.* to skip vision and routing layers. But Gemma 4's naming conventions didn't match:
| Layer | Expected Pattern | Actual Name in Gemma 4 | Result |
|---|---|---|---|
| Vision tower | re:.*visual.* |
model.vision_tower.* |
Quantized (wrong) |
| Vision embedding | re:.*visual.* |
model.embed_vision.* |
Quantized (wrong) |
| MoE routers | re:.*gate.* |
model.*.router.proj.* |
Quantized (wrong) |
Quantizing these layers breaks the model:
- Vision tower in NVFP4 crashes because vLLM allocates standard
Linearlayers (expects.weighttensor, getsweight_packed/weight_scale/etc.) - MoE routers in NVFP4 corrupts expert routing — the router decides which experts to activate for each token, and 4-bit precision on routing logits causes degenerate expert selection
- Vision embedding projection bridges the ViT output to the language model — quantization here cascades errors through every subsequent layer
We fixed this by extracting the original BF16 weights from the base model (TrevorJS/gemma-4-26B-A4B-it-uncensored) and replacing the incorrectly quantized tensors in the safetensors file:
- 760 NVFP4 tensors removed from the vision tower, replaced with 190 original BF16 weights (355 total vision tensors including biases and layernorms)
- 120 NVFP4 tensors removed from router.proj layers, replaced with 30 BF16 weights
- 4 NVFP4 tensors removed from embed_vision, replaced with 1 BF16 weight
The Token Leakage Problem
Gemma 4 uses internal control tokens for multi-channel generation (thinking, tool calls, output). These tokens have specific IDs in the vocabulary:
| Token ID | Token | Purpose |
|---|---|---|
| 100 | <|channel> |
Start internal channel (e.g., thinking) |
| 101 | <channel|> |
End internal channel |
| 98 | <|think|> |
Enter thinking mode |
| 48 | <|tool_call> |
Start tool call |
| 49 | <tool_call|> |
End tool call |
Without proper EOS configuration, the model can enter its "thinking" channel mid-generation, and those internal tokens stream through as plaintext in the API response. Worse, it can get stuck in a repetition loop — endlessly generating <|channel>thought<channel|>call:process{...} as visible text. This manifests as the model appearing to "spam" garbage in the chat.
The fix is adding tokens 98, 100, and 101 to the eos_token_id list in generation_config.json, so vLLM terminates generation cleanly before any internal channel tokens leak into the output.
What's In The Container (The Special Sauce)
The current image is ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-11-pr41703). Users pull one image; no local patching or source build is required.
| Component | What It Is | Why It Matters |
|---|---|---|
| vLLM 0.22.1-lineage + PR #44389 | main@2026-06-05 + Triton NVFP4 KV cache cherry-pick | Up to 3× KV capacity opt-in (--kv-cache-dtype nvfp4; causal speculators only). |
| PR #40898 + #41703 overlay | DFlash drafter fixes merged ahead of upstream | Rejected-slot KV-write masking (no more acceptance collapse under prefix caching), drafter sliding-window support, Gemma-4 embed normalizer + logit softcap, flash_attn drafter on multimodal Gemma targets. |
| AEON sm_121a patches | 3 idempotent runtime patches | GB10 correctness (optional-import lazy binding, hybrid block-size None guards, CUDA-graph capture-size alignment for spec decode). |
| PyTorch 2.11.0 + CUDA 13 | Framework + runtime | SM 12.1 (GB10) support. |
| transformers 5.10 dev | Model config/tokenizer loading | Recognizes gemma4_unified and other bleeding-edge classes. |
| DFlash drafter | z-lab/gemma-4-26B-A4B-it-DFlash (mount separately) |
5-layer SWA-2048 block-diffusion drafter — now actually run with sliding windows. |
| Native FP4 CUTLASS kernels | FlashInfer CUTLASS for linear layers, VLLM CUTLASS for MoE | Do not force Marlin on this image; the native FP4 path is faster on GB10. |
| TRITON_ATTN body / FLASH_ATTN drafter | Attention computation | Triton handles Gemma 4's heterogeneous head dims (256/512); the drafter's non-causal attention runs on FlashAttention (required on this image). |
| torch.compile + CUDA graphs | Graph capture and kernel fusion | Captures decode graphs for the configured batch sizes, reducing Python overhead on the decode hot path. |
Why MoE Makes This Possible
The fundamental constraint on DGX Spark is memory bandwidth: 273 GB/s. During autoregressive decode, the GPU must read the model weights for every single token generated. This is what determines tok/s:
tok/s = memory_bandwidth / bytes_read_per_token
For a dense 27B model at NVFP4 (~13.5 GB weights):
273 GB/s / 13.5 GB = ~20 tok/s (theoretical max, before KV cache and overhead)
For this MoE model (top-8 of 128 experts, ~2.8 GB active per token):
273 GB/s / 2.8 GB = ~97 tok/s (theoretical max)
We achieve ~37–158 tok/s single-stream depending on category — coding sustains 149–150 tok/s measured on the fixed image; prose ~37 and extraction/JSON ~158 from the pre-fix short-context sweep (a floor; fixed-image re-sweep pending) — with aggregate throughput passing 1,700 tok/s on coding (pre-fix sweep) and ~3,700 tok/s on extraction via the stock no-drafter path. The gap from the theoretical limit comes from KV cache reads, attention computation, router overhead, drafter verification, and memory access patterns. But the key insight is that MoE turns a bandwidth-impossible problem (dense 27B) into a bandwidth-comfortable one.
| Model Type | Params Read/Token | Max tok/s on GB10 | Practical tok/s |
|---|---|---|---|
| Dense 27B BF16 | ~54 GB | 5 | Not viable |
| Dense 27B NVFP4 | ~13.5 GB | 20 | ~15 |
| MoE 26B top-8/128 NVFP4 + DFlash | ~2.8 GB + drafter | 97 | 149–150 coding (fixed image); ~54 natural chat / ~158 extraction (pre-fix sweep); 1.7K+ aggregate |
This is why architecture choice matters more than raw parameter count on bandwidth-limited hardware. A 26B MoE model at NVFP4 is faster than a dense 7B at BF16 on the same hardware.
Container Image Details
AEON vLLM Ultimate (recommended)
ghcr.io/aeon-7/aeon-vllm-ultimate:latest (= :2026-06-11-pr41703) — repo + full card
| Component | Version |
|---|---|
| vLLM | 0.22.1-lineage (main@2026-06-05) + PR #44389 NVFP4-KV + PR #40898/#41703 DFlash fixes |
| PyTorch | 2.11.0+cu130 |
| transformers | 5.10.0.dev (HEAD) |
| flashinfer | 0.6.8.post1 |
| DFlash drafter | mount z-lab/gemma-4-26B-A4B-it-DFlash, drafter backend flash_attn |
| Target GPU | NVIDIA GB10 (DGX Spark, sm_121a); also RTX 50-series (sm_120) |
Pinned tags: :2026-06-11-pr41703 (current :latest), :2026-06-04-pr44389 (pre-fix rollback — flex drafter only, prefix caching NOT safe with DFlash). The older aeon-gemma-4-26b-a4b-dflash:v2 (vLLM 0.20.1) lineage is superseded and kept only for historical comparison.
Stock Community Baseline Image
vllm/vllm-openai:latest@sha256:9eff9734a30b6713a8566217d36f8277630fd2d31cec7f0a0292835901a23aa4
| Component | Version |
|---|---|
| vLLM | 0.20.1 |
| PyTorch | 2.11.0+cu130 |
| transformers | 5.7.0 |
| Speculative decoding | None |
This image is useful as a current upstream reference point. It is not the AEON DFlash package and does not include the Gemma 4 DFlash drafter path.
All Fixes Included
This model required several post-quantization fixes to work correctly with vLLM. All fixes are baked into the HuggingFace release — no additional debugging needed:
- De-quantized 760 vision tower tensors (27 ViT layers), 120 router tensors (30 MoE layers), and 4 embedding projection tensors — all restored from original BF16 weights
- Patched vLLM weight loader for compressed-tensors NVFP4 MoE format (
gemma4_patched.py— 3 targeted patches to_weight_iteratorandload_weights) - Added
audio_configandnum_experts_per_toktoconfig.json(vLLM config parser requirements) - Created
preprocessor_config.jsonandprocessor_config.jsonfor multimodal support - Configured EOS token IDs [1, 106, 50, 98, 100, 101] to prevent thinking/channel token leakage
Full technical details: HuggingFace Model Card
Related Models
| Model | Type | Size | tok/s (DGX Spark) | Links |
|---|---|---|---|---|
| This model (Gemma 4 26B MoE + DFlash) | MoE NVFP4 | 15.3 GB | 149–150 c=1 coding sustained (fixed image) / 1,724 aggregate coding (pre-fix sweep; re-sweep pending) / ~3,700 aggregate extraction (stock, no-drafter) | HuggingFace |
| Gemma 4 31B DECKARD AWQ_FULL | Dense NVFP4 | 20.5 GB | ~12-14 | HuggingFace | GitHub |
| Gemma 4 31B DECKARD SVDQuant | Dense NVFP4 | 20.9 GB | ~10-13 | HuggingFace |
| Qwen3.5-27B Uncensored | Dense NVFP4 | ~15 GB | ~15-18 | HuggingFace |
MoE vs Dense: The MoE model is 3-4x faster than dense models because it only reads ~4B parameters per token (top-8 of 128 experts) vs 27-31B for dense models. Choose MoE for speed and concurrency, dense for maximum quality.
Disclaimer, Liability Waiver, and Assumption of Risk
THIS IS AN UNCENSORED MODEL. By downloading, accessing, or using this model, the associated container image (ghcr.io/aeon-7/aeon-vllm-ultimate), or any derivative works thereof, you expressly acknowledge and agree to the following:
Assumption of Risk
Uncensored language models present materially elevated risks compared to safety-aligned models, including but not limited to: generation of harmful, misleading, illegal, or objectionable content; susceptibility to adversarial misuse; potential for facilitating activities that violate applicable laws or regulations; and amplified risk in automated or agentic pipelines where outputs may be executed without human review.
These tools are powerful and serve a multitude of legitimate and essential purposes — including security research, red-teaming, content analysis, creative work, and applications where safety filters interfere with valid use cases. However, the absence of safety guardrails demands a correspondingly higher standard of care from the operator. You must implement your own safeguards, content filtering, access controls, and monitoring appropriate to your use case and jurisdiction.
Limitation of Liability
The authors, contributors, and distributors of this model and container image ("Providers") are not responsible or liable, directly or indirectly, for any actions taken, content generated, damages incurred, or legal consequences arising from the use or misuse of these materials. This includes, without limitation:
- Any harmful, illegal, unethical, or objectionable outputs produced by the model
- Any decisions made or actions taken based on model outputs
- Any damages — direct, indirect, incidental, consequential, special, or exemplary — arising from the use of the model or container, regardless of whether the Providers were advised of the possibility of such damages
- Any violation of local, state, national, or international laws or regulations by the user
User Responsibility
You, the user, assume full and sole responsibility and liability for:
- All outputs generated by the model under your operation
- Ensuring your use complies with all applicable laws, regulations, and ethical standards in your jurisdiction
- Implementing appropriate access controls, content filtering, and human oversight
- Any consequences of deploying this model in production, automated, or public-facing systems
- Evaluating whether an uncensored model is appropriate for your specific use case
Acceptance
By downloading or using any component of this release — including the model weights, container image, configuration files, or patched code — you indicate your acceptance of these terms and your assumption of all associated risks and liabilities. If you do not agree to these terms, do not download or use these materials.
License
This model inherits the Gemma license from Google.
Support the work
If this release has been useful, tips are deeply appreciated. They go directly toward more compute, more models, and more open releases.
Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
- Downloads last month
- 8,749
Model tree for AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4
Base model
google/gemma-4-26B-A4B


