How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

Validated on the unified AEON vLLM Ultimate image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0; = tag :2026-06-18-v0.23.0-dflashfix; rollback tag :2026-06-11-pr41703) — loads + serves cleanly with the z-lab DFlash drafter @ n=12. Latest v0.23.0 DGX Spark bench: ~36 tok/s single-stream / ~274 tok/s aggregate at c=64, ~38% DFlash acceptance (holds ~41% at long context). Recommended container base. Full numbers in Performance below.

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps.

🙏 Reference recipe credit: The modelopt + MTP graft pipeline used to build this variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config, the per-projection quantization choices, and the MTP-head graft technique on the un-abliterated base; we adapted the same recipe to AEON-Ultimate's abliterated weights. The reference benchmark numbers cited below are theirs. Full credit for the recipe → sakamakismile.

🆕 AEON vLLM Ultimate container

ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0; = tag :2026-06-18-v0.23.0-dflashfix; rollback tag :2026-06-11-pr41703) — vLLM 0.23.0 built from source for sm_121a + Triton NVFP4 KV cache (~3× capacity) + DFlash high-concurrency fix + TurboQuant K8V4 + AEON sm_121a patches. This is the canonical container for all Qwen3.6-27B repos. Benchmarked end-to-end on DGX Spark / GB10 under v0.23.0 — see Performance below. This variant uses the modelopt NVFP4 format, the qwen3_5_mtp native head, and the hybrid GDN+attention stack — it serves with --quantization modelopt and either --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (native MTP, the recommended method for this dedicated-VRAM-Blackwell variant) or a DFlash drafter (recommended on Spark — see container README Recipe A; the v0.23.0 Spark numbers below use DFlash@12).

The image ENTRYPOINT is /bin/bash, so docker run must pass --entrypoint vllm and then serve ... (not IMAGE vllm serve, which runs bash vllm serve and fails). DFlash needs BF16 KV — leave --kv-cache-dtype unset (it defaults to BF16); do not set fp8/nvfp4. Full setup + bench comparison: container README.

Why the new image matters for long-context DFlash: the z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). vLLM PR #40898 (in aeon-vllm-ultimate:latest) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged.

Quickstart

Complete copy-paste recipe — pull the container, pull this model, pull the DFlash drafter (fresh), then serve with the validated flags. The image ENTRYPOINT is /bin/bash, so docker run overrides it with --entrypoint vllm. DFlash needs BF16 KV — leave --kv-cache-dtype unset.

# 1. Pull the AEON vLLM Ultimate container (vLLM 0.23.0 sm_121a from-source + PR #44389 NVFP4-KV
#    + PR #40898/#41703 DFlash fixes + DFlash high-concurrency fix).
#    :latest = :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703.
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2. Download this model (fresh).
huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  --local-dir ./aeon-model

# 3. Download the z-lab DFlash drafter (fresh — pull every time).
huggingface-cli download z-lab/Qwen3.6-27B-DFlash \
  --local-dir ./aeon-drafter

# 4. Serve — DFlash@12 on the NVFP4 (modelopt) body, vision tower preserved.
docker run --gpus all --ipc host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  -v ./aeon-model:/model:ro \
  -v ./aeon-drafter:/drafter:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
    --quantization modelopt \
    --trust-remote-code \
    --mamba-cache-dtype float16 \
    --mamba-block-size 256 \
    --max-model-len 262144 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 32768 \
    --gpu-memory-utilization 0.85 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --limit-mm-per-prompt '{"image":4,"video":2}' \
    --mm-encoder-tp-mode data \
    --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12}'

num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. On dedicated-VRAM Blackwell you can swap to the model's native grafted MTP head with --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (see hardware routing). Full flag reference, env vars, and BF16 / dedicated-GPU examples are in Usage below; deployment & compose configs live in the GitHub repo.

Variants

Format Size Use case
BF16 51 GB Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning)
NVFP4 (compressed-tensors + DFlash) 26 GB DGX Spark / GB10 — production validated with DFlash speculative decoding. Unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container.
Multimodal-NVFP4-MTP (this repo) 27 GB High-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native mtp.* head. modelopt format, --quantization modelopt. Vision tower preserved.
Text-NVFP4-MTP 20 GB Same as this repo but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM.

What this is

This is the modelopt-format NVFP4 variant with MTP speculative decoding, multimodal-preserved, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

  • Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through --quantization modelopt (different code path from the -NVFP4 sibling release which uses --quantization compressed-tensors).
  • Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's *linear_attn.conv1d* ignore plus our explicit *linear_attn* exclude keeps these intact.
  • Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
  • MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16). The base contains MTP heads but Qwen3_5ForConditionalGeneration.from_pretrained drops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for --speculative-config '{"method":"qwen3_5_mtp",...}'.

Why MTP — and where it actually wins

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Measured numbers on AEON-Ultimate (this exact variant)

Hardware Median tok/s Peak tok/s Spec-decode acceptance
RTX PRO 6000 Blackwell (96 GB dedicated VRAM) ~92 (this variant) / 111.4 (XS sibling) 124.7 (XS sibling) 67.7 % regular / 69.2 % XS
DGX Spark / GB10 (unified memory) — MTP method 24.1 (XS sibling) 27.5 66.3 %
DGX Spark / GB10 — DFlash method on this body 🏆 38.5 tok/s thinking-on / 38.1 thinking-off 71.3 tok/s thinking-on / 68.4 off DFlash (n=12)
RTX 5090, B100 / B200 not yet measured by us — community welcome

Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)

  • Single-stream short prompts at n=3: ~132 tok/s
  • Single-stream long-form: ~105 tok/s
  • 2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
  • Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)

The hardware-routing punchline

On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Hardware tier Recommended variant Why
DGX Spark / GB10 (sm_121a, unified memory) -NVFP4 (DFlash)not this MTP variant Bench on Spark: DFlash beats MTP by +26 % median, +52 % peak. Spark's unified-memory bandwidth doesn't reward MTP's high acceptance rate. Don't run MTP on Spark.
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM) This variant (Multimodal-NVFP4-MTP) ✅ if you need vision; Text if text-only MTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput.
RTX 5090 (sm_120, 32 GB dedicated VRAM) Multimodal-XS if you use vision; Text-XS if text-only XS variants fit comfortably in 32 GB. 111.4 tok/s median measured on RTX PRO 6000; RTX 5090 should land near or above that.
A100 / H100 (no native FP4) BF16 NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.
B100 / B200 (sm_100, dedicated FP4) This variant (Multimodal) or Text variant Native FP4 + dedicated VRAM = MTP territory.

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
  --quantization modelopt \
  --trust-remote-code \
  --mamba-cache-dtype float16 \
  --mamba-block-size 256 \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":12}'

num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. The --limit-mm-per-prompt / --mm-encoder-tp-mode data flags drive the preserved vision tower (multimodal body).

Configuration notes

  • --quantization modelopt is required (not compressed-tensors — different format).
  • --speculative-config '{"method":"dflash", ...}' drives the z-lab Qwen3.6-27B-DFlash drafter at num_speculative_tokens=12 — the validated optimal for this NVFP4 body. (The native qwen3_5_mtp head is also grafted into this repo's safetensors and can be selected instead on dedicated-VRAM Blackwell; see the GitHub repo for the MTP-vs-DFlash hardware routing.)
  • --gpu-memory-utilization 0.85 keeps headroom on unified-memory parts; on dedicated-VRAM RTX PRO 6000 you can push higher, but 0.95+ causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.

Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)

Measured on a single DGX Spark / GB10 (Blackwell sm_121a, unified memory) with ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0), this NVFP4 body driven by the z-lab DFlash drafter @ n=12 (DFlash@12 speculative decoding). Headline: ~36 tok/s single-stream, ~274 tok/s aggregate at c=64, ~38% DFlash acceptance (holds ~41% at long context).

Aggregate throughput vs concurrency (c=1→64) per prompt category, DFlash@12 on DGX Spark GB10 — peaks at ~274 tok/s at c=64

Qwen3.6-27B-AEON-Ultimate variant comparison — NVFP4 + DFlash@12 single-stream and c=64 aggregate vs unquantized BF16 baseline

Single-stream (c=1) by prompt category — DFlash@12

Category Decode tok/s TTFT (ms) TPOT (ms) Prefill (tok/s) DFlash accept
Coding 36.1 177 27.7 254 37.9 %
Math 37.7 308 26.5 198 41.5 %
Reasoning 42.4 299 23.6 164 47.0 %
Prose 24.4 299 41.0 127 22.9 %
Natural language 27.0 317 37.0 126 26.8 %
Extraction / JSON 36.1 304 27.7 178 38.0 %

Single-stream decode lands around 24–42 tok/s depending on category (~36 tok/s on the structured Coding/Extraction workloads); the higher-acceptance Reasoning/Math prompts decode fastest. Acceptance tracks how predictable the next tokens are — high on Reasoning (47%) and Math (41.5%), lower on free-form Prose (22.9%).

Aggregate throughput by concurrency

Aggregate throughput scales cleanly from c=1 up to c=64 with no crash (the prior image crashed at c≥32 under speculative decoding — see What we fixed below). Peak aggregate throughput is ~274 tok/s at c=64 (Reasoning); other categories at c=64: Math ~251, Extraction/JSON ~240, Coding ~229, Natural language ~186, Prose ~156 tok/s. Most of the gain is already captured by c=16; c=16→64 adds only a few percent.

Category c=1 c=8 c=16 c=32 c=64
Coding 35 162 214 222 229
Math 36 180 246 248 251
Reasoning 41 177 258 269 274
Prose 24 106 142 155 156
Natural language 26 129 180 183 186
Extraction / JSON 35 155 248 218 240

Long-context DFlash acceptance

DFlash draft acceptance holds at ~41% (40.9%) at long context rather than collapsing — PR #40898 runs the drafter's sliding-window layers as true SWA, so drafting survives as the agent history grows past ~2048 tokens.

Stock baseline pending fresh vanilla re-bench. No matched stock / un-optimized vanilla-vLLM baseline exists yet for this variant; the BF16 bar in the variant chart is the unquantized AEON body, not a stock-vLLM reference. A fully-vanilla (no DFlash, no sm_121a opts) re-bench is planned and these figures will be cross-referenced once it lands.

What we fixed for the DGX Spark

All AEON Qwen3.6-27B repos now run on one unified containerghcr.io/aeon-7/aeon-vllm-ultimate:latest — vLLM 0.23.0 built from source for sm_121a and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture. The two changes that matter most for this card:

  • DFlash high-concurrency fix (new in v0.23.0) — the speculative drafter previously crashed at ≥32 concurrent requests (a padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The fix slices the drafter's block-table to the unpadded batch (block_table[:num_reqs]), so it now scales cleanly to c=64. This is a port of upstream PR #43982, which fixed the same bug for MTP but never for DFlash — it was present and unfixed even in the prior image.
  • Triton NVFP4 KV cache (PR #44389) — the only 4-bit KV path on sm_121a (upstream's is hard-gated to B200), giving ~3× KV capacity / longer context per GB of unified memory.
  • DFlash sliding-window attention (PR #40898) — runs the drafter's SWA layers as true sliding window, so long-context draft acceptance holds (~41% here at long context) instead of collapsing past ~2k tokens.

Container rollback tag: :2026-06-11-pr41703. Full writeup: container README.

Quantization recipe

  • Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
  • Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
  • Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
  • Excluded from quantization (kept BF16):
    • lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
    • *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG)
    • *linear_attn* (added — full GDN preservation)
    • *visual* (added — vision tower preservation)
    • *mtp* (added — MTP head preservation)
    • *output_layer*, output.*
  • MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export (AutoModelForCausalLM.from_pretrained drops them; explicit graft restores)
  • Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.


☕ Support the work

If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.

₿ Bitcoin (BTC)
QR
bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
Ξ Ethereum (ETH)
QR
0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
◎ Solana (SOL)
QR
DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
ⓜ Monero (XMR)
QR
836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd

Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.

Downloads last month
8,843
Safetensors
Model size
20B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(29)
this model

Collection including AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP