Instructions to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

SGLang

How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
```

Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP

✅ Validated on the unified AEON vLLM Ultimate image ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0; = tag :2026-06-18-v0.23.0-dflashfix; rollback tag :2026-06-11-pr41703) — loads + serves cleanly with the z-lab DFlash drafter @ n=12. Latest v0.23.0 DGX Spark bench: ~36 tok/s single-stream / ~274 tok/s aggregate at c=64, ~38% DFlash acceptance (holds ~41% at long context). Recommended container base. Full numbers in Performance below.

Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and AGENTS.md — an operator's manual that pre-empts common stale-documentation traps.

🙏 Reference recipe credit: The modelopt + MTP graft pipeline used to build this variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config, the per-projection quantization choices, and the MTP-head graft technique on the un-abliterated base; we adapted the same recipe to AEON-Ultimate's abliterated weights. The reference benchmark numbers cited below are theirs. Full credit for the recipe → sakamakismile.

🆕 AEON vLLM Ultimate container

ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0; = tag :2026-06-18-v0.23.0-dflashfix; rollback tag :2026-06-11-pr41703) — vLLM 0.23.0 built from source for sm_121a + Triton NVFP4 KV cache (~3× capacity) + DFlash high-concurrency fix + TurboQuant K8V4 + AEON sm_121a patches. This is the canonical container for all Qwen3.6-27B repos. Benchmarked end-to-end on DGX Spark / GB10 under v0.23.0 — see Performance below. This variant uses the modelopt NVFP4 format, the qwen3_5_mtp native head, and the hybrid GDN+attention stack — it serves with --quantization modelopt and either --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (native MTP, the recommended method for this dedicated-VRAM-Blackwell variant) or a DFlash drafter (recommended on Spark — see container README Recipe A; the v0.23.0 Spark numbers below use DFlash@12).

The image ENTRYPOINT is /bin/bash, so docker run must pass --entrypoint vllm and then serve ... (not IMAGE vllm serve, which runs bash vllm serve and fails). DFlash needs BF16 KV — leave --kv-cache-dtype unset (it defaults to BF16); do not set fp8/nvfp4. Full setup + bench comparison: container README.

Why the new image matters for long-context DFlash: the z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). vLLM PR #40898 (in aeon-vllm-ultimate:latest) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes --enable-prefix-caching corruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged.

Quickstart

Complete copy-paste recipe — pull the container, pull this model, pull the DFlash drafter (fresh), then serve with the validated flags. The image ENTRYPOINT is /bin/bash, so docker run overrides it with --entrypoint vllm. DFlash needs BF16 KV — leave --kv-cache-dtype unset.

# 1. Pull the AEON vLLM Ultimate container (vLLM 0.23.0 sm_121a from-source + PR #44389 NVFP4-KV
#    + PR #40898/#41703 DFlash fixes + DFlash high-concurrency fix).
#    :latest = :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703.
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest

# 2. Download this model (fresh).
huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  --local-dir ./aeon-model

# 3. Download the z-lab DFlash drafter (fresh — pull every time).
huggingface-cli download z-lab/Qwen3.6-27B-DFlash \
  --local-dir ./aeon-drafter

# 4. Serve — DFlash@12 on the NVFP4 (modelopt) body, vision tower preserved.
docker run --gpus all --ipc host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  -v ./aeon-model:/model:ro \
  -v ./aeon-drafter:/drafter:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
    --quantization modelopt \
    --trust-remote-code \
    --mamba-cache-dtype float16 \
    --mamba-block-size 256 \
    --max-model-len 262144 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 32768 \
    --gpu-memory-utilization 0.85 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder \
    --enable-auto-tool-choice \
    --limit-mm-per-prompt '{"image":4,"video":2}' \
    --mm-encoder-tp-mode data \
    --speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12}'

num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. On dedicated-VRAM Blackwell you can swap to the model's native grafted MTP head with --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (see hardware routing). Full flag reference, env vars, and BF16 / dedicated-GPU examples are in Usage below; deployment & compose configs live in the GitHub repo.

Variants

Format	Size	Use case
BF16	51 GB	Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning)
NVFP4 (compressed-tensors + DFlash)	26 GB	DGX Spark / GB10 — production validated with DFlash speculative decoding. Unified `ghcr.io/aeon-7/aeon-vllm-ultimate:latest` container.
Multimodal-NVFP4-MTP (this repo)	27 GB	High-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native `mtp.*` head. modelopt format, `--quantization modelopt`. Vision tower preserved.
Text-NVFP4-MTP	20 GB	Same as this repo but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM.

What this is

This is the modelopt-format NVFP4 variant with MTP speculative decoding, multimodal-preserved, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).

Specifically:

Body quantized to NVFP4 via nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through --quantization modelopt (different code path from the -NVFP4 sibling release which uses --quantization compressed-tensors).
Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's *linear_attn.conv1d* ignore plus our explicit *linear_attn* exclude keeps these intact.
Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
MTP head grafted from the base Qwen/Qwen3.6-27B checkpoint (15 tensors, BF16). The base contains MTP heads but Qwen3_5ForConditionalGeneration.from_pretrained drops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for --speculative-config '{"method":"qwen3_5_mtp",...}'.

Why MTP — and where it actually wins

Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.

Measured numbers on AEON-Ultimate (this exact variant)

Hardware	Median tok/s	Peak tok/s	Spec-decode acceptance
RTX PRO 6000 Blackwell (96 GB dedicated VRAM)	~92 (this variant) / 111.4 (XS sibling)	124.7 (XS sibling)	67.7 % regular / 69.2 % XS
DGX Spark / GB10 (unified memory) — MTP method	24.1 (XS sibling)	27.5	66.3 %
DGX Spark / GB10 — DFlash method on this body 🏆	38.5 tok/s thinking-on / 38.1 thinking-off	71.3 tok/s thinking-on / 68.4 off	DFlash (n=12)
RTX 5090, B100 / B200	not yet measured by us — community welcome

Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)

Single-stream short prompts at n=3: ~132 tok/s
Single-stream long-form: ~105 tok/s
2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)

The hardware-routing punchline

On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.

🎯 When to pick this variant — measured hardware routing

The right speculative-decode method depends on memory architecture:

Hardware tier	Recommended variant	Why
DGX Spark / GB10 (sm_121a, unified memory)	`-NVFP4` (DFlash) — not this MTP variant	Bench on Spark: DFlash beats MTP by +26 % median, +52 % peak. Spark's unified-memory bandwidth doesn't reward MTP's high acceptance rate. Don't run MTP on Spark.
RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM)	This variant (Multimodal-NVFP4-MTP) ✅ if you need vision; Text if text-only	MTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput.
RTX 5090 (sm_120, 32 GB dedicated VRAM)	Multimodal-XS if you use vision; Text-XS if text-only	XS variants fit comfortably in 32 GB. 111.4 tok/s median measured on RTX PRO 6000; RTX 5090 should land near or above that.
A100 / H100 (no native FP4)	BF16	NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit.
B100 / B200 (sm_100, dedicated FP4)	This variant (Multimodal) or Text variant	Native FP4 + dedicated VRAM = MTP territory.

Full bench numbers: GitHub repo Performance section.

Usage

vLLM serve

# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
  --local-dir ./aeon-ultimate-multimodal-nvfp4-mtp

# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1

vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
  --quantization modelopt \
  --trust-remote-code \
  --mamba-cache-dtype float16 \
  --mamba-block-size 256 \
  --max-model-len 262144 \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt '{"image":4,"video":2}' \
  --mm-encoder-tp-mode data \
  --speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":12}'

num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. The --limit-mm-per-prompt / --mm-encoder-tp-mode data flags drive the preserved vision tower (multimodal body).

Configuration notes

--quantization modelopt is required (not compressed-tensors — different format).
--speculative-config '{"method":"dflash", ...}' drives the z-lab Qwen3.6-27B-DFlash drafter at num_speculative_tokens=12 — the validated optimal for this NVFP4 body. (The native qwen3_5_mtp head is also grafted into this repo's safetensors and can be selected instead on dedicated-VRAM Blackwell; see the GitHub repo for the MTP-vs-DFlash hardware routing.)
--gpu-memory-utilization 0.85 keeps headroom on unified-memory parts; on dedicated-VRAM RTX PRO 6000 you can push higher, but 0.95+ causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.

Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)

Measured on a single DGX Spark / GB10 (Blackwell sm_121a, unified memory) with ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0), this NVFP4 body driven by the z-lab DFlash drafter @ n=12 (DFlash@12 speculative decoding). Headline: ~36 tok/s single-stream, ~274 tok/s aggregate at c=64, ~38% DFlash acceptance (holds ~41% at long context).

Aggregate throughput vs concurrency (c=1→64) per prompt category, DFlash@12 on DGX Spark GB10 — peaks at ~274 tok/s at c=64

Qwen3.6-27B-AEON-Ultimate variant comparison — NVFP4 + DFlash@12 single-stream and c=64 aggregate vs unquantized BF16 baseline

Single-stream (c=1) by prompt category — DFlash@12

Category	Decode tok/s	TTFT (ms)	TPOT (ms)	Prefill (tok/s)	DFlash accept
Coding	36.1	177	27.7	254	37.9 %
Math	37.7	308	26.5	198	41.5 %
Reasoning	42.4	299	23.6	164	47.0 %
Prose	24.4	299	41.0	127	22.9 %
Natural language	27.0	317	37.0	126	26.8 %
Extraction / JSON	36.1	304	27.7	178	38.0 %

Single-stream decode lands around 24–42 tok/s depending on category (~36 tok/s on the structured Coding/Extraction workloads); the higher-acceptance Reasoning/Math prompts decode fastest. Acceptance tracks how predictable the next tokens are — high on Reasoning (47%) and Math (41.5%), lower on free-form Prose (22.9%).

Aggregate throughput by concurrency

Aggregate throughput scales cleanly from c=1 up to c=64 with no crash (the prior image crashed at c≥32 under speculative decoding — see What we fixed below). Peak aggregate throughput is ~274 tok/s at c=64 (Reasoning); other categories at c=64: Math ~251, Extraction/JSON ~240, Coding ~229, Natural language ~186, Prose ~156 tok/s. Most of the gain is already captured by c=16; c=16→64 adds only a few percent.

Category	c=1	c=8	c=16	c=32	c=64
Coding	35	162	214	222	229
Math	36	180	246	248	251
Reasoning	41	177	258	269	274
Prose	24	106	142	155	156
Natural language	26	129	180	183	186
Extraction / JSON	35	155	248	218	240

Long-context DFlash acceptance

DFlash draft acceptance holds at ~41% (40.9%) at long context rather than collapsing — PR #40898 runs the drafter's sliding-window layers as true SWA, so drafting survives as the agent history grows past ~2048 tokens.

Stock baseline pending fresh vanilla re-bench. No matched stock / un-optimized vanilla-vLLM baseline exists yet for this variant; the BF16 bar in the variant chart is the unquantized AEON body, not a stock-vLLM reference. A fully-vanilla (no DFlash, no sm_121a opts) re-bench is planned and these figures will be cross-referenced once it lands.

What we fixed for the DGX Spark

All AEON Qwen3.6-27B repos now run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest — vLLM 0.23.0 built from source for sm_121a and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture. The two changes that matter most for this card:

DFlash high-concurrency fix (new in v0.23.0) — the speculative drafter previously crashed at ≥32 concurrent requests (a padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The fix slices the drafter's block-table to the unpadded batch (block_table[:num_reqs]), so it now scales cleanly to c=64. This is a port of upstream PR #43982, which fixed the same bug for MTP but never for DFlash — it was present and unfixed even in the prior image.
Triton NVFP4 KV cache (PR #44389) — the only 4-bit KV path on sm_121a (upstream's is hard-gated to B200), giving ~3× KV capacity / longer context per GB of unified memory.
DFlash sliding-window attention (PR #40898) — runs the drafter's SWA layers as true sliding window, so long-context draft acceptance holds (~41% here at long context) instead of collapsing past ~2k tokens.

Container rollback tag: :2026-06-11-pr41703. Full writeup: container README.

Quantization recipe

Tool: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Loader: Qwen3_5ForConditionalGeneration.from_pretrained (multimodal-preserved class)
Calibration: neuralmagic/calibration LLM split, 20 samples × 8192 tokens
Excluded from quantization (kept BF16):
- lm_head, proj_out.*, *router*, *mlp.gate.* (NVFP4_DEFAULT_CFG)
- *linear_attn.conv1d*, *mixer.conv1d* (NVFP4_DEFAULT_CFG)
- *linear_attn* (added — full GDN preservation)
- *visual* (added — vision tower preservation)
- *mtp* (added — MTP head preservation)
- *output_layer*, output.*
MTP graft: 15 tensors copied bf16 from Qwen/Qwen3.6-27B after modelopt export (AutoModelForCausalLM.from_pretrained drops them; explicit graft restores)
Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source

Provenance & credits

BF16 source: AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline.
MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (docs/MTP_GRAFT_RECIPE.md)
Reference benchmark recipes: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
Quantization: NVIDIA TensorRT Model Optimizer (nvidia-modelopt 0.43.0)
Base: Alibaba Qwen team — Qwen/Qwen3.6-27B

License + responsibility

Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.