Instructions to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP") model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
- SGLang
How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP with Docker Model Runner:
docker model run hf.co/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM
processor = AutoProcessor.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))- Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
- Quickstart
- Variants
- What this is
- Why MTP — and where it actually wins
- 🎯 When to pick this variant — measured hardware routing
- Usage
- Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)
- What we fixed for the DGX Spark
- Quantization recipe
- Provenance & credits
- License + responsibility
- ☕ Support the work
- Quickstart
Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
✅ Validated on the unified AEON vLLM Ultimate image
ghcr.io/aeon-7/aeon-vllm-ultimate:latest(vLLM 0.23.0; = tag:2026-06-18-v0.23.0-dflashfix; rollback tag:2026-06-11-pr41703) — loads + serves cleanly with the z-lab DFlash drafter @ n=12. Latest v0.23.0 DGX Spark bench: ~36 tok/s single-stream / ~274 tok/s aggregate at c=64, ~38% DFlash acceptance (holds ~41% at long context). Recommended container base. Full numbers in Performance below.
Deployment, operations & benchmarks → github.com/AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash
The GitHub repo is the source of truth for the production deployment guide, hardware-tuned docker-compose configs, full configuration reference, measured benchmarks, and
AGENTS.md— an operator's manual that pre-empts common stale-documentation traps.
🙏 Reference recipe credit: The modelopt + MTP graft pipeline used to build this variant is based on sakamakismile's validated Qwen3.6-27B-NVFP4-MTP series (22K+ downloads). They worked out the modelopt config, the per-projection quantization choices, and the MTP-head graft technique on the un-abliterated base; we adapted the same recipe to AEON-Ultimate's abliterated weights. The reference benchmark numbers cited below are theirs. Full credit for the recipe → sakamakismile.
🆕 AEON vLLM Ultimate container
ghcr.io/aeon-7/aeon-vllm-ultimate:latest(vLLM 0.23.0; = tag:2026-06-18-v0.23.0-dflashfix; rollback tag:2026-06-11-pr41703) — vLLM 0.23.0 built from source for sm_121a + Triton NVFP4 KV cache (~3× capacity) + DFlash high-concurrency fix + TurboQuant K8V4 + AEON sm_121a patches. This is the canonical container for all Qwen3.6-27B repos. Benchmarked end-to-end on DGX Spark / GB10 under v0.23.0 — see Performance below. This variant uses the modelopt NVFP4 format, theqwen3_5_mtpnative head, and the hybrid GDN+attention stack — it serves with--quantization modeloptand either--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'(native MTP, the recommended method for this dedicated-VRAM-Blackwell variant) or a DFlash drafter (recommended on Spark — see container README Recipe A; the v0.23.0 Spark numbers below use DFlash@12).The image ENTRYPOINT is
/bin/bash, sodocker runmust pass--entrypoint vllmand thenserve ...(notIMAGE vllm serve, which runsbash vllm serveand fails). DFlash needs BF16 KV — leave--kv-cache-dtypeunset (it defaults to BF16); do not setfp8/nvfp4. Full setup + bench comparison: container README.Why the new image matters for long-context DFlash: the z-lab Qwen3.6-27B DFlash drafter is a sliding-window model — 4 of its 5 layers use sliding-window attention (window 2048). vLLM PR #40898 (in
aeon-vllm-ultimate:latest) runs those layers as proper SWA; earlier images ran them as full attention, so drafting collapsed once context grew past ~2048 tokens. PR #41703 additionally makes--enable-prefix-cachingcorruption-immune with DFlash. Net: long-context drafting holds up; short-context (<2048, one window) is unchanged.
Quickstart
Complete copy-paste recipe — pull the container, pull this model, pull the DFlash drafter (fresh), then serve with the validated flags. The image ENTRYPOINT is /bin/bash, so docker run overrides it with --entrypoint vllm. DFlash needs BF16 KV — leave --kv-cache-dtype unset.
# 1. Pull the AEON vLLM Ultimate container (vLLM 0.23.0 sm_121a from-source + PR #44389 NVFP4-KV
# + PR #40898/#41703 DFlash fixes + DFlash high-concurrency fix).
# :latest = :2026-06-18-v0.23.0-dflashfix; rollback :2026-06-11-pr41703.
docker pull ghcr.io/aeon-7/aeon-vllm-ultimate:latest
# 2. Download this model (fresh).
huggingface-cli download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
--local-dir ./aeon-model
# 3. Download the z-lab DFlash drafter (fresh — pull every time).
huggingface-cli download z-lab/Qwen3.6-27B-DFlash \
--local-dir ./aeon-drafter
# 4. Serve — DFlash@12 on the NVFP4 (modelopt) body, vision tower preserved.
docker run --gpus all --ipc host --network host \
-e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e VLLM_USE_FLASHINFER_SAMPLER=1 \
-v ./aeon-model:/model:ro \
-v ./aeon-drafter:/drafter:ro \
--entrypoint vllm \
ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
serve /model \
--quantization modelopt \
--trust-remote-code \
--mamba-cache-dtype float16 \
--mamba-block-size 256 \
--max-model-len 262144 \
--max-num-seqs 32 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--limit-mm-per-prompt '{"image":4,"video":2}' \
--mm-encoder-tp-mode data \
--speculative-config '{"method":"dflash","model":"/drafter","num_speculative_tokens":12}'
num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. On dedicated-VRAM Blackwell you can swap to the model's native grafted MTP head with --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' (see hardware routing). Full flag reference, env vars, and BF16 / dedicated-GPU examples are in Usage below; deployment & compose configs live in the GitHub repo.
Variants
| Format | Size | Use case |
|---|---|---|
| BF16 | 51 GB | Full-precision reference weights (A100/H100 80 GB, RTX PRO 6000 96 GB, multi-GPU, fine-tuning) |
| NVFP4 (compressed-tensors + DFlash) | 26 GB | DGX Spark / GB10 — production validated with DFlash speculative decoding. Unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container. |
| Multimodal-NVFP4-MTP (this repo) | 27 GB | High-bandwidth dedicated GPUs (RTX 5090, RTX PRO 6000, B100/B200) with MTP speculative decoding via the model's native mtp.* head. modelopt format, --quantization modelopt. Vision tower preserved. |
| Text-NVFP4-MTP | 20 GB | Same as this repo but with vision tower stripped. Smaller footprint for text-only deployments on tighter VRAM. |
What this is
This is the modelopt-format NVFP4 variant with MTP speculative decoding, multimodal-preserved, of AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16 — the lossless abliteration of Qwen 3.6 27B (KL 0.000492 vs base, 0/100 refusals, multimodal preserved, hybrid GDN-aware quantization).
Specifically:
- Body quantized to NVFP4 via
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG. This is the modelopt compressed-tensors format that vLLM serves through--quantization modelopt(different code path from the-NVFP4sibling release which uses--quantization compressed-tensors). - Linear-attn / GatedDeltaNet layers preserved BF16 (432 keys across 48 GDN layers). NVFP4 quantization on Mamba/SSM state collapses the recurrence; modelopt's
*linear_attn.conv1d*ignore plus our explicit*linear_attn*exclude keeps these intact. - Vision tower preserved BF16 (333 keys). Multimodal inference fully functional.
- MTP head grafted from the base
Qwen/Qwen3.6-27Bcheckpoint (15 tensors, BF16). The base contains MTP heads butQwen3_5ForConditionalGeneration.from_pretraineddrops them during loading; the lna-lab pipeline pattern (which this build follows) explicitly grafts them back into the quantized output, giving vLLM a working drafter for--speculative-config '{"method":"qwen3_5_mtp",...}'.
Why MTP — and where it actually wins
Multi-Token Prediction (MTP) lets the model predict multiple future tokens per forward pass via the trained mtp.* head, enabling speculative decoding without a separate drafter model. The acceptance rate is high because the drafter is the model itself — same architecture, same weights, same distribution.
Measured numbers on AEON-Ultimate (this exact variant)
| Hardware | Median tok/s | Peak tok/s | Spec-decode acceptance |
|---|---|---|---|
| RTX PRO 6000 Blackwell (96 GB dedicated VRAM) | ~92 (this variant) / 111.4 (XS sibling) | 124.7 (XS sibling) | 67.7 % regular / 69.2 % XS |
| DGX Spark / GB10 (unified memory) — MTP method | 24.1 (XS sibling) | 27.5 | 66.3 % |
| DGX Spark / GB10 — DFlash method on this body 🏆 | 38.5 tok/s thinking-on / 38.1 thinking-off | 71.3 tok/s thinking-on / 68.4 off | DFlash (n=12) |
| RTX 5090, B100 / B200 | not yet measured by us — community welcome |
Reference numbers from sakamakismile's un-abliterated recipe (RTX 5090)
- Single-stream short prompts at
n=3: ~132 tok/s - Single-stream long-form: ~105 tok/s
- 2-parallel aggregate (256K + KV FP8): ~189–207 tok/s
- Mean MTP acceptance length: ~3.0–4.0 (vs DFlash chains ~2.0–2.3)
The hardware-routing punchline
On RTX PRO 6000 the XS sibling beats DFlash territory (~111 tok/s vs DFlash-class ~85 we'd expect there). On DGX Spark, DFlash beats MTP by 26 % median / 52 % peak — the unified-memory bandwidth caps how much MTP's high acceptance can translate to throughput. So: MTP is a dedicated-VRAM-Blackwell variant, not a universal upgrade. Full bench data: GitHub repo Performance section.
🎯 When to pick this variant — measured hardware routing
The right speculative-decode method depends on memory architecture:
| Hardware tier | Recommended variant | Why |
|---|---|---|
| DGX Spark / GB10 (sm_121a, unified memory) | -NVFP4 (DFlash) — not this MTP variant |
Bench on Spark: DFlash beats MTP by +26 % median, +52 % peak. Spark's unified-memory bandwidth doesn't reward MTP's high acceptance rate. Don't run MTP on Spark. |
| RTX PRO 6000 Blackwell (sm_120, 96 GB dedicated VRAM) | This variant (Multimodal-NVFP4-MTP) ✅ if you need vision; Text if text-only | MTP wins on dedicated VRAM. ~92 tok/s median measured with GDN BF16; dedicated-VRAM bandwidth lets the MTP head's high acceptance rate translate to throughput. |
| RTX 5090 (sm_120, 32 GB dedicated VRAM) | Multimodal-XS if you use vision; Text-XS if text-only | XS variants fit comfortably in 32 GB. 111.4 tok/s median measured on RTX PRO 6000; RTX 5090 should land near or above that. |
| A100 / H100 (no native FP4) | BF16 | NVFP4 dequantizes to BF16 on Ampere/Hopper — no benefit. |
| B100 / B200 (sm_100, dedicated FP4) | This variant (Multimodal) or Text variant | Native FP4 + dedicated VRAM = MTP territory. |
Full bench numbers: GitHub repo Performance section.
Usage
vLLM serve
# One-time: pull this repo locally
hf download AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP \
--local-dir ./aeon-ultimate-multimodal-nvfp4-mtp
# Serve
export VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_USE_FLASHINFER_SAMPLER=1
vllm serve ./aeon-ultimate-multimodal-nvfp4-mtp \
--quantization modelopt \
--trust-remote-code \
--mamba-cache-dtype float16 \
--mamba-block-size 256 \
--max-model-len 262144 \
--max-num-seqs 32 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--limit-mm-per-prompt '{"image":4,"video":2}' \
--mm-encoder-tp-mode data \
--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-27B-DFlash","num_speculative_tokens":12}'
num_speculative_tokens=12 is the validated DFlash setting for this NVFP4 body. The --limit-mm-per-prompt / --mm-encoder-tp-mode data flags drive the preserved vision tower (multimodal body).
Configuration notes
--quantization modeloptis required (notcompressed-tensors— different format).--speculative-config '{"method":"dflash", ...}'drives the z-labQwen3.6-27B-DFlashdrafter atnum_speculative_tokens=12— the validated optimal for this NVFP4 body. (The nativeqwen3_5_mtphead is also grafted into this repo's safetensors and can be selected instead on dedicated-VRAM Blackwell; see the GitHub repo for the MTP-vs-DFlash hardware routing.)--gpu-memory-utilization 0.85keeps headroom on unified-memory parts; on dedicated-VRAM RTX PRO 6000 you can push higher, but0.95+causes the FlashInfer NVFP4 GEMM autotuner to OOM on first boot. See the GitHub repo's RTX PRO 6000 page for the same OOM behavior under DFlash.
Performance — DGX Spark (v0.23.0, aeon-vllm-ultimate:latest)
Measured on a single DGX Spark / GB10 (Blackwell sm_121a, unified memory) with ghcr.io/aeon-7/aeon-vllm-ultimate:latest (vLLM 0.23.0), this NVFP4 body driven by the z-lab DFlash drafter @ n=12 (DFlash@12 speculative decoding). Headline: ~36 tok/s single-stream, ~274 tok/s aggregate at c=64, ~38% DFlash acceptance (holds ~41% at long context).
Single-stream (c=1) by prompt category — DFlash@12
| Category | Decode tok/s | TTFT (ms) | TPOT (ms) | Prefill (tok/s) | DFlash accept |
|---|---|---|---|---|---|
| Coding | 36.1 | 177 | 27.7 | 254 | 37.9 % |
| Math | 37.7 | 308 | 26.5 | 198 | 41.5 % |
| Reasoning | 42.4 | 299 | 23.6 | 164 | 47.0 % |
| Prose | 24.4 | 299 | 41.0 | 127 | 22.9 % |
| Natural language | 27.0 | 317 | 37.0 | 126 | 26.8 % |
| Extraction / JSON | 36.1 | 304 | 27.7 | 178 | 38.0 % |
Single-stream decode lands around 24–42 tok/s depending on category (~36 tok/s on the structured Coding/Extraction workloads); the higher-acceptance Reasoning/Math prompts decode fastest. Acceptance tracks how predictable the next tokens are — high on Reasoning (47%) and Math (41.5%), lower on free-form Prose (22.9%).
Aggregate throughput by concurrency
Aggregate throughput scales cleanly from c=1 up to c=64 with no crash (the prior image crashed at c≥32 under speculative decoding — see What we fixed below). Peak aggregate throughput is ~274 tok/s at c=64 (Reasoning); other categories at c=64: Math ~251, Extraction/JSON ~240, Coding ~229, Natural language ~186, Prose ~156 tok/s. Most of the gain is already captured by c=16; c=16→64 adds only a few percent.
| Category | c=1 | c=8 | c=16 | c=32 | c=64 |
|---|---|---|---|---|---|
| Coding | 35 | 162 | 214 | 222 | 229 |
| Math | 36 | 180 | 246 | 248 | 251 |
| Reasoning | 41 | 177 | 258 | 269 | 274 |
| Prose | 24 | 106 | 142 | 155 | 156 |
| Natural language | 26 | 129 | 180 | 183 | 186 |
| Extraction / JSON | 35 | 155 | 248 | 218 | 240 |
Long-context DFlash acceptance
DFlash draft acceptance holds at ~41% (40.9%) at long context rather than collapsing — PR #40898 runs the drafter's sliding-window layers as true SWA, so drafting survives as the agent history grows past ~2048 tokens.
Stock baseline pending fresh vanilla re-bench. No matched stock / un-optimized vanilla-vLLM baseline exists yet for this variant; the BF16 bar in the variant chart is the unquantized AEON body, not a stock-vLLM reference. A fully-vanilla (no DFlash, no sm_121a opts) re-bench is planned and these figures will be cross-referenced once it lands.
What we fixed for the DGX Spark
All AEON Qwen3.6-27B repos now run on one unified container — ghcr.io/aeon-7/aeon-vllm-ultimate:latest — vLLM 0.23.0 built from source for sm_121a and merged with the AEON speculative-decoding stack, tuned end-to-end for the GB10's unified-memory Blackwell architecture. The two changes that matter most for this card:
- DFlash high-concurrency fix (new in v0.23.0) — the speculative drafter previously crashed at ≥32 concurrent requests (a padded-vs-unpadded KV block-table shape mismatch in FlashAttention). The fix slices the drafter's block-table to the unpadded batch (
block_table[:num_reqs]), so it now scales cleanly to c=64. This is a port of upstream PR #43982, which fixed the same bug for MTP but never for DFlash — it was present and unfixed even in the prior image. - Triton NVFP4 KV cache (PR #44389) — the only 4-bit KV path on sm_121a (upstream's is hard-gated to B200), giving ~3× KV capacity / longer context per GB of unified memory.
- DFlash sliding-window attention (PR #40898) — runs the drafter's SWA layers as true sliding window, so long-context draft acceptance holds (~41% here at long context) instead of collapsing past ~2k tokens.
Container rollback tag: :2026-06-11-pr41703. Full writeup: container README.
Quantization recipe
- Tool:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Loader:
Qwen3_5ForConditionalGeneration.from_pretrained(multimodal-preserved class) - Calibration:
neuralmagic/calibrationLLM split, 20 samples × 8192 tokens - Excluded from quantization (kept BF16):
lm_head,proj_out.*,*router*,*mlp.gate.*(NVFP4_DEFAULT_CFG)*linear_attn.conv1d*,*mixer.conv1d*(NVFP4_DEFAULT_CFG)*linear_attn*(added — full GDN preservation)*visual*(added — vision tower preservation)*mtp*(added — MTP head preservation)*output_layer*,output.*
- MTP graft: 15 tensors copied bf16 from
Qwen/Qwen3.6-27Bafter modelopt export (AutoModelForCausalLM.from_pretraineddrops them; explicit graft restores) - Pipeline: lna-lab/GGUF-to-NVFP4-SM120 reference recipe, adapted for AEON-Ultimate-BF16 input + separate MTP source
Provenance & credits
- BF16 source:
AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16. See that card for the full abliteration pipeline. - MTP graft technique: lna-lab/GGUF-to-NVFP4-SM120 (
docs/MTP_GRAFT_RECIPE.md) - Reference benchmark recipes:
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP - Quantization: NVIDIA TensorRT Model Optimizer (
nvidia-modelopt0.43.0) - Base: Alibaba Qwen team —
Qwen/Qwen3.6-27B
License + responsibility
Apache 2.0, inherited from Qwen/Qwen3.6-27B. This is an uncensored model. Read the full User Responsibility & Arbitration Clause on the BF16 source card before deploying. Summary: you implement downstream safety layers (input validation, output filtering, content moderation, audit logging, rate limiting, access controls, human-in-the-loop for high-risk workflows). The model has no opinions of its own — you supply the opinions, the judgment, and the ethics.
☕ Support the work
If this release has been useful, tips are deeply appreciated — they go directly toward more compute, more models, and more open releases.
₿ Bitcoin (BTC)![]() bc1q09xmzn00q4z3c5raene0f3pzn9d9pvawfm0py4
|
Ξ Ethereum (ETH)![]() 0x1512667F6D61454ad531d2E45C0a5d1fd82D0500
|
◎ Solana (SOL)![]() DgQsjHdAnT5PNLQTNpJdpLS3tYGpVcsHQCkpoiAKsw8t
|
ⓜ Monero (XMR)![]() 836XrSKw4R76vNi3QPJ5Fa9ugcyvE2cWmKSPv3AhpTNNKvqP8v5ba9JRL4Vh7UnFNjDz3E2GXZDVVenu3rkZaNdUFhjAvgd
|
Ethereum L2s (Base, Arbitrum, Optimism, Polygon, etc.) and EVM-compatible tokens can be sent to the same Ethereum address.
- Downloads last month
- 8,843
Model tree for AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP
Base model
Qwen/Qwen3.6-27B



# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-Multimodal-NVFP4-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)