--- license: apache-2.0 base_model: Qwen/Qwen3.6-27B base_model_relation: quantized library_name: transformers tags: - qwen3_5 - qwen3.6 - nvfp4 - quantized - modelopt - mtp - speculative-decoding - blackwell - text-only pipeline_tag: text-generation language: - en - zh - ja - ko - fr - de - es - it - pt - ru - ar --- # Qwen3.6-27B-Text-NVFP4-MTP NVFP4-quantized text-only sibling of [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B), with the **MTP (Multi-Token Prediction) head restored in bf16** so speculative decoding actually works. ## What's different from `sakamakismile/Qwen3.6-27B-NVFP4` | | This repo (`-Text-NVFP4-MTP`) | `Qwen3.6-27B-NVFP4` | |---|---|---| | Quantization format | **`modelopt`** (vLLM SM120 native path) | `compressed-tensors` | | MTP head | **Restored in bf16, working** | Dropped during export → 0% draft acceptance | | Vision tower | **Stripped (text-only)** | Present (kept for VLM use) | | Suggested launch | with `--speculative-config` | without speculation | The original `Qwen3.6-27B-NVFP4` is left untouched so existing users (~15K downloads) are not disrupted. This is a focused text-only sibling for users who want maximum speed and don't need vision input. ## Why this exists Two HF Discussion threads on the original repo prompted this: - [#5 — slower than official FP8 on Blackwell](https://huggingface.co/sakamakismile/Qwen3.6-27B-NVFP4/discussions/5) — root cause is the `compressed-tensors` NVFP4 path being slower than `modelopt` on Blackwell SM120; this repo uses `modelopt` natively. - [#7 — MTP not responding](https://huggingface.co/sakamakismile/Qwen3.6-27B-NVFP4/discussions/7) — `AutoModelForCausalLM.from_pretrained` does not load the MTP head, so it gets dropped during quantization, leading to 0% draft acceptance. This repo grafts the 15 `mtp.*` tensors (bf16) back into the quantized checkpoint and adds them to the quantization ignore list. Recipe is adapted from [`osoleve/Qwen3.5-27B-Text-NVFP4-MTP`](https://huggingface.co/osoleve/Qwen3.5-27B-Text-NVFP4-MTP) — credit and thanks. ## Reproduce this quantization This model was produced by the open-source [`lna-lab/GGUF-to-NVFP4-SM120`](https://github.com/lna-lab/GGUF-to-NVFP4-SM120) pipeline — Lna-Lab's production line for converting Qwen3.5 / 3.6 / Gemma 4 checkpoints into modelopt-format NVFP4 + working MTP, ready for vLLM on Blackwell SM120. The exact script is [`src/quantize/qwen36_27b_text_mtp.py`](https://github.com/lna-lab/GGUF-to-NVFP4-SM120/blob/main/src/quantize/qwen36_27b_text_mtp.py); the 5-step MTP graft recipe is documented in [`docs/MTP_GRAFT_RECIPE.md`](https://github.com/lna-lab/GGUF-to-NVFP4-SM120/blob/main/docs/MTP_GRAFT_RECIPE.md). ## Quantization details - **Base**: `Qwen/Qwen3.6-27B` (bf16, 27.78B params, hybrid linear-attn + full-attn, 64 layers, 1 MTP layer) - **Quantizer**: `nvidia-modelopt` 0.43.0 with `NVFP4_DEFAULT_CFG` - **Calibration**: 20 samples from `neuralmagic/calibration` (LLM split), max_seq_len 8192 - **Ignored from quantization** (kept in bf16): - `lm_head` - All `model.visual.*` (vision tower) — then **physically deleted** in the text-only build - All `*linear_attn.conv1d*` (Mamba-style SSM convolutions, 48 of the 64 layers) - All `mtp.*` modules (the 1-layer MTP head: 15 tensors total, ~850 MB bf16) - Other defaults from `NVFP4_DEFAULT_CFG`: `*router*`, `*mlp.gate.*`, `*block_sparse_moe.gate*`, `*output_layer*` ## Usage with vLLM (Blackwell, SM120) ### Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2 ```bash vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \ --trust-remote-code \ --quantization modelopt \ --language-model-only \ --max-model-len 262144 \ --max-num-seqs 2 \ --kv-cache-dtype fp8 \ --gpu-memory-utilization 0.9 \ --reasoning-parser qwen3 \ --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' ``` This is the production launch on a single RTX PRO 6000 Blackwell — full 256K context, two concurrent slots, KV FP8 for concurrency headroom. The four flags that are easy to skip but matter: - **`--max-model-len 262144`** — full 256K context (Qwen3.6 trained max). - **`--kv-cache-dtype fp8`** — halves KV memory; lifts max concurrency at 256K from ~4× (BF16, won't fit) to **7.0×** with the same VRAM. ~5–10 % per-token decode overhead, more than paid back by capacity. - **`--max-num-seqs 2`** — load-bearing. `--max-num-seqs 4` plus `--kv-cache-dtype fp8` plus `--speculative-config n=3` plus `--max-model-len 262144` will silently OOM during cuda-graph capture on this build of vLLM (0.19.1rc1). - **`num_speculative_tokens: 3`** — vLLM applies the single MTP layer (`mtp_num_hidden_layers=1`) recursively three times per draft pass. Per-position acceptance ~87 / 72 / 61 %, mean accepted-length ≈ 3.0 / 4.0. The `qwen3_5_mtp` handler is internally normalized to `mtp` (deprecated-name warning is harmless). The `mtp.fc` weight is kept in **bf16** in the safetensors (not NVFP4) — equivalent to the Lorbus-style "dequantize the fusion layer in the file" trick applied to NVFP4 instead of AutoRound. Side effect of the `*mtp*` ignore entry in the modelopt config, but load-bearing for the n=3 throughput. ### Smaller-context launch (16K, no fp8) — fastest single-request decode ```bash vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \ --trust-remote-code \ --gpu-memory-utilization 0.85 \ --max-model-len 8192 \ --language-model-only \ --quantization modelopt \ --reasoning-parser qwen3 \ --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' ``` ## Verified throughput vs the family baseline (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1) Same 256K + KV FP8 + max-num-seqs 2 production launch, T = 0: | Repo | Format | MTP | Single (S/M/L) | 2-parallel agg (M/L) | vs baseline | |---|---|---|---|---|---| | [`Qwen3.6-27B-NVFP4`](https://huggingface.co/sakamakismile/Qwen3.6-27B-NVFP4) (the family baseline) | `compressed-tensors` | ❌ | 56 / 59 / 59 | 119 / 119 | 1.0× | | **`Qwen3.6-27B-Text-NVFP4-MTP`** (this repo) | `modelopt` | ✅ n=3 | **104 / 98 / 100** | **189 / 207** | **1.67× / 1.74×** | | [`Carnice-V2-27b-NVFP4-TEXT-MTP`](https://huggingface.co/sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP) | `modelopt` | ✅ n=3 | 107 / 98 / 102 | 193 / 194 | 1.68× / 1.63× | | [`Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP`](https://huggingface.co/sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP) | `modelopt` | ✅ n=3 | 117 / 96 / 101 | 203 / 183 | 1.65× / 1.54× | | [`Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP`](https://huggingface.co/sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP) (VLM) | `modelopt` | ✅ n=3 | 118 / 97 / 100 | 183 / 198 | 1.66× / 1.66× | (S = 50-token, M = 350-token, L = 700-token decodes.) This repo lands at **1.74× the baseline's 2-parallel aggregate throughput** on long-form decodes (207 tok/s vs 119) — the gain comes from two compounding fixes: - **`modelopt` NVFP4 export** — vLLM's native fast path on Blackwell SM120, vs the `compressed-tensors` slow fallback the baseline lands on. - **bf16-restored MTP head + `num_speculative_tokens=3`** — single MTP layer applied recursively for ~1.9× decode multiplier via speculative decoding. KV cache size at 256K + fp8: **491,200 tokens** → max concurrency 6.98× per request at full 256K. Mean acceptance length 1.93 / 2.0 at n=1, ~3.0 / 4.0 at n=3. ### Smaller-context single-request bench (16K, no fp8) | Prompt | Tokens | n=1 tok/s | **n=3 tok/s** | |---|---|---|---| | Short (50 tok) | 50 | ~71 | **132.5** | | Medium (350 tok) | 350 | ~85 | **105.5** | | Long-form (700 tok) | 700 | ~85 | **106.5** | GPU memory at load: ~15 GB. ## Hardware target Built and tested on **NVIDIA RTX PRO 6000 Blackwell (SM120)**. Should also work on **RTX 5090** and other Blackwell consumer/workstation cards with sufficient VRAM (the model is roughly 14 GB after NVFP4 + ~850 MB of bf16 MTP/conv1d/lm_head). ## Acknowledgements - [`osoleve`](https://huggingface.co/osoleve) — for the MTP-restoration recipe on Qwen3.5 - [`Qwen`](https://huggingface.co/Qwen) — for the base model - [`nvidia-modelopt`](https://github.com/NVIDIA/TensorRT-Model-Optimizer) team - The reporters of Discussions #5 and #7 — for catching this cleanly ## Support the Base Model Authors If you find this model useful, please consider supporting: - **Qwen Team** (original model): [Star the Qwen repo](https://huggingface.co/Qwen/Qwen3.6-27B) ## License This model inherits the Apache 2.0 license.