---
license: apache-2.0
library_name: vllm
pipeline_tag: text-generation
tags:
  - nvfp4
  - modelopt
  - mtp
  - speculative-decoding
  - qwen3_5
  - blackwell
  - vllm
  - abliterated
---

# Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP

NVFP4 (modelopt **W4A4**) quant of **Huihui-Qwen3.6-27B-abliterated** — a Qwen3.5-family
**hybrid** model (linear attention + periodic full attention) with a built-in **MTP**
(multi-token-prediction) head for speculative decoding. Multimodal-capable
(`Qwen3_5ForConditionalGeneration`, vision/video tokens) but served here as a text
generation / reasoning + tool-calling model. Fits **4× 16 GB Blackwell (SM120)**.

- ~7.2 GiB/GPU weights at TP=4 · 64K–262K context · `<think>` reasoning · XML tool-calls
- Single-stream **~81–83 tok/s** (TP=4, MTP n=3); peak **~880 tok/s** @ 24 concurrent (64K)

---

## TL;DR — run it (no build required)

The official vLLM image already ships the `qwen3_5` architecture **and** the `Qwen3_5MTP`
draft module, so you do not need to build anything.

```bash
# from this directory; pick exactly TP_SIZE GPUs and avoid your display GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up
./run.sh test          # waits for /v1/models
./run.sh bench         # one-shot smoke test
```

Or the raw `docker run` (what `run.sh`/`compose.yaml` wrap):

```bash
docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \
  -p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \
  --entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \
    --served-model-name huihui-qwen36-27b-local \
    --trust-remote-code --tensor-parallel-size 4 --quantization modelopt \
    --max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
    --chat-template /model/chat_template.jinja \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --host 0.0.0.0 --port 8000
```

Smoke test:

```bash
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model":"huihui-qwen36-27b-local",
  "messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}],
  "max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq .
```

---

## Hardware & requirements

- **4× NVIDIA RTX PRO 2000 Blackwell** (16 GB each, SM120), PCIe (no NVLink).
- Docker + NVIDIA Container Toolkit. The pre-built **`vllm/vllm-openai:v0.22.0`** image
  carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code.
- TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4.

> Bare-metal (no container) also works: `pip install vllm` (≥0.21 introduced qwen3_5),
> CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the same `vllm serve` flags.

---

## Flags, and why

| flag | value | why |
|---|---|---|
| `--quantization modelopt` | **required** | checkpoint is NVFP4 (`hf_quant_config.json`); without it weights read as garbage. |
| `--trust-remote-code` | recommended | `qwen3_5` multimodal config. |
| `--tensor-parallel-size` | `4` | model needs ~7.2 GiB/GPU; 4× 16 GB is the design point. |
| `--max-model-len` | `65536` (≤ `262144`) | hybrid attention keeps KV cheap — long context is affordable. |
| `--max-num-seqs` | `8` (peak `24` @64K) | concurrent slots. See benchmarks for the throughput curve. |
| `--kv-cache-dtype fp8` | recommended | ~2× KV capacity for more concurrency / longer context. |
| `--gpu-memory-utilization` | `0.85` | model ≈7.2 GiB/GPU → ~6 GiB left for KV. Raise only on a clean card. |
| `--reasoning-parser qwen3` | recommended | splits `<think>…</think>` into `reasoning_content`; answer in `content`. |
| `--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'` | recommended | turns on the **MTP** draft head. vLLM ≥0.22 auto-maps `qwen3_5_mtp → mtp` (harmless deprecation warning). `SPEC_TOKENS=0` disables it. |
| `--enable-auto-tool-choice --tool-call-parser qwen3_xml` | optional (agentic) | parses Qwen3 XML tool calls. Drop for pure chat (`ENABLE_TOOLS=0`). |

**Sampling (Qwen default, `generation_config.json`):** `temperature=0.7`, `top_k=20`,
`top_p=0.95`. It is a reasoning model — give it room (`max_tokens ≥ 512`).

---

## Docker package (bundled)

`compose.yaml` · `entrypoint.sh` · `run.sh` · `Dockerfile`. The compose defaults to the
official image + a mounted `entrypoint.sh` (build-free). Every flag is env-overridable:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up        # start on those 4 GPUs
MAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context mode
MAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmark
SPEC_TOKENS=0 ./run.sh up                       # disable MTP speculative decoding
ENABLE_TOOLS=0 ./run.sh up                      # pure chat (no tool parser)
PORT=8001 ./run.sh up                           # serve on a different host port
./run.sh logs   # tail   ·   ./run.sh down   # stop
```

Env knobs: `PORT`, `MAX_MODEL_LEN`, `MAX_NUM_SEQS`, `MAX_NUM_BATCHED_TOKENS`,
`GPU_MEM_UTIL`, `KV_CACHE_DTYPE`, `SPEC_TOKENS`, `TP_SIZE`, `ENABLE_TOOLS`,
`REASONING_PARSER`, `TOOL_CALL_PARSER`, `CUDA_VISIBLE_DEVICES`, `VLLM_IMAGE`.
The model weights are mounted read-only (`. → /model`); the image carries only the runtime.
`shm_size: 32g` is set (vLLM V1 uses a lot of shared memory).

To build a self-contained image instead: uncomment the `build:` block in `compose.yaml`
and run `./run.sh rebuild` (the `Dockerfile` just pip-installs vLLM on a CUDA 13.1 base).

---

## Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3)

Conditions: 512 output tokens, ~350-token prompt, `--kv-cache-dtype fp8`,
`--gpu-memory-utilization 0.85`.

### 64K context

| Req | Aggregate | Per-req |     | Req | Aggregate | Per-req |
|----:|----------:|--------:|-----|----:|----------:|--------:|
| 1   | 81.0 t/s  | 81.0    |     | 14  | 669.9 t/s | 49.5    |
| 2   | 134.0 t/s | 67.0    |     | 16  | 720.2 t/s | 46.1    |
| 3   | 205.1 t/s | 71.6    |     | 18  | 764.7 t/s | 44.2    |
| 4   | 274.5 t/s | 72.5    |     | 20  | 798.7 t/s | 41.6    |
| 6   | 380.3 t/s | 65.2    |     | 22  | 835.0 t/s | 39.5    |
| 8   | 454.2 t/s | 58.9    |     | **24**  | **879.5 t/s** | 37.2 |
| 10  | 518.9 t/s | 53.7    |     | 28  | 859.7 t/s | 31.7    |
| 12  | 613.8 t/s | 52.6    |     | 32  | 736.8 t/s | 32.1    |

### 256K context (1→16 req)

`83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s`
(per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full +
48/64 linear attention) stays cheap at length.

**Takeaways:** peak throughput is **~880 tok/s @ 24 concurrent (64K)**, decaying past 28.
Long context is nearly free: 256K runs 16-way without OOM. For 256K use
`--max-model-len 262144 --max-num-seqs 8`; for a 128K single-request line ~83.9 tok/s
(`--max-num-seqs 1`).

---

## Rituals (gotchas)

1. **Kill zombie GPU procs** — a failed/cancelled launch leaves workers in VRAM:
   `nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader` → `kill -9 <Worker_TP* PIDs>`.
2. **First launch is slow** — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for
   `Application startup complete` / `Uvicorn running on http://0.0.0.0:8000`.
3. **`gpu-memory-utilization` must exceed real usage** — clean start ≈7.2 GiB/GPU; with
   0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV. `Free memory < desired…` = residual
   allocation from a previous run (see #1).
4. **Concurrent NCCL init can hang** — bringing up two TP servers at once may spin one at
   NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them **one at a time**,
   or set `NCCL_P2P_DISABLE=1` for the smaller group.
5. **MTP acceptance** — `num_speculative_tokens>1` reuses one MTP layer per step; higher
   values trade acceptance for draft depth. `n=3` is a good default here.

---

## OpenCode provider

```jsonc
// ~/.config/opencode/opencode.jsonc
{
  "provider": {
    "local-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local vLLM",
      "options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" },
      "models": {
        "huihui-qwen36-27b-local": {
          "name": "Huihui Qwen3.6 27B NVFP4 MTP Local",
          "reasoning": true, "tool_call": true, "temperature": true,
          "limit": { "context": 65536, "output": 8192 }
        }
      }
    }
  },
  "model": "local-vllm/huihui-qwen36-27b-local",
  "small_model": "local-vllm/huihui-qwen36-27b-local"
}
```

---

## What's inside

- **Quantized → NVFP4** (modelopt 0.43, W4A4, group 16): the Linear layers; `lm_head`,
  conv/short-conv, routers and the MTP embedding kept higher precision (`ignore` list in
  `config.json` / `hf_quant_config.json`).
- **MTP** draft head (`mtp_num_hidden_layers: 1`) → speculative decoding via vLLM.
- Files: `model.safetensors` (~20 GB), `config.json`, `hf_quant_config.json`,
  `chat_template.jinja`, tokenizer, and this Docker package.