--- license: apache-2.0 library_name: vllm pipeline_tag: text-generation tags: - nvfp4 - modelopt - mtp - speculative-decoding - qwen3_5 - blackwell - vllm - abliterated --- # Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP NVFP4 (modelopt **W4A4**) quant of **Huihui-Qwen3.6-27B-abliterated** — a Qwen3.5-family **hybrid** model (linear attention + periodic full attention) with a built-in **MTP** (multi-token-prediction) head for speculative decoding. Multimodal-capable (`Qwen3_5ForConditionalGeneration`, vision/video tokens) but served here as a text generation / reasoning + tool-calling model. Fits **4× 16 GB Blackwell (SM120)**. - ~7.2 GiB/GPU weights at TP=4 · 64K–262K context · `` reasoning · XML tool-calls - Single-stream **~81–83 tok/s** (TP=4, MTP n=3); peak **~880 tok/s** @ 24 concurrent (64K) --- ## TL;DR — run it (no build required) The official vLLM image already ships the `qwen3_5` architecture **and** the `Qwen3_5MTP` draft module, so you do not need to build anything. ```bash # from this directory; pick exactly TP_SIZE GPUs and avoid your display GPU CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up ./run.sh test # waits for /v1/models ./run.sh bench # one-shot smoke test ``` Or the raw `docker run` (what `run.sh`/`compose.yaml` wrap): ```bash docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \ -p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \ -e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \ --entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \ --served-model-name huihui-qwen36-27b-local \ --trust-remote-code --tensor-parallel-size 4 --quantization modelopt \ --max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \ --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \ --reasoning-parser qwen3 \ --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \ --chat-template /model/chat_template.jinja \ --enable-auto-tool-choice --tool-call-parser qwen3_xml \ --host 0.0.0.0 --port 8000 ``` Smoke test: ```bash curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{ "model":"huihui-qwen36-27b-local", "messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}], "max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq . ``` --- ## Hardware & requirements - **4× NVIDIA RTX PRO 2000 Blackwell** (16 GB each, SM120), PCIe (no NVLink). - Docker + NVIDIA Container Toolkit. The pre-built **`vllm/vllm-openai:v0.22.0`** image carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code. - TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4. > Bare-metal (no container) also works: `pip install vllm` (≥0.21 introduced qwen3_5), > CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the same `vllm serve` flags. --- ## Flags, and why | flag | value | why | |---|---|---| | `--quantization modelopt` | **required** | checkpoint is NVFP4 (`hf_quant_config.json`); without it weights read as garbage. | | `--trust-remote-code` | recommended | `qwen3_5` multimodal config. | | `--tensor-parallel-size` | `4` | model needs ~7.2 GiB/GPU; 4× 16 GB is the design point. | | `--max-model-len` | `65536` (≤ `262144`) | hybrid attention keeps KV cheap — long context is affordable. | | `--max-num-seqs` | `8` (peak `24` @64K) | concurrent slots. See benchmarks for the throughput curve. | | `--kv-cache-dtype fp8` | recommended | ~2× KV capacity for more concurrency / longer context. | | `--gpu-memory-utilization` | `0.85` | model ≈7.2 GiB/GPU → ~6 GiB left for KV. Raise only on a clean card. | | `--reasoning-parser qwen3` | recommended | splits `` into `reasoning_content`; answer in `content`. | | `--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'` | recommended | turns on the **MTP** draft head. vLLM ≥0.22 auto-maps `qwen3_5_mtp → mtp` (harmless deprecation warning). `SPEC_TOKENS=0` disables it. | | `--enable-auto-tool-choice --tool-call-parser qwen3_xml` | optional (agentic) | parses Qwen3 XML tool calls. Drop for pure chat (`ENABLE_TOOLS=0`). | **Sampling (Qwen default, `generation_config.json`):** `temperature=0.7`, `top_k=20`, `top_p=0.95`. It is a reasoning model — give it room (`max_tokens ≥ 512`). --- ## Docker package (bundled) `compose.yaml` · `entrypoint.sh` · `run.sh` · `Dockerfile`. The compose defaults to the official image + a mounted `entrypoint.sh` (build-free). Every flag is env-overridable: ```bash CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up # start on those 4 GPUs MAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context mode MAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmark SPEC_TOKENS=0 ./run.sh up # disable MTP speculative decoding ENABLE_TOOLS=0 ./run.sh up # pure chat (no tool parser) PORT=8001 ./run.sh up # serve on a different host port ./run.sh logs # tail · ./run.sh down # stop ``` Env knobs: `PORT`, `MAX_MODEL_LEN`, `MAX_NUM_SEQS`, `MAX_NUM_BATCHED_TOKENS`, `GPU_MEM_UTIL`, `KV_CACHE_DTYPE`, `SPEC_TOKENS`, `TP_SIZE`, `ENABLE_TOOLS`, `REASONING_PARSER`, `TOOL_CALL_PARSER`, `CUDA_VISIBLE_DEVICES`, `VLLM_IMAGE`. The model weights are mounted read-only (`. → /model`); the image carries only the runtime. `shm_size: 32g` is set (vLLM V1 uses a lot of shared memory). To build a self-contained image instead: uncomment the `build:` block in `compose.yaml` and run `./run.sh rebuild` (the `Dockerfile` just pip-installs vLLM on a CUDA 13.1 base). --- ## Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3) Conditions: 512 output tokens, ~350-token prompt, `--kv-cache-dtype fp8`, `--gpu-memory-utilization 0.85`. ### 64K context | Req | Aggregate | Per-req | | Req | Aggregate | Per-req | |----:|----------:|--------:|-----|----:|----------:|--------:| | 1 | 81.0 t/s | 81.0 | | 14 | 669.9 t/s | 49.5 | | 2 | 134.0 t/s | 67.0 | | 16 | 720.2 t/s | 46.1 | | 3 | 205.1 t/s | 71.6 | | 18 | 764.7 t/s | 44.2 | | 4 | 274.5 t/s | 72.5 | | 20 | 798.7 t/s | 41.6 | | 6 | 380.3 t/s | 65.2 | | 22 | 835.0 t/s | 39.5 | | 8 | 454.2 t/s | 58.9 | | **24** | **879.5 t/s** | 37.2 | | 10 | 518.9 t/s | 53.7 | | 28 | 859.7 t/s | 31.7 | | 12 | 613.8 t/s | 52.6 | | 32 | 736.8 t/s | 32.1 | ### 256K context (1→16 req) `83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s` (per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full + 48/64 linear attention) stays cheap at length. **Takeaways:** peak throughput is **~880 tok/s @ 24 concurrent (64K)**, decaying past 28. Long context is nearly free: 256K runs 16-way without OOM. For 256K use `--max-model-len 262144 --max-num-seqs 8`; for a 128K single-request line ~83.9 tok/s (`--max-num-seqs 1`). --- ## Rituals (gotchas) 1. **Kill zombie GPU procs** — a failed/cancelled launch leaves workers in VRAM: `nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader` → `kill -9 `. 2. **First launch is slow** — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for `Application startup complete` / `Uvicorn running on http://0.0.0.0:8000`. 3. **`gpu-memory-utilization` must exceed real usage** — clean start ≈7.2 GiB/GPU; with 0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV. `Free memory < desired…` = residual allocation from a previous run (see #1). 4. **Concurrent NCCL init can hang** — bringing up two TP servers at once may spin one at NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them **one at a time**, or set `NCCL_P2P_DISABLE=1` for the smaller group. 5. **MTP acceptance** — `num_speculative_tokens>1` reuses one MTP layer per step; higher values trade acceptance for draft depth. `n=3` is a good default here. --- ## OpenCode provider ```jsonc // ~/.config/opencode/opencode.jsonc { "provider": { "local-vllm": { "npm": "@ai-sdk/openai-compatible", "name": "Local vLLM", "options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" }, "models": { "huihui-qwen36-27b-local": { "name": "Huihui Qwen3.6 27B NVFP4 MTP Local", "reasoning": true, "tool_call": true, "temperature": true, "limit": { "context": 65536, "output": 8192 } } } } }, "model": "local-vllm/huihui-qwen36-27b-local", "small_model": "local-vllm/huihui-qwen36-27b-local" } ``` --- ## What's inside - **Quantized → NVFP4** (modelopt 0.43, W4A4, group 16): the Linear layers; `lm_head`, conv/short-conv, routers and the MTP embedding kept higher precision (`ignore` list in `config.json` / `hf_quant_config.json`). - **MTP** draft head (`mtp_num_hidden_layers: 1`) → speculative decoding via vLLM. - Files: `model.safetensors` (~20 GB), `config.json`, `hf_quant_config.json`, `chat_template.jinja`, tokenizer, and this Docker package.