Instructions to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
model = AutoModelForMultimodalLM.from_pretrained("sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

SGLang

How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
```

--enable-prefix-caching and Feedback from HGX B200 Deployment

by cgelias - opened Apr 27

Discussion

cgelias

Apr 27

Does this model support prefix caching on vLLM while also using MTP? Or do we have to wait until this issue is resolved and merged:
https://github.com/vllm-project/vllm/pull/26807

Getting 200+ token/s decode on single B200 but suffering on the prefill and ttft because of missing prefix caching.

I really appreciate this model. Any thoughts?

For reference here is my current vLLM config:

Verify if container recognizes the gpu:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
Sun Apr 26 21:50:41 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA B200                    On  |   00000000:3A:00.0 Off |                    0 |
| N/A   30C    P0            141W / 1000W |       0MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Verify if PyTorch recognizes the GPU
pytorch version: 2.11.0+cu130
CUDA version recognized by pytorch: 13.0
Recognized CUDA device: True

VLLM args:
>>> -O3: --download-dir
>>> /data/vllm/download: --enable-auto-tool-choice
>>> gpu-memory-utilization: 0.9
>>> kv-cache-dtype: fp8_e4m3
>>> language-model-only: --max-model-len
>>> 179200: --max-num-batched-tokens
>>> 32768: --max-num-seqs
>>> 32: --model
>>> sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP: --reasoning-parser
>>> qwen3: --served-model-name
>>> Qwen3.6-27B: --speculative-config
>>> {"method": "mtp", "num_speculative_tokens": 2}: --tensor-parallel-size
>>> 1: --tool-call-parser
>>> qwen3_coder: --trust-remote-code
>>> None:

Env:
>>> NVIDIA_VISIBLE_DEVICES=void
>>> HOSTNAME=e92e510eb6a8
>>> NCCL_P2P_DISABLE=1
>>> NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
>>> TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
>>> PWD=/vllm-workspace
>>> NVIDIA_DRIVER_CAPABILITIES=compute,utility
>>> NV_CUDA_CUDART_VERSION=13.0.88-1
>>> VLLM_USAGE_SOURCE=production-docker-image
>>> HOME=/root
>>> CUDA_VERSION=13.0.1
>>> HF_TOKEN=hf_**********************************
>>> UV_LINK_MODE=copy
>>> VLLM_ENABLE_CUDA_COMPATIBILITY=0
>>> UV_INDEX_STRATEGY=unsafe-best-match
>>> DO_NOT_TRACK=1
>>> SHLVL=1
>>> NVARCH=x86_64
>>> LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
>>> SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
>>> VLLM_NO_USAGE_STATS=1
>>> REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
>>> NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
>>> VLLM_LOGGING_LEVEL=INFO
>>> PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>> UV_HTTP_TIMEOUT=500
>>> DEBIAN_FRONTEND=noninteractive
>>> VLLM_API_KEY=****************************************
>>> _=/usr/bin/printenv

Starting vllm:
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.0
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]   █▄█▀ █     █     █     █  model   sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:233] non-default args: {'model_tag': 'None', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', 'trust_remote_code': True, 'max_model_len': 179200, 'served_model_name': ['Qwen3.6-27B'], 'download_dir': '/data/vllm/download', 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8_e4m3', 'language_model_only': True, 'max_num_batched_tokens': 32768, 'max_num_seqs': 32, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2}, 'optimization_level': '3'}
(APIServer pid=1) INFO 04-26 21:52:11 [model.py:554] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 04-26 21:52:11 [model.py:1685] Using max model len 179200
(APIServer pid=1) INFO 04-26 21:52:13 [cache.py:247] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 04-26 21:52:25 [model.py:554] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 04-26 21:52:25 [model.py:1685] Using max model len 262144
(APIServer pid=1) WARNING 04-26 21:52:25 [speculative.py:532] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 04-26 21:52:25 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=32768.
(APIServer pid=1) WARNING 04-26 21:52:25 [modelopt.py:1013] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 04-26 21:52:25 [vllm.py:834] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-26 21:52:25 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) INFO 04-26 21:52:32 [compilation.py:294] Enabled custom fusions: act_quant
(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=2051) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) INFO 04-26 21:52:33 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=2051) INFO 04-26 21:52:48 [core.py:107] Initializing a V1 LLM engine (v0.20.0) with config: model='sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', speculative_config=SpeculativeConfig(method='mtp', model='sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', num_spec_tokens=2), tokenizer='sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=179200, download_dir='/data/vllm/download', load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.6-27B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 192, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=2051) INFO 04-26 21:52:52 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=2051) INFO 04-26 21:52:52 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://*********:38535 backend=nccl
(EngineCore pid=2051) INFO 04-26 21:52:52 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=2051) WARNING 04-26 21:52:53 [__init__.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=2051) INFO 04-26 21:52:53 [gpu_model_runner.py:4752] Starting to load model sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP...
(EngineCore pid=2051) INFO 04-26 21:52:54 [cuda.py:424] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=2051) INFO 04-26 21:52:54 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=2051) INFO 04-26 21:52:54 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(EngineCore pid=2051) INFO 04-26 21:52:54 [gdn_linear_attn.py:155] Using Triton/FLA GDN prefill kernel
(EngineCore pid=2051) INFO 04-26 21:52:54 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(EngineCore pid=2051) INFO 04-26 21:52:54 [selector.py:132] Using HND KV cache layout for FLASHINFER backend.
(EngineCore pid=2051) INFO 04-26 21:52:54 [deep_gemm.py:115] DeepGEMM E8M0 enabled on current platform.
(EngineCore pid=2051) INFO 04-26 21:52:56 [weight_utils.py:659] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore pid=2051) INFO 04-26 21:52:56 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 18.29 GiB. Available RAM: 1920.73 GiB.
(EngineCore pid=2051) INFO 04-26 21:52:56 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (BTRFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.34s/it]
(EngineCore pid=2051)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.06s/it]
(EngineCore pid=2051)
(EngineCore pid=2051) INFO 04-26 21:52:59 [default_loader.py:384] Loading weights took 3.69 seconds
(EngineCore pid=2051) WARNING 04-26 21:52:59 [kv_cache.py:109] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore pid=2051) WARNING 04-26 21:52:59 [kv_cache.py:123] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore pid=2051) INFO 04-26 21:52:59 [gpu_model_runner.py:4776] Loading drafter model...
(EngineCore pid=2051) INFO 04-26 21:53:00 [weight_utils.py:659] No model.safetensors.index.json found in remote.
(EngineCore pid=2051) INFO 04-26 21:53:00 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 18.29 GiB. Available RAM: 1920.54 GiB.
(EngineCore pid=2051) INFO 04-26 21:53:01 [default_loader.py:384] Loading weights took 1.18 seconds
(EngineCore pid=2051) INFO 04-26 21:53:01 [eagle.py:1425] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=2051) INFO 04-26 21:53:01 [eagle.py:1481] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=2051) INFO 04-26 21:53:01 [gpu_model_runner.py:4837] Model loading took 18.55 GiB memory and 7.411608 seconds
(EngineCore pid=2051) INFO 04-26 21:53:01 [interface.py:606] Setting attention block size to 1600 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=2051) INFO 04-26 21:53:01 [interface.py:630] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore pid=2051) INFO 04-26 21:53:22 [backends.py:1077] Using cache directory: /root/.cache/vllm/torch_compile_cache/4c919e5dad/rank_0_0/backbone for vLLM's torch.compile
![claude_b9akDIiYx5](https://cdn-uploads.huggingface.co/production/uploads/6717761a6599a79d31f483d1/IyZbY3qNEHEy27JSxiS3v.png)

(EngineCore pid=2051) INFO 04-26 21:53:22 [backends.py:1137] Dynamo bytecode transform time: 20.73 s
(EngineCore pid=2051) INFO 04-26 21:53:31 [backends.py:377] Cache the graph of compile range (1, 32768) for later use
(EngineCore pid=2051) INFO 04-26 21:54:44 [backends.py:398] Compiling a graph for compile range (1, 32768) takes 80.66 s
(EngineCore pid=2051) INFO 04-26 21:54:54 [decorators.py:665] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/ee57439e4b671222aec9e727c54957d772d1fb982891ac133bef1e83b760cf33/rank_0_0/model
(EngineCore pid=2051) INFO 04-26 21:54:54 [monitor.py:48] torch.compile took 112.76 s in total
(EngineCore pid=2051) INFO 04-26 21:56:36 [monitor.py:76] Initial profiling/warmup run took 101.73 s
(EngineCore pid=2051) INFO 04-26 21:56:37 [backends.py:1077] Using cache directory: /root/.cache/vllm/torch_compile_cache/4c919e5dad/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=2051) INFO 04-26 21:56:37 [backends.py:1137] Dynamo bytecode transform time: 0.91 s
(EngineCore pid=2051) INFO 04-26 21:57:03 [backends.py:398] Compiling a graph for compile range (1, 32768) takes 25.46 s
(EngineCore pid=2051) INFO 04-26 21:57:03 [decorators.py:665] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/b7f8179a9f54ed62f029189106417382f0c152d89a0e90fcb7fbebfd894fb6ce/rank_0_0/model
(EngineCore pid=2051) INFO 04-26 21:57:03 [monitor.py:48] torch.compile took 26.98 s in total
(EngineCore pid=2051) INFO 04-26 21:57:04 [monitor.py:76] Initial profiling/warmup run took 0.99 s
(EngineCore pid=2051) WARNING 04-26 21:57:16 [kv_cache_utils.py:1059] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=2051) INFO 04-26 21:57:16 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(EngineCore pid=2051) INFO 04-26 21:57:16 [utils.py:60] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore pid=2051) INFO 04-26 21:57:16 [gpu_model_runner.py:5916] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(EngineCore pid=2051) INFO 04-26 21:57:17 [flashinfer.py:390] Using TRTLLM attention (query is quantized).
(EngineCore pid=2051) INFO 04-26 21:57:23 [gpu_model_runner.py:5995] Estimated CUDA graph memory: 0.82 GiB total
(EngineCore pid=2051) INFO 04-26 21:57:23 [gpu_worker.py:440] Available KV cache memory: 136.34 GiB
(EngineCore pid=2051) INFO 04-26 21:57:23 [gpu_worker.py:474] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9046 to maintain the same effective KV cache size.
(EngineCore pid=2051) WARNING 04-26 21:57:23 [kv_cache_utils.py:1059] Add 3 padding layers, may waste at most 6.25% KV cache memory
(En(EngineCore pid=2051) 2026-04-26 21:57:23,820 - INFO - autotuner.py:446 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
gineCore pid=2051) INFO 04-26 21:57:23 [kv_cache_utils.py:1319] GPU KV cache size: 1,051,200 tokens
(EngineCore pid=2051) 2026-04-26 21:57:36,497 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore pid=2051) INFO 04-26 21:57:23 [kv_cache_utils.py:1324] Maximum concurrency for 179,200 tokens per request: 21.72x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 26/26 [00:04<00:00,  5.26it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 14/14 [00:05<00:00,  2.39it/s]
(EngineCore pid=2051) INFO 04-26 21:57:48 [gpu_model_runner.py:6086] Graph capturing finished in 12 secs, took 0.49 GiB
(EngineCore pid=2051) INFO 04-26 21:57:48 [gpu_worker.py:601] CUDA graph pool memory: 0.49 GiB (actual), 0.82 GiB (estimated), difference: 0.32 GiB (65.6%).
(EngineCore pid=2051) INFO 04-26 21:57:48 [core.py:299] init engine (profile, create kv cache, warmup model) took 286.47 s (compilation: 127.75 s)
(EngineCore pid=2051) INFO 04-26 21:57:48 [vllm.py:834] Asynchronous scheduling is enabled.
(EngineCore pid=2051) INFO 04-26 21:57:48 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=2051) INFO 04-26 21:57:48 [compilation.py:294] Enabled custom fusions: act_quant
(APIServer pid=1) INFO 04-26 21:57:48 [api_server.py:598] Supported tasks: ['generate']
(APIServer pid=1) INFO 04-26 21:57:49 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 04-26 21:57:49 [model.py:1442] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 04-26 21:57:52 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) WARNING 04-26 21:57:53 [base.py:247] Multi-modal warmup failed
(APIServer pid=1) INFO 04-26 21:57:53 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 21:58:14 [loggers.py:271] Engine 000: Avg prompt throughput: 71.1 tokens/s, Avg generation throughput: 140.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-26 21:58:14 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.53, Accepted throughput: 32.93 tokens/s, Drafted throughput: 43.05 tokens/s, Accepted: 846 tokens, Drafted: 1106 tokens, Per-position acceptance rate: 0.855, 0.675, Avg Draft acceptance rate: 76.5%
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:50818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 21:58:24 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 142.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-26 21:58:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.87, Accepted throughput: 92.98 tokens/s, Drafted throughput: 99.38 tokens/s, Accepted: 930 tokens, Drafted: 994 tokens, Per-position acceptance rate: 0.966, 0.905, Avg Draft acceptance rate: 93.6%
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 21:58:34 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
[22:01:57] /project/cpp/grammar_matcher.cc:497: Warning: The matcher has terminated after accepting the stop token, but is trying to accept new token with id 198.
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 22:01:54 [loggers.py:271] Engine 000: Avg prompt throughput: 67.7 tokens/s, Avg generation throughput: 182.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-26 22:01:54 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.63, Accepted throughput: 5.37 tokens/s, Drafted throughput: 6.58 tokens/s, Accepted: 1128 tokens, Drafted: 1382 tokens, Per-position acceptance rate: 0.890, 0.742, Avg Draft acceptance rate: 81.6%

livepeer-ren

Apr 27

Yes it supports it, your configs are all over the place >>> 179200: --max-num-batched-tokens this is wrong!! Use these flags.
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
--quantization
modelopt
--speculative-config
'{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
--max-model-len
"196608"
--max-num-batched-tokens
"8192"
--max-num-seqs
"50"
--gpu-memory-utilization
"0.95"
--enable-prefix-caching
--no-scheduler-reserve-full-isl
--trust-remote-code
--reasoning-parser
qwen3
--enable-auto-tool-choice
--tool-call-parser
qwen3_coder
--default-chat-template-kwargs
'{"preserve_thinking":true}'
--language-model-only

sakamakismile

Owner Apr 28

@cgelias 200+ TPS on a single B200, congrats. @livepeer-ren 's flags are pointing in the right direction — let me add the why so you can tune from there:

1. num_speculative_tokens: 2 → 3 ← biggest single lever
vLLM 0.20+ applies the single Qwen3.6 MTP layer recursively num_speculative_tokens times. We measured per-position acceptance 87 / 72 / 61% across the family at n=3, mean acceptance length ~3.0–3.4 → roughly +30–50% decode TPS over n=2. Your current mean acceptance ~2.53 matches what we see at n=2.

2. method: "mtp" is fine on 0.20.0 — your startup log shows Detected MTP model. Sharing target embedding..., which is the qwen3_5_mtp path being auto-dispatched. No change needed there.

3. --max-num-batched-tokens 32768 → 8192 for prefill-heavy workloads
With chunked prefill, this is the per-step token budget. 32K means each prefill chunk is huge — fewer overlap opportunities with decode and longer per-step latency. 8K trims TTFT noticeably without hurting decode throughput on B200.

4. Prefix caching does work with MTP on 0.20.0 — you don't need to wait for #26807. Just add --enable-prefix-caching (your config has it disabled). #26807 further improves cross-request MTP draft state reuse, but baseline KV prefix reuse already gives the biggest TTFT win on repeated system prompts. --no-scheduler-reserve-full-isl (livepeer-ren's tip) is the right complement — it stops the scheduler from front-loading full ISL reservations.

Optional next step: with 183 GB on B200 you can comfortably push our newer sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 (same NVFP4 + MTP recipe + KVTC) to ~1M max-model-len with single-stream ~120 TPS held.

Thanks @livepeer-ren for the immediate help.

— Tonoken3 / Lna-Lab

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment