--enable-prefix-caching and Feedback from HGX B200 Deployment

#5
by cgelias - opened

Does this model support prefix caching on vLLM while also using MTP? Or do we have to wait until this issue is resolved and merged:
https://github.com/vllm-project/vllm/pull/26807

Getting 200+ token/s decode on single B200 but suffering on the prefill and ttft because of missing prefix caching.

I really appreciate this model. Any thoughts?

For reference here is my current vLLM config:

Verify if container recognizes the gpu:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:58:59_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0
Sun Apr 26 21:50:41 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA B200                    On  |   00000000:3A:00.0 Off |                    0 |
| N/A   30C    P0            141W / 1000W |       0MiB / 183359MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Verify if PyTorch recognizes the GPU
pytorch version: 2.11.0+cu130
CUDA version recognized by pytorch: 13.0
Recognized CUDA device: True

VLLM args:
>>> -O3: --download-dir
>>> /data/vllm/download: --enable-auto-tool-choice
>>> gpu-memory-utilization: 0.9
>>> kv-cache-dtype: fp8_e4m3
>>> language-model-only: --max-model-len
>>> 179200: --max-num-batched-tokens
>>> 32768: --max-num-seqs
>>> 32: --model
>>> sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP: --reasoning-parser
>>> qwen3: --served-model-name
>>> Qwen3.6-27B: --speculative-config
>>> {"method": "mtp", "num_speculative_tokens": 2}: --tensor-parallel-size
>>> 1: --tool-call-parser
>>> qwen3_coder: --trust-remote-code
>>> None:

Env:
>>> NVIDIA_VISIBLE_DEVICES=void
>>> HOSTNAME=e92e510eb6a8
>>> NCCL_P2P_DISABLE=1
>>> NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
>>> TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
>>> PWD=/vllm-workspace
>>> NVIDIA_DRIVER_CAPABILITIES=compute,utility
>>> NV_CUDA_CUDART_VERSION=13.0.88-1
>>> VLLM_USAGE_SOURCE=production-docker-image
>>> HOME=/root
>>> CUDA_VERSION=13.0.1
>>> HF_TOKEN=hf_**********************************
>>> UV_LINK_MODE=copy
>>> VLLM_ENABLE_CUDA_COMPATIBILITY=0
>>> UV_INDEX_STRATEGY=unsafe-best-match
>>> DO_NOT_TRACK=1
>>> SHLVL=1
>>> NVARCH=x86_64
>>> LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
>>> SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
>>> VLLM_NO_USAGE_STATS=1
>>> REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
>>> NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
>>> VLLM_LOGGING_LEVEL=INFO
>>> PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>>> UV_HTTP_TIMEOUT=500
>>> DEBIAN_FRONTEND=noninteractive
>>> VLLM_API_KEY=****************************************
>>> _=/usr/bin/printenv

Starting vllm:
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]        β–ˆ     β–ˆ     β–ˆβ–„   β–„β–ˆ
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]  β–„β–„ β–„β–ˆ β–ˆ     β–ˆ     β–ˆ β–€β–„β–€ β–ˆ  version 0.20.0
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]   β–ˆβ–„β–ˆβ–€ β–ˆ     β–ˆ     β–ˆ     β–ˆ  model   sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]    β–€β–€  β–€β–€β–€β–€β–€ β–€β–€β–€β–€β–€ β–€     β–€
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:299]
(APIServer pid=1) INFO 04-26 21:51:11 [utils.py:233] non-default args: {'model_tag': 'None', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', 'trust_remote_code': True, 'max_model_len': 179200, 'served_model_name': ['Qwen3.6-27B'], 'download_dir': '/data/vllm/download', 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8_e4m3', 'language_model_only': True, 'max_num_batched_tokens': 32768, 'max_num_seqs': 32, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2}, 'optimization_level': '3'}
(APIServer pid=1) INFO 04-26 21:52:11 [model.py:554] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 04-26 21:52:11 [model.py:1685] Using max model len 179200
(APIServer pid=1) INFO 04-26 21:52:13 [cache.py:247] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 04-26 21:52:25 [model.py:554] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 04-26 21:52:25 [model.py:1685] Using max model len 262144
(APIServer pid=1) WARNING 04-26 21:52:25 [speculative.py:532] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 04-26 21:52:25 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=32768.
(APIServer pid=1) WARNING 04-26 21:52:25 [modelopt.py:1013] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 04-26 21:52:25 [vllm.py:834] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-26 21:52:25 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) INFO 04-26 21:52:32 [compilation.py:294] Enabled custom fusions: act_quant
(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=2051) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) INFO 04-26 21:52:33 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=2051) INFO 04-26 21:52:48 [core.py:107] Initializing a V1 LLM engine (v0.20.0) with config: model='sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', speculative_config=SpeculativeConfig(method='mtp', model='sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', num_spec_tokens=2), tokenizer='sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=179200, download_dir='/data/vllm/download', load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.6-27B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 192, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=2051) INFO 04-26 21:52:52 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=2051) INFO 04-26 21:52:52 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://*********:38535 backend=nccl
(EngineCore pid=2051) INFO 04-26 21:52:52 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=2051) WARNING 04-26 21:52:53 [__init__.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=2051) INFO 04-26 21:52:53 [gpu_model_runner.py:4752] Starting to load model sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP...
(EngineCore pid=2051) INFO 04-26 21:52:54 [cuda.py:424] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=2051) INFO 04-26 21:52:54 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=2051) INFO 04-26 21:52:54 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(EngineCore pid=2051) INFO 04-26 21:52:54 [gdn_linear_attn.py:155] Using Triton/FLA GDN prefill kernel
(EngineCore pid=2051) INFO 04-26 21:52:54 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(EngineCore pid=2051) INFO 04-26 21:52:54 [selector.py:132] Using HND KV cache layout for FLASHINFER backend.
(EngineCore pid=2051) INFO 04-26 21:52:54 [deep_gemm.py:115] DeepGEMM E8M0 enabled on current platform.
(EngineCore pid=2051) INFO 04-26 21:52:56 [weight_utils.py:659] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore pid=2051) INFO 04-26 21:52:56 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 18.29 GiB. Available RAM: 1920.73 GiB.
(EngineCore pid=2051) INFO 04-26 21:52:56 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (BTRFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.34s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.34s/it]
(EngineCore pid=2051)
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.06s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.06s/it]
(EngineCore pid=2051)
(EngineCore pid=2051) INFO 04-26 21:52:59 [default_loader.py:384] Loading weights took 3.69 seconds
(EngineCore pid=2051) WARNING 04-26 21:52:59 [kv_cache.py:109] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore pid=2051) WARNING 04-26 21:52:59 [kv_cache.py:123] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore pid=2051) INFO 04-26 21:52:59 [gpu_model_runner.py:4776] Loading drafter model...
(EngineCore pid=2051) INFO 04-26 21:53:00 [weight_utils.py:659] No model.safetensors.index.json found in remote.
(EngineCore pid=2051) INFO 04-26 21:53:00 [weight_utils.py:904] Filesystem type for checkpoints: BTRFS. Checkpoint size: 18.29 GiB. Available RAM: 1920.54 GiB.
(EngineCore pid=2051) INFO 04-26 21:53:01 [default_loader.py:384] Loading weights took 1.18 seconds
(EngineCore pid=2051) INFO 04-26 21:53:01 [eagle.py:1425] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=2051) INFO 04-26 21:53:01 [eagle.py:1481] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=2051) INFO 04-26 21:53:01 [gpu_model_runner.py:4837] Model loading took 18.55 GiB memory and 7.411608 seconds
(EngineCore pid=2051) INFO 04-26 21:53:01 [interface.py:606] Setting attention block size to 1600 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=2051) INFO 04-26 21:53:01 [interface.py:630] Padding mamba page size by 0.88% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore pid=2051) INFO 04-26 21:53:22 [backends.py:1077] Using cache directory: /root/.cache/vllm/torch_compile_cache/4c919e5dad/rank_0_0/backbone for vLLM's torch.compile
![claude_b9akDIiYx5](https://cdn-uploads.huggingface.co/production/uploads/6717761a6599a79d31f483d1/IyZbY3qNEHEy27JSxiS3v.png)

(EngineCore pid=2051) INFO 04-26 21:53:22 [backends.py:1137] Dynamo bytecode transform time: 20.73 s
(EngineCore pid=2051) INFO 04-26 21:53:31 [backends.py:377] Cache the graph of compile range (1, 32768) for later use
(EngineCore pid=2051) INFO 04-26 21:54:44 [backends.py:398] Compiling a graph for compile range (1, 32768) takes 80.66 s
(EngineCore pid=2051) INFO 04-26 21:54:54 [decorators.py:665] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/ee57439e4b671222aec9e727c54957d772d1fb982891ac133bef1e83b760cf33/rank_0_0/model
(EngineCore pid=2051) INFO 04-26 21:54:54 [monitor.py:48] torch.compile took 112.76 s in total
(EngineCore pid=2051) INFO 04-26 21:56:36 [monitor.py:76] Initial profiling/warmup run took 101.73 s
(EngineCore pid=2051) INFO 04-26 21:56:37 [backends.py:1077] Using cache directory: /root/.cache/vllm/torch_compile_cache/4c919e5dad/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=2051) INFO 04-26 21:56:37 [backends.py:1137] Dynamo bytecode transform time: 0.91 s
(EngineCore pid=2051) INFO 04-26 21:57:03 [backends.py:398] Compiling a graph for compile range (1, 32768) takes 25.46 s
(EngineCore pid=2051) INFO 04-26 21:57:03 [decorators.py:665] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/b7f8179a9f54ed62f029189106417382f0c152d89a0e90fcb7fbebfd894fb6ce/rank_0_0/model
(EngineCore pid=2051) INFO 04-26 21:57:03 [monitor.py:48] torch.compile took 26.98 s in total
(EngineCore pid=2051) INFO 04-26 21:57:04 [monitor.py:76] Initial profiling/warmup run took 0.99 s
(EngineCore pid=2051) WARNING 04-26 21:57:16 [kv_cache_utils.py:1059] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=2051) INFO 04-26 21:57:16 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=192
(EngineCore pid=2051) INFO 04-26 21:57:16 [utils.py:60] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore pid=2051) INFO 04-26 21:57:16 [gpu_model_runner.py:5916] Profiling CUDA graph memory: PIECEWISE=26 (largest=192), FULL=14 (largest=96)
(EngineCore pid=2051) INFO 04-26 21:57:17 [flashinfer.py:390] Using TRTLLM attention (query is quantized).
(EngineCore pid=2051) INFO 04-26 21:57:23 [gpu_model_runner.py:5995] Estimated CUDA graph memory: 0.82 GiB total
(EngineCore pid=2051) INFO 04-26 21:57:23 [gpu_worker.py:440] Available KV cache memory: 136.34 GiB
(EngineCore pid=2051) INFO 04-26 21:57:23 [gpu_worker.py:474] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9046 to maintain the same effective KV cache size.
(EngineCore pid=2051) WARNING 04-26 21:57:23 [kv_cache_utils.py:1059] Add 3 padding layers, may waste at most 6.25% KV cache memory
(En(EngineCore pid=2051) 2026-04-26 21:57:23,820 - INFO - autotuner.py:446 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
gineCore pid=2051) INFO 04-26 21:57:23 [kv_cache_utils.py:1319] GPU KV cache size: 1,051,200 tokens
(EngineCore pid=2051) 2026-04-26 21:57:36,497 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore pid=2051) INFO 04-26 21:57:23 [kv_cache_utils.py:1324] Maximum concurrency for 179,200 tokens per request: 21.72x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 26/26 [00:04<00:00,  5.26it/s]
Capturing CUDA graphs (decode, FULL): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 14/14 [00:05<00:00,  2.39it/s]
(EngineCore pid=2051) INFO 04-26 21:57:48 [gpu_model_runner.py:6086] Graph capturing finished in 12 secs, took 0.49 GiB
(EngineCore pid=2051) INFO 04-26 21:57:48 [gpu_worker.py:601] CUDA graph pool memory: 0.49 GiB (actual), 0.82 GiB (estimated), difference: 0.32 GiB (65.6%).
(EngineCore pid=2051) INFO 04-26 21:57:48 [core.py:299] init engine (profile, create kv cache, warmup model) took 286.47 s (compilation: 127.75 s)
(EngineCore pid=2051) INFO 04-26 21:57:48 [vllm.py:834] Asynchronous scheduling is enabled.
(EngineCore pid=2051) INFO 04-26 21:57:48 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=2051) INFO 04-26 21:57:48 [compilation.py:294] Enabled custom fusions: act_quant
(APIServer pid=1) INFO 04-26 21:57:48 [api_server.py:598] Supported tasks: ['generate']
(APIServer pid=1) INFO 04-26 21:57:49 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 04-26 21:57:49 [model.py:1442] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 04-26 21:57:52 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) WARNING 04-26 21:57:53 [base.py:247] Multi-modal warmup failed
(APIServer pid=1) INFO 04-26 21:57:53 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-26 21:57:53 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 21:58:14 [loggers.py:271] Engine 000: Avg prompt throughput: 71.1 tokens/s, Avg generation throughput: 140.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-26 21:58:14 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.53, Accepted throughput: 32.93 tokens/s, Drafted throughput: 43.05 tokens/s, Accepted: 846 tokens, Drafted: 1106 tokens, Per-position acceptance rate: 0.855, 0.675, Avg Draft acceptance rate: 76.5%
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:50818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 21:58:24 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 142.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-26 21:58:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.87, Accepted throughput: 92.98 tokens/s, Drafted throughput: 99.38 tokens/s, Accepted: 930 tokens, Drafted: 994 tokens, Per-position acceptance rate: 0.966, 0.905, Avg Draft acceptance rate: 93.6%
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 21:58:34 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
[22:01:57] /project/cpp/grammar_matcher.cc:497: Warning: The matcher has terminated after accepting the stop token, but is trying to accept new token with id 198.
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     **.*.**.***:60138 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1) INFO 04-26 22:01:54 [loggers.py:271] Engine 000: Avg prompt throughput: 67.7 tokens/s, Avg generation throughput: 182.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 04-26 22:01:54 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.63, Accepted throughput: 5.37 tokens/s, Drafted throughput: 6.58 tokens/s, Accepted: 1128 tokens, Drafted: 1382 tokens, Per-position acceptance rate: 0.890, 0.742, Avg Draft acceptance rate: 81.6%

prefix

Yes it supports it, your configs are all over the place >>> 179200: --max-num-batched-tokens this is wrong!! Use these flags.
sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP
--quantization
modelopt
--speculative-config
'{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
--max-model-len
"196608"
--max-num-batched-tokens
"8192"
--max-num-seqs
"50"
--gpu-memory-utilization
"0.95"
--enable-prefix-caching
--no-scheduler-reserve-full-isl
--trust-remote-code
--reasoning-parser
qwen3
--enable-auto-tool-choice
--tool-call-parser
qwen3_coder
--default-chat-template-kwargs
'{"preserve_thinking":true}'
--language-model-only

@cgelias 200+ TPS on a single B200, congrats. @livepeer-ren 's flags are pointing in the right direction β€” let me add the why so you can tune from there:

1. num_speculative_tokens: 2 β†’ 3 ← biggest single lever
vLLM 0.20+ applies the single Qwen3.6 MTP layer recursively num_speculative_tokens times. We measured per-position acceptance 87 / 72 / 61% across the family at n=3, mean acceptance length ~3.0–3.4 β†’ roughly +30–50% decode TPS over n=2. Your current mean acceptance ~2.53 matches what we see at n=2.

2. method: "mtp" is fine on 0.20.0 β€” your startup log shows Detected MTP model. Sharing target embedding..., which is the qwen3_5_mtp path being auto-dispatched. No change needed there.

3. --max-num-batched-tokens 32768 β†’ 8192 for prefill-heavy workloads
With chunked prefill, this is the per-step token budget. 32K means each prefill chunk is huge β€” fewer overlap opportunities with decode and longer per-step latency. 8K trims TTFT noticeably without hurting decode throughput on B200.

4. Prefix caching does work with MTP on 0.20.0 β€” you don't need to wait for #26807. Just add --enable-prefix-caching (your config has it disabled). #26807 further improves cross-request MTP draft state reuse, but baseline KV prefix reuse already gives the biggest TTFT win on repeated system prompts. --no-scheduler-reserve-full-isl (livepeer-ren's tip) is the right complement β€” it stops the scheduler from front-loading full ISL reservations.

Optional next step: with 183 GB on B200 you can comfortably push our newer sakamakismile/Qwen3.6-27B-LNARIZE-NVFP4 (same NVFP4 + MTP recipe + KVTC) to ~1M max-model-len with single-stream ~120 TPS held.

Thanks @livepeer-ren for the immediate help.

β€” Tonoken3 / Lna-Lab

Sign up or log in to comment