on the DGX spark

by shakhizat - opened 21 days ago

(EngineCore pid=7077) ValueError: There is no module or parameter named 'lm_head.input_scale' in Qwen3_5MoeForCausalLM. The available parameters belonging to lm_head (ParallelLMHead) are: {'lm_head.weight'}

cpatonn

21 days ago

Yeah, it is due to they quantize lm_head, which vllm does not support. We have to manually dequantized lm_head from NVFP4 to bf16 and replace the quantized lm_head tensors with the reconstructed lm_head.

ComplexMinded

20 days ago

Why was DGX spark not considered when this was optimized into NVFP4 - just curious! Is this something that is expected that vllm will fix in the near future? If not, it just seems weird the DGX Spark wasn't considered for NVFP4 especially since it's NVIDIA's product.

technigmaai

20 days ago

https://github.com/technigmaai/dgx-spark/tree/main/spark-vllm-docker/nvidia-Qwen3.6-35B-A3B-NVFP4

remifan

20 days ago

https://github.com/technigmaai/dgx-spark/tree/main/spark-vllm-docker/nvidia-Qwen3.6-35B-A3B-NVFP4

promising work. There are community NVFP4 models resorting to compressed-tensor path, the hurdle here is the modelopt.
NVIDIA quantized the output head, vLLM's modelopt loader only accepts an unquantized lm_head.weight, so the extra lm_head.input_scale tensor has nowhere to go and loading aborts

MrPMorris

19 days ago

•

edited 19 days ago

This worked for me

docker run -d \
  --name vllm-qwen36-nvfp4 \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e HF_HUB_DISABLE_XET=1 \
  -e HF_HUB_DOWNLOAD_TIMEOUT=60 \
  -e HF_HUB_ETAG_TIMEOUT=60 \
  -e HF_HOME=/hf \
  -v "$HOME/.cache/huggingface:/hf" \
  ghcr.io/spark-arena/dgx-vllm-eugr-nightly-tf5:latest \
  vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
    --served-model-name qwen3.6-35b-a3b-nvfp4 \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --moe_backend flashinfer_cutlass \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
    --trust-remote-code

remifan

18 days ago

real fix: vLLM 0.22.1rc1.dev33 include the loader fix that skips/handles the orphan lm_head scale tensor for an unquantized head. The 0.22 release branch doesn't have it; main/nightly does

dangzhangqq

18 days ago

real fix: vLLM 0.22.1rc1.dev33 include the loader fix that skips/handles the orphan lm_head scale tensor for an unquantized head. The 0.22 release branch doesn't have it; main/nightly does

Thanks bro, I'll check it out.

bitbound

16 days ago

•

edited 16 days ago

This is my compose file, and it's working. Note the "nightly" tag for the vllm-openai image.

Edit: I should probably add that I'm noticing degradation in tool calling, coding, and logic compared to the FP8. So I think I'll be going back to that.

services:
  vllm:
    image: vllm/vllm-openai:nightly
    container_name: qwen3-6-35b
    ipc: host
    gpus: all
    shm_size: '16gb'
    restart: unless-stopped
    ports:
      - "8001:8000"
    volumes:
      - /opt/lm_cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_USE_FLASHINFER_MOE_FP4=0
      - VLLM_FP8_MOE_BACKEND=flashinfer_cutlass
      - FLASHINFER_DISABLE_VERSION_CHECK=1
      - CUTE_DSL_ARCH=sm_121a
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: |
      --model nvidia/Qwen3.6-35B-A3B-NVFP4
      --port 8000
      --tensor-parallel-size 1
      --trust-remote-code
      --enable-auto-tool-choice
      --reasoning-parser qwen3
      --tool-call-parser qwen3_coder
      --dtype auto
      --quantization modelopt
      --kv-cache-dtype fp8
      --attention-backend flashinfer
      --moe-backend marlin
      --gpu-memory-utilization 0.5
      --max-model-len 262144
      --max-num-seqs 8
      --max-num-batched-tokens 32768
      --enable-chunked-prefill
      --async-scheduling
      --enable-prefix-caching
      --default-chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
      --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'

dangzhangqq

16 days ago

This works for me:

networks:
  1panel-network:
    external: true

services:
  vllm-qwopus:
    image: vllm/vllm-openai:nightly-f91fb2fcf3f100d46baf4db2e5d86fa5596b276b
    container_name: vllm-qwen-36-27b-nvfp4
    restart: unless-stopped
    ports:
      - "8004:8000"
    ipc: host
    networks:
      - 1panel-network
    runtime: nvidia

    environment:
      - TZ=Asia/Shanghai
      - HF_HOME=/models/hf_cache
      - CUDA_VISIBLE_DEVICES=0
      - HF_HUB_OFFLINE=1           
      - TRANSFORMERS_OFFLINE=1   
      - HF_HUB_ENABLE_HF_TRANSFER=0
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - VLLM_USE_FLASHINFER_MOE_FP4=0 #nvidia spec
      - VLLM_FP8_MOE_BACKEND=flashinfer_cutlass #nvidia spec
      - FLASHINFER_DISABLE_VERSION_CHECK=1 #nvidia spec
      - CUTE_DSL_ARCH=sm_121a #nvidia spec


    volumes:
      - ./models:/models
      - ./logs:/logs

    command: >
      /models/qwen3.6-35B-A3B-NVFP4
      --served-model-name qwen36nvfp4
      --host 0.0.0.0
      --port 8000
      --dtype auto
      --quantization modelopt
      --kv-cache-dtype fp8
      --attention-backend flashinfer
      --moe-backend marlin
      --gpu-memory-utilization 0.65
      --max-model-len 262144
      --max-num-seqs 4
      --max-num-batched-tokens 8192
      --enable-chunked-prefill
      --async-scheduling
      --enable-prefix-caching
      --speculative-config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder

But I don't Know if it works correctly， this is logs:

(APIServer pid=1) INFO 06-03 16:34:45 [utils.py:344]
(APIServer pid=1) INFO 06-03 16:34:45 [utils.py:344]        █     █     █▄   ▄█
(APIServer pid=1) INFO 06-03 16:34:45 [utils.py:344]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.22.1rc1.dev55+gf91fb2fcf
(APIServer pid=1) INFO 06-03 16:34:45 [utils.py:344]   █▄█▀ █     █     █     █  model   /models/qwen3.6-35B-A3B-NVFP4
(APIServer pid=1) INFO 06-03 16:34:45 [utils.py:344]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 06-03 16:34:45 [utils.py:344]
(APIServer pid=1) INFO 06-03 16:34:45 [utils.py:278] non-default args: {'model_tag': '/models/qwen3.6-35B-A3B-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/models/qwen3.6-35B-A3B-NVFP4', 'max_model_len': 262144, 'quantization': 'modelopt', 'served_model_name': ['qwen36nvfp4'], 'attention_backend': 'flashinfer', 'gpu_memory_utilization': 0.65, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 4, 'enable_chunked_prefill': True, 'async_scheduling': True, 'moe_backend': 'marlin', 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 3, 'moe_backend': 'triton'}}
(APIServer pid=1) WARNING 06-03 16:34:45 [envs.py:2060] Unknown vLLM environment variable detected: VLLM_FP8_MOE_BACKEND
(APIServer pid=1) WARNING 06-03 16:34:45 [envs.py:2060] Unknown vLLM environment variable detected: VLLM_BUILD_COMMIT
(APIServer pid=1) WARNING 06-03 16:34:45 [envs.py:2060] Unknown vLLM environment variable detected: VLLM_BUILD_PIPELINE
(APIServer pid=1) WARNING 06-03 16:34:45 [envs.py:2060] Unknown vLLM environment variable detected: VLLM_BUILD_URL
(APIServer pid=1) WARNING 06-03 16:34:45 [envs.py:2060] Unknown vLLM environment variable detected: VLLM_IMAGE_TAG
(APIServer pid=1) INFO 06-03 16:34:53 [model.py:617] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1) INFO 06-03 16:34:53 [model.py:1751] Using max model len 262144
(APIServer pid=1) INFO 06-03 16:34:54 [cache.py:269] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 06-03 16:34:58 [model.py:617] Resolved architecture: Qwen3_5MoeMTP
(APIServer pid=1) INFO 06-03 16:34:58 [model.py:1751] Using max model len 262144
(APIServer pid=1) WARNING 06-03 16:34:58 [speculative.py:722] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 06-03 16:34:58 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) WARNING 06-03 16:34:58 [config.py:355] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=1) INFO 06-03 16:34:58 [config.py:375] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) WARNING 06-03 16:34:58 [modelopt.py:379] Detected ModelOpt fp8 checkpoint (quant_algo=FP8). Please note that the format is experimental and could change.
(APIServer pid=1) WARNING 06-03 16:34:58 [modelopt.py:1022] Detected ModelOpt NVFP4 checkpoint (quant_algo=NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=1) WARNING 06-03 16:34:58 [modelopt.py:1022] Detected ModelOpt NVFP4 checkpoint (quant_algo=W4A16_NVFP4). Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 06-03 16:34:58 [vllm.py:984] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 06-03 16:34:58 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=263) INFO 06-03 16:35:09 [core.py:112] Initializing a V1 LLM engine (v0.22.1rc1.dev55+gf91fb2fcf) with config: model='/models/qwen3.6-35B-A3B-NVFP4', speculative_config=SpeculativeConfig(method='mtp', model='/models/qwen3.6-35B-A3B-NVFP4', num_spec_tokens=3), tokenizer='/models/qwen3.6-35B-A3B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_mixed, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen36nvfp4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::qwen_gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 32, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='marlin', linear_backend='auto')
(EngineCore pid=263) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=263) INFO 06-03 16:35:10 [parallel_state.py:1422] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.18.0.5:50911 backend=nccl
(EngineCore pid=263) INFO 06-03 16:35:10 [parallel_state.py:1735] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=263) INFO 06-03 16:35:10 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=263) WARNING 06-03 16:35:10 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=263) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=263) INFO 06-03 16:35:14 [gpu_model_runner.py:5075] Starting to load model /models/qwen3.6-35B-A3B-NVFP4...
(EngineCore pid=263) INFO 06-03 16:35:14 [cuda.py:433] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=263) INFO 06-03 16:35:14 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=263) INFO 06-03 16:35:14 [__init__.py:569] Selected FlashInferFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod
(EngineCore pid=263) INFO 06-03 16:35:14 [qwen_gdn_linear_attn.py:228] Using Triton/FLA GDN prefill kernel (requested=auto, head_k_dim=None).
(EngineCore pid=263) INFO 06-03 16:35:14 [nvfp4.py:231] Using 'MARLIN' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(EngineCore pid=263) INFO 06-03 16:35:14 [cuda.py:318] Using AttentionBackendEnum.FLASHINFER backend.
(EngineCore pid=263) INFO 06-03 16:35:15 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 21.82 GiB. Available RAM: 64.60 GiB.
(EngineCore pid=263) INFO 06-03 16:35:15 [weight_utils.py:945] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:57<01:54, 57.31s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [01:56<00:58, 58.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [02:06<00:00, 36.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [02:06<00:00, 42.27s/it]
(EngineCore pid=263)
(EngineCore pid=263) INFO 06-03 16:37:22 [default_loader.py:397] Loading weights took 126.87 seconds
(EngineCore pid=263) WARNING 06-03 16:37:22 [marlin.py:34] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=263) WARNING 06-03 16:37:23 [marlin_utils_fp4.py:300] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=263) INFO 06-03 16:37:23 [nvfp4.py:537] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=263) INFO 06-03 16:37:26 [gpu_model_runner.py:5099] Loading drafter model...
(EngineCore pid=263) INFO 06-03 16:37:26 [vllm.py:984] Asynchronous scheduling is enabled.
(EngineCore pid=263) INFO 06-03 16:37:26 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=263) INFO 06-03 16:37:26 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(EngineCore pid=263) INFO 06-03 16:37:26 [cuda.py:378] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(EngineCore pid=263) INFO 06-03 16:37:26 [unquantized.py:212] Using TRITON Unquantized MoE backend out of potential backends: ['FlashInfer TRTLLM', 'FlashInfer CUTLASS', 'TRITON', 'BATCHED_TRITON'].
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
(EngineCore pid=263) INFO 06-03 16:37:26 [weight_utils.py:922] Filesystem type for checkpoints: EXT4. Checkpoint size: 21.82 GiB. Available RAM: 56.37 GiB.
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:06<00:13,  6.94s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:07<00:03,  3.18s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:15<00:00,  5.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:15<00:00,  5.22s/it]
(EngineCore pid=263)
(EngineCore pid=263) INFO 06-03 16:37:42 [default_loader.py:397] Loading weights took 15.73 seconds
(EngineCore pid=263) INFO 06-03 16:37:42 [unquantized.py:341] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=263) INFO 06-03 16:37:42 [llm_base_proposer.py:1328] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=263) INFO 06-03 16:37:42 [llm_base_proposer.py:1384] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=263) INFO 06-03 16:37:42 [gpu_model_runner.py:5170] Model loading took 21.94 GiB memory and 147.987344 seconds
(EngineCore pid=263) INFO 06-03 16:37:42 [interface.py:662] Setting attention block size to 2144 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=263) INFO 06-03 16:37:43 [gpu_model_runner.py:6179] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=263) /usr/local/lib/python3.12/dist-packages/vllm/envs.py:2144: FutureWarning: VLLM_USE_FLASHINFER_MOE_FP4 is deprecated and will be removed in v0.23. Use --moe-backend (e.g. flashinfer_trtllm, flashinfer_cutlass, flashinfer_cutedsl).
(EngineCore pid=263)   raw = getter()
(EngineCore pid=263) /usr/local/lib/python3.12/dist-packages/vllm/envs.py:2144: FutureWarning: VLLM_USE_FLASHINFER_MOE_FP4 is deprecated and will be removed in v0.23. Use --moe-backend (e.g. flashinfer_trtllm, flashinfer_cutlass, flashinfer_cutedsl).
(EngineCore pid=263)   raw = getter()
(EngineCore pid=263) INFO 06-03 16:37:56 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/a404f2a2de/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=263) INFO 06-03 16:37:56 [backends.py:1148] Dynamo bytecode transform time: 6.51 s
(EngineCore pid=263) [rank0]:W0603 16:38:00.566000 263 torch/_inductor/utils.py:1731] Not enough SMs to use max_autotune_gemm mode
(EngineCore pid=263) INFO 06-03 16:38:01 [backends.py:378] Cache the graph of compile range (1, 8192) for later use
(EngineCore pid=263) INFO 06-03 16:38:19 [backends.py:393] Compiling a graph for compile range (1, 8192) takes 22.55 s
(EngineCore pid=263) INFO 06-03 16:38:21 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/a5d5ced553da0bab8ed18975f13669d36cbe7e6c7df08660fdb13a91502de5db/rank_0_0/model
(EngineCore pid=263) INFO 06-03 16:38:21 [monitor.py:53] torch.compile took 31.35 s in total
(EngineCore pid=263) INFO 06-03 16:38:54 [marlin_utils.py:437] Marlin kernel can achieve better performance for small size_n with experimental use_atomic_add feature. You can consider set environment variable VLLM_MARLIN_USE_ATOMIC_ADD to 1 if possible.
(EngineCore pid=263) INFO 06-03 16:38:57 [monitor.py:81] Initial profiling/warmup run took 36.36 s
(EngineCore pid=263) /usr/local/lib/python3.12/dist-packages/vllm/envs.py:2144: FutureWarning: VLLM_USE_FLASHINFER_MOE_FP4 is deprecated and will be removed in v0.23. Use --moe-backend (e.g. flashinfer_trtllm, flashinfer_cutlass, flashinfer_cutedsl).
(EngineCore pid=263)   raw = getter()
(EngineCore pid=263) INFO 06-03 16:38:57 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/a404f2a2de/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=263) INFO 06-03 16:38:57 [backends.py:1148] Dynamo bytecode transform time: 0.35 s
(EngineCore pid=263) INFO 06-03 16:39:04 [backends.py:393] Compiling a graph for compile range (1, 8192) takes 6.29 s
(EngineCore pid=263) INFO 06-03 16:39:04 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/4278900251c5847da4039e4393fed9b34372422be97abf94adcbc268f996460a/rank_0_0/model
(EngineCore pid=263) INFO 06-03 16:39:04 [monitor.py:53] torch.compile took 6.76 s in total
(EngineCore pid=263) WARNING 06-03 16:39:05 [fused_moe.py:1071] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_GB10.json
(EngineCore pid=263) INFO 06-03 16:39:05 [monitor.py:81] Initial profiling/warmup run took 1.32 s
(EngineCore pid=263) WARNING 06-03 16:39:07 [kv_cache_utils.py:1157] Add 3 padding layers, may waste at most 10.00% KV cache memory
(EngineCore pid=263) WARNING 06-03 16:39:07 [compilation.py:1416] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE
(EngineCore pid=263) INFO 06-03 16:39:07 [gpu_model_runner.py:6380] Profiling CUDA graph memory: PIECEWISE=7 (largest=32)
(EngineCore pid=263) INFO 06-03 16:39:08 [gpu_model_runner.py:6485] Estimated CUDA graph memory: 0.20 GiB total
(EngineCore pid=263) INFO 06-03 16:39:08 [gpu_worker.py:469] Available KV cache memory: 51.19 GiB
(EngineCore pid=263) INFO 06-03 16:39:08 [gpu_worker.py:484] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.6500 is equivalent to --gpu-memory-utilization=0.6484 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.6516. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=263) WARNING 06-03 16:39:08 [kv_cache_utils.py:1157] Add 3 padding layers, may waste at most 10.00% KV cache memory
(EngineCore pid=263) INFO 06-03 16:39:08 [kv_cache_utils.py:1733] GPU KV cache size: 4,323,476 tokens
(EngineCore pid=263) INFO 06-03 16:39:08 [kv_cache_utils.py:1734] Maximum concurrency for 262,144 tokens per request: 16.49x
(EngineCore pid=263) 2026-06-03 16:39:14,868 - INFO - autotuner.py:615 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:07<00:00,  2.86profile/s]
(EngineCore pid=263) cudnn_handle created for device_id = 0
(EngineCore pid=263)
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:05<00:00,  3.67profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.05profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 86.88profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 24.04profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 101.49profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:04<00:00,  4.24profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 102.16profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 24.22profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 102.55profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 23.93profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 15.13profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 23.91profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 15.13profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00,  8.54profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.87profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.30profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.91profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.70profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00,  8.53profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.96profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.21profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.65profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.90profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00,  8.53profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.86profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.22profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.74profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.67profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00,  8.07profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 15.03profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.96profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.73profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.00profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.75profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.21profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.91profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.70profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.80profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.31profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.78profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.81profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.85profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.09profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.75profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.83profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 13.76profile/s]
[AutoTuner]: Tuning fp8_gemm: 100%|██████████| 21/21 [00:00<00:00, 14.21profile/s]
(EngineCore pid=263) 2026-06-03 16:39:41,927 - INFO - autotuner.py:634 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 7/7 [00:01<00:00,  3.91it/s]
(EngineCore pid=263) INFO 06-03 16:39:44 [gpu_model_runner.py:6553] Graph capturing finished in 3 secs, took 0.30 GiB
(EngineCore pid=263) INFO 06-03 16:39:44 [gpu_worker.py:622] CUDA graph pool memory: 0.3 GiB (actual), 0.2 GiB (estimated), difference: 0.1 GiB (34.6%).
(EngineCore pid=263) INFO 06-03 16:39:44 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=263) INFO 06-03 16:39:44 [core.py:302] init engine (profile, create kv cache, warmup model) took 121.57 s (compilation: 38.12 s)
(EngineCore pid=263) /usr/local/lib/python3.12/dist-packages/vllm/envs.py:1999: FutureWarning: VLLM_USE_FLASHINFER_MOE_FP4 is deprecated and will be removed in v0.23. Use --moe-backend (e.g. flashinfer_trtllm, flashinfer_cutlass, flashinfer_cutedsl).
(EngineCore pid=263)   return environment_variables[name]()
(EngineCore pid=263) INFO 06-03 16:39:44 [kernel.py:270] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'], fused_add_rms_norm=['native'])
(APIServer pid=1) INFO 06-03 16:39:46 [api_server.py:580] Supported tasks: ['generate']
(APIServer pid=1) INFO 06-03 16:39:46 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 06-03 16:39:46 [model.py:1508] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 06-03 16:39:50 [hf.py:488] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 06-03 16:40:02 [base.py:227] Multi-modal warmup completed in 12.413s
(APIServer pid=1) INFO 06-03 16:40:03 [base.py:227] Readonly multi-modal warmup completed in 0.428s
(APIServer pid=1) INFO 06-03 16:40:03 [api_server.py:584] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 06-03 16:40:03 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(APIServer pid=1) INFO:     172.18.0.4:34444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore pid=263) WARNING 06-03 16:43:20 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _zero_kv_blocks_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:20 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:21 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _copy_page_indices_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) 2026-06-03 16:43:21,176 - WARNING - autotuner.py:1121 - flashinfer.jit: [AutoTuner]: No tuned config covers fp8_gemm input_shapes=(torch.Size([1, 6432, 2048]), torch.Size([1, 2048, 12288]), torch.Size([]), torch.Size([]), torch.Size([1, 6432, 12288]), torch.Size([33554432])); falling back to runner=CutlassFp8GemmRunner tactic=-1.  This shape is outside the tuning bucket range -- expand tuning_buckets / max_num_tokens during the next tuning pass to avoid this perf cliff.
(EngineCore pid=263) WARNING 06-03 16:43:21 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _causal_conv1d_fwd_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) 2026-06-03 16:43:21,502 - WARNING - autotuner.py:1121 - flashinfer.jit: [AutoTuner]: No tuned config covers fp8_gemm input_shapes=(torch.Size([1, 6432, 4096]), torch.Size([1, 4096, 2048]), torch.Size([]), torch.Size([]), torch.Size([1, 6432, 2048]), torch.Size([33554432])); falling back to runner=CutlassFp8GemmRunner tactic=-1.  This shape is outside the tuning bucket range -- expand tuning_buckets / max_num_tokens during the next tuning pass to avoid this perf cliff.
(EngineCore pid=263) 2026-06-03 16:43:21,521 - WARNING - autotuner.py:1121 - flashinfer.jit: [AutoTuner]: No tuned config covers fp8_gemm input_shapes=(torch.Size([1, 6432, 2048]), torch.Size([1, 2048, 9216]), torch.Size([]), torch.Size([]), torch.Size([1, 6432, 9216]), torch.Size([33554432])); falling back to runner=CutlassFp8GemmRunner tactic=-1.  This shape is outside the tuning bucket range -- expand tuning_buckets / max_num_tokens during the next tuning pass to avoid this perf cliff.
(EngineCore pid=263) WARNING 06-03 16:43:22 [jit_monitor.py:103] Triton kernel JIT compilation during inference: postprocess_mamba_fused_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:22 [jit_monitor.py:103] Triton kernel JIT compilation during inference: eagle_prepare_next_token_padded_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:22 [jit_monitor.py:103] Triton kernel JIT compilation during inference: eagle_step_slot_mapping_metadata_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:22 [jit_monitor.py:103] Triton kernel JIT compilation during inference: batch_memcpy_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:27 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _fused_post_conv_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:28 [jit_monitor.py:103] Triton kernel JIT compilation during inference: fused_moe_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:29 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _causal_conv1d_update_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:29 [jit_monitor.py:103] Triton kernel JIT compilation during inference: fused_sigmoid_gating_delta_rule_update_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:29 [jit_monitor.py:103] Triton kernel JIT compilation during inference: expand_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(EngineCore pid=263) WARNING 06-03 16:43:30 [jit_monitor.py:103] Triton kernel JIT compilation during inference: eagle_prepare_inputs_padded_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=1) INFO:     172.18.0.4:34444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 06-03 16:43:33 [loggers.py:271] Engine 000: Avg prompt throughput: 3329.9 tokens/s, Avg generation throughput: 10.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-03 16:43:33 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.09, Accepted throughput: 0.32 tokens/s, Drafted throughput: 0.46 tokens/s, Accepted: 73 tokens, Drafted: 105 tokens, Per-position acceptance rate: 0.800, 0.686, 0.600, Avg Draft acceptance rate: 69.5%
(APIServer pid=1) INFO:     172.18.0.4:34444 - "POST /v1/chat/completions HTTP/1.1" 200 OK

cboergermann

7 days ago

Is it better than the RedHat Version? Because that works ootb with the current vLLM 0.22.1

MrPMorris

7 days ago

I got Claude to write a script to iterate over lots of options. After 3+ hours of testing, this was the fastest combination at 49.15 tokens per second.

recipe_version: '2'

model: Qwen/Qwen3.6-35B-A3B-FP8

runtime: vllm
builder: eugr
container: ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest

min_nodes: 1
max_nodes: 1

metadata:
  description: "Qwen3.6-35B-A3B-FP8 - native FP8 format"

mods:
- mods/fix-qwen3-coder-next
- mods/fix-qwen3.5-chat-template

defaults:
  port: 8000
  host: 0.0.0.0
  tensor_parallel: 1
  pipeline_parallel: 1
  gpu_memory_utilization: 0.9
  max_model_len: 65536
  max_num_batched_tokens: 8192
  max_num_seqs: 4
  load_format: instanttensor
  kv_cache_dtype: fp8
  attention_backend: flashinfer
  tool_call_parser: qwen3_coder
  reasoning_parser: qwen3
  speculative_config: '{"method": "mtp", "num_speculative_tokens": 3}'

env:
  VLLM_MARLIN_USE_ATOMIC_ADD: '1'

command: |
  vllm serve {model} \
    --host {host} \
    --port {port} \
    --max-model-len {max_model_len} \
    --max-num-batched-tokens {max_num_batched_tokens} \
    --trust-remote-code \
    --chat-template unsloth.jinja \
    --gpu-memory-utilization {gpu_memory_utilization} \
    --enable-auto-tool-choice \
    --tool-call-parser {tool_call_parser} \
    --reasoning-parser {reasoning_parser} \
    --kv-cache-dtype {kv_cache_dtype} \
    --load-format {load_format} \
    --attention-backend {attention_backend} \
    --speculative-config '{speculative_config}' \
    -tp {tensor_parallel} \
    -pp {pipeline_parallel} \
    --max-num-seqs {max_num_seqs}

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment