--- tags: - neuron - aws - inf2 - qwen3-vl - pre-compiled library_name: neuronx-distributed-inference --- # Qwen3-VL-4B-Instruct — Pre-compiled for AWS Inferentia2 Pre-compiled artifacts for running [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) on AWS Inferentia2 (inf2) instances. ## Configuration | Setting | Value | |---------|-------| | Instance type | inf2.xlarge / inf2.8xlarge (2 NeuronCores) | | Tensor parallel | 2 | | Batch size | 1 | | Max sequence length | 4096 | | Data type | BF16 | | ISA Kernels | All OFF | | Buckets (context) | 512, 1024, 4096 | | Buckets (token gen) | 512, 1024, 4096 | | Vision buckets | 512, 1024, 4096 | | SDK | Neuron SDK 2.28 (DLAMI 20260227) | | NxD Inference | 0.8.x | | vLLM | 0.13.x | ## Usage ### Prerequisites - AWS Inferentia2 instance (inf2.xlarge or larger) - Deep Learning AMI Neuron (Ubuntu 24.04) 20260227 - Pre-installed venv: `/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/` ### Quick Start ```bash # Activate environment source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate pip install huggingface_hub # Download original model weights (required for weight loading) python -c " from huggingface_hub import snapshot_download snapshot_download(repo_id='Qwen/Qwen3-VL-4B-Instruct', local_dir='Qwen3-VL-4B-Instruct') " # Download pre-compiled artifacts python -c " from huggingface_hub import snapshot_download snapshot_download(repo_id='jburtoft/Qwen3-VL-4B-Instruct-neuron-inf2-tp2', local_dir='neuron-artifacts') " # Set environment export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-artifacts/bs1_tp2 export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference ``` ### Python API (vLLM) ```python import os os.environ["NEURON_COMPILED_ARTIFACTS"] = "neuron-artifacts/bs1_tp2" os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference" from vllm import LLM, SamplingParams llm = LLM( model="Qwen3-VL-4B-Instruct", trust_remote_code=True, dtype="bfloat16", tensor_parallel_size=2, max_num_seqs=1, max_model_len=4096, swap_space=0, additional_config=dict(override_neuron_config=dict( text_neuron_config={ "batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1, "seq_len": 4096, "max_context_length": 4096, "torch_dtype": "bfloat16", "tp_degree": 2, "world_size": 2, "enable_bucketing": True, "context_encoding_buckets": [512, 1024, 4096], "token_generation_buckets": [512, 1024, 4096], "fused_qkv": True, "qkv_kernel_enabled": False, "mlp_kernel_enabled": False, "attn_kernel_enabled": False, "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "rpl_reduce_dtype": "bfloat16", "attention_dtype": "bfloat16", "cast_type": "as-declared", }, vision_neuron_config={ "batch_size": 1, "seq_len": 4096, "max_context_length": 4096, "enable_bucketing": True, "buckets": [512, 1024, 4096], "world_size": 2, "tp_degree": 2, "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16", "cast_type": "as-declared", "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1, "fused_qkv": True, "attn_kernel_enabled": False, "mlp_kernel_enabled": False, }, )), limit_mm_per_prompt={"image": 1}, enable_prefix_caching=False, enable_chunked_prefill=False, ) # Run inference sampling = SamplingParams(top_k=1, max_tokens=256, temperature=0.0) outputs = llm.generate([{"prompt": "Hello, what can you do?"}], sampling) print(outputs[0].outputs[0].text) ``` ## Important Notes 1. **Original model weights required**: This repo contains only compiled NEFFs and (if available) pre-sharded weight checkpoints. You still need the original `Qwen/Qwen3-VL-4B-Instruct` model weights on disk. 2. **`tie_word_embeddings` fix**: The original model has `tie_word_embeddings=true`. You must either add `lm_head.weight` to the safetensors file or apply the monkey-patch (see the benchmark script for details). 3. **inf2.xlarge (16 GB RAM)**: System RAM is tight. Use `swap_space=0` to avoid vLLM allocating swap memory. Pre-sharded checkpoints (if included) help reduce peak memory during weight loading. 4. **Artifacts are SDK-version and hardware specific**: These artifacts only work on inf2 instances with Neuron SDK 2.28 (DLAMI 20260227).