---
tags:
- neuron
- aws
- inf2
- qwen3-vl
- pre-compiled
library_name: neuronx-distributed-inference
---

# Qwen3-VL-4B-Instruct — Pre-compiled for AWS Inferentia2

Pre-compiled artifacts for running [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) on AWS Inferentia2 (inf2) instances.

## Configuration

| Setting | Value |
|---------|-------|
| Instance type | inf2.xlarge / inf2.8xlarge (2 NeuronCores) |
| Tensor parallel | 2 |
| Batch size | 1 |
| Max sequence length | 4096 |
| Data type | BF16 |
| ISA Kernels | All OFF |
| Buckets (context) | 512, 1024, 4096 |
| Buckets (token gen) | 512, 1024, 4096 |
| Vision buckets | 512, 1024, 4096 |
| SDK | Neuron SDK 2.28 (DLAMI 20260227) |
| NxD Inference | 0.8.x |
| vLLM | 0.13.x |

## Usage

### Prerequisites

- AWS Inferentia2 instance (inf2.xlarge or larger)
- Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
- Pre-installed venv: `/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/`

### Quick Start

```bash
# Activate environment
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub

# Download original model weights (required for weight loading)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='Qwen/Qwen3-VL-4B-Instruct', local_dir='Qwen3-VL-4B-Instruct')
"

# Download pre-compiled artifacts
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='jburtoft/Qwen3-VL-4B-Instruct-neuron-inf2-tp2', local_dir='neuron-artifacts')
"

# Set environment
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-artifacts/bs1_tp2
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
```

### Python API (vLLM)

```python
import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = "neuron-artifacts/bs1_tp2"
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen3-VL-4B-Instruct",
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=2,
    max_num_seqs=1,
    max_model_len=4096,
    swap_space=0,
    additional_config=dict(override_neuron_config=dict(
        text_neuron_config={
            "batch_size": 1, "ctx_batch_size": 1, "tkg_batch_size": 1,
            "seq_len": 4096, "max_context_length": 4096,
            "torch_dtype": "bfloat16", "tp_degree": 2, "world_size": 2,
            "enable_bucketing": True,
            "context_encoding_buckets": [512, 1024, 4096],
            "token_generation_buckets": [512, 1024, 4096],
            "fused_qkv": True,
            "qkv_kernel_enabled": False, "mlp_kernel_enabled": False,
            "attn_kernel_enabled": False,
            "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
            "rpl_reduce_dtype": "bfloat16", "attention_dtype": "bfloat16",
            "cast_type": "as-declared",
        },
        vision_neuron_config={
            "batch_size": 1, "seq_len": 4096, "max_context_length": 4096,
            "enable_bucketing": True, "buckets": [512, 1024, 4096],
            "world_size": 2, "tp_degree": 2,
            "torch_dtype": "bfloat16", "rpl_reduce_dtype": "bfloat16",
            "cast_type": "as-declared",
            "logical_neuron_cores": 1, "cc_pipeline_tiling_factor": 1,
            "fused_qkv": True,
            "attn_kernel_enabled": False, "mlp_kernel_enabled": False,
        },
    )),
    limit_mm_per_prompt={"image": 1},
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
)

# Run inference
sampling = SamplingParams(top_k=1, max_tokens=256, temperature=0.0)
outputs = llm.generate([{"prompt": "Hello, what can you do?"}], sampling)
print(outputs[0].outputs[0].text)
```

## Important Notes

1. **Original model weights required**: This repo contains only compiled NEFFs and
   (if available) pre-sharded weight checkpoints. You still need the original
   `Qwen/Qwen3-VL-4B-Instruct` model weights on disk.

2. **`tie_word_embeddings` fix**: The original model has `tie_word_embeddings=true`.
   You must either add `lm_head.weight` to the safetensors file or apply the
   monkey-patch (see the benchmark script for details).

3. **inf2.xlarge (16 GB RAM)**: System RAM is tight. Use `swap_space=0` to avoid
   vLLM allocating swap memory. Pre-sharded checkpoints (if included) help reduce
   peak memory during weight loading.

4. **Artifacts are SDK-version and hardware specific**: These artifacts only work on
   inf2 instances with Neuron SDK 2.28 (DLAMI 20260227).