Qwen3-14B with LoRA -- Pre-compiled for AWS Inferentia2

Pre-compiled artifacts for running Qwen/Qwen3-14B with LoRA adapters on AWS Inferentia2 (inf2.24xlarge).

Configuration

Setting Value
Instance type inf2.24xlarge (12 NeuronCores)
Tensor parallel 4
Batch size 1
Max sequence length 4096
Data type BF16
ISA Kernels All OFF (required on inf2)
Compile time 753s
SDK Neuron SDK 2.28 (DLAMI 20260227)
NxD Inference 0.8.x
vLLM-neuron 0.4.1

Benchmark Results

Config Throughput (tok/s) Latency (s) Avg Tokens
Adapter A (nicoboss/Uncensored) 27.6 +/- 0.0 9.26 256
Adapter B (Wuhall/LoRA) 27.5 +/- 0.1 9.29 256

Included LoRA Adapters

Adapter Source Rank Alpha Target Modules
adapter_a nicoboss/Qwen3-14B-Uncensored-Lora 32 16 q/k/v/o/gate/up/down_proj
adapter_b Wuhall/Qwen3-14B-LoRA 32 32 q/k/v/o/gate/up/down_proj

Quick Start on a Fresh inf2.24xlarge

# 1. Activate Neuron venv
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub

# 2. Download base model (required -- artifacts don't include base weights)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='Qwen/Qwen3-14B', local_dir='Qwen3-14B',
                  ignore_patterns=['*.gguf', '*.md', 'original/*'])
"

# 3. Download pre-compiled artifacts + LoRA adapters
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='jburtoft/Qwen3-14B-neuron-inf2-tp4-lora', local_dir='neuron-artifacts')
"

# 4. Create LoRA config JSON (update paths to match your layout)
python -c "
import json, os
config = {
    'lora-ckpt-dir': os.path.abspath('neuron-artifacts/lora_adapters'),
    'lora-ckpt-paths': {
        'adapter_a': 'adapter_a',
        'adapter_b': 'adapter_b',
    },
    'lora-ckpt-paths-cpu': {}
}
with open('neuron-artifacts/lora_adapters/adapters.json', 'w') as f:
    json.dump(config, f, indent=2)
print('LoRA config written')
"

# 5. Set environment and run
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-artifacts/bs1_tp4
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
export VLLM_USE_V1=1
import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = "neuron-artifacts/bs1_tp4"
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
os.environ["VLLM_USE_V1"] = "1"

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(
    model="Qwen3-14B",
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=4,
    max_num_seqs=1,
    max_model_len=4096,
    swap_space=0,
    enable_lora=True,
    max_loras=2,
    max_lora_rank=32,
    additional_config=dict(
        override_neuron_config=dict(
            text_neuron_config={...},  # same config as Section 3
            lora_ckpt_json="neuron-artifacts/lora_adapters/adapters.json",
        )
    ),
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
)

# NOTE: When enable_lora=True, EVERY request must include a lora_request.
# Sending lora_request=None causes an error (known issue, internal ticket filed).
sampling = SamplingParams(top_k=1, max_tokens=256, temperature=0.0)
lora_req = LoRARequest("adapter_a", lora_int_id=1, lora_path=" ")
outputs = llm.generate(
    [{"prompt": "Hello! Explain quantum computing briefly."}],
    sampling,
    lora_request=[lora_req],
)
print(outputs[0].outputs[0].text)

Important Notes

  1. Base model weights required: Download Qwen/Qwen3-14B separately (~30 GB).
  2. LoRA always required: When enable_lora=True, every request MUST include a lora_request. Omitting it causes AttributeError. Internal ticket filed.
  3. inf2.24xlarge required: tp=4 needs 4+ NeuronCores. Smaller inf2 instances won't work.
  4. No FP8 on inf2: BF16 only. Customer's GPU --quantization fp8 cannot be replicated.
  5. SDK-specific: Artifacts work with Neuron SDK 2.28 (DLAMI 20260227) only.
  6. Update LoRA paths: The lora-ckpt-dir in adapters.json must be an absolute path matching your local layout.

Customer Migration Notes

The customer's GPU command:

vllm serve Qwen3-14B --quantization fp8 --enable-lora --max-loras 2 --max-lora-rank 32 \
  --lora-modules main_adapter=<path1> main_adapter_green=<path2>

On inf2.24xlarge becomes:

export NEURON_COMPILED_ARTIFACTS=neuron-artifacts/bs1_tp4
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
# LoRA adapters specified via JSON config (see above)
# No --quantization (BF16 only on inf2)
# tensor_parallel_size=4, max_model_len=4096
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support