Qwen3-14B with LoRA -- Pre-compiled for AWS Inferentia2

Pre-compiled artifacts for running Qwen/Qwen3-14B with LoRA adapters on AWS Inferentia2 (inf2.24xlarge).

Configuration

Setting	Value
Instance type	inf2.24xlarge (12 NeuronCores)
Tensor parallel	4
Batch size	1
Max sequence length	4096
Data type	BF16
ISA Kernels	All OFF (required on inf2)
Compile time	753s
SDK	Neuron SDK 2.28 (DLAMI 20260227)
NxD Inference	0.8.x
vLLM-neuron	0.4.1

Benchmark Results

Config	Throughput (tok/s)	Latency (s)	Avg Tokens
Adapter A (nicoboss/Uncensored)	27.6 +/- 0.0	9.26	256
Adapter B (Wuhall/LoRA)	27.5 +/- 0.1	9.29	256

Included LoRA Adapters

Adapter	Source	Rank	Alpha	Target Modules
adapter_a	nicoboss/Qwen3-14B-Uncensored-Lora	32	16	q/k/v/o/gate/up/down_proj
adapter_b	Wuhall/Qwen3-14B-LoRA	32	32	q/k/v/o/gate/up/down_proj

Quick Start on a Fresh inf2.24xlarge

# 1. Activate Neuron venv
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub

# 2. Download base model (required -- artifacts don't include base weights)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='Qwen/Qwen3-14B', local_dir='Qwen3-14B',
                  ignore_patterns=['*.gguf', '*.md', 'original/*'])
"

# 3. Download pre-compiled artifacts + LoRA adapters
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='jburtoft/Qwen3-14B-neuron-inf2-tp4-lora', local_dir='neuron-artifacts')
"

# 4. Create LoRA config JSON (update paths to match your layout)
python -c "
import json, os
config = {
    'lora-ckpt-dir': os.path.abspath('neuron-artifacts/lora_adapters'),
    'lora-ckpt-paths': {
        'adapter_a': 'adapter_a',
        'adapter_b': 'adapter_b',
    },
    'lora-ckpt-paths-cpu': {}
}
with open('neuron-artifacts/lora_adapters/adapters.json', 'w') as f:
    json.dump(config, f, indent=2)
print('LoRA config written')
"

# 5. Set environment and run
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-artifacts/bs1_tp4
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
export VLLM_USE_V1=1

import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = "neuron-artifacts/bs1_tp4"
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
os.environ["VLLM_USE_V1"] = "1"

from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(
    model="Qwen3-14B",
    trust_remote_code=True,
    dtype="bfloat16",
    tensor_parallel_size=4,
    max_num_seqs=1,
    max_model_len=4096,
    swap_space=0,
    enable_lora=True,
    max_loras=2,
    max_lora_rank=32,
    additional_config=dict(
        override_neuron_config=dict(
            text_neuron_config={...},  # same config as Section 3
            lora_ckpt_json="neuron-artifacts/lora_adapters/adapters.json",
        )
    ),
    enable_prefix_caching=False,
    enable_chunked_prefill=False,
)

# NOTE: When enable_lora=True, EVERY request must include a lora_request.
# Sending lora_request=None causes an error (known issue, internal ticket filed).
sampling = SamplingParams(top_k=1, max_tokens=256, temperature=0.0)
lora_req = LoRARequest("adapter_a", lora_int_id=1, lora_path=" ")
outputs = llm.generate(
    [{"prompt": "Hello! Explain quantum computing briefly."}],
    sampling,
    lora_request=[lora_req],
)
print(outputs[0].outputs[0].text)

Important Notes

Base model weights required: Download Qwen/Qwen3-14B separately (~30 GB).
LoRA always required: When enable_lora=True, every request MUST include a lora_request. Omitting it causes AttributeError. Internal ticket filed.
inf2.24xlarge required: tp=4 needs 4+ NeuronCores. Smaller inf2 instances won't work.
No FP8 on inf2: BF16 only. Customer's GPU --quantization fp8 cannot be replicated.
SDK-specific: Artifacts work with Neuron SDK 2.28 (DLAMI 20260227) only.
Update LoRA paths: The lora-ckpt-dir in adapters.json must be an absolute path matching your local layout.

Customer Migration Notes

The customer's GPU command:

vllm serve Qwen3-14B --quantization fp8 --enable-lora --max-loras 2 --max-lora-rank 32 \
  --lora-modules main_adapter=<path1> main_adapter_green=<path2>

On inf2.24xlarge becomes:

export NEURON_COMPILED_ARTIFACTS=neuron-artifacts/bs1_tp4
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
# LoRA adapters specified via JSON config (see above)
# No --quantization (BF16 only on inf2)
# tensor_parallel_size=4, max_model_len=4096

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support