Qwen3-14B with LoRA -- Pre-compiled for AWS Inferentia2
Pre-compiled artifacts for running Qwen/Qwen3-14B with LoRA adapters on AWS Inferentia2 (inf2.24xlarge).
Configuration
| Setting | Value |
|---|---|
| Instance type | inf2.24xlarge (12 NeuronCores) |
| Tensor parallel | 4 |
| Batch size | 1 |
| Max sequence length | 4096 |
| Data type | BF16 |
| ISA Kernels | All OFF (required on inf2) |
| Compile time | 753s |
| SDK | Neuron SDK 2.28 (DLAMI 20260227) |
| NxD Inference | 0.8.x |
| vLLM-neuron | 0.4.1 |
Benchmark Results
| Config | Throughput (tok/s) | Latency (s) | Avg Tokens |
|---|---|---|---|
| Adapter A (nicoboss/Uncensored) | 27.6 +/- 0.0 | 9.26 | 256 |
| Adapter B (Wuhall/LoRA) | 27.5 +/- 0.1 | 9.29 | 256 |
Included LoRA Adapters
| Adapter | Source | Rank | Alpha | Target Modules |
|---|---|---|---|---|
| adapter_a | nicoboss/Qwen3-14B-Uncensored-Lora | 32 | 16 | q/k/v/o/gate/up/down_proj |
| adapter_b | Wuhall/Qwen3-14B-LoRA | 32 | 32 | q/k/v/o/gate/up/down_proj |
Quick Start on a Fresh inf2.24xlarge
# 1. Activate Neuron venv
source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
pip install huggingface_hub
# 2. Download base model (required -- artifacts don't include base weights)
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='Qwen/Qwen3-14B', local_dir='Qwen3-14B',
ignore_patterns=['*.gguf', '*.md', 'original/*'])
"
# 3. Download pre-compiled artifacts + LoRA adapters
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='jburtoft/Qwen3-14B-neuron-inf2-tp4-lora', local_dir='neuron-artifacts')
"
# 4. Create LoRA config JSON (update paths to match your layout)
python -c "
import json, os
config = {
'lora-ckpt-dir': os.path.abspath('neuron-artifacts/lora_adapters'),
'lora-ckpt-paths': {
'adapter_a': 'adapter_a',
'adapter_b': 'adapter_b',
},
'lora-ckpt-paths-cpu': {}
}
with open('neuron-artifacts/lora_adapters/adapters.json', 'w') as f:
json.dump(config, f, indent=2)
print('LoRA config written')
"
# 5. Set environment and run
export NEURON_COMPILED_ARTIFACTS=$PWD/neuron-artifacts/bs1_tp4
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
export VLLM_USE_V1=1
import os
os.environ["NEURON_COMPILED_ARTIFACTS"] = "neuron-artifacts/bs1_tp4"
os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
os.environ["VLLM_USE_V1"] = "1"
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(
model="Qwen3-14B",
trust_remote_code=True,
dtype="bfloat16",
tensor_parallel_size=4,
max_num_seqs=1,
max_model_len=4096,
swap_space=0,
enable_lora=True,
max_loras=2,
max_lora_rank=32,
additional_config=dict(
override_neuron_config=dict(
text_neuron_config={...}, # same config as Section 3
lora_ckpt_json="neuron-artifacts/lora_adapters/adapters.json",
)
),
enable_prefix_caching=False,
enable_chunked_prefill=False,
)
# NOTE: When enable_lora=True, EVERY request must include a lora_request.
# Sending lora_request=None causes an error (known issue, internal ticket filed).
sampling = SamplingParams(top_k=1, max_tokens=256, temperature=0.0)
lora_req = LoRARequest("adapter_a", lora_int_id=1, lora_path=" ")
outputs = llm.generate(
[{"prompt": "Hello! Explain quantum computing briefly."}],
sampling,
lora_request=[lora_req],
)
print(outputs[0].outputs[0].text)
Important Notes
- Base model weights required: Download
Qwen/Qwen3-14Bseparately (~30 GB). - LoRA always required: When
enable_lora=True, every request MUST include alora_request. Omitting it causesAttributeError. Internal ticket filed. - inf2.24xlarge required: tp=4 needs 4+ NeuronCores. Smaller inf2 instances won't work.
- No FP8 on inf2: BF16 only. Customer's GPU
--quantization fp8cannot be replicated. - SDK-specific: Artifacts work with Neuron SDK 2.28 (DLAMI 20260227) only.
- Update LoRA paths: The
lora-ckpt-dirinadapters.jsonmust be an absolute path matching your local layout.
Customer Migration Notes
The customer's GPU command:
vllm serve Qwen3-14B --quantization fp8 --enable-lora --max-loras 2 --max-lora-rank 32 \
--lora-modules main_adapter=<path1> main_adapter_green=<path2>
On inf2.24xlarge becomes:
export NEURON_COMPILED_ARTIFACTS=neuron-artifacts/bs1_tp4
export VLLM_NEURON_FRAMEWORK=neuronx-distributed-inference
# LoRA adapters specified via JSON config (see above)
# No --quantization (BF16 only on inf2)
# tensor_parallel_size=4, max_model_len=4096
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support