Instructions to use nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", trust_remote_code=True, dtype="auto") - Inference
- Local Apps Settings
- vLLM
How to use nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
- SGLang
How to use nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with Docker Model Runner:
docker model run hf.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
Running Llama-3_3-Nemotron-Super-49B-v1_5 on DGX Spark with NGC vLLM Container
Running Llama-3_3-Nemotron-Super-49B-v1_5 on DGX Spark with NGC vLLM Container
Hardware
- System: NVIDIA DGX Spark
- Memory: 128GB unified memory (CPU+GPU shared)
- GPU: Single GPU (Grace Blackwell architecture)
- Current working model: gpt-oss-120b runs successfully with NGC vLLM container
Current Setup
I'm using the NGC vLLM container for inference:
nvcr.io/nvidia/vllm:25.11-py3 (vLLM 0.11.0)
My working gpt-oss-120b configuration:
sudo docker run -d \
--gpus all \
--ipc=host \
--shm-size 32g \
-v /home/data/models/gpt-oss-120b:/model \
-p 8000:8000 \
nvcr.io/nvidia/vllm:25.11-py3 \
vllm serve /model \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.7 \
--max-model-len 131072 \
--trust-remote-code \
--generation-config=vllm
Questions
The HuggingFace example uses pip install vllm==0.9.2 with tensor-parallel-size=8. I need to adapt this for DGX Spark (single GPU, unified memory).
1. NGC vLLM 0.11.0 Compatibility
- Is
nvcr.io/nvidia/vllm:25.11-py3(vLLM 0.11.0) compatible with this model? - Or must I use
vllm==0.9.2specifically?
2. Required Parameters
Please confirm or correct each parameter for DGX Spark:
| Parameter | My assumption | Correct? |
|---|---|---|
--trust-remote-code |
Required | ? |
--enforce-eager |
Required | ? |
--gpu-memory-utilization |
0.7 (unified memory constraint) | ? |
--max-model-len |
32768 or 65536? | ? |
--tensor-parallel-size |
1 (single GPU) | ? |
--generation-config |
vllm | ? |
3. DeciLMForCausalLM Architecture
- I saw GitHub issues about
DeciLMForCausalLMnot being supported in some vLLM versions - Does NGC vLLM 0.11.0 support this architecture natively, or does
--trust-remote-codehandle it?
4. Reasoning Mode
- Does vLLM deployment support
<think>tag parsing natively? - Or is additional configuration needed for reasoning on/off modes?
5. Complete Docker Command
If possible, please provide a tested docker run command for DGX Spark with:
- NGC vLLM container (or specify if pip version is required)
- Single GPU / unified memory configuration
- Recommended context length for 128GB unified memory
Model Location
/home/data/models/llama-nemotron-super-49b/
(Downloaded via huggingface-cli download nvidia/Llama-3_3-Nemotron-Super-49B-v1_5)
Thank you for any guidance. I want to avoid trial-and-error on production hardware.