Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

SGLang

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```

Does not work with dgx spark

#13

by sotaaa - opened Dec 18, 2025

Discussion

sotaaa

Dec 18, 2025

Tried to follow https://build.nvidia.com/spark/sglang/instructions to run a sglang docker on dgx spark, then ran the model using the provided command in model card. Got this error

launch_server.py: error: argument --reasoning-parser: invalid choice: 'nano_v3' (choose from 'deepseek-r1', 'deepseek-v3', 'glm45', 'gpt-oss', 'kimi', 'qwen3', 'qwen3-thinking', 'minimax', 'minimax-append-think', 'step3')

julien-c

Dec 18, 2025

yes would be awesome to have it work on dgx spark

PhotosGrafus

Dec 19, 2025

Same issue here on DGX Spark.

Environment:

DGX Spark, GB10 GPU, CUDA 13.0.1
Model downloaded: /home/data/models/nemotron-3-nano-bf16 (all 13 safetensors verified)

Tried:

lmsysorg/sglang:spark → nano_v3 parser not available (same error as OP)
nvcr.io/nvidia/vllm:25.11-py3 → AttributeError: 'NemotronHConfig' object has no attribute 'rms_norm_eps'

Question:
Which exact docker image + tag supports Nemotron-3-Nano on DGX Spark today?

okuchaiev

NVIDIA org Dec 19, 2025

•

edited Dec 19, 2025

To try it on DGX Spark now try the following steps:

Via lmstudio.ai https://lmstudio.ai/ I tried Q4_K_M (24.5GB) GGUF and everything "just worked".
Trying few queries produced reasonable result and about 65 tok/sec which is higher what I get (53 tok/sec) on MacBook M3 Pro with MLX variant.
It is able to use built-in (to lmstudio.ai) tool: js-sandbox even though we did not train on that specific tool.

I did not measure any accuracies and this GGUF isn't "official" NVIDIA GGUF. Many thanks to the OSS community for providing these!

suhara

NVIDIA org Dec 20, 2025

Hi @sotaaa @PhotosGrafus

We have two options confirmed for DGX Spark. We're looking into SGLang support.

vLLM path is a little bit tricky as the user needs to build a Docker image by themselves. We'll keep posted about the latest information about Nemotron 3 Nano for DGX Spark.

(1) Llama.cpp (used as the backend for LM Studio as @okuchaiev mentioned above)
- https://docs.unsloth.ai/models/nemotron-3#run-nemotron-3-nano-30b-a3b
(2) vLLM
- https://github.com/zhenghax/recipes/blob/main/NVIDIA/Nemotron-3-Nano-30B-A3B.md#run-docker-container-on-dgx-spark
- The docker build command information is currently missing in the documentation. v0.12.0 or newer is needed. Pull vLLM v0.12.0 or later, then build the Docker image using the following command.

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ DOCKER_BUILDKIT=1 docker build \
    --build-arg max_jobs=32 \    # Decrease the number if you face an OOM issue
    --build-arg RUN_WHEEL_CHECK=false \
    --build-arg CUDA_VERSION=13.0.1 \
    --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \
    --build-arg torch_cuda_arch_list='12.1' \
    --platform "linux/arm64" \
    --tag <docker-image-tag-name> \
    --target vllm-openai \
    --progress plain \
    -f docker/Dockerfile \
.

PhotosGrafus

Dec 25, 2025

@suhara Thank you for the build instructions.

I need to run the BF16 full precision model (nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), not quantized.

Will this custom vLLM build support BF16 on DGX Spark?

suhara

NVIDIA org Jan 8

Hi @PhotosGrafus

Sorry for the delayed response.

Now pre-built container that supports NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and FP8 available.
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3

The information is now in the model card.
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16#use-it-with-vllm

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment