Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- SGLang
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Docker Model Runner:
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Does not work with dgx spark
Tried to follow https://build.nvidia.com/spark/sglang/instructions to run a sglang docker on dgx spark, then ran the model using the provided command in model card. Got this error
launch_server.py: error: argument --reasoning-parser: invalid choice: 'nano_v3' (choose from 'deepseek-r1', 'deepseek-v3', 'glm45', 'gpt-oss', 'kimi', 'qwen3', 'qwen3-thinking', 'minimax', 'minimax-append-think', 'step3')
yes would be awesome to have it work on dgx spark
Same issue here on DGX Spark.
Environment:
- DGX Spark, GB10 GPU, CUDA 13.0.1
- Model downloaded: /home/data/models/nemotron-3-nano-bf16 (all 13 safetensors verified)
Tried:
lmsysorg/sglang:sparkโnano_v3parser not available (same error as OP)nvcr.io/nvidia/vllm:25.11-py3โAttributeError: 'NemotronHConfig' object has no attribute 'rms_norm_eps'
Question:
Which exact docker image + tag supports Nemotron-3-Nano on DGX Spark today?
To try it on DGX Spark now try the following steps:
- Via lmstudio.ai https://lmstudio.ai/ I tried Q4_K_M (24.5GB) GGUF and everything "just worked".
- Trying few queries produced reasonable result and about 65 tok/sec which is higher what I get (53 tok/sec) on MacBook M3 Pro with MLX variant.
- It is able to use built-in (to lmstudio.ai) tool: js-sandbox even though we did not train on that specific tool.
I did not measure any accuracies and this GGUF isn't "official" NVIDIA GGUF. Many thanks to the OSS community for providing these!
We have two options confirmed for DGX Spark. We're looking into SGLang support.
vLLM path is a little bit tricky as the user needs to build a Docker image by themselves. We'll keep posted about the latest information about Nemotron 3 Nano for DGX Spark.
- (1) Llama.cpp (used as the backend for LM Studio as @okuchaiev mentioned above)
- (2) vLLM
- https://github.com/zhenghax/recipes/blob/main/NVIDIA/Nemotron-3-Nano-30B-A3B.md#run-docker-container-on-dgx-spark
- The docker build command information is currently missing in the documentation.
v0.12.0or newer is needed. Pull vLLM v0.12.0 or later, then build the Docker image using the following command.
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ DOCKER_BUILDKIT=1 docker build \
--build-arg max_jobs=32 \ # Decrease the number if you face an OOM issue
--build-arg RUN_WHEEL_CHECK=false \
--build-arg CUDA_VERSION=13.0.1 \
--build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 \
--build-arg torch_cuda_arch_list='12.1' \
--platform "linux/arm64" \
--tag <docker-image-tag-name> \
--target vllm-openai \
--progress plain \
-f docker/Dockerfile \
.
@suhara Thank you for the build instructions.
I need to run the BF16 full precision model (nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16), not quantized.
Will this custom vLLM build support BF16 on DGX Spark?
Sorry for the delayed response.
Now pre-built container that supports NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and FP8 available.
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3
The information is now in the model card.
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16#use-it-with-vllm
