Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

SGLang

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```

Correct `get_decoder`/`set_decoder`

#40

by kylemylonakisprotopia - opened Jan 19

base: refs/heads/main

←

from: refs/pr/40

Discussion Files changed

-0

kylemylonakisprotopia

Jan 19

No description provided.

refactor: Call after initializing backbone to conform with standard *ForCausalLM tnterface2876b6ff

kylemylonakisprotopia

Jan 19

All *ForCausalLM models have set_decoder and get_decoder methods which point to the actual decoder of the underlying transformer. Typically the get_decoder method points to the self.model attribute, however for the NemotronHForCausalLM, no such attribute exists, as the module function as the decoder is named backbone for this model. It would be useful if the get_decoder method pointed to backbone by default, as it would maintain a more consistent interface laid out by *ForCausalLM Transformer models.

kylemylonakisprotopia changed pull request status to open Jan 19

kylemylonakisprotopia

Jan 19

Gerald001

Feb 18

hi @kylemylonakisprotopia - slightly offtopic im trying to reach out to you about: https://github.com/huggingface/transformers/pull/42901 - would you be able to reply there and provide infos?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment