Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

SGLang

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```

Fix streaming output when enable_thinking is disabled

#29

by Kwindla - opened Dec 22, 2025

base: refs/heads/main

←

from: refs/pr/29

Discussion Files changed

+49

-1

Kwindla

Dec 22, 2025

Fix streaming output when enable_thinking is disabled

Problem

The current nano_v3_reasoning_parser.py correctly handles the enable_thinking: false flag for non-streaming requests, but streaming requests still route content to the wrong field.

When using vLLM with streaming enabled and thinking disabled:

response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

Current behavior: Content appears in delta.reasoning_content instead of delta.content

Expected behavior: Content should appear in delta.content (since thinking is disabled)

Root Cause

The existing extract_reasoning method handles the field swap for non-streaming responses, but the streaming path uses extract_reasoning_streaming from the parent DeepSeekR1ReasoningParser, which doesn't know about the enable_thinking flag.

Solution

Override extract_reasoning_streaming to swap the fields when thinking is disabled, matching the behavior of the non-streaming path.

Changes

Add __init__ to capture enable_thinking state at parser initialization
Add extract_reasoning_streaming override to swap fields in streaming mode
Add docstring explaining the parser's purpose

Testing

Tested with vLLM v0.1.dev on NVIDIA DGX Spark (GB10) with both streaming and non-streaming requests:

# Streaming with thinking disabled - now works correctly
curl -X POST http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/path/to/model",
        "messages": [{"role": "user", "content": "Hello"}],
        "stream": true,
        "chat_template_kwargs": {"enable_thinking": false}
    }'

Content now correctly appears in delta.content for all streaming chunks.

Fix streaming output when enable_thinking is disabledf17bf950

Kwindla changed pull request status to open Dec 22, 2025

g-a-b-y

Jan 16

@kwondla This indeed fixes the non-reasoning/streaming issue, but breaks tool calling.

I can't get any IDE to use tools after adding this parser.

Any ideas?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment