Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- SGLang
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Docker Model Runner:
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Tool calling issue: got "True" as a String instead of a valid JSON format such as true (the primitive, unquoted value)
I have a tool with a parameter defined as "type": "boolean". However, the model keeps calling this tool using Python-style format:
"key": "True"
instead of the correct JSON format for tool calling:
"key": true
Interestingly, the reasoning block contains the correct format, but the actual tool call in the response returns with the incorrect Python-style format.
I should note that I'm using DeepInfra via OpenRouter and testing with a direct Postman call—no frameworks between the request and response, just a plain HTTP POST to the /chat/completions endpoint.
The same test produces correct JSON-formatted tool calls when using both Qwen3 30B and GPT-OSS 20B on DeepInfra via OpenRouter.
UPDATE:
I repeated the above test against a local llama.cpp hosted nemotron-3-nano-30b and the results has been the same: boolean is returned ad "True" (with the initial uppercase and surrounded by double quotes) instead of true
Step to reproduce:
{
"model": "Nemotron-3-Nano-30B-A3B",
"tools": [
{
"type": "function",
"function": {
"name": "get_wheather",
"description": "return the wheather for the given city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "A City."
},
"includeWindSpeed": {
"type": "boolean",
"description": "Whether the wind speed should be included in the response"
}
},
"required": [
"city",
"includeWindSpeed"
]
}
}
}
],
"messages": [
{
"role": "system",
"content": "You are an helpful assistant"
},
{
"role": "user",
"content": "Find the wheather in Florence"
}
]
}
Response:
[...]
"tool_calls": [
{
"index": 0,
"id": "call_69f36fd277c8216e",
"function": {
"arguments": "{\"city\": \"Florence\", \"includeWindSpeed\": \"False\"}",
"name": "get_wheather"
},
"type": "function"
}
]
[...]
Hey @j3st3r666 , ran into the same issue. Dug into it a bit and I'm pretty sure the root cause is in the chat template itself, not the model.
In tokenizer_config.json, tool call arguments get formatted with args_value | string for non-dict/non-list values. Problem is | string calls Python's str(), so str(True) gives you "True" instead of json.dumps(True) giving "true". Dicts and lists correctly go through | tojson but booleans just fall through. So the model literally learned to output Python-style booleans because that's what it saw in training.
I tested this on build.nvidia.com (Super-49B) — if you ask the model for plain JSON output it gives you correct true/false. But when it uses its native XML tool call format, you get True/False. Even tried few-shot prompting both ways: give it examples with JSON bools and it follows JSON style, give it Python bools and it follows that. The model isn't broken, the template just trained it wrong for this case.
Fix is straightforward — add booleans to the tojson branch in the template: args_value is sameas true or args_value is sameas false. Or inference engines could just normalize True/False to true/false on the parser side since <type>boolean</type> is already in the tool schema.
@suhara would be curious if this tracks with what you guys see internally. Happy to put together a PR if useful.