Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

SGLang

How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
```

Tool calling issue: got "True" as a String instead of a valid JSON format such as true (the primitive, unquoted value)

#25

by j3st3r666 - opened Dec 20, 2025

Discussion

j3st3r666

Dec 20, 2025

•

edited Dec 20, 2025

I have a tool with a parameter defined as "type": "boolean". However, the model keeps calling this tool using Python-style format:

"key": "True"

instead of the correct JSON format for tool calling:

"key": true

Interestingly, the reasoning block contains the correct format, but the actual tool call in the response returns with the incorrect Python-style format.

I should note that I'm using DeepInfra via OpenRouter and testing with a direct Postman call—no frameworks between the request and response, just a plain HTTP POST to the /chat/completions endpoint.

The same test produces correct JSON-formatted tool calls when using both Qwen3 30B and GPT-OSS 20B on DeepInfra via OpenRouter.

j3st3r666

Dec 20, 2025

UPDATE:
I repeated the above test against a local llama.cpp hosted nemotron-3-nano-30b and the results has been the same: boolean is returned ad "True" (with the initial uppercase and surrounded by double quotes) instead of true

j3st3r666

Dec 20, 2025

•

edited Dec 20, 2025

Step to reproduce:

{
    "model": "Nemotron-3-Nano-30B-A3B",
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_wheather",
                "description": "return the wheather for the given city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "A City."
                        },
                        "includeWindSpeed": {
                            "type": "boolean",
                            "description": "Whether the wind speed should be included in the response"
                        }
                    },
                    "required": [
                        "city",
                        "includeWindSpeed"
                    ]
                }
            }
        }
    ],
    "messages": [
        {
            "role": "system",
            "content": "You are an helpful assistant"
        },
        {
            "role": "user",
            "content": "Find the wheather in Florence"
        }
    ]
}

Response:

[...]
"tool_calls": [
                    {
                        "index": 0,
                        "id": "call_69f36fd277c8216e",
                        "function": {
                            "arguments": "{\"city\": \"Florence\", \"includeWindSpeed\": \"False\"}",
                            "name": "get_wheather"
                        },
                        "type": "function"
                    }
                ]
[...]

chjkh8113

Feb 4

Hey @j3st3r666 , ran into the same issue. Dug into it a bit and I'm pretty sure the root cause is in the chat template itself, not the model.

In tokenizer_config.json, tool call arguments get formatted with args_value | string for non-dict/non-list values. Problem is | string calls Python's str(), so str(True) gives you "True" instead of json.dumps(True) giving "true". Dicts and lists correctly go through | tojson but booleans just fall through. So the model literally learned to output Python-style booleans because that's what it saw in training.

I tested this on build.nvidia.com (Super-49B) — if you ask the model for plain JSON output it gives you correct true/false. But when it uses its native XML tool call format, you get True/False. Even tried few-shot prompting both ways: give it examples with JSON bools and it follows JSON style, give it Python bools and it follows that. The model isn't broken, the template just trained it wrong for this case.

Fix is straightforward — add booleans to the tojson branch in the template: args_value is sameas true or args_value is sameas false. Or inference engines could just normalize True/False to true/false on the parser side since <type>boolean</type> is already in the tool schema.

@suhara would be curious if this tracks with what you guys see internally. Happy to put together a PR if useful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment