Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
conversational
custom_code
Eval Results
Instructions to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
- SGLang
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with Docker Model Runner:
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
[Research] Adaptive-K Routing Validation: 33% Compute Savings on Nemotron 3 Nano
#41
by Gabrobals - opened
Hi NVIDIA team! ๐
I've been working on Adaptive-K routing - an entropy-guided method for dynamic expert selection in MoE models. Today I validated it on Nemotron 3 Nano and wanted to share the results.
TL;DR
- 33.3% compute savings by dynamically selecting Kโ{2,4,6} based on router entropy
- Zero retraining required - inference-time only
- Entropy-based selection: confident tokens use fewer experts
Results
| Test Case | Router Entropy | Effective K | Savings |
|---|---|---|---|
| Easy tokens | 5.26 bits | 4.1 | 32.4% |
| Code tokens | 5.28 bits | 4.0 | 33.3% |
| Hard tokens | 5.16 bits | 3.9 | 34.4% |
| Average | 5.23 bits | 4.0 | 33.3% |
The Insight
Nemotron 3's router entropy (measured via pre-top-k logits) averages 5.23 bits out of 7.0 max (logโ(128)). This means:
- ~75% of max entropy โ router is moderately confident
- Many tokens don't need all 6 experts
- The shared expert provides a quality safety net for aggressive K reduction
Methodology
Since output_router_logits isn't supported, I used forward hooks on backbone.layers.X.mixer.gate to compute full 128-expert logits:
router_logits = hidden_states @ module.weight.T # [batch, seq, 128]
probs = softmax(router_logits)
entropy = -sum(probs * log(probs)) # Per-token entropy
Why This Matters for Nemotron 3
Amplifies reasoning budget control: Users already control reasoning tokens - Adaptive-K automates compute optimization per-token
Shared expert synergy: The always-active shared expert means quality is maintained even at K=2
No retraining: Drop-in replacement for the router
Open Source
Full validation: https://github.com/Gabrobals/sbm-efficient
Results JSON: nemotron3_nano_validation.json
Validation script: nemotron3_entropy_validation.py
Questions for NVIDIA
Would you be interested in integrating Adaptive-K as an optional routing mode?
Is there a preferred way to contribute to the Nemotron cookbooks?
Any plans to expose output_router_logits in future versions?
Happy to collaborate on benchmarks or provide a PR to the NeMo repository!
Gabriele Balsamo
GitHub: @Gabrobals