Instructions to use nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

SGLang

How to use nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
```

Run on DGX Spark

#14

by LimeemiL - opened Mar 15

Discussion

LimeemiL

Mar 15

Hello, I was trying to get this model running on the DGX Spark with the latest vLLM (I do not use the official DGX Spark container; I build the latest from GitHub). All other models work, but almost none of the NVFP4 models do ["illegal instruction"]. So I wanted to ask, how can it be done? I looked at TRTLLM, and it seems not to support it yet either. For sglang, I haven't checked, but to get the latest version running on the DGX Spark, it is not as simple as with vLLM.
Thanks.

nawoalanor

Mar 15

•

edited Mar 15

Try with the experimental Avarok backend. It makes nvfp4 work on GB10 faster than int4 (~20%) but I can't say for sure that it will work for this specific LLM.

Its developer is working on further optimizations to possibly greatly improve performance but it's a complicated task.

shakhizat

Mar 15

The same on the Nvidia Jetson Thor: https://github.com/vllm-project/vllm/issues/37060

Zambonilli

Mar 15

I was able to get vllm to boot when setting the attention backend to flashinfer but it's not fast and gobbled nearly all the memory on the DGX spark. After a couple of slow turns in a chat, vllm crashed. A bit disappointing because the nano nvfp4 ran at ~80 tokens/second and right now the super's q4_K_M gguf runs better than the nvfp4.

eugreugr

Mar 15

It only runs with Marlin backend right now. Have a look at our community Docker and an existing recipe: https://github.com/eugr/spark-vllm-docker/blob/main/recipes/nemotron-3-super-nvfp4.yaml

raphaelamorim

Mar 16

You guys can also take a look at the expected performance at https://spark-arena.com

steveheh

NVIDIA org Mar 16

@LimeemiL What vllm configs do you use to launch it on DGX Spark? I used the provided ones in the HF model card and the process got killed due to OOM every time....

catplusplus

Mar 16

I get a different tensor size mismatch exception on NVIDIA Thor + triton attention. It would help if someone from NVIDIA got this to work on Thor, DGX Spark and for good measure sm_120 consumer GPU and documented versions, config and expected performance. Similar Qwen 3.5 122B model runs fine at least on Thor with same vllm, so there seems to be a particular bug with this specific one's attention implementation.

LimeemiL

Mar 16

@eugreugr
Thanks, it now works.
When I set it up, I decided to test with Qwen3.5 35B Nvfp4, and it failed, so I thought it didn't work. Then I tried the Nemotron Super, and it works. Thanks again.

dionode

Mar 17

Hi,

Any recommendations or known docs to run on DGX Spark using vLLM ?

I've downloaded the NVFP4 version but vLMM raise this error :

ModelOpt currently only supports: ['FP8', 'NVFP4'] quantizations in vLLM

The config.json file &
hf_quant_config.json

--> Are defining "quant_algo": "MIXED_PRECISION" not only NVFP4 so vLLM triggers the error...

Any idea on possible fix ?

I was able to run NVIDIA-Nemotron-3-Super-120B using Ollama and OpenWebUI but I want to try with vLLM to get the theoretical higher performance with dedicated quantization.

Trying to run with docker:

docker run -d --name vllm-nemotron \
  --gpus all \
  --shm-size=16gb \
  --network host \
  -p 8000:8000 \
  -v ~/Documents/models:/models \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  -e VLLM_MOE_PADDING_SIZE=512 \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "/models/nemotron-120b" \
  --quantization nvfp4 \
  --tensor-parallel-size 4 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --override-generation-config '{"max_new_tokens":4096}'

Support docs:

https://vllm.ai/blog/nemotron-3-super
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=26.02-py3

eugreugr

Mar 17

@dionode - yes, you can follow this guide: https://github.com/eugr/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/AdvancedDeploymentGuide#config-b---dgx-spark-single-and-dual-configuration
NVIDIA folks asked me to submit this PR, but it's still not approved.

It is using our community Docker: https://github.com/eugr/spark-vllm-docker

raphaelamorim

Mar 18

@dionode our community Docker has recipes and the community shares their recipes and bechmarks of different configurations, runtimes, cluster sizes at http://spark-arena.com. Please check it up.

dionode

Mar 19

@eugreugr & @raphaelamorim Thanks for sharing the additional resources. Super helpful to kickstart my work with the DGX!

ryuunami1

Mar 21

can i run this nemotron super on a dgx spark with 200k context with kv cache fp4? Thansk!

bkmtech

Mar 21

•

edited Mar 21

This worked for me. Great model so far. Huge context (256k):

cd ~/Documents
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker

./build-and-copy.sh          # one-time build (~5-10 min, builds the patched image)

./run-recipe.sh nemotron-3-super-nvfp4 --solo

The author has this version of vllm for dgx Spark, and it includes the setup for this model.
https://github.com/eugr/spark-vllm-docker

Specifically this docker compose:
https://github.com/eugr/spark-vllm-docker/blob/main/recipes/nemotron-3-super-nvfp4.yaml

llm-wizard

Apr 1

https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/SparkDeploymentGuide

This is where we're going to keep up to date/most recent configs for Spark for this model!

LMK if you run into any issues!

RedstoneWhite

Apr 5

https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/SparkDeploymentGuide

This is where we're going to keep up to date/most recent configs for Spark for this model!

LMK if you run into any issues!

Hi, I followed exactly instructions in this guide to deploy nemotron v3 super to my founder edition spark using vllm (cu130-nightly, 1d4065704a3a). I ran into a problem that in streaming mode the parser seems to fail in parsing the thinking and content, resulting responses hidden in thinking blocks for frontends like Jan and OpenWebUI, and became completely unusable for openclaw.

mholler47

May 10

I have been able to get the nemotron - 3-super-120B model to run at the command line using pieces of the NVidia script "NemoClaw with Nemotron 3 Super and Telegram on DGX Spark" . I skipped the Nemoclaw and Telegram parts. Just do steps 1-3, Configure docker and the NVidia container runtime, Install ollama, and then use ollama to pull the model and run it. You enter prompts at the command line and get its responses there. It runs very fast using 95% of the DGX GPUs.

I tried to use WebUI to provide a web browser interface for the text interactions following the "OpenWebUI with ollama" playbook" but, it wouldn't connect with the model I already had running. It downloaded some version of the 120B model which then showed up in the WebUI but, when run there was very slow with no sign it was using any of the GPUs. I am going to try to use the llama.cpp approach shown in the "Nemotron-3-Nano with llama.cpp" playbook which runs the model using the GPUs and gives a nice GPT UI in the browser. llama.cpp gets compiled with CUDA drivers in the process. llama.cpp is an inference engine alternative to vLLM but, I don't know how broadly usable it is. I'm hoping it at least works also for the Nemotron-3-super model although I can live with the command line interface for now.

mholler47

May 13

Follow up: I did find a .gguf file of the Nemotron 3 Super 120B model at HuggingFace which was able to run and produce a standard web UI on port localhost:30000. This model is apparently a quantized (Q4_K) version posted by Mr. Garganov. It runs fast (16 tps) on my Spark I don't know what quality impact the quantization produced. Inference by this model uses 89GB of RAM on the Spark.

cosmicequanimity

13 days ago

Running a guff model with ollama on a DGX Spark completely defeats the purpose, I think. Reduced precision and extra overhead. Am I wrong?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment