Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
nemotron-3
latent-moe
mtp
conversational
custom_code
8-bit precision
modelopt

Run on DGX Spark

#14
by LimeemiL - opened

Hello, I was trying to get this model running on the DGX Spark with the latest vLLM (I do not use the official DGX Spark container; I build the latest from GitHub). All other models work, but almost none of the NVFP4 models do ["illegal instruction"]. So I wanted to ask, how can it be done? I looked at TRTLLM, and it seems not to support it yet either. For sglang, I haven't checked, but to get the latest version running on the DGX Spark, it is not as simple as with vLLM.
Thanks.

Try with the experimental Avarok backend. It makes nvfp4 work on GB10 faster than int4 (~20%) but I can't say for sure that it will work for this specific LLM.

Its developer is working on further optimizations to possibly greatly improve performance but it's a complicated task.

The same on the Nvidia Jetson Thor: https://github.com/vllm-project/vllm/issues/37060

I was able to get vllm to boot when setting the attention backend to flashinfer but it's not fast and gobbled nearly all the memory on the DGX spark. After a couple of slow turns in a chat, vllm crashed. A bit disappointing because the nano nvfp4 ran at ~80 tokens/second and right now the super's q4_K_M gguf runs better than the nvfp4.

It only runs with Marlin backend right now. Have a look at our community Docker and an existing recipe: https://github.com/eugr/spark-vllm-docker/blob/main/recipes/nemotron-3-super-nvfp4.yaml

You guys can also take a look at the expected performance at https://spark-arena.com

NVIDIA org

@LimeemiL What vllm configs do you use to launch it on DGX Spark? I used the provided ones in the HF model card and the process got killed due to OOM every time....

I get a different tensor size mismatch exception on NVIDIA Thor + triton attention. It would help if someone from NVIDIA got this to work on Thor, DGX Spark and for good measure sm_120 consumer GPU and documented versions, config and expected performance. Similar Qwen 3.5 122B model runs fine at least on Thor with same vllm, so there seems to be a particular bug with this specific one's attention implementation.

@eugreugr
Thanks, it now works.
When I set it up, I decided to test with Qwen3.5 35B Nvfp4, and it failed, so I thought it didn't work. Then I tried the Nemotron Super, and it works. Thanks again.

Hi,

Any recommendations or known docs to run on DGX Spark using vLLM ?

I've downloaded the NVFP4 version but vLMM raise this error :

ModelOpt currently only supports: ['FP8', 'NVFP4'] quantizations in vLLM

The config.json file &
hf_quant_config.json

--> Are defining "quant_algo": "MIXED_PRECISION" not only NVFP4 so vLLM triggers the error...

Any idea on possible fix ?

I was able to run NVIDIA-Nemotron-3-Super-120B using Ollama and OpenWebUI but I want to try with vLLM to get the theoretical higher performance with dedicated quantization.

Trying to run with docker:

docker run -d --name vllm-nemotron \
  --gpus all \
  --shm-size=16gb \
  --network host \
  -p 8000:8000 \
  -v ~/Documents/models:/models \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  -e VLLM_MOE_PADDING_SIZE=512 \
  nvcr.io/nvidia/vllm:26.01-py3 \
  vllm serve "/models/nemotron-120b" \
  --quantization nvfp4 \
  --tensor-parallel-size 4 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --override-generation-config '{"max_new_tokens":4096}'

Support docs:

https://vllm.ai/blog/nemotron-3-super
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=26.02-py3

@dionode - yes, you can follow this guide: https://github.com/eugr/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/AdvancedDeploymentGuide#config-b---dgx-spark-single-and-dual-configuration
NVIDIA folks asked me to submit this PR, but it's still not approved.

It is using our community Docker: https://github.com/eugr/spark-vllm-docker

@dionode our community Docker has recipes and the community shares their recipes and bechmarks of different configurations, runtimes, cluster sizes at http://spark-arena.com. Please check it up.

@eugreugr & @raphaelamorim Thanks for sharing the additional resources. Super helpful to kickstart my work with the DGX!

can i run this nemotron super on a dgx spark with 200k context with kv cache fp4? Thansk!

This worked for me. Great model so far. Huge context (256k):

cd ~/Documents
git clone https://github.com/eugr/spark-vllm-docker.git
cd spark-vllm-docker

./build-and-copy.sh          # one-time build (~5-10 min, builds the patched image)

./run-recipe.sh nemotron-3-super-nvfp4 --solo

The author has this version of vllm for dgx Spark, and it includes the setup for this model.
https://github.com/eugr/spark-vllm-docker

Specifically this docker compose:
https://github.com/eugr/spark-vllm-docker/blob/main/recipes/nemotron-3-super-nvfp4.yaml

https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/SparkDeploymentGuide

This is where we're going to keep up to date/most recent configs for Spark for this model!

LMK if you run into any issues!

https://github.com/NVIDIA-NeMo/Nemotron/tree/main/usage-cookbook/Nemotron-3-Super/SparkDeploymentGuide

This is where we're going to keep up to date/most recent configs for Spark for this model!

LMK if you run into any issues!

Hi, I followed exactly instructions in this guide to deploy nemotron v3 super to my founder edition spark using vllm (cu130-nightly, 1d4065704a3a). I ran into a problem that in streaming mode the parser seems to fail in parsing the thinking and content, resulting responses hidden in thinking blocks for frontends like Jan and OpenWebUI, and became completely unusable for openclaw.

I have been able to get the nemotron - 3-super-120B model to run at the command line using pieces of the NVidia script "NemoClaw with Nemotron 3 Super and Telegram on DGX Spark" . I skipped the Nemoclaw and Telegram parts. Just do steps 1-3, Configure docker and the NVidia container runtime, Install ollama, and then use ollama to pull the model and run it. You enter prompts at the command line and get its responses there. It runs very fast using 95% of the DGX GPUs.

I tried to use WebUI to provide a web browser interface for the text interactions following the "OpenWebUI with ollama" playbook" but, it wouldn't connect with the model I already had running. It downloaded some version of the 120B model which then showed up in the WebUI but, when run there was very slow with no sign it was using any of the GPUs. I am going to try to use the llama.cpp approach shown in the "Nemotron-3-Nano with llama.cpp" playbook which runs the model using the GPUs and gives a nice GPT UI in the browser. llama.cpp gets compiled with CUDA drivers in the process. llama.cpp is an inference engine alternative to vLLM but, I don't know how broadly usable it is. I'm hoping it at least works also for the Nemotron-3-super model although I can live with the command line interface for now.

Follow up: I did find a .gguf file of the Nemotron 3 Super 120B model at HuggingFace which was able to run and produce a standard web UI on port localhost:30000. This model is apparently a quantized (Q4_K) version posted by Mr. Garganov. It runs fast (16 tps) on my Spark I don't know what quality impact the quantization produced. Inference by this model uses 89GB of RAM on the Spark.

Running a guff model with ollama on a DGX Spark completely defeats the purpose, I think. Reduced precision and extra overhead. Am I wrong?

Sign up or log in to comment