Instructions to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit

SGLang

How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with Docker Model Runner:
```
docker model run hf.co/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
```

The RTX3090 works very well，ths！

by summerbuild - opened Mar 17

Discussion

summerbuild

Mar 17

CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONIOENCODING=utf-8 vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
--host 0.0.0.0
--port 50001
--api-key xxx
--served-model-name my-model
--override-generation-config '{"temperature": 1, "top_p": 0.95}'
--trust-remote-code
--tensor-parallel-size 2
--pipeline-parallel-size 2
--enable-prefix-caching
--enable-chunked-prefill
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 64
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--attention-backend flashinfer
--reasoning-parser nemotron_v3
--mamba-ssm-cache-dtype float16

dehnhaide

Mar 20

Indeed, great quant, thanks cyankiwi! On 8x RTX3090 I've got it (eventually!!!) working with:

export OMP_NUM_THREADS=6
export VLLM_SKIP_P2P_CHECK=1
export NCCL_P2P_LEVEL=NVL
export PYTHONIOENCODING=utf-8

vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit --served-model-name "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit"
--tensor-parallel-size 2
--pipeline-parallel-size 4
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 4
--max-num-batched-tokens 4192
--kv-cache-dtype fp8
--enable-expert-parallel
--attention-backend flashinfer
--swap-space 0
--trust-remote-code
--enable-chunked-prefill
--mamba-ssm-cache-dtype float16
--reasoning-parser-plugin cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit/super_v3_reasoning_parser.py
--reasoning-parser super_v3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--host 0.0.0.0
--port 5005
--disable-uvicorn-access-log
--override-generation-config '{"temperature": 1, "top_p": 0.95}'

summerbuild changed discussion title from The RX3090 works very well，ths！ to The RTX3090 works very well，ths！ Mar 23

anonymousmaharaj

Mar 27

CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONIOENCODING=utf-8 vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
--host 0.0.0.0
--port 50001
--api-key xxx
--served-model-name my-model
--override-generation-config '{"temperature": 1, "top_p": 0.95}'
--trust-remote-code
--tensor-parallel-size 2
--pipeline-parallel-size 2
--enable-prefix-caching
--enable-chunked-prefill
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 64
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--attention-backend flashinfer
--reasoning-parser nemotron_v3
--mamba-ssm-cache-dtype float16

Hey Bro which vLLM version do u use?

dehnhaide

Mar 27

•

edited Mar 27

@
Hey Bro which vLLM version do u use?

Latest

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment