Instructions to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
- SGLang
How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit with Docker Model Runner:
docker model run hf.co/cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
The RTX3090 works very well,ths!
CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONIOENCODING=utf-8 vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
--host 0.0.0.0
--port 50001
--api-key xxx
--served-model-name my-model
--override-generation-config '{"temperature": 1, "top_p": 0.95}'
--trust-remote-code
--tensor-parallel-size 2
--pipeline-parallel-size 2
--enable-prefix-caching
--enable-chunked-prefill
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 64
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--attention-backend flashinfer
--reasoning-parser nemotron_v3
--mamba-ssm-cache-dtype float16
Indeed, great quant, thanks cyankiwi! On 8x RTX3090 I've got it (eventually!!!) working with:
export OMP_NUM_THREADS=6
export VLLM_SKIP_P2P_CHECK=1
export NCCL_P2P_LEVEL=NVL
export PYTHONIOENCODING=utf-8
vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit --served-model-name "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit"
--tensor-parallel-size 2
--pipeline-parallel-size 4
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 4
--max-num-batched-tokens 4192
--kv-cache-dtype fp8
--enable-expert-parallel
--attention-backend flashinfer
--swap-space 0
--trust-remote-code
--enable-chunked-prefill
--mamba-ssm-cache-dtype float16
--reasoning-parser-plugin cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit/super_v3_reasoning_parser.py
--reasoning-parser super_v3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--host 0.0.0.0
--port 5005
--disable-uvicorn-access-log
--override-generation-config '{"temperature": 1, "top_p": 0.95}'
@
CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONIOENCODING=utf-8 vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
--host 0.0.0.0
--port 50001
--api-key xxx
--served-model-name my-model
--override-generation-config '{"temperature": 1, "top_p": 0.95}'
--trust-remote-code
--tensor-parallel-size 2
--pipeline-parallel-size 2
--enable-prefix-caching
--enable-chunked-prefill
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 64
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--attention-backend flashinfer
--reasoning-parser nemotron_v3
--mamba-ssm-cache-dtype float16
Hey Bro which vLLM version do u use?
@
Hey Bro which vLLM version do u use?
Latest