Text Generation
Transformers
Safetensors
PyTorch
nvidia
nemotron-3
latent-moe
mtp

The RTX3090 works very well,ths!

#2
by summerbuild - opened

CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONIOENCODING=utf-8 vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
--host 0.0.0.0
--port 50001
--api-key xxx
--served-model-name my-model
--override-generation-config '{"temperature": 1, "top_p": 0.95}'
--trust-remote-code
--tensor-parallel-size 2
--pipeline-parallel-size 2
--enable-prefix-caching
--enable-chunked-prefill
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 64
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--attention-backend flashinfer
--reasoning-parser nemotron_v3
--mamba-ssm-cache-dtype float16

Indeed, great quant, thanks cyankiwi! On 8x RTX3090 I've got it (eventually!!!) working with:

export OMP_NUM_THREADS=6
export VLLM_SKIP_P2P_CHECK=1
export NCCL_P2P_LEVEL=NVL
export PYTHONIOENCODING=utf-8

vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit --served-model-name "cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit"
--tensor-parallel-size 2
--pipeline-parallel-size 4
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 4
--max-num-batched-tokens 4192
--kv-cache-dtype fp8
--enable-expert-parallel
--attention-backend flashinfer
--swap-space 0
--trust-remote-code
--enable-chunked-prefill
--mamba-ssm-cache-dtype float16
--reasoning-parser-plugin cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit/super_v3_reasoning_parser.py
--reasoning-parser super_v3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--host 0.0.0.0
--port 5005
--disable-uvicorn-access-log
--override-generation-config '{"temperature": 1, "top_p": 0.95}'

summerbuild changed discussion title from The RX3090 works very well,ths! to The RTX3090 works very well,ths!

@

CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONIOENCODING=utf-8 vllm serve cyankiwi/NVIDIA-Nemotron-3-Super-120B-A12B-AWQ-4bit
--host 0.0.0.0
--port 50001
--api-key xxx
--served-model-name my-model
--override-generation-config '{"temperature": 1, "top_p": 0.95}'
--trust-remote-code
--tensor-parallel-size 2
--pipeline-parallel-size 2
--enable-prefix-caching
--enable-chunked-prefill
--max-model-len 262144
--gpu-memory-utilization 0.85
--max-num-seqs 64
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--attention-backend flashinfer
--reasoning-parser nemotron_v3
--mamba-ssm-cache-dtype float16

Hey Bro which vLLM version do u use?

@
Hey Bro which vLLM version do u use?

Latest

Sign up or log in to comment