How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "shiny-plan/Nemotron-Cascade-2-30B-A3B-Q4_K_M-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "shiny-plan/Nemotron-Cascade-2-30B-A3B-Q4_K_M-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/shiny-plan/Nemotron-Cascade-2-30B-A3B-Q4_K_M-GGUF:Q4_K_M
Quick Links

Nemotron-Cascade-2-30B-A3B — Q4_K_M GGUF

GGUF quantization of nvidia/Nemotron-Cascade-2-30B-A3B.

  • Architecture: Hybrid Attention + Mamba (SSM) + MoE — 30B total parameters, 3B active
  • Quantization: Q4_K_M (k-quant, mixed precision ~4.5 bpw)

Quantization commands

# Convert HF model to GGUF (bf16)
python llama.cpp/convert_hf_to_gguf.py \
  nvidia/Nemotron-Cascade-2-30B-A3B \
  --outfile Nemotron-Cascade-2-30B-A3B-bf16.gguf \
  --outtype bf16

# Quantize to Q4_K_M
llama-quantize Nemotron-Cascade-2-30B-A3B-bf16.gguf \
  Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf Q4_K_M

Usage

Load in LM Studio, llama.cpp, or any GGUF-compatible runtime.

Downloads last month
11
GGUF
Model size
32B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shiny-plan/Nemotron-Cascade-2-30B-A3B-Q4_K_M-GGUF

Quantized
(30)
this model