majentik's picture
Add model card
bfa9043 verified
|
Raw
History Blame
3.6 kB
metadata
library_name: gguf
base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
tags:
  - gguf
  - turboquant
  - kv-cache-quantization
  - nemotron
  - nvidia
  - mamba2
  - hybrid
  - moe
  - llama-cpp
  - quantized
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf

Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M

GGUF Q4_K_M weight-quantized variant of nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with TurboQuant KV cache compression for efficient inference with llama.cpp, Ollama, and LM Studio. Features a hybrid Mamba-2 + Transformer MoE architecture with 30.7B total parameters (3.2B active per token) and up to 1M context length.

Overview

This model combines two compression techniques:

  • GGUF Q4_K_M weight quantization -- reduces model size from ~60 GB to ~14 GB
  • TurboQuant KV cache compression -- block-diagonal rotations (Clifford algebra) for 3-bit KV cache, 5.3x faster prefill

Quickstart

llama.cpp

llama-cli -m Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M.gguf \
  --cache-type-k q4_0 --cache-type-v q4_0 \
  -p "Explain quantum computing"

Ollama

ollama run majentik/Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M

LM Studio

Download the GGUF file and load in LM Studio. Enable TurboQuant KV cache in advanced settings.

Specifications

Property Value
Base Model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Parameters 30.7B (3.2B active, Mamba-2 + Transformer MoE)
Context Length 1,048,576 tokens (1M)
Weight Quantization GGUF Q4_K_M
KV Cache TurboQuant 3-bit (planar/iso)
File Size ~14 GB
License NVIDIA Open Model License (commercial use OK)
Compatible llama.cpp, Ollama, LM Studio, koboldcpp

What is TurboQuant?

TurboQuant applies block-diagonal rotations (Clifford algebra) for KV cache compression. When used with llama.cpp's --cache-type-k q4_0 --cache-type-v q4_0 flags:

Metric TurboQuant TurboQuant
Prefill Speed 3,822 tok/s 722 tok/s
Decode Speed 119 tok/s 93 tok/s
Perplexity 6.91 7.07

GGUF Quant Variants

Quant File Size Quality Variant
Q2_K ~9 GB Lowest Q2_K
Q3_K_M ~11 GB Low Q3_K_M
IQ4_XS ~13 GB Medium-Low IQ4_XS
Q4_K_M ~14 GB Medium (recommended) This model
Q5_K_M ~17 GB Medium-High Q5_K_M
Q8_0 ~27 GB High Q8_0

See Also