File size: 7,711 Bytes
bfa9043
32b6d20
 
 
bfa9043
 
e4404dd
 
 
 
 
 
 
 
 
 
32b6d20
 
bfa9043
 
 
 
32b6d20
 
 
 
 
bfa9043
e4404dd
 
 
 
 
 
 
 
bfa9043
 
32b6d20
 
 
 
 
 
bfa9043
 
 
32b6d20
 
 
 
bfa9043
32b6d20
 
 
 
 
 
 
 
 
 
 
 
 
 
bfa9043
32b6d20
 
 
 
 
bfa9043
 
32b6d20
 
 
 
 
bfa9043
32b6d20
 
 
 
bfa9043
 
32b6d20
 
 
 
 
 
 
 
 
 
 
 
 
bfa9043
 
 
 
 
32b6d20
 
 
 
 
 
 
 
 
 
 
 
bfa9043
 
 
32b6d20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfa9043
32b6d20
bfa9043
32b6d20
bfa9043
32b6d20
 
 
 
 
bfa9043
 
 
32b6d20
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
tags:
  - gguf
  - turboquant
  - kv-cache-quantization
  - nemotron
  - nvidia
  - mamba2
  - hybrid
  - moe
  - llama-cpp
  - quantized
library_name: gguf
pipeline_tag: text-generation
---

# Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M

GGUF Q4_K_M weight-quantized variant of [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) optimised for use with **TurboQuant** KV cache compression via a dedicated llama.cpp fork.

> **Important:** TurboQuant KV cache types (`planar3`, `iso3`) are **not** available in upstream llama.cpp, standard Ollama, or LM Studio.
> They require a [specific llama.cpp fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache).
> The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).

## Hardware compatibility

| Device | VRAM / RAM | Recommendation |
| --- | --- | --- |
| CPU host with β‰₯18 GB RAM | ~17.8 GB | works via llama.cpp; slower than GPU but no accelerator required |
| Apple Silicon (Metal) | ~19.4 GB | llama.cpp Metal backend; fast on M-series unified memory |
| NVIDIA GPU (partial offload) | split between GPU + RAM | offload as many layers as VRAM allows; rest on CPU |

## Overview

This model combines two independent compression techniques:

| Technique | What it does | Requirement |
|-----------|-------------|-------------|
| **GGUF Q4_K_M weight quantization** | Reduces model size from ~60 GB (BF16) to ~16.2 GB | Any llama.cpp-compatible runtime |
| **TurboQuant KV cache compression** β€” random rotation + Lloyd-Max scalar quantization (`--cache-type-k planar3 --cache-type-v planar3`) | Block-diagonal rotations / random rotation for compressed KV cache | [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) only |

## Quickstart

### Option A β€” With TurboQuant KV cache (fork required)

You must build from the TurboQuant-enabled llama.cpp fork:

```bash
# Clone and build the fork
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache

# CUDA (Windows/Linux)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j

# Run with TurboQuant KV cache
./build/bin/llama-cli -m Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M.gguf \
  --cache-type-k planar3 --cache-type-v planar3 \
  -ngl 99 -fa \
  -p "Explain quantum computing"

# Or run as a server
./build/bin/llama-server -m Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M.gguf \
  --cache-type-k planar3 --cache-type-v planar3 \
  -ngl 99 -fa --jinja
```

### Option B β€” With standard llama.cpp / LM Studio / Ollama

The GGUF works as a normal quantised model. You won't get TurboQuant-specific KV cache benefits, but standard KV cache quantization (q8_0, q4_0) still reduces VRAM significantly.

**llama.cpp (upstream)**
```bash
llama-cli -m Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M.gguf \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -ngl 99 -fa \
  -p "Explain quantum computing"
```

**LM Studio**
1. Download the GGUF file and load in LM Studio.
2. Enable **Developer Mode** (Settings β†’ Developer).
3. In the model loader's advanced settings, set **Flash Attention** to ON.
4. Set **K Cache Quantization** and **V Cache Quantization** to `q8_0` (or `q4_0` for more aggressive VRAM savings).
5. Note: LM Studio does not currently support TurboQuant's `planar3` cache types. Track [this feature request](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) for updates.

**Ollama**
```bash
# Standard Ollama does not support TurboQuant cache types.
# Use with default or q8_0 KV cache via OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama run majentik/Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M
```

## Specifications

| Property | Value |
|----------|-------|
| Base Model | [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) |
| Architecture | Mamba-2 + Transformer hybrid Sparse MoE |
| Parameters | 30.7B total, 3.2B active per token |
| Context Length | 1M |
| Weight Quantization | GGUF Q4_K_M (popular 4-bit, best quality/size tradeoff) |
| Original Size (BF16) | ~60 GB |
| Quantized File Size | ~16.2 GB |
| KV Cache (TurboQuant) | 3-bit via `--cache-type-k planar3 --cache-type-v planar3` (fork only) |
| KV Cache (standard) | q8_0, q4_0, f16, etc. (any llama.cpp runtime) |
| License | other |
| Modalities | Text only |
| Compatible Runtimes | llama.cpp, LM Studio, Ollama, koboldcpp |

## What is TurboQuant?

[TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) is a KV cache compression method that applies a random orthogonal rotation followed by optimal scalar quantization. Bit-identical prefill logits at 4-bit on tested models, with up to 4-8Γ— memory savings for long sequences.

**Benchmarks from the TurboQuant repository** (Llama 3.1 8B, RTX 5090 β€” results will vary by model and hardware):

| Metric | TurboQuant (4-bit) | Standard q4_0 |
|--------|--------------------|---------------|
| Quality | Bit-identical prefill | Lossy |
| KV Compression | ~4Γ— vs FP16 | ~4Γ— vs FP16 |
| Speedup (Apple Silicon) | 1.4–1.7Γ— | β€” |

> **Note:** These benchmarks are from the TurboQuant repository using Llama 3.1 8B on an RTX 5090. Performance on Nemotron-3-Nano-30B-A3B will differ. Independent benchmarks for this specific model are welcome β€” please open a discussion if you have results to share.

## Current Status of TurboQuant in the Ecosystem

| Runtime | TurboQuant Support | Standard KV Quant |
|---------|---------------------|-------------------|
| llama.cpp (upstream) | ❌ Not merged | βœ… q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 |
| llama-cpp-turboquant fork | βœ… planar3 | βœ… All standard types |
| LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | βœ… Via advanced settings |
| Ollama | ❌ Not supported | βœ… Via OLLAMA_KV_CACHE_TYPE |
| koboldcpp | ❌ Not supported | βœ… Standard types |

## Recommended Settings

For VRAM-constrained setups, standard q8_0 KV cache quantization already halves KV cache memory with negligible quality impact. Flash Attention should always be enabled β€” it is required for V cache quantization and improves memory efficiency regardless.

| VRAM | Suggested Configuration |
|------|------------------------|
| 24 GB (RTX 4090) | Q4_K_M + q8_0 KV cache + Flash Attention, 8K–16K context |
| 16 GB | Q4_K_M + q4_0 KV cache + Flash Attention, 4K–8K context |
| 48+ GB | Q4_K_M + f16 KV cache, full 32K+ context |

## See Also

- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
- [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
- [TurboQuant llama.cpp discussion](https://github.com/ggml-org/llama.cpp/discussions/20969)
- [TurboQuant paper (arXiv: 2504.19874)](https://arxiv.org/abs/2504.19874)
- [Base model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)
- [Nemotron-3-Nano-30B-A3B announcement](https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models)