majentik commited on
Commit
e4404dd
·
verified ·
1 Parent(s): a6a20d3

chore(card): add hardware compatibility section

Browse files
Files changed (1) hide show
  1. README.md +18 -12
README.md CHANGED
@@ -4,20 +4,18 @@ license_name: nvidia-open-model-license
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
6
  tags:
7
- - gguf
8
- - turboquant
9
- - kv-cache-quantization
10
- - nemotron
11
- - nvidia
12
- - mamba2
13
- - hybrid
14
- - moe
15
- - llama-cpp
16
- - quantized
17
  library_name: gguf
18
  pipeline_tag: text-generation
19
- language:
20
- - en
21
  ---
22
 
23
  # Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M
@@ -28,6 +26,14 @@ GGUF Q4_K_M weight-quantized variant of [nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-B
28
  > They require a [specific llama.cpp fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache).
29
  > The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).
30
 
 
 
 
 
 
 
 
 
31
  ## Overview
32
 
33
  This model combines two independent compression techniques:
 
4
  license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
5
  base_model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
6
  tags:
7
+ - gguf
8
+ - turboquant
9
+ - kv-cache-quantization
10
+ - nemotron
11
+ - nvidia
12
+ - mamba2
13
+ - hybrid
14
+ - moe
15
+ - llama-cpp
16
+ - quantized
17
  library_name: gguf
18
  pipeline_tag: text-generation
 
 
19
  ---
20
 
21
  # Nemotron-3-Nano-30B-A3B-TurboQuant-GGUF-Q4_K_M
 
26
  > They require a [specific llama.cpp fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache).
27
  > The GGUF file itself is a standard GGUF and works with any llama.cpp-compatible runtime using normal KV cache types (f16, q8_0, q4_0, etc.).
28
 
29
+ ## Hardware compatibility
30
+
31
+ | Device | VRAM / RAM | Recommendation |
32
+ | --- | --- | --- |
33
+ | CPU host with ≥18 GB RAM | ~17.8 GB | works via llama.cpp; slower than GPU but no accelerator required |
34
+ | Apple Silicon (Metal) | ~19.4 GB | llama.cpp Metal backend; fast on M-series unified memory |
35
+ | NVIDIA GPU (partial offload) | split between GPU + RAM | offload as many layers as VRAM allows; rest on CPU |
36
+
37
  ## Overview
38
 
39
  This model combines two independent compression techniques: