Spaces:
Running on Zero
Running on Zero
File size: 6,359 Bytes
e9fc2fc 2f663bd e18b02f 2f663bd e9fc2fc 2f663bd e18b02f 2f663bd e9fc2fc 2f663bd e9fc2fc 2f663bd e9fc2fc 2f663bd e9fc2fc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | # NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide
## Overview
The game generation system uses **NVIDIA Nemotron 3 Nano 4B** in GGUF quantized format, optimized for inference via **llama.cpp**.
### Why this configuration?
1. **Hackathon bonus** β llama.cpp runtime gives extra credit
2. **Memory efficient** β GGUF 4-bit quantization reduces model size to ~2.5GB
3. **Performance** β llama.cpp provides fast CPU and GPU inference
4. **Quality** β Nemotron 3 Nano 4B is NVIDIA's optimized chat model
5. **Sponsor visibility** β NVIDIA + llama.cpp integration
## Installation
### Step 1: Install llama-cpp-python
**For CPU inference:**
```bash
pip install llama-cpp-python
```
**For GPU inference (CUDA 11.8+):**
```bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
```
**For GPU (Metal on macOS):**
```bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
```
### Step 2: Verify installation
```bash
python -c "from llama_cpp import Llama; print('β llama-cpp-python installed')"
```
## Model Details
- **Model ID:** `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF`
- **Repository:** https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
- **File:** `model.gguf`
- **Size:** ~2.5GB (4-bit quantization)
- **Context:** 2048 tokens
- **Format:** GGUF (compatible with llama.cpp)
## How it works
1. **First run** β llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
2. **Model caching** β Subsequent runs use the cached model (no re-download)
3. **GPU acceleration** β If CUDA/Metal is available, llama.cpp uses GPU for faster inference
4. **Fallback** β If llama-cpp-python unavailable, system uses mock generation
## Usage in code
```python
from app.services.generator import generate_game
from app.services.retrieval import retrieve_examples
# Generate a game
config = {
"game_type": "scavenger_hunt",
"city": "Paris",
"area": "Le Marais",
"duration_minutes": 60,
"num_players": 4,
"difficulty": "medium",
"age_group": "adults",
"location_type": "mixed"
}
# Retrieve similar games for grounding
retrieved = retrieve_examples(config, dataset, k=3)
# Generate game (uses llama.cpp if available, mock fallback otherwise)
game = generate_game(config, retrieved)
```
## Performance expectations
| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
## Troubleshooting
### Issue: "llama-cpp-python not found"
**Solution:** Install with `pip install llama-cpp-python` or with GPU support.
### Issue: CUDA compatibility errors
**Solution:** Check CUDA version compatibility:
```bash
# Check CUDA version
nvidia-smi
# Install specific CUDA-compatible version
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
pip install llama-cpp-python
```
### Issue: Model download stuck
**Solution:** Download manually from HuggingFace and place in `~/.cache/huggingface/hub/`
### Issue: Out of memory
**Solution:** Reduce context window or use CPU-offloading in llama.cpp settings.
## Performance expectations
| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
| HF Zero GPU (auto) | 5-10 min | 5-15s per game | Fast |
## Integration with Hugging Face Spaces (Zero GPU)
The app is fully configured for **Hugging Face Spaces** with **Zero GPU** support.
### What is Zero GPU?
Hugging Face Spaces **Zero GPU** (paid tier) provides on-demand GPU allocation:
- GPU is allocated only when a `@spaces.GPU`-decorated function runs
- GPU is **released** after the function completes (saves cost)
- Without the decorator, code runs on CPU
### How it works in this app
1. `app.py` imports `spaces` gracefully (no error if missing)
2. The `_generate_with_gpu()` function is wrapped with `@spaces.GPU` only at runtime on HF Spaces
3. Inside that function, `torch.cuda.is_available()` returns `True`, so `generator.py` auto-detects GPU via `_get_n_gpu_layers()` and sets `n_gpu_layers=-1`
4. On CPU (local dev or free Spaces tier), it falls back to `n_gpu_layers=0`
### Requirements
Add to `requirements.txt`:
```
llama-cpp-python
spaces
```
> **Note:** The `spaces` package is only available on the Hugging Face Spaces runtime. Local imports use `try/except ImportError` to handle this gracefully.
### File structure for HF Spaces
```
app/
main.py β HF Spaces entry point (launches Gradio)
services/
generator.py β Auto-detects GPU via torch.cuda
...
requirements.txt
```
On Hugging Face Spaces, the app runs `app.py` automatically.
### GPU auto-detection logic
In `app/services/generator.py`:
```python
def _get_n_gpu_layers() -> int:
try:
import torch
if torch.cuda.is_available():
return -1 # All layers on GPU
except ImportError:
pass
return 0 # CPU only
```
This works because `@spaces.GPU` makes `torch.cuda.is_available()` return `True` inside the decorated function.
### Deployment steps
1. Push the repo to Hugging Face Spaces
2. Set **Space SDK** to **Gradio**
3. Set **Space hardware** to **Zero GPU** (paid) for GPU acceleration, or leave as CPU (free)
4. The app auto-detects and uses GPU/CPU accordingly
## Hackathon Integration
This setup satisfies:
- β **Extra credit:** llama.cpp runtime
- β **Sponsor integration:** NVIDIA Nemotron model
- β **Visible pipeline:** Model usage shown in logs and prompts
- β **Quality:** Small model optimized for this task
## Code reference
See [app/services/generator.py](app/services/generator.py) for:
- `generate_game_with_model()` β llama.cpp integration
- `NEMOTRON_MODEL_ID` β Model configuration
- Model caching and initialization
## Testing
Run the test suite:
```bash
python test_generation_gguf.py
```
Expected output:
- β Tests pass with mock (if llama-cpp-python not installed)
- β Tests pass with actual model (if llama-cpp-python installed)
- β All generated games validate against schema
|