File size: 6,359 Bytes
e9fc2fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f663bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e18b02f
2f663bd
 
 
 
 
e9fc2fc
2f663bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e18b02f
2f663bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9fc2fc
2f663bd
e9fc2fc
2f663bd
e9fc2fc
2f663bd
 
 
 
e9fc2fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide

## Overview

The game generation system uses **NVIDIA Nemotron 3 Nano 4B** in GGUF quantized format, optimized for inference via **llama.cpp**.

### Why this configuration?

1. **Hackathon bonus** β€” llama.cpp runtime gives extra credit
2. **Memory efficient** β€” GGUF 4-bit quantization reduces model size to ~2.5GB
3. **Performance** β€” llama.cpp provides fast CPU and GPU inference
4. **Quality** β€” Nemotron 3 Nano 4B is NVIDIA's optimized chat model
5. **Sponsor visibility** β€” NVIDIA + llama.cpp integration

## Installation

### Step 1: Install llama-cpp-python

**For CPU inference:**
```bash
pip install llama-cpp-python
```

**For GPU inference (CUDA 11.8+):**
```bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
```

**For GPU (Metal on macOS):**
```bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
```

### Step 2: Verify installation

```bash
python -c "from llama_cpp import Llama; print('βœ“ llama-cpp-python installed')"
```

## Model Details

- **Model ID:** `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF`
- **Repository:** https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
- **File:** `model.gguf`
- **Size:** ~2.5GB (4-bit quantization)
- **Context:** 2048 tokens
- **Format:** GGUF (compatible with llama.cpp)

## How it works

1. **First run** β€” llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
2. **Model caching** β€” Subsequent runs use the cached model (no re-download)
3. **GPU acceleration** β€” If CUDA/Metal is available, llama.cpp uses GPU for faster inference
4. **Fallback** β€” If llama-cpp-python unavailable, system uses mock generation

## Usage in code

```python
from app.services.generator import generate_game
from app.services.retrieval import retrieve_examples

# Generate a game
config = {
    "game_type": "scavenger_hunt",
    "city": "Paris",
    "area": "Le Marais",
    "duration_minutes": 60,
    "num_players": 4,
    "difficulty": "medium",
    "age_group": "adults",
    "location_type": "mixed"
}

# Retrieve similar games for grounding
retrieved = retrieve_examples(config, dataset, k=3)

# Generate game (uses llama.cpp if available, mock fallback otherwise)
game = generate_game(config, retrieved)
```

## Performance expectations

| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |

## Troubleshooting

### Issue: "llama-cpp-python not found"
**Solution:** Install with `pip install llama-cpp-python` or with GPU support.

### Issue: CUDA compatibility errors
**Solution:** Check CUDA version compatibility:
```bash
# Check CUDA version
nvidia-smi

# Install specific CUDA-compatible version
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
  pip install llama-cpp-python
```

### Issue: Model download stuck
**Solution:** Download manually from HuggingFace and place in `~/.cache/huggingface/hub/`

### Issue: Out of memory
**Solution:** Reduce context window or use CPU-offloading in llama.cpp settings.

## Performance expectations

| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
| HF Zero GPU (auto) | 5-10 min | 5-15s per game | Fast |

## Integration with Hugging Face Spaces (Zero GPU)

The app is fully configured for **Hugging Face Spaces** with **Zero GPU** support.

### What is Zero GPU?

Hugging Face Spaces **Zero GPU** (paid tier) provides on-demand GPU allocation:
- GPU is allocated only when a `@spaces.GPU`-decorated function runs
- GPU is **released** after the function completes (saves cost)
- Without the decorator, code runs on CPU

### How it works in this app

1. `app.py` imports `spaces` gracefully (no error if missing)
2. The `_generate_with_gpu()` function is wrapped with `@spaces.GPU` only at runtime on HF Spaces
3. Inside that function, `torch.cuda.is_available()` returns `True`, so `generator.py` auto-detects GPU via `_get_n_gpu_layers()` and sets `n_gpu_layers=-1`
4. On CPU (local dev or free Spaces tier), it falls back to `n_gpu_layers=0`

### Requirements

Add to `requirements.txt`:
```
llama-cpp-python
spaces
```

> **Note:** The `spaces` package is only available on the Hugging Face Spaces runtime. Local imports use `try/except ImportError` to handle this gracefully.

### File structure for HF Spaces

```
app/
  main.py           ← HF Spaces entry point (launches Gradio)
  services/
    generator.py    ← Auto-detects GPU via torch.cuda
    ...
requirements.txt
```

On Hugging Face Spaces, the app runs `app.py` automatically.

### GPU auto-detection logic

In `app/services/generator.py`:

```python
def _get_n_gpu_layers() -> int:
    try:
        import torch
        if torch.cuda.is_available():
            return -1  # All layers on GPU
    except ImportError:
        pass
    return 0  # CPU only
```

This works because `@spaces.GPU` makes `torch.cuda.is_available()` return `True` inside the decorated function.

### Deployment steps

1. Push the repo to Hugging Face Spaces
2. Set **Space SDK** to **Gradio**
3. Set **Space hardware** to **Zero GPU** (paid) for GPU acceleration, or leave as CPU (free)
4. The app auto-detects and uses GPU/CPU accordingly

## Hackathon Integration

This setup satisfies:
- βœ“ **Extra credit:** llama.cpp runtime
- βœ“ **Sponsor integration:** NVIDIA Nemotron model
- βœ“ **Visible pipeline:** Model usage shown in logs and prompts
- βœ“ **Quality:** Small model optimized for this task

## Code reference

See [app/services/generator.py](app/services/generator.py) for:
- `generate_game_with_model()` β€” llama.cpp integration
- `NEMOTRON_MODEL_ID` β€” Model configuration
- Model caching and initialization

## Testing

Run the test suite:
```bash
python test_generation_gguf.py
```

Expected output:
- βœ“ Tests pass with mock (if llama-cpp-python not installed)
- βœ“ Tests pass with actual model (if llama-cpp-python installed)
- βœ“ All generated games validate against schema