# NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide

## Overview

The game generation system uses **NVIDIA Nemotron 3 Nano 4B** in GGUF quantized format, optimized for inference via **llama.cpp**.

### Why this configuration?

1. **Hackathon bonus** — llama.cpp runtime gives extra credit
2. **Memory efficient** — GGUF 4-bit quantization reduces model size to ~2.5GB
3. **Performance** — llama.cpp provides fast CPU and GPU inference
4. **Quality** — Nemotron 3 Nano 4B is NVIDIA's optimized chat model
5. **Sponsor visibility** — NVIDIA + llama.cpp integration

## Installation

### Step 1: Install llama-cpp-python

**For CPU inference:**
```bash
pip install llama-cpp-python
```

**For GPU inference (CUDA 11.8+):**
```bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
```

**For GPU (Metal on macOS):**
```bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
```

### Step 2: Verify installation

```bash
python -c "from llama_cpp import Llama; print('✓ llama-cpp-python installed')"
```

## Model Details

- **Model ID:** `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF`
- **Repository:** https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
- **File:** `model.gguf`
- **Size:** ~2.5GB (4-bit quantization)
- **Context:** 2048 tokens
- **Format:** GGUF (compatible with llama.cpp)

## How it works

1. **First run** — llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
2. **Model caching** — Subsequent runs use the cached model (no re-download)
3. **GPU acceleration** — If CUDA/Metal is available, llama.cpp uses GPU for faster inference
4. **Fallback** — If llama-cpp-python unavailable, system uses mock generation

## Usage in code

```python
from app.services.generator import generate_game
from app.services.retrieval import retrieve_examples

# Generate a game
config = {
    "game_type": "scavenger_hunt",
    "city": "Paris",
    "area": "Le Marais",
    "duration_minutes": 60,
    "num_players": 4,
    "difficulty": "medium",
    "age_group": "adults",
    "location_type": "mixed"
}

# Retrieve similar games for grounding
retrieved = retrieve_examples(config, dataset, k=3)

# Generate game (uses llama.cpp if available, mock fallback otherwise)
game = generate_game(config, retrieved)
```

## Performance expectations

| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |

## Troubleshooting

### Issue: "llama-cpp-python not found"
**Solution:** Install with `pip install llama-cpp-python` or with GPU support.

### Issue: CUDA compatibility errors
**Solution:** Check CUDA version compatibility:
```bash
# Check CUDA version
nvidia-smi

# Install specific CUDA-compatible version
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
  pip install llama-cpp-python
```

### Issue: Model download stuck
**Solution:** Download manually from HuggingFace and place in `~/.cache/huggingface/hub/`

### Issue: Out of memory
**Solution:** Reduce context window or use CPU-offloading in llama.cpp settings.

## Performance expectations

| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
| HF Zero GPU (auto) | 5-10 min | 5-15s per game | Fast |

## Integration with Hugging Face Spaces (Zero GPU)

The app is fully configured for **Hugging Face Spaces** with **Zero GPU** support.

### What is Zero GPU?

Hugging Face Spaces **Zero GPU** (paid tier) provides on-demand GPU allocation:
- GPU is allocated only when a `@spaces.GPU`-decorated function runs
- GPU is **released** after the function completes (saves cost)
- Without the decorator, code runs on CPU

### How it works in this app

1. `app.py` imports `spaces` gracefully (no error if missing)
2. The `_generate_with_gpu()` function is wrapped with `@spaces.GPU` only at runtime on HF Spaces
3. Inside that function, `torch.cuda.is_available()` returns `True`, so `generator.py` auto-detects GPU via `_get_n_gpu_layers()` and sets `n_gpu_layers=-1`
4. On CPU (local dev or free Spaces tier), it falls back to `n_gpu_layers=0`

### Requirements

Add to `requirements.txt`:
```
llama-cpp-python
spaces
```

> **Note:** The `spaces` package is only available on the Hugging Face Spaces runtime. Local imports use `try/except ImportError` to handle this gracefully.

### File structure for HF Spaces

```
app/
  main.py           ← HF Spaces entry point (launches Gradio)
  services/
    generator.py    ← Auto-detects GPU via torch.cuda
    ...
requirements.txt
```

On Hugging Face Spaces, the app runs `app.py` automatically.

### GPU auto-detection logic

In `app/services/generator.py`:

```python
def _get_n_gpu_layers() -> int:
    try:
        import torch
        if torch.cuda.is_available():
            return -1  # All layers on GPU
    except ImportError:
        pass
    return 0  # CPU only
```

This works because `@spaces.GPU` makes `torch.cuda.is_available()` return `True` inside the decorated function.

### Deployment steps

1. Push the repo to Hugging Face Spaces
2. Set **Space SDK** to **Gradio**
3. Set **Space hardware** to **Zero GPU** (paid) for GPU acceleration, or leave as CPU (free)
4. The app auto-detects and uses GPU/CPU accordingly

## Hackathon Integration

This setup satisfies:
- ✓ **Extra credit:** llama.cpp runtime
- ✓ **Sponsor integration:** NVIDIA Nemotron model
- ✓ **Visible pipeline:** Model usage shown in logs and prompts
- ✓ **Quality:** Small model optimized for this task

## Code reference

See [app/services/generator.py](app/services/generator.py) for:
- `generate_game_with_model()` — llama.cpp integration
- `NEMOTRON_MODEL_ID` — Model configuration
- Model caching and initialization

## Testing

Run the test suite:
```bash
python test_generation_gguf.py
```

Expected output:
- ✓ Tests pass with mock (if llama-cpp-python not installed)
- ✓ Tests pass with actual model (if llama-cpp-python installed)
- ✓ All generated games validate against schema