# NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide ## Overview The game generation system uses **NVIDIA Nemotron 3 Nano 4B** in GGUF quantized format, optimized for inference via **llama.cpp**. ### Why this configuration? 1. **Hackathon bonus** — llama.cpp runtime gives extra credit 2. **Memory efficient** — GGUF 4-bit quantization reduces model size to ~2.5GB 3. **Performance** — llama.cpp provides fast CPU and GPU inference 4. **Quality** — Nemotron 3 Nano 4B is NVIDIA's optimized chat model 5. **Sponsor visibility** — NVIDIA + llama.cpp integration ## Installation ### Step 1: Install llama-cpp-python **For CPU inference:** ```bash pip install llama-cpp-python ``` **For GPU inference (CUDA 11.8+):** ```bash CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python ``` **For GPU (Metal on macOS):** ```bash CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python ``` ### Step 2: Verify installation ```bash python -c "from llama_cpp import Llama; print('✓ llama-cpp-python installed')" ``` ## Model Details - **Model ID:** `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF` - **Repository:** https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF - **File:** `model.gguf` - **Size:** ~2.5GB (4-bit quantization) - **Context:** 2048 tokens - **Format:** GGUF (compatible with llama.cpp) ## How it works 1. **First run** — llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB) 2. **Model caching** — Subsequent runs use the cached model (no re-download) 3. **GPU acceleration** — If CUDA/Metal is available, llama.cpp uses GPU for faster inference 4. **Fallback** — If llama-cpp-python unavailable, system uses mock generation ## Usage in code ```python from app.services.generator import generate_game from app.services.retrieval import retrieve_examples # Generate a game config = { "game_type": "scavenger_hunt", "city": "Paris", "area": "Le Marais", "duration_minutes": 60, "num_players": 4, "difficulty": "medium", "age_group": "adults", "location_type": "mixed" } # Retrieve similar games for grounding retrieved = retrieve_examples(config, dataset, k=3) # Generate game (uses llama.cpp if available, mock fallback otherwise) game = generate_game(config, retrieved) ``` ## Performance expectations | Setup | First Run | Subsequent Runs | Speed | |-------|-----------|-----------------|-------| | CPU (no optimization) | 5-10 min | 2-5 min per game | Slow | | CPU (quantized) | 5-10 min | 30-60s per game | Moderate | | GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast | ## Troubleshooting ### Issue: "llama-cpp-python not found" **Solution:** Install with `pip install llama-cpp-python` or with GPU support. ### Issue: CUDA compatibility errors **Solution:** Check CUDA version compatibility: ```bash # Check CUDA version nvidia-smi # Install specific CUDA-compatible version CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \ pip install llama-cpp-python ``` ### Issue: Model download stuck **Solution:** Download manually from HuggingFace and place in `~/.cache/huggingface/hub/` ### Issue: Out of memory **Solution:** Reduce context window or use CPU-offloading in llama.cpp settings. ## Performance expectations | Setup | First Run | Subsequent Runs | Speed | |-------|-----------|-----------------|-------| | CPU (no optimization) | 5-10 min | 2-5 min per game | Slow | | CPU (quantized) | 5-10 min | 30-60s per game | Moderate | | GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast | | HF Zero GPU (auto) | 5-10 min | 5-15s per game | Fast | ## Integration with Hugging Face Spaces (Zero GPU) The app is fully configured for **Hugging Face Spaces** with **Zero GPU** support. ### What is Zero GPU? Hugging Face Spaces **Zero GPU** (paid tier) provides on-demand GPU allocation: - GPU is allocated only when a `@spaces.GPU`-decorated function runs - GPU is **released** after the function completes (saves cost) - Without the decorator, code runs on CPU ### How it works in this app 1. `app.py` imports `spaces` gracefully (no error if missing) 2. The `_generate_with_gpu()` function is wrapped with `@spaces.GPU` only at runtime on HF Spaces 3. Inside that function, `torch.cuda.is_available()` returns `True`, so `generator.py` auto-detects GPU via `_get_n_gpu_layers()` and sets `n_gpu_layers=-1` 4. On CPU (local dev or free Spaces tier), it falls back to `n_gpu_layers=0` ### Requirements Add to `requirements.txt`: ``` llama-cpp-python spaces ``` > **Note:** The `spaces` package is only available on the Hugging Face Spaces runtime. Local imports use `try/except ImportError` to handle this gracefully. ### File structure for HF Spaces ``` app/ main.py ← HF Spaces entry point (launches Gradio) services/ generator.py ← Auto-detects GPU via torch.cuda ... requirements.txt ``` On Hugging Face Spaces, the app runs `app.py` automatically. ### GPU auto-detection logic In `app/services/generator.py`: ```python def _get_n_gpu_layers() -> int: try: import torch if torch.cuda.is_available(): return -1 # All layers on GPU except ImportError: pass return 0 # CPU only ``` This works because `@spaces.GPU` makes `torch.cuda.is_available()` return `True` inside the decorated function. ### Deployment steps 1. Push the repo to Hugging Face Spaces 2. Set **Space SDK** to **Gradio** 3. Set **Space hardware** to **Zero GPU** (paid) for GPU acceleration, or leave as CPU (free) 4. The app auto-detects and uses GPU/CPU accordingly ## Hackathon Integration This setup satisfies: - ✓ **Extra credit:** llama.cpp runtime - ✓ **Sponsor integration:** NVIDIA Nemotron model - ✓ **Visible pipeline:** Model usage shown in logs and prompts - ✓ **Quality:** Small model optimized for this task ## Code reference See [app/services/generator.py](app/services/generator.py) for: - `generate_game_with_model()` — llama.cpp integration - `NEMOTRON_MODEL_ID` — Model configuration - Model caching and initialization ## Testing Run the test suite: ```bash python test_generation_gguf.py ``` Expected output: - ✓ Tests pass with mock (if llama-cpp-python not installed) - ✓ Tests pass with actual model (if llama-cpp-python installed) - ✓ All generated games validate against schema