Spaces:
Running on Zero
Running on Zero
| # NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide | |
| ## Overview | |
| The game generation system uses **NVIDIA Nemotron 3 Nano 4B** in GGUF quantized format, optimized for inference via **llama.cpp**. | |
| ### Why this configuration? | |
| 1. **Hackathon bonus** β llama.cpp runtime gives extra credit | |
| 2. **Memory efficient** β GGUF 4-bit quantization reduces model size to ~2.5GB | |
| 3. **Performance** β llama.cpp provides fast CPU and GPU inference | |
| 4. **Quality** β Nemotron 3 Nano 4B is NVIDIA's optimized chat model | |
| 5. **Sponsor visibility** β NVIDIA + llama.cpp integration | |
| ## Installation | |
| ### Step 1: Install llama-cpp-python | |
| **For CPU inference:** | |
| ```bash | |
| pip install llama-cpp-python | |
| ``` | |
| **For GPU inference (CUDA 11.8+):** | |
| ```bash | |
| CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python | |
| ``` | |
| **For GPU (Metal on macOS):** | |
| ```bash | |
| CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python | |
| ``` | |
| ### Step 2: Verify installation | |
| ```bash | |
| python -c "from llama_cpp import Llama; print('β llama-cpp-python installed')" | |
| ``` | |
| ## Model Details | |
| - **Model ID:** `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF` | |
| - **Repository:** https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF | |
| - **File:** `model.gguf` | |
| - **Size:** ~2.5GB (4-bit quantization) | |
| - **Context:** 2048 tokens | |
| - **Format:** GGUF (compatible with llama.cpp) | |
| ## How it works | |
| 1. **First run** β llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB) | |
| 2. **Model caching** β Subsequent runs use the cached model (no re-download) | |
| 3. **GPU acceleration** β If CUDA/Metal is available, llama.cpp uses GPU for faster inference | |
| 4. **Fallback** β If llama-cpp-python unavailable, system uses mock generation | |
| ## Usage in code | |
| ```python | |
| from app.services.generator import generate_game | |
| from app.services.retrieval import retrieve_examples | |
| # Generate a game | |
| config = { | |
| "game_type": "scavenger_hunt", | |
| "city": "Paris", | |
| "area": "Le Marais", | |
| "duration_minutes": 60, | |
| "num_players": 4, | |
| "difficulty": "medium", | |
| "age_group": "adults", | |
| "location_type": "mixed" | |
| } | |
| # Retrieve similar games for grounding | |
| retrieved = retrieve_examples(config, dataset, k=3) | |
| # Generate game (uses llama.cpp if available, mock fallback otherwise) | |
| game = generate_game(config, retrieved) | |
| ``` | |
| ## Performance expectations | |
| | Setup | First Run | Subsequent Runs | Speed | | |
| |-------|-----------|-----------------|-------| | |
| | CPU (no optimization) | 5-10 min | 2-5 min per game | Slow | | |
| | CPU (quantized) | 5-10 min | 30-60s per game | Moderate | | |
| | GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast | | |
| ## Troubleshooting | |
| ### Issue: "llama-cpp-python not found" | |
| **Solution:** Install with `pip install llama-cpp-python` or with GPU support. | |
| ### Issue: CUDA compatibility errors | |
| **Solution:** Check CUDA version compatibility: | |
| ```bash | |
| # Check CUDA version | |
| nvidia-smi | |
| # Install specific CUDA-compatible version | |
| CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \ | |
| pip install llama-cpp-python | |
| ``` | |
| ### Issue: Model download stuck | |
| **Solution:** Download manually from HuggingFace and place in `~/.cache/huggingface/hub/` | |
| ### Issue: Out of memory | |
| **Solution:** Reduce context window or use CPU-offloading in llama.cpp settings. | |
| ## Performance expectations | |
| | Setup | First Run | Subsequent Runs | Speed | | |
| |-------|-----------|-----------------|-------| | |
| | CPU (no optimization) | 5-10 min | 2-5 min per game | Slow | | |
| | CPU (quantized) | 5-10 min | 30-60s per game | Moderate | | |
| | GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast | | |
| | HF Zero GPU (auto) | 5-10 min | 5-15s per game | Fast | | |
| ## Integration with Hugging Face Spaces (Zero GPU) | |
| The app is fully configured for **Hugging Face Spaces** with **Zero GPU** support. | |
| ### What is Zero GPU? | |
| Hugging Face Spaces **Zero GPU** (paid tier) provides on-demand GPU allocation: | |
| - GPU is allocated only when a `@spaces.GPU`-decorated function runs | |
| - GPU is **released** after the function completes (saves cost) | |
| - Without the decorator, code runs on CPU | |
| ### How it works in this app | |
| 1. `app.py` imports `spaces` gracefully (no error if missing) | |
| 2. The `_generate_with_gpu()` function is wrapped with `@spaces.GPU` only at runtime on HF Spaces | |
| 3. Inside that function, `torch.cuda.is_available()` returns `True`, so `generator.py` auto-detects GPU via `_get_n_gpu_layers()` and sets `n_gpu_layers=-1` | |
| 4. On CPU (local dev or free Spaces tier), it falls back to `n_gpu_layers=0` | |
| ### Requirements | |
| Add to `requirements.txt`: | |
| ``` | |
| llama-cpp-python | |
| spaces | |
| ``` | |
| > **Note:** The `spaces` package is only available on the Hugging Face Spaces runtime. Local imports use `try/except ImportError` to handle this gracefully. | |
| ### File structure for HF Spaces | |
| ``` | |
| app/ | |
| main.py β HF Spaces entry point (launches Gradio) | |
| services/ | |
| generator.py β Auto-detects GPU via torch.cuda | |
| ... | |
| requirements.txt | |
| ``` | |
| On Hugging Face Spaces, the app runs `app.py` automatically. | |
| ### GPU auto-detection logic | |
| In `app/services/generator.py`: | |
| ```python | |
| def _get_n_gpu_layers() -> int: | |
| try: | |
| import torch | |
| if torch.cuda.is_available(): | |
| return -1 # All layers on GPU | |
| except ImportError: | |
| pass | |
| return 0 # CPU only | |
| ``` | |
| This works because `@spaces.GPU` makes `torch.cuda.is_available()` return `True` inside the decorated function. | |
| ### Deployment steps | |
| 1. Push the repo to Hugging Face Spaces | |
| 2. Set **Space SDK** to **Gradio** | |
| 3. Set **Space hardware** to **Zero GPU** (paid) for GPU acceleration, or leave as CPU (free) | |
| 4. The app auto-detects and uses GPU/CPU accordingly | |
| ## Hackathon Integration | |
| This setup satisfies: | |
| - β **Extra credit:** llama.cpp runtime | |
| - β **Sponsor integration:** NVIDIA Nemotron model | |
| - β **Visible pipeline:** Model usage shown in logs and prompts | |
| - β **Quality:** Small model optimized for this task | |
| ## Code reference | |
| See [app/services/generator.py](app/services/generator.py) for: | |
| - `generate_game_with_model()` β llama.cpp integration | |
| - `NEMOTRON_MODEL_ID` β Model configuration | |
| - Model caching and initialization | |
| ## Testing | |
| Run the test suite: | |
| ```bash | |
| python test_generation_gguf.py | |
| ``` | |
| Expected output: | |
| - β Tests pass with mock (if llama-cpp-python not installed) | |
| - β Tests pass with actual model (if llama-cpp-python installed) | |
| - β All generated games validate against schema | |