Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide
Overview
The game generation system uses NVIDIA Nemotron 3 Nano 4B in GGUF quantized format, optimized for inference via llama.cpp.
Why this configuration?
- Hackathon bonus β llama.cpp runtime gives extra credit
- Memory efficient β GGUF 4-bit quantization reduces model size to ~2.5GB
- Performance β llama.cpp provides fast CPU and GPU inference
- Quality β Nemotron 3 Nano 4B is NVIDIA's optimized chat model
- Sponsor visibility β NVIDIA + llama.cpp integration
Installation
Step 1: Install llama-cpp-python
For CPU inference:
pip install llama-cpp-python
For GPU inference (CUDA 11.8+):
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
For GPU (Metal on macOS):
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
Step 2: Verify installation
python -c "from llama_cpp import Llama; print('β llama-cpp-python installed')"
Model Details
- Model ID:
nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF - Repository: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
- File:
model.gguf - Size: ~2.5GB (4-bit quantization)
- Context: 2048 tokens
- Format: GGUF (compatible with llama.cpp)
How it works
- First run β llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
- Model caching β Subsequent runs use the cached model (no re-download)
- GPU acceleration β If CUDA/Metal is available, llama.cpp uses GPU for faster inference
- Fallback β If llama-cpp-python unavailable, system uses mock generation
Usage in code
from app.services.generator import generate_game
from app.services.retrieval import retrieve_examples
# Generate a game
config = {
"game_type": "scavenger_hunt",
"city": "Paris",
"area": "Le Marais",
"duration_minutes": 60,
"num_players": 4,
"difficulty": "medium",
"age_group": "adults",
"location_type": "mixed"
}
# Retrieve similar games for grounding
retrieved = retrieve_examples(config, dataset, k=3)
# Generate game (uses llama.cpp if available, mock fallback otherwise)
game = generate_game(config, retrieved)
Performance expectations
| Setup | First Run | Subsequent Runs | Speed |
|---|---|---|---|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
Troubleshooting
Issue: "llama-cpp-python not found"
Solution: Install with pip install llama-cpp-python or with GPU support.
Issue: CUDA compatibility errors
Solution: Check CUDA version compatibility:
# Check CUDA version
nvidia-smi
# Install specific CUDA-compatible version
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
pip install llama-cpp-python
Issue: Model download stuck
Solution: Download manually from HuggingFace and place in ~/.cache/huggingface/hub/
Issue: Out of memory
Solution: Reduce context window or use CPU-offloading in llama.cpp settings.
Performance expectations
| Setup | First Run | Subsequent Runs | Speed |
|---|---|---|---|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
| HF Zero GPU (auto) | 5-10 min | 5-15s per game | Fast |
Integration with Hugging Face Spaces (Zero GPU)
The app is fully configured for Hugging Face Spaces with Zero GPU support.
What is Zero GPU?
Hugging Face Spaces Zero GPU (paid tier) provides on-demand GPU allocation:
- GPU is allocated only when a
@spaces.GPU-decorated function runs - GPU is released after the function completes (saves cost)
- Without the decorator, code runs on CPU
How it works in this app
app.pyimportsspacesgracefully (no error if missing)- The
_generate_with_gpu()function is wrapped with@spaces.GPUonly at runtime on HF Spaces - Inside that function,
torch.cuda.is_available()returnsTrue, sogenerator.pyauto-detects GPU via_get_n_gpu_layers()and setsn_gpu_layers=-1 - On CPU (local dev or free Spaces tier), it falls back to
n_gpu_layers=0
Requirements
Add to requirements.txt:
llama-cpp-python
spaces
Note: The
spacespackage is only available on the Hugging Face Spaces runtime. Local imports usetry/except ImportErrorto handle this gracefully.
File structure for HF Spaces
app/
main.py β HF Spaces entry point (launches Gradio)
services/
generator.py β Auto-detects GPU via torch.cuda
...
requirements.txt
On Hugging Face Spaces, the app runs app.py automatically.
GPU auto-detection logic
In app/services/generator.py:
def _get_n_gpu_layers() -> int:
try:
import torch
if torch.cuda.is_available():
return -1 # All layers on GPU
except ImportError:
pass
return 0 # CPU only
This works because @spaces.GPU makes torch.cuda.is_available() return True inside the decorated function.
Deployment steps
- Push the repo to Hugging Face Spaces
- Set Space SDK to Gradio
- Set Space hardware to Zero GPU (paid) for GPU acceleration, or leave as CPU (free)
- The app auto-detects and uses GPU/CPU accordingly
Hackathon Integration
This setup satisfies:
- β Extra credit: llama.cpp runtime
- β Sponsor integration: NVIDIA Nemotron model
- β Visible pipeline: Model usage shown in logs and prompts
- β Quality: Small model optimized for this task
Code reference
See app/services/generator.py for:
generate_game_with_model()β llama.cpp integrationNEMOTRON_MODEL_IDβ Model configuration- Model caching and initialization
Testing
Run the test suite:
python test_generation_gguf.py
Expected output:
- β Tests pass with mock (if llama-cpp-python not installed)
- β Tests pass with actual model (if llama-cpp-python installed)
- β All generated games validate against schema