cq-test / NEMOTRON_GGUF_SETUP.md
NANI-Nithin's picture
add recap and poster genration with minicpm and bfl
e18b02f
|
Raw
History Blame Contribute Delete
6.36 kB
# NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide
## Overview
The game generation system uses **NVIDIA Nemotron 3 Nano 4B** in GGUF quantized format, optimized for inference via **llama.cpp**.
### Why this configuration?
1. **Hackathon bonus** β€” llama.cpp runtime gives extra credit
2. **Memory efficient** β€” GGUF 4-bit quantization reduces model size to ~2.5GB
3. **Performance** β€” llama.cpp provides fast CPU and GPU inference
4. **Quality** β€” Nemotron 3 Nano 4B is NVIDIA's optimized chat model
5. **Sponsor visibility** β€” NVIDIA + llama.cpp integration
## Installation
### Step 1: Install llama-cpp-python
**For CPU inference:**
```bash
pip install llama-cpp-python
```
**For GPU inference (CUDA 11.8+):**
```bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
```
**For GPU (Metal on macOS):**
```bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
```
### Step 2: Verify installation
```bash
python -c "from llama_cpp import Llama; print('βœ“ llama-cpp-python installed')"
```
## Model Details
- **Model ID:** `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF`
- **Repository:** https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
- **File:** `model.gguf`
- **Size:** ~2.5GB (4-bit quantization)
- **Context:** 2048 tokens
- **Format:** GGUF (compatible with llama.cpp)
## How it works
1. **First run** β€” llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
2. **Model caching** β€” Subsequent runs use the cached model (no re-download)
3. **GPU acceleration** β€” If CUDA/Metal is available, llama.cpp uses GPU for faster inference
4. **Fallback** β€” If llama-cpp-python unavailable, system uses mock generation
## Usage in code
```python
from app.services.generator import generate_game
from app.services.retrieval import retrieve_examples
# Generate a game
config = {
"game_type": "scavenger_hunt",
"city": "Paris",
"area": "Le Marais",
"duration_minutes": 60,
"num_players": 4,
"difficulty": "medium",
"age_group": "adults",
"location_type": "mixed"
}
# Retrieve similar games for grounding
retrieved = retrieve_examples(config, dataset, k=3)
# Generate game (uses llama.cpp if available, mock fallback otherwise)
game = generate_game(config, retrieved)
```
## Performance expectations
| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
## Troubleshooting
### Issue: "llama-cpp-python not found"
**Solution:** Install with `pip install llama-cpp-python` or with GPU support.
### Issue: CUDA compatibility errors
**Solution:** Check CUDA version compatibility:
```bash
# Check CUDA version
nvidia-smi
# Install specific CUDA-compatible version
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
pip install llama-cpp-python
```
### Issue: Model download stuck
**Solution:** Download manually from HuggingFace and place in `~/.cache/huggingface/hub/`
### Issue: Out of memory
**Solution:** Reduce context window or use CPU-offloading in llama.cpp settings.
## Performance expectations
| Setup | First Run | Subsequent Runs | Speed |
|-------|-----------|-----------------|-------|
| CPU (no optimization) | 5-10 min | 2-5 min per game | Slow |
| CPU (quantized) | 5-10 min | 30-60s per game | Moderate |
| GPU (CUDA/Metal) | 5-10 min | 5-15s per game | Fast |
| HF Zero GPU (auto) | 5-10 min | 5-15s per game | Fast |
## Integration with Hugging Face Spaces (Zero GPU)
The app is fully configured for **Hugging Face Spaces** with **Zero GPU** support.
### What is Zero GPU?
Hugging Face Spaces **Zero GPU** (paid tier) provides on-demand GPU allocation:
- GPU is allocated only when a `@spaces.GPU`-decorated function runs
- GPU is **released** after the function completes (saves cost)
- Without the decorator, code runs on CPU
### How it works in this app
1. `app.py` imports `spaces` gracefully (no error if missing)
2. The `_generate_with_gpu()` function is wrapped with `@spaces.GPU` only at runtime on HF Spaces
3. Inside that function, `torch.cuda.is_available()` returns `True`, so `generator.py` auto-detects GPU via `_get_n_gpu_layers()` and sets `n_gpu_layers=-1`
4. On CPU (local dev or free Spaces tier), it falls back to `n_gpu_layers=0`
### Requirements
Add to `requirements.txt`:
```
llama-cpp-python
spaces
```
> **Note:** The `spaces` package is only available on the Hugging Face Spaces runtime. Local imports use `try/except ImportError` to handle this gracefully.
### File structure for HF Spaces
```
app/
main.py ← HF Spaces entry point (launches Gradio)
services/
generator.py ← Auto-detects GPU via torch.cuda
...
requirements.txt
```
On Hugging Face Spaces, the app runs `app.py` automatically.
### GPU auto-detection logic
In `app/services/generator.py`:
```python
def _get_n_gpu_layers() -> int:
try:
import torch
if torch.cuda.is_available():
return -1 # All layers on GPU
except ImportError:
pass
return 0 # CPU only
```
This works because `@spaces.GPU` makes `torch.cuda.is_available()` return `True` inside the decorated function.
### Deployment steps
1. Push the repo to Hugging Face Spaces
2. Set **Space SDK** to **Gradio**
3. Set **Space hardware** to **Zero GPU** (paid) for GPU acceleration, or leave as CPU (free)
4. The app auto-detects and uses GPU/CPU accordingly
## Hackathon Integration
This setup satisfies:
- βœ“ **Extra credit:** llama.cpp runtime
- βœ“ **Sponsor integration:** NVIDIA Nemotron model
- βœ“ **Visible pipeline:** Model usage shown in logs and prompts
- βœ“ **Quality:** Small model optimized for this task
## Code reference
See [app/services/generator.py](app/services/generator.py) for:
- `generate_game_with_model()` β€” llama.cpp integration
- `NEMOTRON_MODEL_ID` β€” Model configuration
- Model caching and initialization
## Testing
Run the test suite:
```bash
python test_generation_gguf.py
```
Expected output:
- βœ“ Tests pass with mock (if llama-cpp-python not installed)
- βœ“ Tests pass with actual model (if llama-cpp-python installed)
- βœ“ All generated games validate against schema