Spaces:

build-small-hackathon
/

cq-test

Running on Zero

App Files Files Community

cq-test / NEMOTRON_GGUF_SETUP.md

NANI-Nithin

add recap and poster genration with minicpm and bfl

e18b02f 26 days ago

preview code

Raw

History Blame Contribute Delete

6.36 kB

	# NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide

	## Overview

	The game generation system uses NVIDIA Nemotron 3 Nano 4B in GGUF quantized format, optimized for inference via llama.cpp.

	### Why this configuration?

	1. Hackathon bonus — llama.cpp runtime gives extra credit
	2. Memory efficient — GGUF 4-bit quantization reduces model size to ~2.5GB
	3. Performance — llama.cpp provides fast CPU and GPU inference
	4. Quality — Nemotron 3 Nano 4B is NVIDIA's optimized chat model
	5. Sponsor visibility — NVIDIA + llama.cpp integration

	## Installation

	### Step 1: Install llama-cpp-python

	For CPU inference:
	```bash
	pip install llama-cpp-python
	```

	For GPU inference (CUDA 11.8+):
	```bash
	CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
	```

	For GPU (Metal on macOS):
	```bash
	CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
	```

	### Step 2: Verify installation

	```bash
	python -c "from llama_cpp import Llama; print('✓ llama-cpp-python installed')"
	```

	## Model Details

	- Model ID: `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF`
	- Repository: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
	- File: `model.gguf`
	- Size: ~2.5GB (4-bit quantization)
	- Context: 2048 tokens
	- Format: GGUF (compatible with llama.cpp)

	## How it works

	1. First run — llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
	2. Model caching — Subsequent runs use the cached model (no re-download)
	3. GPU acceleration — If CUDA/Metal is available, llama.cpp uses GPU for faster inference
	4. Fallback — If llama-cpp-python unavailable, system uses mock generation

	## Usage in code

	```python
	from app.services.generator import generate_game
	from app.services.retrieval import retrieve_examples

	# Generate a game
	config = {
	"game_type": "scavenger_hunt",
	"city": "Paris",
	"area": "Le Marais",
	"duration_minutes": 60,
	"num_players": 4,
	"difficulty": "medium",
	"age_group": "adults",
	"location_type": "mixed"
	}

	# Retrieve similar games for grounding
	retrieved = retrieve_examples(config, dataset, k=3)

	# Generate game (uses llama.cpp if available, mock fallback otherwise)
	game = generate_game(config, retrieved)
	```

	## Performance expectations

	\| Setup \| First Run \| Subsequent Runs \| Speed \|
	\|-------\|-----------\|-----------------\|-------\|
	\| CPU (no optimization) \| 5-10 min \| 2-5 min per game \| Slow \|
	\| CPU (quantized) \| 5-10 min \| 30-60s per game \| Moderate \|
	\| GPU (CUDA/Metal) \| 5-10 min \| 5-15s per game \| Fast \|

	## Troubleshooting

	### Issue: "llama-cpp-python not found"
	Solution: Install with `pip install llama-cpp-python` or with GPU support.

	### Issue: CUDA compatibility errors
	Solution: Check CUDA version compatibility:
	```bash
	# Check CUDA version
	nvidia-smi

	# Install specific CUDA-compatible version
	CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
	pip install llama-cpp-python
	```

	### Issue: Model download stuck
	Solution: Download manually from HuggingFace and place in `~/.cache/huggingface/hub/`

	### Issue: Out of memory
	Solution: Reduce context window or use CPU-offloading in llama.cpp settings.

	## Performance expectations

	\| Setup \| First Run \| Subsequent Runs \| Speed \|
	\|-------\|-----------\|-----------------\|-------\|
	\| CPU (no optimization) \| 5-10 min \| 2-5 min per game \| Slow \|
	\| CPU (quantized) \| 5-10 min \| 30-60s per game \| Moderate \|
	\| GPU (CUDA/Metal) \| 5-10 min \| 5-15s per game \| Fast \|
	\| HF Zero GPU (auto) \| 5-10 min \| 5-15s per game \| Fast \|

	## Integration with Hugging Face Spaces (Zero GPU)

	The app is fully configured for Hugging Face Spaces with Zero GPU support.

	### What is Zero GPU?

	Hugging Face Spaces Zero GPU (paid tier) provides on-demand GPU allocation:
	- GPU is allocated only when a `@spaces.GPU`-decorated function runs
	- GPU is released after the function completes (saves cost)
	- Without the decorator, code runs on CPU

	### How it works in this app

	1. `app.py` imports `spaces` gracefully (no error if missing)
	2. The `_generate_with_gpu()` function is wrapped with `@spaces.GPU` only at runtime on HF Spaces
	3. Inside that function, `torch.cuda.is_available()` returns `True`, so `generator.py` auto-detects GPU via `_get_n_gpu_layers()` and sets `n_gpu_layers=-1`
	4. On CPU (local dev or free Spaces tier), it falls back to `n_gpu_layers=0`

	### Requirements

	Add to `requirements.txt`:
	```
	llama-cpp-python
	spaces
	```

	> Note: The `spaces` package is only available on the Hugging Face Spaces runtime. Local imports use `try/except ImportError` to handle this gracefully.

	### File structure for HF Spaces

	```
	app/
	main.py ← HF Spaces entry point (launches Gradio)
	services/
	generator.py ← Auto-detects GPU via torch.cuda
	...
	requirements.txt
	```

	On Hugging Face Spaces, the app runs `app.py` automatically.

	### GPU auto-detection logic

	In `app/services/generator.py`:

	```python
	def _get_n_gpu_layers() -> int:
	try:
	import torch
	if torch.cuda.is_available():
	return -1 # All layers on GPU
	except ImportError:
	pass
	return 0 # CPU only
	```

	This works because `@spaces.GPU` makes `torch.cuda.is_available()` return `True` inside the decorated function.

	### Deployment steps

	1. Push the repo to Hugging Face Spaces
	2. Set Space SDK to Gradio
	3. Set Space hardware to Zero GPU (paid) for GPU acceleration, or leave as CPU (free)
	4. The app auto-detects and uses GPU/CPU accordingly

	## Hackathon Integration

	This setup satisfies:
	- ✓ Extra credit: llama.cpp runtime
	- ✓ Sponsor integration: NVIDIA Nemotron model
	- ✓ Visible pipeline: Model usage shown in logs and prompts
	- ✓ Quality: Small model optimized for this task

	## Code reference

	See [app/services/generator.py](app/services/generator.py) for:
	- `generate_game_with_model()` — llama.cpp integration
	- `NEMOTRON_MODEL_ID` — Model configuration
	- Model caching and initialization

	## Testing

	Run the test suite:
	```bash
	python test_generation_gguf.py
	```

	Expected output:
	- ✓ Tests pass with mock (if llama-cpp-python not installed)
	- ✓ Tests pass with actual model (if llama-cpp-python installed)
	- ✓ All generated games validate against schema