Spaces:

build-small-hackathon
/

cq-test

Running on Zero

App Files Files Community

cq-test / NEMOTRON_GGUF_SETUP.md

NANI-Nithin

add recap and poster genration with minicpm and bfl

e18b02f 26 days ago

preview code

Raw

History Blame Contribute Delete

6.36 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide

Overview

The game generation system uses NVIDIA Nemotron 3 Nano 4B in GGUF quantized format, optimized for inference via llama.cpp.

Why this configuration?

Hackathon bonus — llama.cpp runtime gives extra credit
Memory efficient — GGUF 4-bit quantization reduces model size to ~2.5GB
Performance — llama.cpp provides fast CPU and GPU inference
Quality — Nemotron 3 Nano 4B is NVIDIA's optimized chat model
Sponsor visibility — NVIDIA + llama.cpp integration

Installation

Step 1: Install llama-cpp-python

For CPU inference:

pip install llama-cpp-python

For GPU inference (CUDA 11.8+):

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

For GPU (Metal on macOS):

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

Step 2: Verify installation

python -c "from llama_cpp import Llama; print('✓ llama-cpp-python installed')"

Model Details

Model ID: nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
Repository: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF
File: model.gguf
Size: ~2.5GB (4-bit quantization)
Context: 2048 tokens
Format: GGUF (compatible with llama.cpp)

How it works

First run — llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
Model caching — Subsequent runs use the cached model (no re-download)
GPU acceleration — If CUDA/Metal is available, llama.cpp uses GPU for faster inference
Fallback — If llama-cpp-python unavailable, system uses mock generation

Usage in code

from app.services.generator import generate_game
from app.services.retrieval import retrieve_examples

# Generate a game
config = {
    "game_type": "scavenger_hunt",
    "city": "Paris",
    "area": "Le Marais",
    "duration_minutes": 60,
    "num_players": 4,
    "difficulty": "medium",
    "age_group": "adults",
    "location_type": "mixed"
}

# Retrieve similar games for grounding
retrieved = retrieve_examples(config, dataset, k=3)

# Generate game (uses llama.cpp if available, mock fallback otherwise)
game = generate_game(config, retrieved)

Performance expectations

Setup	First Run	Subsequent Runs	Speed
CPU (no optimization)	5-10 min	2-5 min per game	Slow
CPU (quantized)	5-10 min	30-60s per game	Moderate
GPU (CUDA/Metal)	5-10 min	5-15s per game	Fast

Troubleshooting

Issue: "llama-cpp-python not found"

Solution: Install with pip install llama-cpp-python or with GPU support.

Issue: CUDA compatibility errors

Solution: Check CUDA version compatibility:

# Check CUDA version
nvidia-smi

# Install specific CUDA-compatible version
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
  pip install llama-cpp-python

Issue: Model download stuck

Solution: Download manually from HuggingFace and place in ~/.cache/huggingface/hub/

Issue: Out of memory

Solution: Reduce context window or use CPU-offloading in llama.cpp settings.

Performance expectations

Setup	First Run	Subsequent Runs	Speed
CPU (no optimization)	5-10 min	2-5 min per game	Slow
CPU (quantized)	5-10 min	30-60s per game	Moderate
GPU (CUDA/Metal)	5-10 min	5-15s per game	Fast
HF Zero GPU (auto)	5-10 min	5-15s per game	Fast

Integration with Hugging Face Spaces (Zero GPU)

The app is fully configured for Hugging Face Spaces with Zero GPU support.

What is Zero GPU?

Hugging Face Spaces Zero GPU (paid tier) provides on-demand GPU allocation:

GPU is allocated only when a @spaces.GPU-decorated function runs
GPU is released after the function completes (saves cost)
Without the decorator, code runs on CPU

How it works in this app

app.py imports spaces gracefully (no error if missing)
The _generate_with_gpu() function is wrapped with @spaces.GPU only at runtime on HF Spaces
Inside that function, torch.cuda.is_available() returns True, so generator.py auto-detects GPU via _get_n_gpu_layers() and sets n_gpu_layers=-1
On CPU (local dev or free Spaces tier), it falls back to n_gpu_layers=0

Requirements

Add to requirements.txt:

llama-cpp-python
spaces

Note: The spaces package is only available on the Hugging Face Spaces runtime. Local imports use try/except ImportError to handle this gracefully.

File structure for HF Spaces

app/
  main.py           ← HF Spaces entry point (launches Gradio)
  services/
    generator.py    ← Auto-detects GPU via torch.cuda
    ...
requirements.txt

On Hugging Face Spaces, the app runs app.py automatically.

GPU auto-detection logic

In app/services/generator.py:

def _get_n_gpu_layers() -> int:
    try:
        import torch
        if torch.cuda.is_available():
            return -1  # All layers on GPU
    except ImportError:
        pass
    return 0  # CPU only

This works because @spaces.GPU makes torch.cuda.is_available() return True inside the decorated function.

Deployment steps

Push the repo to Hugging Face Spaces
Set Space SDK to Gradio
Set Space hardware to Zero GPU (paid) for GPU acceleration, or leave as CPU (free)
The app auto-detects and uses GPU/CPU accordingly

Hackathon Integration

This setup satisfies:

✓ Extra credit: llama.cpp runtime
✓ Sponsor integration: NVIDIA Nemotron model
✓ Visible pipeline: Model usage shown in logs and prompts
✓ Quality: Small model optimized for this task

Code reference

See app/services/generator.py for:

generate_game_with_model() — llama.cpp integration
NEMOTRON_MODEL_ID — Model configuration
Model caching and initialization

Testing

Run the test suite:

python test_generation_gguf.py

Expected output:

✓ Tests pass with mock (if llama-cpp-python not installed)
✓ Tests pass with actual model (if llama-cpp-python installed)
✓ All generated games validate against schema