cq-test / NEMOTRON_GGUF_SETUP.md
NANI-Nithin's picture
add recap and poster genration with minicpm and bfl
e18b02f
|
Raw
History Blame Contribute Delete
6.36 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

NVIDIA Nemotron 3 Nano 4B GGUF Setup Guide

Overview

The game generation system uses NVIDIA Nemotron 3 Nano 4B in GGUF quantized format, optimized for inference via llama.cpp.

Why this configuration?

  1. Hackathon bonus β€” llama.cpp runtime gives extra credit
  2. Memory efficient β€” GGUF 4-bit quantization reduces model size to ~2.5GB
  3. Performance β€” llama.cpp provides fast CPU and GPU inference
  4. Quality β€” Nemotron 3 Nano 4B is NVIDIA's optimized chat model
  5. Sponsor visibility β€” NVIDIA + llama.cpp integration

Installation

Step 1: Install llama-cpp-python

For CPU inference:

pip install llama-cpp-python

For GPU inference (CUDA 11.8+):

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

For GPU (Metal on macOS):

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

Step 2: Verify installation

python -c "from llama_cpp import Llama; print('βœ“ llama-cpp-python installed')"

Model Details

How it works

  1. First run β€” llama-cpp-python downloads the GGUF model from HuggingFace (~2.5GB)
  2. Model caching β€” Subsequent runs use the cached model (no re-download)
  3. GPU acceleration β€” If CUDA/Metal is available, llama.cpp uses GPU for faster inference
  4. Fallback β€” If llama-cpp-python unavailable, system uses mock generation

Usage in code

from app.services.generator import generate_game
from app.services.retrieval import retrieve_examples

# Generate a game
config = {
    "game_type": "scavenger_hunt",
    "city": "Paris",
    "area": "Le Marais",
    "duration_minutes": 60,
    "num_players": 4,
    "difficulty": "medium",
    "age_group": "adults",
    "location_type": "mixed"
}

# Retrieve similar games for grounding
retrieved = retrieve_examples(config, dataset, k=3)

# Generate game (uses llama.cpp if available, mock fallback otherwise)
game = generate_game(config, retrieved)

Performance expectations

Setup First Run Subsequent Runs Speed
CPU (no optimization) 5-10 min 2-5 min per game Slow
CPU (quantized) 5-10 min 30-60s per game Moderate
GPU (CUDA/Metal) 5-10 min 5-15s per game Fast

Troubleshooting

Issue: "llama-cpp-python not found"

Solution: Install with pip install llama-cpp-python or with GPU support.

Issue: CUDA compatibility errors

Solution: Check CUDA version compatibility:

# Check CUDA version
nvidia-smi

# Install specific CUDA-compatible version
CMAKE_ARGS="-DLLAMA_CUDA=on -DLLAMA_CUDA_DATALAYOUT=row -DCUDAToolkit_INCLUDE_DIR=/path/to/cuda/include" \
  pip install llama-cpp-python

Issue: Model download stuck

Solution: Download manually from HuggingFace and place in ~/.cache/huggingface/hub/

Issue: Out of memory

Solution: Reduce context window or use CPU-offloading in llama.cpp settings.

Performance expectations

Setup First Run Subsequent Runs Speed
CPU (no optimization) 5-10 min 2-5 min per game Slow
CPU (quantized) 5-10 min 30-60s per game Moderate
GPU (CUDA/Metal) 5-10 min 5-15s per game Fast
HF Zero GPU (auto) 5-10 min 5-15s per game Fast

Integration with Hugging Face Spaces (Zero GPU)

The app is fully configured for Hugging Face Spaces with Zero GPU support.

What is Zero GPU?

Hugging Face Spaces Zero GPU (paid tier) provides on-demand GPU allocation:

  • GPU is allocated only when a @spaces.GPU-decorated function runs
  • GPU is released after the function completes (saves cost)
  • Without the decorator, code runs on CPU

How it works in this app

  1. app.py imports spaces gracefully (no error if missing)
  2. The _generate_with_gpu() function is wrapped with @spaces.GPU only at runtime on HF Spaces
  3. Inside that function, torch.cuda.is_available() returns True, so generator.py auto-detects GPU via _get_n_gpu_layers() and sets n_gpu_layers=-1
  4. On CPU (local dev or free Spaces tier), it falls back to n_gpu_layers=0

Requirements

Add to requirements.txt:

llama-cpp-python
spaces

Note: The spaces package is only available on the Hugging Face Spaces runtime. Local imports use try/except ImportError to handle this gracefully.

File structure for HF Spaces

app/
  main.py           ← HF Spaces entry point (launches Gradio)
  services/
    generator.py    ← Auto-detects GPU via torch.cuda
    ...
requirements.txt

On Hugging Face Spaces, the app runs app.py automatically.

GPU auto-detection logic

In app/services/generator.py:

def _get_n_gpu_layers() -> int:
    try:
        import torch
        if torch.cuda.is_available():
            return -1  # All layers on GPU
    except ImportError:
        pass
    return 0  # CPU only

This works because @spaces.GPU makes torch.cuda.is_available() return True inside the decorated function.

Deployment steps

  1. Push the repo to Hugging Face Spaces
  2. Set Space SDK to Gradio
  3. Set Space hardware to Zero GPU (paid) for GPU acceleration, or leave as CPU (free)
  4. The app auto-detects and uses GPU/CPU accordingly

Hackathon Integration

This setup satisfies:

  • βœ“ Extra credit: llama.cpp runtime
  • βœ“ Sponsor integration: NVIDIA Nemotron model
  • βœ“ Visible pipeline: Model usage shown in logs and prompts
  • βœ“ Quality: Small model optimized for this task

Code reference

See app/services/generator.py for:

  • generate_game_with_model() β€” llama.cpp integration
  • NEMOTRON_MODEL_ID β€” Model configuration
  • Model caching and initialization

Testing

Run the test suite:

python test_generation_gguf.py

Expected output:

  • βœ“ Tests pass with mock (if llama-cpp-python not installed)
  • βœ“ Tests pass with actual model (if llama-cpp-python installed)
  • βœ“ All generated games validate against schema