# Deploy Tiny Narrator to Hugging Face Spaces This guide deploys Tiny Narrator on Hugging Face Spaces using the **Docker SDK on free CPU**. The Space serves the custom article UI, FastAPI routes, Kokoro/browser speech path, and fallback assets. GPU inference runs outside the Space through Modal or other OpenAI-compatible endpoints. > **Why Docker SDK?** Tiny Narrator uses `gradio.Server` with custom FastAPI routes and static files. The Docker SDK preserves that app shape without rewriting the UI into `gr.Blocks`. > > **Why free CPU?** Paid HF GPU is not required when `llama.cpp` and image generation are hosted externally. The Space can stay on CPU Basic while still calling live model endpoints. --- ## Prerequisites - A Hugging Face account - Git and Git LFS installed locally - Python 3.11+ installed locally for testing - Modal CLI installed if you want live reader-brain or Klein image generation --- ## Step 1 - Create a New Space 1. Go to [https://huggingface.co/new-space](https://huggingface.co/new-space). 2. Fill in the form: - **Space name**: `tiny-narrator` - **SDK**: **Docker** - **Hardware**: **CPU Basic** - **Visibility**: Public or Private 3. Click **Create Space**. Your Space README metadata should include: ```yaml --- title: Tiny Narrator emoji: book colorFrom: blue colorTo: teal sdk: docker app_port: 7860 --- ``` The `app_port: 7860` line is important because Tiny Narrator binds to port 7860 by default. --- ## Step 2 - Copy Project Files Clone the Space repository and copy the Tiny Narrator project files into it: ```bash git clone https://huggingface.co/spaces//tiny-narrator cd tiny-narrator ``` Include these files and directories: ```text app.py requirements.txt Dockerfile start.sh README.md SUBMISSION.md FIELD_NOTES.md LICENSE static/ modal_workers/ scripts/ ``` Do **not** copy `.env`; configure secrets in the Space Settings UI. --- ## Step 3 - Deploy Modal Reader Brain The reader-brain role uses `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF:Q4_K_M` through `llama.cpp`. Deploy it to Modal: ```bash pip install modal modal setup modal secret create tiny-narrator-reader-brain-token LLAMA_CPP_TOKEN=your-random-token modal deploy modal_workers/reader_brain.py ``` After deployment, Modal prints a URL similar to: ```text https://your-workspace--tiny-narrator-reader-brain.modal.run ``` Set the Space variable: ```text LLAMA_CPP_BASE_URL=https://your-workspace--tiny-narrator-reader-brain.modal.run/v1 LLAMA_CPP_MODEL=narrator-brain LLAMA_CPP_TOKEN=your-random-token LLAMA_CPP_TIMEOUT_SECONDS=90 ``` The worker starts `llama-server` with: ```bash llama-server \ -hf nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF:Q4_K_M \ --alias narrator-brain \ --host 0.0.0.0 \ --port 8080 \ --ctx-size 4096 \ --parallel 1 \ --reasoning off \ --n-gpu-layers 999 \ --api-key your-random-token ``` Modal scales the worker down when idle. The first request after scale-to-zero can be slow while the container and model start. The worker uses the prebuilt `ghcr.io/ggml-org/llama.cpp:server-cuda12` image, so Modal should pull a CUDA server image instead of compiling llama.cpp from source during deployment. The worker also clears the prebuilt image entrypoint so Modal can start its Python runner before launching `llama-server`, and gives the web server up to 10 minutes to download/load the GGUF on first startup. --- ## Step 4 - Optional Live Image Generation Tiny Narrator can use the Modal Klein worker for `black-forest-labs/FLUX.2-klein-4B` thumbnails: ```bash modal secret create tiny-narrator-klein-token KLEIN_MODAL_TOKEN=your-random-token modal deploy modal_workers/klein_image.py ``` Set these Space values: ```text KLEIN_MODAL_ENDPOINT=https://your-workspace--tiny-narrator-klein.modal.run KLEIN_MODAL_TOKEN=your-random-token KLEIN_MODAL_HEALTH_TIMEOUT_SECONDS=30 KLEIN_MODAL_TIMEOUT_SECONDS=120 ``` If the Klein worker is not configured, the app falls back to bundled SVG assets. --- ## Step 5 - Optional MiniCPM-V Image Descriptions If you have an OpenAI-compatible MiniCPM-V-4.6 endpoint, set: ```text MINICPM_VISION_BASE_URL=https://your-vision-endpoint.example.com/v1 MINICPM_VISION_API_KEY= MINICPM_VISION_MODEL=openbmb/MiniCPM-V-4.6 MINICPM_VISION_TIMEOUT_SECONDS=45 ``` If this endpoint is not configured, Tiny Narrator uses deterministic cached alt text. --- ## Step 6 - Configure Space Variables In your Space, open **Settings -> Variables and secrets**. Add these variables: | Variable | Example | Notes | | --- | --- | --- | | `PUBLIC_BASE_URL` | `https://your-username-tiny-narrator.hf.space` | Used in generated judge/demo links | | `GRADIO_SHARE` | `false` | Keep false on Spaces | | `LLAMA_CPP_BASE_URL` | `https://...modal.run/v1` | Modal reader-brain endpoint | | `LLAMA_CPP_MODEL` | `narrator-brain` | Alias served by llama.cpp | | `LLAMA_CPP_TIMEOUT_SECONDS` | `90` | Reader/article generation timeout | | `KLEIN_MODAL_ENDPOINT` | `https://...modal.run` | Optional image generation endpoint | | `KLEIN_MODAL_HEALTH_TIMEOUT_SECONDS` | `30` | Optional Klein health timeout | | `KLEIN_MODAL_TIMEOUT_SECONDS` | `120` | Optional Klein generation timeout | | `MINICPM_VISION_BASE_URL` | `https://.../v1` | Optional image descriptor endpoint | | `MINICPM_VISION_MODEL` | `openbmb/MiniCPM-V-4.6` | Optional descriptor model id | | `MINICPM_VISION_TIMEOUT_SECONDS` | `45` | Optional descriptor timeout | Add these as **Secrets**: | Secret | Notes | | --- | --- | | `LLAMA_CPP_TOKEN` | Recommended for Modal reader-brain auth | | `KLEIN_MODAL_TOKEN` | Only needed if Modal Klein is enabled | | `MINICPM_VISION_API_KEY` | Only needed if MiniCPM-V is enabled | --- ## Step 7 - Commit and Push ```bash git add . git commit -m "Deploy Tiny Narrator to HF Spaces CPU" git push origin main ``` HF Spaces will: 1. Detect the Dockerfile. 2. Build a small CPU image. 3. Run `start.sh`. 4. Launch `python app.py` on port 7860. 5. Expose the app at `https://-tiny-narrator.hf.space`. --- ## Step 8 - Verify the Deployment Open the Space URL in a browser, then check: ```bash curl https://-tiny-narrator.hf.space/api/health curl https://-tiny-narrator.hf.space/api/model-budget curl https://-tiny-narrator.hf.space/api/runtime-status curl https://-tiny-narrator.hf.space/api/submission-readiness ``` `/api/runtime-status` should show: - `reader_brain`: `online` when Modal llama.cpp is reachable, `fallback-ready` otherwise; configured MiniCPM-V can still act as the first text fallback before deterministic narration - `speech`: Kokoro or fallback speech path - `vision`: MiniCPM online or fallback-ready - `image_generation`: Modal Klein online or fallback-ready --- ## Troubleshooting ### Space starts but reader brain is fallback-ready - Confirm `LLAMA_CPP_BASE_URL` ends in `/v1`. - Confirm the Space `LLAMA_CPP_TOKEN` secret matches the Modal `tiny-narrator-reader-brain-token` secret. - Open the Modal logs for `tiny-narrator-reader-brain`. - Check that the Modal URL is reachable at `/v1/models`. - The first request after scale-to-zero may need extra time while Modal starts the container and loads the GGUF. ### Space stuck on "Starting" - Check the Space **Container** logs. - Make sure `requirements.txt` installed successfully. - Make sure the app binds to `0.0.0.0:7860`; the defaults already do this. ### Kokoro TTS not producing audio - `libsndfile1` is installed in the Docker image. - If Kokoro fails to load, the app falls back to browser speech synthesis and transcript output. ### Modal reader-brain cold starts are slow - Keep `scaledown_window` higher in `modal_workers/reader_brain.py` if you want the container to stay warm longer. - Use a larger GPU by editing `gpu="T4"` in `modal_workers/reader_brain.py` before deploying. - Keep `min_containers=0` for cheapest operation; use a warm container only if you accept continuous cost. ### External services unreachable - HF Spaces has outbound internet access, but private endpoints are not reachable. - Use public HTTPS endpoints for Modal, MiniCPM, and any tunnels. --- ## Architecture Overview ```text Hugging Face Space (Docker CPU) app.py static HTML/CSS/JS outputs WAV files deterministic fallbacks | +--> Modal reader-brain worker | llama.cpp /v1/chat/completions | +--> Modal Klein image worker (optional) | +--> MiniCPM-V-4.6 OpenAI-compatible endpoint (optional) image descriptions and reader-brain text fallback ``` --- ## Quick Reference | Item | Value | | --- | --- | | Space SDK | Docker | | Space hardware | CPU Basic | | App port | 7860 | | Space base image | `python:3.12-slim` | | Reader brain runtime | Modal-hosted llama.cpp | | Reader brain model | `nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUF:Q4_K_M` | | Reader brain base URL | `https:///v1` | | Image generation | Modal Klein worker or SVG fallback | | Image descriptions | MiniCPM-V endpoint or cached fallback | | Reader-brain fallback | MiniCPM-V chat endpoint, then deterministic narration | --- ## Further Reading - [HF Spaces Docker SDK docs](https://huggingface.co/docs/hub/spaces-sdks-docker) - [HF Spaces Secrets & Variables](https://huggingface.co/docs/hub/spaces-overview#managing-secrets) - [Modal Web Functions](https://modal.com/docs/guide/webhooks) - [Modal GPU acceleration](https://modal.com/docs/guide/gpu) - [llama.cpp repository](https://github.com/ggml-org/llama.cpp)