--- title: Pocket-TTS API emoji: 🔊 colorFrom: green colorTo: blue sdk: docker app_port: 7860 pinned: false license: cc-by-4.0 secrets: - name: HF_TOKEN description: "Hugging Face token with write access. Required to download the gated kyutai/pocket-tts model and voice WAV files from source Spaces." --- # Pocket-TTS API FastAPI server running [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) with direct WAV/OGG audio output. No Gradio — just clean API endpoints. ## API Endpoints | Endpoint | Description | |---|---| | `GET /tts?text=Hello&voice=af_alloy&format=ogg` | Generate speech (format: `wav` or `ogg`) | | `GET /voices` | List all available voices | | `GET /health` | Health check | ## Scripts Utility scripts for generating audio, available in `scripts/`: | Script | Purpose | |---|---| | `voice.py` | One-shot TTS: `python3 voice.py "text" [voice] [format]` or `--file input.txt` | | `voice.sh` | Shell wrapper for voice.py | | `voice-from-file.py` | Read text from file and generate TTS | | `voice-chunked.py` | Split long text and generate in sequence | | `voice-long.sh` | Shell script for long text with ffmpeg concat | | `chunk_giselle.py` | Split `giselle_60min.txt` into ~10K char chunks on paragraph boundaries | | `batch_tts.py` | Full batch generator with auto-restart on failure and ffmpeg concat | | `giselle_batch.sh` | Shell batch equivalent for Giselle story generation | | `run_giselle_batch.sh` | Sequential batch runner with retry logic | | `restart_space.py` | Restart the HF Space via API (requires token) | **Usage:** ```bash # Quick one-shot python3 scripts/voice.py "Hello world" af_alloy ogg # From file python3 scripts/voice.py --file story.txt scarlett_johansson ogg # Chunk a long story and generate python3 scripts/chunk_giselle.py python3 scripts/run_giselle_batch.sh ``` ## Voices (78 total) ### Standard voices (from [Nymbo/Pocket-TTS](https://huggingface.co/spaces/Nymbo/Pocket-TTS)) 54 multilingual voices including: `af_alloy`, `af_nova`, `am_onyx`, `am_adam`, `bf_emma`, `bm_fable`, `ef_dora`, `ff_siwis`, `jf_alpha`, `zf_xiaoxiao`, and more. Prefixes: `af` (American female), `am` (American male), `bf` (British female), `bm` (British male), `ef` (English female), `em` (English male), `ff` (French female), `hf` (Hindi female), `if` (Italian female), `jf` (Japanese female), `pf` (Portuguese female), `zf` (Chinese female), `zm` (Chinese male). ### Character voices (from [chandypants/ollie-pocket-tts](https://huggingface.co/spaces/chandypants/ollie-pocket-tts)) 24 character voices: `benji`, `bertha`, `damian`, `f01_young_bright`, `f02_texas_gal`, `f03_sharp_pro`, `f04_warm_mom`, `f05_husky_mature`, `f06_perky_young`, `f07_southern_belle`, `f08_tough_cop`, `f09_elderly_sweet`, `f10_theater_kid`, `m01_deep_south`, `m02_smooth_tenor`, `m03_gruff_ny`, `m04_warm_dad`, `m05_distinguished`, `m06_young_rough`, `m07_cowboy`, `m08_fast_talker`, `m09_gentle_giant`, `m10_slick`. ## Setup 1. **Duplicate this Space** or deploy the Dockerfile 2. **Add `HF_TOKEN` secret** in Space Settings → Secrets (required for gated model access) 3. **Accept model terms** at https://huggingface.co/kyutai/pocket-tts 4. Space builds and serves on port 7860 ### Keep-Alive (recommended for free Spaces) Free HuggingFace Spaces sleep after inactivity. Use a Cloudflare Worker cron to keep it awake: ```bash CLOUDFLARE_WORKERS_TOKEN=your_token SPACE_HOST=your-space.hf.space python3 cloudflare-keepalive-setup.py ``` ## Example Usage ```bash # Generate OGG audio (Telegram-friendly) curl "https://your-space.hf.space/tts?text=Hello+world&voice=af_alloy&format=ogg" -o speech.ogg # Generate WAV audio curl "https://your-space.hf.space/tts?text=Hello+world&voice=m07_cowboy&format=wav" -o speech.wav # List voices curl "https://your-space.hf.space/voices" ``` ## Architecture ``` Request → FastAPI (/tts) → Pocket-TTS model → WAV audio → ffmpeg → OGG/Opus → Response ``` - **Server**: FastAPI on uvicorn - **Model**: kyutai/pocket-tts (english_2026-04, with voice cloning) - **Voices**: Downloaded on-demand from HF Spaces, cached in memory - **Audio conversion**: ffmpeg (installed in Docker image) converts WAV → OGG/Opus ## Lessons Learned (Debugging Notes) This Space went through several iterations before producing clean speech. Documenting the issues and fixes for anyone who runs into similar problems: ### 1. Gradio SDK routing issues ❌→✅ **Problem**: Initially used the Gradio SDK (duplicating Nymbo/Pocket-TTS). Adding custom API endpoints (`/api/tts`) alongside Gradio caused persistent 500 errors (`jinja2.exceptions.UndefinedError: 'None' has no attribute 'get'`). Gradio's internal SvelteKit catch-all route intercepted all custom paths. **Attempts**: - Mounting FastAPI under `/api` with Starlette wrapper → broke Gradio template rendering (500 error) - Adding routes to `demo.app.routes` → `AttributeError: property 'routes' of 'App' object has no setter` - ASGI middleware to intercept `/api/` paths → Gradio's `demo.launch()` creates its own server, ignoring the wrapped app - Adding a Gradio button with `api_name="tts_file"` → worked but returned HLS playlists, hard to consume programmatically **Fix**: Switched to Docker SDK with pure FastAPI server. Full control over routing, no Gradio interference. ### 2. All audio was noise (static) ❌→✅ **Problem**: Every generated audio file sounded like white noise/static, not speech. This persisted across multiple approaches. **Root causes** (multiple, layered): #### 2a. Wrong model variant `TTSModel.load_model()` defaults to `language="english"` which loads the **without-voice-cloning** variant (`kyutai/pocket-tts-without-voice-cloning`). This model **cannot process voice embeddings** at all — it just generates noise when given voice conditioning. **Fix**: `TTSModel.load_model(language="english_2026-04")` loads the full model with voice cloning support. #### 2b. Incompatible embeddings The `kyutai/pocket-tts` model repo provides embeddings in three formats: - `embeddings/` (v1): Contains `audio_prompt` tensor — but using it with the wrong model variant produces noise - `embeddings_v2/`: Pre-computed KV caches with `cache` and `current_end` keys — **incompatible format**, produces noise/garbage even with the voice cloning model - `embeddings_v3/`: Pre-computed KV caches with `cache` and `offset` keys — also incompatible, model generates until max tokens without EOS (indicates garbage output) The Nymbo/Pocket-TTS Space works because it uses **Kokoro-82M compatible embeddings** that are different from the kyutai repo embeddings. **Fix**: Don't use pre-computed embeddings at all. Use `model.get_state_for_audio_prompt(wav_file)` with actual WAV audio files. This is the only reliable method. #### 2c. OGG conversion without ffmpeg Without ffmpeg installed, pydub's `export(format="ogg", codec="libopus")` silently produced corrupted files that sounded like noise. **Fix**: Install ffmpeg in the Docker image (`apt-get install ffmpeg`). ### 3. HLS streaming output ❌→✅ **Problem**: Gradio's streaming audio component returns HLS playlists (`.m3u8` files with `.aac` segments). The `gradio_client` downloads the playlist file but not the segments, making programmatic audio retrieval impossible. **Fix**: FastAPI endpoint returns complete audio files directly — no streaming, no playlists. ### 4. Deep copy error ❌→✅ **Problem**: `model.generate_audio(..., copy_state=True)` internally calls `copy.deepcopy(voice_state)` which fails with `RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol`. **Fix**: Detach all tensors in the voice state to make them leaf tensors: ```python def detach_all(obj): if isinstance(obj, torch.Tensor): return obj.detach().clone() elif isinstance(obj, dict): return {k: detach_all(v) for k, v in obj.items()} else: return obj voice_state = detach_all(voice_state) ``` ### 5. Gated model access ❌→✅ **Problem**: `403 Client Error: Cannot access gated repo`. The kyutai/pocket-tts model is gated — you must accept the terms on the model page before your token can download it. **Fix**: Visit https://huggingface.co/kyutai/pocket-tts and click "Agree and access repository". Then set `HF_TOKEN` as a Space secret. ## License See [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for model licensing (CC-BY-4.0).