Spaces:
Running
Running
| title: Pocket-TTS API | |
| emoji: π | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: cc-by-4.0 | |
| secrets: | |
| - name: HF_TOKEN | |
| description: "Hugging Face token with write access. Required to download the gated kyutai/pocket-tts model and voice WAV files from source Spaces." | |
| # Pocket-TTS API | |
| FastAPI server running [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) with direct WAV/OGG audio output. No Gradio β just clean API endpoints. | |
| ## API Endpoints | |
| | Endpoint | Description | | |
| |---|---| | |
| | `GET /tts?text=Hello&voice=af_alloy&format=ogg` | Generate speech (format: `wav` or `ogg`) | | |
| | `GET /voices` | List all available voices | | |
| | `GET /health` | Health check | | |
| ## Voices (78 total) | |
| ### Standard voices (from [Nymbo/Pocket-TTS](https://huggingface.co/spaces/Nymbo/Pocket-TTS)) | |
| 54 multilingual voices including: `af_alloy`, `af_nova`, `am_onyx`, `am_adam`, `bf_emma`, `bm_fable`, `ef_dora`, `ff_siwis`, `jf_alpha`, `zf_xiaoxiao`, and more. | |
| Prefixes: `af` (American female), `am` (American male), `bf` (British female), `bm` (British male), `ef` (English female), `em` (English male), `ff` (French female), `hf` (Hindi female), `if` (Italian female), `jf` (Japanese female), `pf` (Portuguese female), `zf` (Chinese female), `zm` (Chinese male). | |
| ### Character voices (from [chandypants/ollie-pocket-tts](https://huggingface.co/spaces/chandypants/ollie-pocket-tts)) | |
| 24 character voices: `benji`, `bertha`, `damian`, `f01_young_bright`, `f02_texas_gal`, `f03_sharp_pro`, `f04_warm_mom`, `f05_husky_mature`, `f06_perky_young`, `f07_southern_belle`, `f08_tough_cop`, `f09_elderly_sweet`, `f10_theater_kid`, `m01_deep_south`, `m02_smooth_tenor`, `m03_gruff_ny`, `m04_warm_dad`, `m05_distinguished`, `m06_young_rough`, `m07_cowboy`, `m08_fast_talker`, `m09_gentle_giant`, `m10_slick`. | |
| ## Setup | |
| 1. **Duplicate this Space** or deploy the Dockerfile | |
| 2. **Add `HF_TOKEN` secret** in Space Settings β Secrets (required for gated model access) | |
| 3. **Accept model terms** at https://huggingface.co/kyutai/pocket-tts | |
| 4. Space builds and serves on port 7860 | |
| ### Keep-Alive (recommended for free Spaces) | |
| Free HuggingFace Spaces sleep after inactivity. Use a Cloudflare Worker cron to keep it awake: | |
| ```bash | |
| CLOUDFLARE_WORKERS_TOKEN=your_token SPACE_HOST=your-space.hf.space python3 cloudflare-keepalive-setup.py | |
| ``` | |
| ## Example Usage | |
| ```bash | |
| # Generate OGG audio (Telegram-friendly) | |
| curl "https://your-space.hf.space/tts?text=Hello+world&voice=af_alloy&format=ogg" -o speech.ogg | |
| # Generate WAV audio | |
| curl "https://your-space.hf.space/tts?text=Hello+world&voice=m07_cowboy&format=wav" -o speech.wav | |
| # List voices | |
| curl "https://your-space.hf.space/voices" | |
| ``` | |
| ## Architecture | |
| ``` | |
| Request β FastAPI (/tts) β Pocket-TTS model β WAV audio β ffmpeg β OGG/Opus β Response | |
| ``` | |
| - **Server**: FastAPI on uvicorn | |
| - **Model**: kyutai/pocket-tts (english_2026-04, with voice cloning) | |
| - **Voices**: Downloaded on-demand from HF Spaces, cached in memory | |
| - **Audio conversion**: ffmpeg (installed in Docker image) converts WAV β OGG/Opus | |
| ## Lessons Learned (Debugging Notes) | |
| This Space went through several iterations before producing clean speech. Documenting the issues and fixes for anyone who runs into similar problems: | |
| ### 1. Gradio SDK routing issues βββ | |
| **Problem**: Initially used the Gradio SDK (duplicating Nymbo/Pocket-TTS). Adding custom API endpoints (`/api/tts`) alongside Gradio caused persistent 500 errors (`jinja2.exceptions.UndefinedError: 'None' has no attribute 'get'`). Gradio's internal SvelteKit catch-all route intercepted all custom paths. | |
| **Attempts**: | |
| - Mounting FastAPI under `/api` with Starlette wrapper β broke Gradio template rendering (500 error) | |
| - Adding routes to `demo.app.routes` β `AttributeError: property 'routes' of 'App' object has no setter` | |
| - ASGI middleware to intercept `/api/` paths β Gradio's `demo.launch()` creates its own server, ignoring the wrapped app | |
| - Adding a Gradio button with `api_name="tts_file"` β worked but returned HLS playlists, hard to consume programmatically | |
| **Fix**: Switched to Docker SDK with pure FastAPI server. Full control over routing, no Gradio interference. | |
| ### 2. All audio was noise (static) βββ | |
| **Problem**: Every generated audio file sounded like white noise/static, not speech. This persisted across multiple approaches. | |
| **Root causes** (multiple, layered): | |
| #### 2a. Wrong model variant | |
| `TTSModel.load_model()` defaults to `language="english"` which loads the **without-voice-cloning** variant (`kyutai/pocket-tts-without-voice-cloning`). This model **cannot process voice embeddings** at all β it just generates noise when given voice conditioning. | |
| **Fix**: `TTSModel.load_model(language="english_2026-04")` loads the full model with voice cloning support. | |
| #### 2b. Incompatible embeddings | |
| The `kyutai/pocket-tts` model repo provides embeddings in three formats: | |
| - `embeddings/` (v1): Contains `audio_prompt` tensor β but using it with the wrong model variant produces noise | |
| - `embeddings_v2/`: Pre-computed KV caches with `cache` and `current_end` keys β **incompatible format**, produces noise/garbage even with the voice cloning model | |
| - `embeddings_v3/`: Pre-computed KV caches with `cache` and `offset` keys β also incompatible, model generates until max tokens without EOS (indicates garbage output) | |
| The Nymbo/Pocket-TTS Space works because it uses **Kokoro-82M compatible embeddings** that are different from the kyutai repo embeddings. | |
| **Fix**: Don't use pre-computed embeddings at all. Use `model.get_state_for_audio_prompt(wav_file)` with actual WAV audio files. This is the only reliable method. | |
| #### 2c. OGG conversion without ffmpeg | |
| Without ffmpeg installed, pydub's `export(format="ogg", codec="libopus")` silently produced corrupted files that sounded like noise. | |
| **Fix**: Install ffmpeg in the Docker image (`apt-get install ffmpeg`). | |
| ### 3. HLS streaming output βββ | |
| **Problem**: Gradio's streaming audio component returns HLS playlists (`.m3u8` files with `.aac` segments). The `gradio_client` downloads the playlist file but not the segments, making programmatic audio retrieval impossible. | |
| **Fix**: FastAPI endpoint returns complete audio files directly β no streaming, no playlists. | |
| ### 4. Deep copy error βββ | |
| **Problem**: `model.generate_audio(..., copy_state=True)` internally calls `copy.deepcopy(voice_state)` which fails with `RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol`. | |
| **Fix**: Detach all tensors in the voice state to make them leaf tensors: | |
| ```python | |
| def detach_all(obj): | |
| if isinstance(obj, torch.Tensor): | |
| return obj.detach().clone() | |
| elif isinstance(obj, dict): | |
| return {k: detach_all(v) for k, v in obj.items()} | |
| else: | |
| return obj | |
| voice_state = detach_all(voice_state) | |
| ``` | |
| ### 5. Gated model access βββ | |
| **Problem**: `403 Client Error: Cannot access gated repo`. The kyutai/pocket-tts model is gated β you must accept the terms on the model page before your token can download it. | |
| **Fix**: Visit https://huggingface.co/kyutai/pocket-tts and click "Agree and access repository". Then set `HF_TOKEN` as a Space secret. | |
| ## License | |
| See [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for model licensing (CC-BY-4.0). | |