Pocket-TTS / README.md
hf4uwho's picture
Comprehensive README: API docs, 78 voices, architecture, and full debugging notes
d5bd886
|
Raw
History Blame
7.36 kB
---
title: Pocket-TTS API
emoji: πŸ”Š
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: cc-by-4.0
secrets:
- name: HF_TOKEN
description: "Hugging Face token with write access. Required to download the gated kyutai/pocket-tts model and voice WAV files from source Spaces."
---
# Pocket-TTS API
FastAPI server running [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) with direct WAV/OGG audio output. No Gradio β€” just clean API endpoints.
## API Endpoints
| Endpoint | Description |
|---|---|
| `GET /tts?text=Hello&voice=af_alloy&format=ogg` | Generate speech (format: `wav` or `ogg`) |
| `GET /voices` | List all available voices |
| `GET /health` | Health check |
## Voices (78 total)
### Standard voices (from [Nymbo/Pocket-TTS](https://huggingface.co/spaces/Nymbo/Pocket-TTS))
54 multilingual voices including: `af_alloy`, `af_nova`, `am_onyx`, `am_adam`, `bf_emma`, `bm_fable`, `ef_dora`, `ff_siwis`, `jf_alpha`, `zf_xiaoxiao`, and more.
Prefixes: `af` (American female), `am` (American male), `bf` (British female), `bm` (British male), `ef` (English female), `em` (English male), `ff` (French female), `hf` (Hindi female), `if` (Italian female), `jf` (Japanese female), `pf` (Portuguese female), `zf` (Chinese female), `zm` (Chinese male).
### Character voices (from [chandypants/ollie-pocket-tts](https://huggingface.co/spaces/chandypants/ollie-pocket-tts))
24 character voices: `benji`, `bertha`, `damian`, `f01_young_bright`, `f02_texas_gal`, `f03_sharp_pro`, `f04_warm_mom`, `f05_husky_mature`, `f06_perky_young`, `f07_southern_belle`, `f08_tough_cop`, `f09_elderly_sweet`, `f10_theater_kid`, `m01_deep_south`, `m02_smooth_tenor`, `m03_gruff_ny`, `m04_warm_dad`, `m05_distinguished`, `m06_young_rough`, `m07_cowboy`, `m08_fast_talker`, `m09_gentle_giant`, `m10_slick`.
## Setup
1. **Duplicate this Space** or deploy the Dockerfile
2. **Add `HF_TOKEN` secret** in Space Settings β†’ Secrets (required for gated model access)
3. **Accept model terms** at https://huggingface.co/kyutai/pocket-tts
4. Space builds and serves on port 7860
### Keep-Alive (recommended for free Spaces)
Free HuggingFace Spaces sleep after inactivity. Use a Cloudflare Worker cron to keep it awake:
```bash
CLOUDFLARE_WORKERS_TOKEN=your_token SPACE_HOST=your-space.hf.space python3 cloudflare-keepalive-setup.py
```
## Example Usage
```bash
# Generate OGG audio (Telegram-friendly)
curl "https://your-space.hf.space/tts?text=Hello+world&voice=af_alloy&format=ogg" -o speech.ogg
# Generate WAV audio
curl "https://your-space.hf.space/tts?text=Hello+world&voice=m07_cowboy&format=wav" -o speech.wav
# List voices
curl "https://your-space.hf.space/voices"
```
## Architecture
```
Request β†’ FastAPI (/tts) β†’ Pocket-TTS model β†’ WAV audio β†’ ffmpeg β†’ OGG/Opus β†’ Response
```
- **Server**: FastAPI on uvicorn
- **Model**: kyutai/pocket-tts (english_2026-04, with voice cloning)
- **Voices**: Downloaded on-demand from HF Spaces, cached in memory
- **Audio conversion**: ffmpeg (installed in Docker image) converts WAV β†’ OGG/Opus
## Lessons Learned (Debugging Notes)
This Space went through several iterations before producing clean speech. Documenting the issues and fixes for anyone who runs into similar problems:
### 1. Gradio SDK routing issues βŒβ†’βœ…
**Problem**: Initially used the Gradio SDK (duplicating Nymbo/Pocket-TTS). Adding custom API endpoints (`/api/tts`) alongside Gradio caused persistent 500 errors (`jinja2.exceptions.UndefinedError: 'None' has no attribute 'get'`). Gradio's internal SvelteKit catch-all route intercepted all custom paths.
**Attempts**:
- Mounting FastAPI under `/api` with Starlette wrapper β†’ broke Gradio template rendering (500 error)
- Adding routes to `demo.app.routes` β†’ `AttributeError: property 'routes' of 'App' object has no setter`
- ASGI middleware to intercept `/api/` paths β†’ Gradio's `demo.launch()` creates its own server, ignoring the wrapped app
- Adding a Gradio button with `api_name="tts_file"` β†’ worked but returned HLS playlists, hard to consume programmatically
**Fix**: Switched to Docker SDK with pure FastAPI server. Full control over routing, no Gradio interference.
### 2. All audio was noise (static) βŒβ†’βœ…
**Problem**: Every generated audio file sounded like white noise/static, not speech. This persisted across multiple approaches.
**Root causes** (multiple, layered):
#### 2a. Wrong model variant
`TTSModel.load_model()` defaults to `language="english"` which loads the **without-voice-cloning** variant (`kyutai/pocket-tts-without-voice-cloning`). This model **cannot process voice embeddings** at all β€” it just generates noise when given voice conditioning.
**Fix**: `TTSModel.load_model(language="english_2026-04")` loads the full model with voice cloning support.
#### 2b. Incompatible embeddings
The `kyutai/pocket-tts` model repo provides embeddings in three formats:
- `embeddings/` (v1): Contains `audio_prompt` tensor β€” but using it with the wrong model variant produces noise
- `embeddings_v2/`: Pre-computed KV caches with `cache` and `current_end` keys β€” **incompatible format**, produces noise/garbage even with the voice cloning model
- `embeddings_v3/`: Pre-computed KV caches with `cache` and `offset` keys β€” also incompatible, model generates until max tokens without EOS (indicates garbage output)
The Nymbo/Pocket-TTS Space works because it uses **Kokoro-82M compatible embeddings** that are different from the kyutai repo embeddings.
**Fix**: Don't use pre-computed embeddings at all. Use `model.get_state_for_audio_prompt(wav_file)` with actual WAV audio files. This is the only reliable method.
#### 2c. OGG conversion without ffmpeg
Without ffmpeg installed, pydub's `export(format="ogg", codec="libopus")` silently produced corrupted files that sounded like noise.
**Fix**: Install ffmpeg in the Docker image (`apt-get install ffmpeg`).
### 3. HLS streaming output βŒβ†’βœ…
**Problem**: Gradio's streaming audio component returns HLS playlists (`.m3u8` files with `.aac` segments). The `gradio_client` downloads the playlist file but not the segments, making programmatic audio retrieval impossible.
**Fix**: FastAPI endpoint returns complete audio files directly β€” no streaming, no playlists.
### 4. Deep copy error βŒβ†’βœ…
**Problem**: `model.generate_audio(..., copy_state=True)` internally calls `copy.deepcopy(voice_state)` which fails with `RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol`.
**Fix**: Detach all tensors in the voice state to make them leaf tensors:
```python
def detach_all(obj):
if isinstance(obj, torch.Tensor):
return obj.detach().clone()
elif isinstance(obj, dict):
return {k: detach_all(v) for k, v in obj.items()}
else:
return obj
voice_state = detach_all(voice_state)
```
### 5. Gated model access βŒβ†’βœ…
**Problem**: `403 Client Error: Cannot access gated repo`. The kyutai/pocket-tts model is gated β€” you must accept the terms on the model page before your token can download it.
**Fix**: Visit https://huggingface.co/kyutai/pocket-tts and click "Agree and access repository". Then set `HF_TOKEN` as a Space secret.
## License
See [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for model licensing (CC-BY-4.0).