---
title: Pocket-TTS API
emoji: 🔊
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: cc-by-4.0
secrets:
  - name: HF_TOKEN
    description: "Hugging Face token with write access. Required to download the gated kyutai/pocket-tts model and voice WAV files from source Spaces."
---

# Pocket-TTS API

FastAPI server running [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) with direct WAV/OGG audio output. No Gradio — just clean API endpoints.

## API Endpoints

| Endpoint | Description |
|---|---|
| `GET /tts?text=Hello&voice=af_alloy&format=ogg` | Generate speech (format: `wav` or `ogg`) |
| `GET /voices` | List all available voices |
| `GET /health` | Health check |

## Scripts

Utility scripts for generating audio, available in `scripts/`:

| Script | Purpose |
|---|---|
| `voice.py` | One-shot TTS: `python3 voice.py "text" [voice] [format]` or `--file input.txt` |
| `voice.sh` | Shell wrapper for voice.py |
| `voice-from-file.py` | Read text from file and generate TTS |
| `voice-chunked.py` | Split long text and generate in sequence |
| `voice-long.sh` | Shell script for long text with ffmpeg concat |
| `chunk_giselle.py` | Split `giselle_60min.txt` into ~10K char chunks on paragraph boundaries |
| `batch_tts.py` | Full batch generator with auto-restart on failure and ffmpeg concat |
| `giselle_batch.sh` | Shell batch equivalent for Giselle story generation |
| `run_giselle_batch.sh` | Sequential batch runner with retry logic |
| `restart_space.py` | Restart the HF Space via API (requires token) |

**Usage:**
```bash
# Quick one-shot
python3 scripts/voice.py "Hello world" af_alloy ogg

# From file
python3 scripts/voice.py --file story.txt scarlett_johansson ogg

# Chunk a long story and generate
python3 scripts/chunk_giselle.py
python3 scripts/run_giselle_batch.sh
```

## Voices (78 total)

### Standard voices (from [Nymbo/Pocket-TTS](https://huggingface.co/spaces/Nymbo/Pocket-TTS))
54 multilingual voices including: `af_alloy`, `af_nova`, `am_onyx`, `am_adam`, `bf_emma`, `bm_fable`, `ef_dora`, `ff_siwis`, `jf_alpha`, `zf_xiaoxiao`, and more.

Prefixes: `af` (American female), `am` (American male), `bf` (British female), `bm` (British male), `ef` (English female), `em` (English male), `ff` (French female), `hf` (Hindi female), `if` (Italian female), `jf` (Japanese female), `pf` (Portuguese female), `zf` (Chinese female), `zm` (Chinese male).

### Character voices (from [chandypants/ollie-pocket-tts](https://huggingface.co/spaces/chandypants/ollie-pocket-tts))
24 character voices: `benji`, `bertha`, `damian`, `f01_young_bright`, `f02_texas_gal`, `f03_sharp_pro`, `f04_warm_mom`, `f05_husky_mature`, `f06_perky_young`, `f07_southern_belle`, `f08_tough_cop`, `f09_elderly_sweet`, `f10_theater_kid`, `m01_deep_south`, `m02_smooth_tenor`, `m03_gruff_ny`, `m04_warm_dad`, `m05_distinguished`, `m06_young_rough`, `m07_cowboy`, `m08_fast_talker`, `m09_gentle_giant`, `m10_slick`.

## Setup

1. **Duplicate this Space** or deploy the Dockerfile
2. **Add `HF_TOKEN` secret** in Space Settings → Secrets (required for gated model access)
3. **Accept model terms** at https://huggingface.co/kyutai/pocket-tts
4. Space builds and serves on port 7860

### Keep-Alive (recommended for free Spaces)

Free HuggingFace Spaces sleep after inactivity. Use a Cloudflare Worker cron to keep it awake:

```bash
CLOUDFLARE_WORKERS_TOKEN=your_token SPACE_HOST=your-space.hf.space python3 cloudflare-keepalive-setup.py
```

## Example Usage

```bash
# Generate OGG audio (Telegram-friendly)
curl "https://your-space.hf.space/tts?text=Hello+world&voice=af_alloy&format=ogg" -o speech.ogg

# Generate WAV audio
curl "https://your-space.hf.space/tts?text=Hello+world&voice=m07_cowboy&format=wav" -o speech.wav

# List voices
curl "https://your-space.hf.space/voices"
```

## Architecture

```
Request → FastAPI (/tts) → Pocket-TTS model → WAV audio → ffmpeg → OGG/Opus → Response
```

- **Server**: FastAPI on uvicorn
- **Model**: kyutai/pocket-tts (english_2026-04, with voice cloning)
- **Voices**: Downloaded on-demand from HF Spaces, cached in memory
- **Audio conversion**: ffmpeg (installed in Docker image) converts WAV → OGG/Opus

## Lessons Learned (Debugging Notes)

This Space went through several iterations before producing clean speech. Documenting the issues and fixes for anyone who runs into similar problems:

### 1. Gradio SDK routing issues ❌→✅
**Problem**: Initially used the Gradio SDK (duplicating Nymbo/Pocket-TTS). Adding custom API endpoints (`/api/tts`) alongside Gradio caused persistent 500 errors (`jinja2.exceptions.UndefinedError: 'None' has no attribute 'get'`). Gradio's internal SvelteKit catch-all route intercepted all custom paths.

**Attempts**:
- Mounting FastAPI under `/api` with Starlette wrapper → broke Gradio template rendering (500 error)
- Adding routes to `demo.app.routes` → `AttributeError: property 'routes' of 'App' object has no setter`
- ASGI middleware to intercept `/api/` paths → Gradio's `demo.launch()` creates its own server, ignoring the wrapped app
- Adding a Gradio button with `api_name="tts_file"` → worked but returned HLS playlists, hard to consume programmatically

**Fix**: Switched to Docker SDK with pure FastAPI server. Full control over routing, no Gradio interference.

### 2. All audio was noise (static) ❌→✅
**Problem**: Every generated audio file sounded like white noise/static, not speech. This persisted across multiple approaches.

**Root causes** (multiple, layered):

#### 2a. Wrong model variant
`TTSModel.load_model()` defaults to `language="english"` which loads the **without-voice-cloning** variant (`kyutai/pocket-tts-without-voice-cloning`). This model **cannot process voice embeddings** at all — it just generates noise when given voice conditioning.

**Fix**: `TTSModel.load_model(language="english_2026-04")` loads the full model with voice cloning support.

#### 2b. Incompatible embeddings
The `kyutai/pocket-tts` model repo provides embeddings in three formats:
- `embeddings/` (v1): Contains `audio_prompt` tensor — but using it with the wrong model variant produces noise
- `embeddings_v2/`: Pre-computed KV caches with `cache` and `current_end` keys — **incompatible format**, produces noise/garbage even with the voice cloning model
- `embeddings_v3/`: Pre-computed KV caches with `cache` and `offset` keys — also incompatible, model generates until max tokens without EOS (indicates garbage output)

The Nymbo/Pocket-TTS Space works because it uses **Kokoro-82M compatible embeddings** that are different from the kyutai repo embeddings.

**Fix**: Don't use pre-computed embeddings at all. Use `model.get_state_for_audio_prompt(wav_file)` with actual WAV audio files. This is the only reliable method.

#### 2c. OGG conversion without ffmpeg
Without ffmpeg installed, pydub's `export(format="ogg", codec="libopus")` silently produced corrupted files that sounded like noise.

**Fix**: Install ffmpeg in the Docker image (`apt-get install ffmpeg`).

### 3. HLS streaming output ❌→✅
**Problem**: Gradio's streaming audio component returns HLS playlists (`.m3u8` files with `.aac` segments). The `gradio_client` downloads the playlist file but not the segments, making programmatic audio retrieval impossible.

**Fix**: FastAPI endpoint returns complete audio files directly — no streaming, no playlists.

### 4. Deep copy error ❌→✅
**Problem**: `model.generate_audio(..., copy_state=True)` internally calls `copy.deepcopy(voice_state)` which fails with `RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol`.

**Fix**: Detach all tensors in the voice state to make them leaf tensors:
```python
def detach_all(obj):
    if isinstance(obj, torch.Tensor):
        return obj.detach().clone()
    elif isinstance(obj, dict):
        return {k: detach_all(v) for k, v in obj.items()}
    else:
        return obj
voice_state = detach_all(voice_state)
```

### 5. Gated model access ❌→✅
**Problem**: `403 Client Error: Cannot access gated repo`. The kyutai/pocket-tts model is gated — you must accept the terms on the model page before your token can download it.

**Fix**: Visit https://huggingface.co/kyutai/pocket-tts and click "Agree and access repository". Then set `HF_TOKEN` as a Space secret.

## License

See [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for model licensing (CC-BY-4.0).