Pocket-TTS / README.md
hf4uwho's picture
Comprehensive README: API docs, 78 voices, architecture, and full debugging notes
d5bd886
|
Raw
History Blame
7.36 kB
metadata
title: Pocket-TTS API
emoji: πŸ”Š
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: cc-by-4.0
secrets:
  - name: HF_TOKEN
    description: >-
      Hugging Face token with write access. Required to download the gated
      kyutai/pocket-tts model and voice WAV files from source Spaces.

Pocket-TTS API

FastAPI server running kyutai/pocket-tts with direct WAV/OGG audio output. No Gradio β€” just clean API endpoints.

API Endpoints

Endpoint Description
GET /tts?text=Hello&voice=af_alloy&format=ogg Generate speech (format: wav or ogg)
GET /voices List all available voices
GET /health Health check

Voices (78 total)

Standard voices (from Nymbo/Pocket-TTS)

54 multilingual voices including: af_alloy, af_nova, am_onyx, am_adam, bf_emma, bm_fable, ef_dora, ff_siwis, jf_alpha, zf_xiaoxiao, and more.

Prefixes: af (American female), am (American male), bf (British female), bm (British male), ef (English female), em (English male), ff (French female), hf (Hindi female), if (Italian female), jf (Japanese female), pf (Portuguese female), zf (Chinese female), zm (Chinese male).

Character voices (from chandypants/ollie-pocket-tts)

24 character voices: benji, bertha, damian, f01_young_bright, f02_texas_gal, f03_sharp_pro, f04_warm_mom, f05_husky_mature, f06_perky_young, f07_southern_belle, f08_tough_cop, f09_elderly_sweet, f10_theater_kid, m01_deep_south, m02_smooth_tenor, m03_gruff_ny, m04_warm_dad, m05_distinguished, m06_young_rough, m07_cowboy, m08_fast_talker, m09_gentle_giant, m10_slick.

Setup

  1. Duplicate this Space or deploy the Dockerfile
  2. Add HF_TOKEN secret in Space Settings β†’ Secrets (required for gated model access)
  3. Accept model terms at https://huggingface.co/kyutai/pocket-tts
  4. Space builds and serves on port 7860

Keep-Alive (recommended for free Spaces)

Free HuggingFace Spaces sleep after inactivity. Use a Cloudflare Worker cron to keep it awake:

CLOUDFLARE_WORKERS_TOKEN=your_token SPACE_HOST=your-space.hf.space python3 cloudflare-keepalive-setup.py

Example Usage

# Generate OGG audio (Telegram-friendly)
curl "https://your-space.hf.space/tts?text=Hello+world&voice=af_alloy&format=ogg" -o speech.ogg

# Generate WAV audio
curl "https://your-space.hf.space/tts?text=Hello+world&voice=m07_cowboy&format=wav" -o speech.wav

# List voices
curl "https://your-space.hf.space/voices"

Architecture

Request β†’ FastAPI (/tts) β†’ Pocket-TTS model β†’ WAV audio β†’ ffmpeg β†’ OGG/Opus β†’ Response
  • Server: FastAPI on uvicorn
  • Model: kyutai/pocket-tts (english_2026-04, with voice cloning)
  • Voices: Downloaded on-demand from HF Spaces, cached in memory
  • Audio conversion: ffmpeg (installed in Docker image) converts WAV β†’ OGG/Opus

Lessons Learned (Debugging Notes)

This Space went through several iterations before producing clean speech. Documenting the issues and fixes for anyone who runs into similar problems:

1. Gradio SDK routing issues βŒβ†’βœ…

Problem: Initially used the Gradio SDK (duplicating Nymbo/Pocket-TTS). Adding custom API endpoints (/api/tts) alongside Gradio caused persistent 500 errors (jinja2.exceptions.UndefinedError: 'None' has no attribute 'get'). Gradio's internal SvelteKit catch-all route intercepted all custom paths.

Attempts:

  • Mounting FastAPI under /api with Starlette wrapper β†’ broke Gradio template rendering (500 error)
  • Adding routes to demo.app.routes β†’ AttributeError: property 'routes' of 'App' object has no setter
  • ASGI middleware to intercept /api/ paths β†’ Gradio's demo.launch() creates its own server, ignoring the wrapped app
  • Adding a Gradio button with api_name="tts_file" β†’ worked but returned HLS playlists, hard to consume programmatically

Fix: Switched to Docker SDK with pure FastAPI server. Full control over routing, no Gradio interference.

2. All audio was noise (static) βŒβ†’βœ…

Problem: Every generated audio file sounded like white noise/static, not speech. This persisted across multiple approaches.

Root causes (multiple, layered):

2a. Wrong model variant

TTSModel.load_model() defaults to language="english" which loads the without-voice-cloning variant (kyutai/pocket-tts-without-voice-cloning). This model cannot process voice embeddings at all β€” it just generates noise when given voice conditioning.

Fix: TTSModel.load_model(language="english_2026-04") loads the full model with voice cloning support.

2b. Incompatible embeddings

The kyutai/pocket-tts model repo provides embeddings in three formats:

  • embeddings/ (v1): Contains audio_prompt tensor β€” but using it with the wrong model variant produces noise
  • embeddings_v2/: Pre-computed KV caches with cache and current_end keys β€” incompatible format, produces noise/garbage even with the voice cloning model
  • embeddings_v3/: Pre-computed KV caches with cache and offset keys β€” also incompatible, model generates until max tokens without EOS (indicates garbage output)

The Nymbo/Pocket-TTS Space works because it uses Kokoro-82M compatible embeddings that are different from the kyutai repo embeddings.

Fix: Don't use pre-computed embeddings at all. Use model.get_state_for_audio_prompt(wav_file) with actual WAV audio files. This is the only reliable method.

2c. OGG conversion without ffmpeg

Without ffmpeg installed, pydub's export(format="ogg", codec="libopus") silently produced corrupted files that sounded like noise.

Fix: Install ffmpeg in the Docker image (apt-get install ffmpeg).

3. HLS streaming output βŒβ†’βœ…

Problem: Gradio's streaming audio component returns HLS playlists (.m3u8 files with .aac segments). The gradio_client downloads the playlist file but not the segments, making programmatic audio retrieval impossible.

Fix: FastAPI endpoint returns complete audio files directly β€” no streaming, no playlists.

4. Deep copy error βŒβ†’βœ…

Problem: model.generate_audio(..., copy_state=True) internally calls copy.deepcopy(voice_state) which fails with RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol.

Fix: Detach all tensors in the voice state to make them leaf tensors:

def detach_all(obj):
    if isinstance(obj, torch.Tensor):
        return obj.detach().clone()
    elif isinstance(obj, dict):
        return {k: detach_all(v) for k, v in obj.items()}
    else:
        return obj
voice_state = detach_all(voice_state)

5. Gated model access βŒβ†’βœ…

Problem: 403 Client Error: Cannot access gated repo. The kyutai/pocket-tts model is gated β€” you must accept the terms on the model page before your token can download it.

Fix: Visit https://huggingface.co/kyutai/pocket-tts and click "Agree and access repository". Then set HF_TOKEN as a Space secret.

License

See kyutai/pocket-tts for model licensing (CC-BY-4.0).