Spaces:
Running
title: Pocket-TTS API
emoji: π
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
pinned: false
license: cc-by-4.0
secrets:
- name: HF_TOKEN
description: >-
Hugging Face token with write access. Required to download the gated
kyutai/pocket-tts model and voice WAV files from source Spaces.
Pocket-TTS API
FastAPI server running kyutai/pocket-tts with direct WAV/OGG audio output. No Gradio β just clean API endpoints.
API Endpoints
| Endpoint | Description |
|---|---|
GET /tts?text=Hello&voice=af_alloy&format=ogg |
Generate speech (format: wav or ogg) |
GET /voices |
List all available voices |
GET /health |
Health check |
Voices (78 total)
Standard voices (from Nymbo/Pocket-TTS)
54 multilingual voices including: af_alloy, af_nova, am_onyx, am_adam, bf_emma, bm_fable, ef_dora, ff_siwis, jf_alpha, zf_xiaoxiao, and more.
Prefixes: af (American female), am (American male), bf (British female), bm (British male), ef (English female), em (English male), ff (French female), hf (Hindi female), if (Italian female), jf (Japanese female), pf (Portuguese female), zf (Chinese female), zm (Chinese male).
Character voices (from chandypants/ollie-pocket-tts)
24 character voices: benji, bertha, damian, f01_young_bright, f02_texas_gal, f03_sharp_pro, f04_warm_mom, f05_husky_mature, f06_perky_young, f07_southern_belle, f08_tough_cop, f09_elderly_sweet, f10_theater_kid, m01_deep_south, m02_smooth_tenor, m03_gruff_ny, m04_warm_dad, m05_distinguished, m06_young_rough, m07_cowboy, m08_fast_talker, m09_gentle_giant, m10_slick.
Setup
- Duplicate this Space or deploy the Dockerfile
- Add
HF_TOKENsecret in Space Settings β Secrets (required for gated model access) - Accept model terms at https://huggingface.co/kyutai/pocket-tts
- Space builds and serves on port 7860
Keep-Alive (recommended for free Spaces)
Free HuggingFace Spaces sleep after inactivity. Use a Cloudflare Worker cron to keep it awake:
CLOUDFLARE_WORKERS_TOKEN=your_token SPACE_HOST=your-space.hf.space python3 cloudflare-keepalive-setup.py
Example Usage
# Generate OGG audio (Telegram-friendly)
curl "https://your-space.hf.space/tts?text=Hello+world&voice=af_alloy&format=ogg" -o speech.ogg
# Generate WAV audio
curl "https://your-space.hf.space/tts?text=Hello+world&voice=m07_cowboy&format=wav" -o speech.wav
# List voices
curl "https://your-space.hf.space/voices"
Architecture
Request β FastAPI (/tts) β Pocket-TTS model β WAV audio β ffmpeg β OGG/Opus β Response
- Server: FastAPI on uvicorn
- Model: kyutai/pocket-tts (english_2026-04, with voice cloning)
- Voices: Downloaded on-demand from HF Spaces, cached in memory
- Audio conversion: ffmpeg (installed in Docker image) converts WAV β OGG/Opus
Lessons Learned (Debugging Notes)
This Space went through several iterations before producing clean speech. Documenting the issues and fixes for anyone who runs into similar problems:
1. Gradio SDK routing issues βββ
Problem: Initially used the Gradio SDK (duplicating Nymbo/Pocket-TTS). Adding custom API endpoints (/api/tts) alongside Gradio caused persistent 500 errors (jinja2.exceptions.UndefinedError: 'None' has no attribute 'get'). Gradio's internal SvelteKit catch-all route intercepted all custom paths.
Attempts:
- Mounting FastAPI under
/apiwith Starlette wrapper β broke Gradio template rendering (500 error) - Adding routes to
demo.app.routesβAttributeError: property 'routes' of 'App' object has no setter - ASGI middleware to intercept
/api/paths β Gradio'sdemo.launch()creates its own server, ignoring the wrapped app - Adding a Gradio button with
api_name="tts_file"β worked but returned HLS playlists, hard to consume programmatically
Fix: Switched to Docker SDK with pure FastAPI server. Full control over routing, no Gradio interference.
2. All audio was noise (static) βββ
Problem: Every generated audio file sounded like white noise/static, not speech. This persisted across multiple approaches.
Root causes (multiple, layered):
2a. Wrong model variant
TTSModel.load_model() defaults to language="english" which loads the without-voice-cloning variant (kyutai/pocket-tts-without-voice-cloning). This model cannot process voice embeddings at all β it just generates noise when given voice conditioning.
Fix: TTSModel.load_model(language="english_2026-04") loads the full model with voice cloning support.
2b. Incompatible embeddings
The kyutai/pocket-tts model repo provides embeddings in three formats:
embeddings/(v1): Containsaudio_prompttensor β but using it with the wrong model variant produces noiseembeddings_v2/: Pre-computed KV caches withcacheandcurrent_endkeys β incompatible format, produces noise/garbage even with the voice cloning modelembeddings_v3/: Pre-computed KV caches withcacheandoffsetkeys β also incompatible, model generates until max tokens without EOS (indicates garbage output)
The Nymbo/Pocket-TTS Space works because it uses Kokoro-82M compatible embeddings that are different from the kyutai repo embeddings.
Fix: Don't use pre-computed embeddings at all. Use model.get_state_for_audio_prompt(wav_file) with actual WAV audio files. This is the only reliable method.
2c. OGG conversion without ffmpeg
Without ffmpeg installed, pydub's export(format="ogg", codec="libopus") silently produced corrupted files that sounded like noise.
Fix: Install ffmpeg in the Docker image (apt-get install ffmpeg).
3. HLS streaming output βββ
Problem: Gradio's streaming audio component returns HLS playlists (.m3u8 files with .aac segments). The gradio_client downloads the playlist file but not the segments, making programmatic audio retrieval impossible.
Fix: FastAPI endpoint returns complete audio files directly β no streaming, no playlists.
4. Deep copy error βββ
Problem: model.generate_audio(..., copy_state=True) internally calls copy.deepcopy(voice_state) which fails with RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol.
Fix: Detach all tensors in the voice state to make them leaf tensors:
def detach_all(obj):
if isinstance(obj, torch.Tensor):
return obj.detach().clone()
elif isinstance(obj, dict):
return {k: detach_all(v) for k, v in obj.items()}
else:
return obj
voice_state = detach_all(voice_state)
5. Gated model access βββ
Problem: 403 Client Error: Cannot access gated repo. The kyutai/pocket-tts model is gated β you must accept the terms on the model page before your token can download it.
Fix: Visit https://huggingface.co/kyutai/pocket-tts and click "Agree and access repository". Then set HF_TOKEN as a Space secret.
License
See kyutai/pocket-tts for model licensing (CC-BY-4.0).