Pocket-TTS

Running

App Files Files Community

Pocket-TTS / README.md

hf4uwho

Comprehensive README: API docs, 78 voices, architecture, and full debugging notes

d5bd886 about 2 months ago

preview code

Raw

History Blame

7.36 kB

	---
	title: Pocket-TTS API
	emoji: 🔊
	colorFrom: green
	colorTo: blue
	sdk: docker
	app_port: 7860
	pinned: false
	license: cc-by-4.0
	secrets:
	- name: HF_TOKEN
	description: "Hugging Face token with write access. Required to download the gated kyutai/pocket-tts model and voice WAV files from source Spaces."
	---

	# Pocket-TTS API

	FastAPI server running [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) with direct WAV/OGG audio output. No Gradio — just clean API endpoints.

	## API Endpoints

	\| Endpoint \| Description \|
	\|---\|---\|
	\| `GET /tts?text=Hello&voice=af_alloy&format=ogg` \| Generate speech (format: `wav` or `ogg`) \|
	\| `GET /voices` \| List all available voices \|
	\| `GET /health` \| Health check \|

	## Voices (78 total)

	### Standard voices (from [Nymbo/Pocket-TTS](https://huggingface.co/spaces/Nymbo/Pocket-TTS))
	54 multilingual voices including: `af_alloy`, `af_nova`, `am_onyx`, `am_adam`, `bf_emma`, `bm_fable`, `ef_dora`, `ff_siwis`, `jf_alpha`, `zf_xiaoxiao`, and more.

	Prefixes: `af` (American female), `am` (American male), `bf` (British female), `bm` (British male), `ef` (English female), `em` (English male), `ff` (French female), `hf` (Hindi female), `if` (Italian female), `jf` (Japanese female), `pf` (Portuguese female), `zf` (Chinese female), `zm` (Chinese male).

	### Character voices (from [chandypants/ollie-pocket-tts](https://huggingface.co/spaces/chandypants/ollie-pocket-tts))
	24 character voices: `benji`, `bertha`, `damian`, `f01_young_bright`, `f02_texas_gal`, `f03_sharp_pro`, `f04_warm_mom`, `f05_husky_mature`, `f06_perky_young`, `f07_southern_belle`, `f08_tough_cop`, `f09_elderly_sweet`, `f10_theater_kid`, `m01_deep_south`, `m02_smooth_tenor`, `m03_gruff_ny`, `m04_warm_dad`, `m05_distinguished`, `m06_young_rough`, `m07_cowboy`, `m08_fast_talker`, `m09_gentle_giant`, `m10_slick`.

	## Setup

	1. Duplicate this Space or deploy the Dockerfile
	2. Add `HF_TOKEN` secret in Space Settings → Secrets (required for gated model access)
	3. Accept model terms at https://huggingface.co/kyutai/pocket-tts
	4. Space builds and serves on port 7860

	### Keep-Alive (recommended for free Spaces)

	Free HuggingFace Spaces sleep after inactivity. Use a Cloudflare Worker cron to keep it awake:

	```bash
	CLOUDFLARE_WORKERS_TOKEN=your_token SPACE_HOST=your-space.hf.space python3 cloudflare-keepalive-setup.py
	```

	## Example Usage

	```bash
	# Generate OGG audio (Telegram-friendly)
	curl "https://your-space.hf.space/tts?text=Hello+world&voice=af_alloy&format=ogg" -o speech.ogg

	# Generate WAV audio
	curl "https://your-space.hf.space/tts?text=Hello+world&voice=m07_cowboy&format=wav" -o speech.wav

	# List voices
	curl "https://your-space.hf.space/voices"
	```

	## Architecture

	```
	Request → FastAPI (/tts) → Pocket-TTS model → WAV audio → ffmpeg → OGG/Opus → Response
	```

	- Server: FastAPI on uvicorn
	- Model: kyutai/pocket-tts (english_2026-04, with voice cloning)
	- Voices: Downloaded on-demand from HF Spaces, cached in memory
	- Audio conversion: ffmpeg (installed in Docker image) converts WAV → OGG/Opus

	## Lessons Learned (Debugging Notes)

	This Space went through several iterations before producing clean speech. Documenting the issues and fixes for anyone who runs into similar problems:

	### 1. Gradio SDK routing issues ❌→✅
	Problem: Initially used the Gradio SDK (duplicating Nymbo/Pocket-TTS). Adding custom API endpoints (`/api/tts`) alongside Gradio caused persistent 500 errors (`jinja2.exceptions.UndefinedError: 'None' has no attribute 'get'`). Gradio's internal SvelteKit catch-all route intercepted all custom paths.

	Attempts:
	- Mounting FastAPI under `/api` with Starlette wrapper → broke Gradio template rendering (500 error)
	- Adding routes to `demo.app.routes` → `AttributeError: property 'routes' of 'App' object has no setter`
	- ASGI middleware to intercept `/api/` paths → Gradio's `demo.launch()` creates its own server, ignoring the wrapped app
	- Adding a Gradio button with `api_name="tts_file"` → worked but returned HLS playlists, hard to consume programmatically

	Fix: Switched to Docker SDK with pure FastAPI server. Full control over routing, no Gradio interference.

	### 2. All audio was noise (static) ❌→✅
	Problem: Every generated audio file sounded like white noise/static, not speech. This persisted across multiple approaches.

	Root causes (multiple, layered):

	#### 2a. Wrong model variant
	`TTSModel.load_model()` defaults to `language="english"` which loads the without-voice-cloning variant (`kyutai/pocket-tts-without-voice-cloning`). This model cannot process voice embeddings at all — it just generates noise when given voice conditioning.

	Fix: `TTSModel.load_model(language="english_2026-04")` loads the full model with voice cloning support.

	#### 2b. Incompatible embeddings
	The `kyutai/pocket-tts` model repo provides embeddings in three formats:
	- `embeddings/` (v1): Contains `audio_prompt` tensor — but using it with the wrong model variant produces noise
	- `embeddings_v2/`: Pre-computed KV caches with `cache` and `current_end` keys — incompatible format, produces noise/garbage even with the voice cloning model
	- `embeddings_v3/`: Pre-computed KV caches with `cache` and `offset` keys — also incompatible, model generates until max tokens without EOS (indicates garbage output)

	The Nymbo/Pocket-TTS Space works because it uses Kokoro-82M compatible embeddings that are different from the kyutai repo embeddings.

	Fix: Don't use pre-computed embeddings at all. Use `model.get_state_for_audio_prompt(wav_file)` with actual WAV audio files. This is the only reliable method.

	#### 2c. OGG conversion without ffmpeg
	Without ffmpeg installed, pydub's `export(format="ogg", codec="libopus")` silently produced corrupted files that sounded like noise.

	Fix: Install ffmpeg in the Docker image (`apt-get install ffmpeg`).

	### 3. HLS streaming output ❌→✅
	Problem: Gradio's streaming audio component returns HLS playlists (`.m3u8` files with `.aac` segments). The `gradio_client` downloads the playlist file but not the segments, making programmatic audio retrieval impossible.

	Fix: FastAPI endpoint returns complete audio files directly — no streaming, no playlists.

	### 4. Deep copy error ❌→✅
	Problem: `model.generate_audio(..., copy_state=True)` internally calls `copy.deepcopy(voice_state)` which fails with `RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol`.

	Fix: Detach all tensors in the voice state to make them leaf tensors:
	```python
	def detach_all(obj):
	if isinstance(obj, torch.Tensor):
	return obj.detach().clone()
	elif isinstance(obj, dict):
	return {k: detach_all(v) for k, v in obj.items()}
	else:
	return obj
	voice_state = detach_all(voice_state)
	```

	### 5. Gated model access ❌→✅
	Problem: `403 Client Error: Cannot access gated repo`. The kyutai/pocket-tts model is gated — you must accept the terms on the model page before your token can download it.

	Fix: Visit https://huggingface.co/kyutai/pocket-tts and click "Agree and access repository". Then set `HF_TOKEN` as a Space secret.

	## License

	See [kyutai/pocket-tts](https://huggingface.co/kyutai/pocket-tts) for model licensing (CC-BY-4.0).