--- library_name: TTS tags: - text-to-speech - hindi - english - bilingual - gradio - coqui-xtts - hing-bert license: mit pipeline_tag: text-to-speech language: - hi - en --- # Hindi/English Text-to-Speech Pipeline ## Overview This project delivers a bilingual text-to-speech (TTS) experience that accepts English or Hinglish text, detects Hindi tokens, transliterates them to Devanagari, and renders speech with an XTTS voice cloned model. The interactive Gradio UI defined in `inference.py` is the primary entry point for end users. Key capabilities: - **Language identification & transliteration** – `hing_bert_module` wraps a fine-tuned Hing-BERT token classifier, dictionary overrides, and Devanagari transliteration helpers. - **Speech synthesis** – Coqui XTTS (fine-tuned checkpoint under `xtts_Hindi_FineTuned/`) generates audio from the processed text using reference speaker WAVs. - **User interface** – Gradio Blocks app exposes text input, language/voice choices, advanced sampling controls, and returns generated audio plus metadata. ## Repository layout ``` ├── inference.py # Gradio UI and XTTS generation pipeline ├── hing_bert_module/ # Token classifier, transliteration utilities, and assets │ ├── hing-bert-lid/ # Hugging Face model weights & tokenizer files (local) │ └── dictionary.txt # Mythology/Sanskrit dictionary overrides ├── xtts_Hindi_FineTuned/ # Fine-tuned XTTS checkpoint and reference voices ├── imp_scripts/ │ └── test_inference.py # Console-driven TTS workflow (optional) ├── text_processor.py # Standalone CLI for token tagging & transliteration (optional) ├── translitor.py # Standalone CLI transliterator (optional) ├── requirements.txt # Minimal dependency lock for runtime └── README.md # Project documentation (this file) ``` ## Prerequisites - Windows 10/11 (project paths are Windows-oriented, though code is portable). - Python 3.10 (matching the fine-tuned environment used for XTTS + Hing-BERT). - CUDA-capable GPU recommended for low-latency inference (CPU is supported but slower). - Fine-tuned XTTS assets placed under `xtts_Hindi_FineTuned/` (includes `config.json`, checkpoints, and `speakers/Reference_*.wav`). ## Quickstart 1. **Create & activate a virtual environment** ```powershell python -m venv xtts_env_win .\xtts_env_win\Scripts\Activate.ps1 ``` 2. **Install dependencies** ```powershell pip install --upgrade pip pip install -r requirements.txt ``` 3. **Launch the Gradio app** ```powershell python inference.py ``` The UI will start at `http://0.0.0.0:7860` (Gradio also provides an optional public share URL). Enter text, choose voice/language, tweak advanced settings, and click **Generate Speech**. ## How it works 1. **Text preprocessing** (`hing_bert_module.process_text`) - Loads the Hing-BERT model from `hing_bert_module/hing-bert-lid/`. - Classifies tokens as Hindi or English and applies heuristics to boost Hindi detection. - Uses dictionary lookups + Hindi transliteration model to convert detected Hindi words into Devanagari. - Reconstructs the final text string for speech synthesis and logs outputs to `final_output.txt`. 2. **Speech synthesis** (`TTSGenerator` in `inference.py`) - Initializes Coqui XTTS with the supplied fine-tuned checkpoint and reference speakers. - Generates audio using parameters from the UI (temperature, top-k/p, speed). - Writes audio to a temp WAV file and reports processing stats. ## Optional tools - `imp_scripts/test_inference.py`: menu-driven CLI for batch experimentation and audio preview without Gradio. - `text_processor.py` / `translitor.py`: utility scripts for inspection or debugging of language detection & transliteration. ## Maintenance tips - Keep `requirements.txt` in sync with the active environment (`pip freeze` and prune to essentials as needed). - Do **not** commit virtual environments (`xtts_env_win/`) or large checkpoints beyond repository policy. - Periodically review `hing_bert_module/dictionary.txt` for custom transliteration entries. ## Troubleshooting - **Model load errors** – ensure `xtts_Hindi_FineTuned/` contains the expected files and paths referenced in `TTSGenerator.reference_voices`. - **Missing dependencies** – rerun `pip install -r requirements.txt`; verify CUDA compatibility for torch/torchaudio builds. - **Unicode output in terminals** – scripts handle Windows UTF-8 console settings; if characters still render incorrectly, set `PYTHONUTF8=1` or use UTF-8 capable shells.