--- language: - mn license: apache-2.0 library_name: coqui pipeline_tag: text-to-speech tags: - text-to-speech - tts - vits - mongolian - coqui-tts --- # Mongolian VITS TTS Multi-speaker [VITS](https://arxiv.org/abs/2106.06103) text-to-speech model for **Mongolian**, trained with [Coqui TTS](https://github.com/coqui-ai/TTS). - **Architecture:** VITS (end-to-end, multi-speaker) - **Language:** Mongolian (`mn`) - **Sample rate:** 22050 Hz - **Speakers:** 78 (see `speakers.pth`) - **Checkpoint:** best model at training step 241549 ## Files | File | Description | |------|-------------| | `best_model.pth` | Best VITS checkpoint (eval loss) | | `config.json` | Coqui TTS training/inference config | | `speakers.pth` | Speaker manager / speaker id map | | `tensorboard/` | TensorBoard event files (training curves) | ## Usage ```python from huggingface_hub import hf_hub_download from TTS.utils.synthesizer import Synthesizer repo = "Bokhbat/mongolian-vits-tts" model_path = hf_hub_download(repo, "best_model.pth") config_path = hf_hub_download(repo, "config.json") speakers_path = hf_hub_download(repo, "speakers.pth") synth = Synthesizer(model_path, config_path, speakers_path, use_cuda=False) wav = synth.tts("Сайн байна уу?", speaker_name=synth.tts_model.speaker_manager.speaker_names[0]) synth.save_wav(wav, "out.wav") ``` ## Training metrics TensorBoard logs are included under `tensorboard/` and render in the **Training metrics** tab of this repository.