---
language:
- mn
license: apache-2.0
library_name: coqui
pipeline_tag: text-to-speech
tags:
- text-to-speech
- tts
- vits
- mongolian
- coqui-tts
---

# Mongolian VITS TTS

Multi-speaker [VITS](https://arxiv.org/abs/2106.06103) text-to-speech model for **Mongolian**, trained with [Coqui TTS](https://github.com/coqui-ai/TTS).

- **Architecture:** VITS (end-to-end, multi-speaker)
- **Language:** Mongolian (`mn`)
- **Sample rate:** 22050 Hz
- **Speakers:** 78 (see `speakers.pth`)
- **Checkpoint:** best model at training step 241549

## Files

| File | Description |
|------|-------------|
| `best_model.pth` | Best VITS checkpoint (eval loss) |
| `config.json` | Coqui TTS training/inference config |
| `speakers.pth` | Speaker manager / speaker id map |
| `tensorboard/` | TensorBoard event files (training curves) |

## Usage

```python
from huggingface_hub import hf_hub_download
from TTS.utils.synthesizer import Synthesizer

repo = "Bokhbat/mongolian-vits-tts"
model_path   = hf_hub_download(repo, "best_model.pth")
config_path  = hf_hub_download(repo, "config.json")
speakers_path = hf_hub_download(repo, "speakers.pth")

synth = Synthesizer(model_path, config_path, speakers_path, use_cuda=False)
wav = synth.tts("Сайн байна уу?", speaker_name=synth.tts_model.speaker_manager.speaker_names[0])
synth.save_wav(wav, "out.wav")
```

## Training metrics

TensorBoard logs are included under `tensorboard/` and render in the
**Training metrics** tab of this repository.