---
license: cc-by-sa-4.0
language:
  - ne
tags:
  - text-to-speech
  - nepali
  - vits
  - piper
  - multi-speaker
library_name: piper-tts
datasets:
  - openslr/openslr
pipeline_tag: text-to-speech
---

# Nepali Voices v0 (Piper-VITS, 419 speakers)

A multi-speaker Nepali text-to-speech model in the [Piper](https://github.com/rhasspy/piper) /
[piper-plus](https://github.com/ayutaz/piper-plus) format. 419 speaker embeddings, ~22 hours of
training audio, custom 65-phone Nepali inventory (NOT eSpeak).

## At a glance

| | |
|---|---|
| Architecture | VITS (multi-speaker), 77.5 M parameters |
| Phoneme inventory | Project-internal 65-phone Nepali (Khatiwada 2009) |
| Speaker count | 419 |
| Sample rate | 22 050 Hz |
| Audio quality | 22 kHz medium |
| Base | `ayousanz/piper-plus-base` (multilingual, 6 languages) |
| Fine-tune steps | ~130 800 (v2 600 epochs + v3b 200 epochs) |
| License | CC-BY-SA-4.0 (forced by training-data licenses) |

## Recommended speakers for production inference

| speaker_id | label | training utterances | notes |
|---|---|---|---|
| 399 | `slr143_F` | 554 | Cleanest studio female. **Default recommendation.** |
| 403 | `slr43_0546` | 505 | Alternate clean female (different timbre). |
| 406 | `slr43_2099` | 275 | Alternate clean female. |
| 400 | `slr143_M` | 108 | Male reference. Smaller training set, voice less stable. |
| 398 | `algenib` | 1984 | Synthetic teacher (Gemini-Flash). Under-trained at this checkpoint. |

For other speaker IDs (IV-R crowdsourced, additional SLR43 voices), see `dataset.jsonl` for the full
mapping. Quality varies; the four IDs above are the curated production set.

## Quick start

```python
# 1. Install piper-plus (the trainer/inference fork we used)
# 2. Install our G2P frontend (the phoneme producer; required — eSpeak is NOT compatible)
import json, torch
from piper_train.vits import VitsModel
from nepali_frontend.g2p import phonemizer as ph

model = VitsModel.load_from_checkpoint("model.ckpt", dataset=None).cuda().eval()
config = json.load(open("config.json"))
PIM = config["phoneme_id_map"]

def to_ids(sentence: str) -> list[int]:
    out = [1]  # BOS
    for w in ph.phonemize_text(sentence):
        for p in w.phones:
            if p == "|":
                continue
            out.extend(PIM.get(p, []))
    out.append(2)  # EOS
    return out

text = "नेपाल हाम्रो देश हो।"
ids = torch.LongTensor(to_ids(text)).unsqueeze(0).cuda()
text_lengths = torch.LongTensor([ids.size(1)]).cuda()
sid = torch.LongTensor([399]).cuda()  # slr143_F
audio = model(ids, text_lengths, scales=[0.667, 1.0, 0.0], sid=sid).cpu().numpy()
```

## Training data

| source | hours | utterances | speakers | license |
|---|---|---|---|---|
| AI4Bharat IndicVoices-R Nepali | 13.74h | 5598 | 401 | CC-BY-4.0 |
| OpenSLR SLR143 (M+F TTS) | 1.24h | 662 | 2 | CC-BY-SA-4.0 |
| OpenSLR SLR43 (multi-speaker female TTS) | 2.80h | 2064 | 18 | CC-BY-SA-4.0 |
| Gemini-Flash Algenib (synthetic teacher) | 4.47h | 1984 | 1 | Synthetic, public-release consent |
| **Total** | **~22h** | **~10 200** | **419** | **CC-BY-SA-4.0 (most restrictive)** |

## What this model does well

- Renders common Nepali phonotactic patterns cleanly across the production speakers.
- Distinguishes Nepali-specific contrasts: aspiration, retroflex/dental, oral/nasal vowels, gemination.
- Handles natural prosody on Wikipedia-style and conversational sentences.

## Known limitations

- **Rare phoneme contexts** (e.g. `ts i n` / `p ax s . ts i m` / final `r` after `h ax`) are
  underlearned — the model fumbles certain words like `चीन`, `पश्चिम`, `सहर`. These contexts
  appear ~120-170 times in training, which is in the marginal zone for VITS articulation learning.
- **/ts/ vs /tʃ/ for च** — this model follows Khatiwada 2009 (`/ts/`). Native speakers may
  perceive Devanagari `च` as the more familiar `/tʃ/` ("ch") sound; this is a transcription-policy
  decision baked into the phoneme inventory, not a model defect.
- **No phonemic vowel length** — `ि` and `ी` both map to `i` per Khatiwada policy.
- **English / mixed-script input is not supported.** The G2P drops Latin-script tokens silently.

## License

The model is released under **CC-BY-SA-4.0** (Attribution-ShareAlike 4.0 International), the most
restrictive license among the training datasets. If you redistribute or build on this model, your
work must also be ShareAlike-licensed.

## Citation

```
@misc{nepali_voices_v0_2026,
  title  = {Nepali Voices v0: Multi-speaker Piper-VITS for Nepali},
  author = {Ampixa},
  year   = {2026},
  url    = {https://huggingface.co/ampixa/nepali-voices-v0}
}
```

## Acknowledgements

- [piper-plus](https://github.com/ayutaz/piper-plus) (training stack)
- [ayousanz/piper-plus-base](https://huggingface.co/ayousanz/piper-plus-base) (multilingual base)
- [OpenSLR SLR43](https://www.openslr.org/43/), [SLR143](https://www.openslr.org/143/) (audio corpora)
- [AI4Bharat IndicVoices-R](https://huggingface.co/datasets/ai4bharat/IndicVoices-R) (audio corpus)
- Khatiwada, R. (2009). *Nepali*. *Journal of the International Phonetic Association*. (phonology source)