--- license: cc-by-sa-4.0 language: - ne tags: - text-to-speech - nepali - vits - piper - multi-speaker library_name: piper-tts datasets: - openslr/openslr pipeline_tag: text-to-speech --- # Nepali Voices v0 (Piper-VITS, 419 speakers) A multi-speaker Nepali text-to-speech model in the [Piper](https://github.com/rhasspy/piper) / [piper-plus](https://github.com/ayutaz/piper-plus) format. 419 speaker embeddings, ~22 hours of training audio, custom 65-phone Nepali inventory (NOT eSpeak). ## At a glance | | | |---|---| | Architecture | VITS (multi-speaker), 77.5 M parameters | | Phoneme inventory | Project-internal 65-phone Nepali (Khatiwada 2009) | | Speaker count | 419 | | Sample rate | 22 050 Hz | | Audio quality | 22 kHz medium | | Base | `ayousanz/piper-plus-base` (multilingual, 6 languages) | | Fine-tune steps | ~130 800 (v2 600 epochs + v3b 200 epochs) | | License | CC-BY-SA-4.0 (forced by training-data licenses) | ## Recommended speakers for production inference | speaker_id | label | training utterances | notes | |---|---|---|---| | 399 | `slr143_F` | 554 | Cleanest studio female. **Default recommendation.** | | 403 | `slr43_0546` | 505 | Alternate clean female (different timbre). | | 406 | `slr43_2099` | 275 | Alternate clean female. | | 400 | `slr143_M` | 108 | Male reference. Smaller training set, voice less stable. | | 398 | `algenib` | 1984 | Synthetic teacher (Gemini-Flash). Under-trained at this checkpoint. | For other speaker IDs (IV-R crowdsourced, additional SLR43 voices), see `dataset.jsonl` for the full mapping. Quality varies; the four IDs above are the curated production set. ## Quick start ```python # 1. Install piper-plus (the trainer/inference fork we used) # 2. Install our G2P frontend (the phoneme producer; required — eSpeak is NOT compatible) import json, torch from piper_train.vits import VitsModel from nepali_frontend.g2p import phonemizer as ph model = VitsModel.load_from_checkpoint("model.ckpt", dataset=None).cuda().eval() config = json.load(open("config.json")) PIM = config["phoneme_id_map"] def to_ids(sentence: str) -> list[int]: out = [1] # BOS for w in ph.phonemize_text(sentence): for p in w.phones: if p == "|": continue out.extend(PIM.get(p, [])) out.append(2) # EOS return out text = "नेपाल हाम्रो देश हो।" ids = torch.LongTensor(to_ids(text)).unsqueeze(0).cuda() text_lengths = torch.LongTensor([ids.size(1)]).cuda() sid = torch.LongTensor([399]).cuda() # slr143_F audio = model(ids, text_lengths, scales=[0.667, 1.0, 0.0], sid=sid).cpu().numpy() ``` ## Training data | source | hours | utterances | speakers | license | |---|---|---|---|---| | AI4Bharat IndicVoices-R Nepali | 13.74h | 5598 | 401 | CC-BY-4.0 | | OpenSLR SLR143 (M+F TTS) | 1.24h | 662 | 2 | CC-BY-SA-4.0 | | OpenSLR SLR43 (multi-speaker female TTS) | 2.80h | 2064 | 18 | CC-BY-SA-4.0 | | Gemini-Flash Algenib (synthetic teacher) | 4.47h | 1984 | 1 | Synthetic, public-release consent | | **Total** | **~22h** | **~10 200** | **419** | **CC-BY-SA-4.0 (most restrictive)** | ## What this model does well - Renders common Nepali phonotactic patterns cleanly across the production speakers. - Distinguishes Nepali-specific contrasts: aspiration, retroflex/dental, oral/nasal vowels, gemination. - Handles natural prosody on Wikipedia-style and conversational sentences. ## Known limitations - **Rare phoneme contexts** (e.g. `ts i n` / `p ax s . ts i m` / final `r` after `h ax`) are underlearned — the model fumbles certain words like `चीन`, `पश्चिम`, `सहर`. These contexts appear ~120-170 times in training, which is in the marginal zone for VITS articulation learning. - **/ts/ vs /tʃ/ for च** — this model follows Khatiwada 2009 (`/ts/`). Native speakers may perceive Devanagari `च` as the more familiar `/tʃ/` ("ch") sound; this is a transcription-policy decision baked into the phoneme inventory, not a model defect. - **No phonemic vowel length** — `ि` and `ी` both map to `i` per Khatiwada policy. - **English / mixed-script input is not supported.** The G2P drops Latin-script tokens silently. ## License The model is released under **CC-BY-SA-4.0** (Attribution-ShareAlike 4.0 International), the most restrictive license among the training datasets. If you redistribute or build on this model, your work must also be ShareAlike-licensed. ## Citation ``` @misc{nepali_voices_v0_2026, title = {Nepali Voices v0: Multi-speaker Piper-VITS for Nepali}, author = {Ampixa}, year = {2026}, url = {https://huggingface.co/ampixa/nepali-voices-v0} } ``` ## Acknowledgements - [piper-plus](https://github.com/ayutaz/piper-plus) (training stack) - [ayousanz/piper-plus-base](https://huggingface.co/ayousanz/piper-plus-base) (multilingual base) - [OpenSLR SLR43](https://www.openslr.org/43/), [SLR143](https://www.openslr.org/143/) (audio corpora) - [AI4Bharat IndicVoices-R](https://huggingface.co/datasets/ai4bharat/IndicVoices-R) (audio corpus) - Khatiwada, R. (2009). *Nepali*. *Journal of the International Phonetic Association*. (phonology source)