Instructions to use neriqlabs/kasanoma-tts-twi-v0.3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- CosyVoice
How to use neriqlabs/kasanoma-tts-twi-v0.3 with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Kasanoma TTS v0.3 — Asante Twi + English + code-switch voice
Kasanoma is Neriqlabs' Ghanaian voice for the Neriq Live customer-service voice agent. v0.3 upgrades the backbone to CosyVoice3-0.5B (Apache-2.0, 1M training hours) and fine-tunes the LLM stage on clean-license Ghanaian Twi speech. It speaks Asante Twi, English, and Twi-English code-switch in one voice (zero-shot cross-lingual cloning).
What's new vs v0.1
v0.1 was a CosyVoice2-0.5B fine-tune. v0.3 swaps in the stronger CosyVoice3 base. The biggest gains are on Asante Twi and Twi–English code-switch (the hardest, most product-critical axis); English intelligibility is unchanged.
Objective eval — round-trip ASR-WER/CER (synthesize → transcribe with our TubaSTT v0.5 ASR → WER vs reference; lower = more intelligible; 15 Twi + 15 English + 10 code-switch held-out sentences):
| Category | v0.1 WER / CER | v0.3 WER / CER | head-to-head (better/tie/worse) |
|---|---|---|---|
| Asante Twi | 26.7 / 7.8 | 20.6 / 4.4 | 8 / 3 / 4 |
| Twi–English code-switch | 57.2 / 27.2 | 46.1 / 21.0 | 5 / 4 / 1 |
| English | 23.7 / 12.0 | 26.6 / 14.4 | 5 / 6 / 4 (median 22→22, unchanged) |
Twi −6.1 WER and code-switch −11.1 WER are large, robust improvements (mean + median + sentence head-to-head agree). English is statistically flat — identical median (22), more per-sentence wins than losses; the higher mean is two round-trip outliers. (Round-trip WER conflates TTS + ASR error, so absolute numbers are inflated; valid as a relative intelligibility signal under a fixed ASR. Naturalness MOS calibration pending.)
Training data (all commercially clean-license)
- BibleTTS Asante Twi (CC-BY-SA 4.0) — studio single-speaker, ɛ/ɔ-corrected transcripts.
- Ashesi Financial-Inclusion Speech (CC-BY) — multi-speaker, fintech domain.
- Common Voice Twi (CC0).
36,328 utterances / 62 speakers. Only the llm (text→speech-token) stage is fine-tuned; the
flow-matching + HiFi-GAN vocoder transfer from the base. ɛ (U+025B) / ɔ (U+0254) preserved (NFC).
Usage
from cosyvoice.cli.cosyvoice import CosyVoice3
cv = CosyVoice3("kasanoma-tts-twi-v0.3", load_trt=False, fp16=False)
prompt_wav = "ref_voice.wav" # 16 kHz reference for the voice to clone
# CosyVoice3 needs the instruct prefix:
text = "You are a helpful assistant.<|endofprompt|>me PIN no reset, please help me"
for out in cv.inference_cross_lingual(text, prompt_wav, stream=False, text_frontend=False):
audio = out["tts_speech"] # 24 kHz
License
Weights released CC-BY-SA-4.0 (inherited from BibleTTS, the most restrictive training source). Backbone CosyVoice3 is Apache-2.0. No non-commercial or undeclared-license data was used.
Limitations
Code-switch remains the weakest axis. Naturalness not yet MOS-calibrated. Single cloned voice per reference. Built by Neriqlabs (founder: Samson Nkrumah) for Ghanaian-language voice AI.
- Downloads last month
- 73
Model tree for neriqlabs/kasanoma-tts-twi-v0.3
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512