Kasanoma TTS v0.3 — Asante Twi + English + code-switch voice

Kasanoma is Neriqlabs' Ghanaian voice for the Neriq Live customer-service voice agent. v0.3 upgrades the backbone to CosyVoice3-0.5B (Apache-2.0, 1M training hours) and fine-tunes the LLM stage on clean-license Ghanaian Twi speech. It speaks Asante Twi, English, and Twi-English code-switch in one voice (zero-shot cross-lingual cloning).

What's new vs v0.1

v0.1 was a CosyVoice2-0.5B fine-tune. v0.3 swaps in the stronger CosyVoice3 base. The biggest gains are on Asante Twi and Twi–English code-switch (the hardest, most product-critical axis); English intelligibility is unchanged.

Objective eval — round-trip ASR-WER/CER (synthesize → transcribe with our TubaSTT v0.5 ASR → WER vs reference; lower = more intelligible; 15 Twi + 15 English + 10 code-switch held-out sentences):

Category v0.1 WER / CER v0.3 WER / CER head-to-head (better/tie/worse)
Asante Twi 26.7 / 7.8 20.6 / 4.4 8 / 3 / 4
Twi–English code-switch 57.2 / 27.2 46.1 / 21.0 5 / 4 / 1
English 23.7 / 12.0 26.6 / 14.4 5 / 6 / 4 (median 22→22, unchanged)

Twi −6.1 WER and code-switch −11.1 WER are large, robust improvements (mean + median + sentence head-to-head agree). English is statistically flat — identical median (22), more per-sentence wins than losses; the higher mean is two round-trip outliers. (Round-trip WER conflates TTS + ASR error, so absolute numbers are inflated; valid as a relative intelligibility signal under a fixed ASR. Naturalness MOS calibration pending.)

Training data (all commercially clean-license)

  • BibleTTS Asante Twi (CC-BY-SA 4.0) — studio single-speaker, ɛ/ɔ-corrected transcripts.
  • Ashesi Financial-Inclusion Speech (CC-BY) — multi-speaker, fintech domain.
  • Common Voice Twi (CC0).

36,328 utterances / 62 speakers. Only the llm (text→speech-token) stage is fine-tuned; the flow-matching + HiFi-GAN vocoder transfer from the base. ɛ (U+025B) / ɔ (U+0254) preserved (NFC).

Usage

from cosyvoice.cli.cosyvoice import CosyVoice3
cv = CosyVoice3("kasanoma-tts-twi-v0.3", load_trt=False, fp16=False)
prompt_wav = "ref_voice.wav"            # 16 kHz reference for the voice to clone
# CosyVoice3 needs the instruct prefix:
text = "You are a helpful assistant.<|endofprompt|>me PIN no reset, please help me"
for out in cv.inference_cross_lingual(text, prompt_wav, stream=False, text_frontend=False):
    audio = out["tts_speech"]           # 24 kHz

License

Weights released CC-BY-SA-4.0 (inherited from BibleTTS, the most restrictive training source). Backbone CosyVoice3 is Apache-2.0. No non-commercial or undeclared-license data was used.

Limitations

Code-switch remains the weakest axis. Naturalness not yet MOS-calibrated. Single cloned voice per reference. Built by Neriqlabs (founder: Samson Nkrumah) for Ghanaian-language voice AI.

Downloads last month
73
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for neriqlabs/kasanoma-tts-twi-v0.3

Quantized
(9)
this model