VoxCPM — AfriSpeech Multilingual TTS (50 African Languages)

Full fine-tune of openbmb/VoxCPM-0.5B on all 50 language subsets of AfriSpeech/africa-speech, merged into a single training manifest (mono 16 kHz WAV).

Try it live: AfriSpeech/VoxCPM-AfriSpeech

Supported languages

Afar, Akan (Twi), Amharic, Baoule, Bemba, Burkina Faso Fulfulde, Dan, Ewe, Fon, Fulani, Ganda (Luganda), Hausa, Igbo, Jola-Kasa, Kalanga, Kalenjin, Kikuyu, Lingala, Lozi, Luba-Lulua, Makhuwa-Shirima, Malgache, Mankanya, Mbunda, Mende, Mossi, Ngambay, Northeastern Dinka, Nyanja, Oromo (Borana-Arsi-Guji), Pular, Punu, Rundi (Kirundi), Rwandan (Kinyarwanda), Sango, Shilluk, Shona, Somali, Sukuma, Swahili, Tarifit, Tashelhayt, Tigrinya, Tiv, Tumbuka, West Central Oromo, Western Niger Fulfulde, Wolof (Senegal), Yaka, Yoruba.

Training details


Base model	openbmb/VoxCPM-0.5B (full fine-tune, no LoRA)
Data	all 50 configs of AfriSpeech/africa-speech, mono 16 kHz WAV
Epochs	2 (12,364 optimizer steps, effective batch 16)
LR / warmup	1e-5 / 200 steps
Hardware	1× A100-80GB (Modal)

Weights are stored as pytorch_model.bin wrapped in {"state_dict": ...} — the format VoxCPM.from_pretrained() expects. Optimizer/scheduler states are not included.

Usage

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("AfriSpeech/voxcpm-afrispeech-full-inference-20260606")

wav = model.generate(
    text="Karibu sana! Tunafurahi kukuona hapa leo.",
    inference_timesteps=10,
    cfg_value=2.0,
)
sf.write("out.wav", wav, 16000)

For voice cloning, pass prompt_wav_path and prompt_text (a 3–10 s reference clip and its transcript) to model.generate(...).