Kikiri German Base — 51 Speakers Synthetic
A German multi-speaker TTS base model trained with StyleTTS2 on a synthetic dataset of 51 German speakers.
This is a Stage 1 base model — it provides the acoustic foundation for speaker-adapted fine-tuning (Stage 2). It is compatible with the Kokoro inference architecture.
Model Details
| Property | Value |
|---|---|
| Architecture | StyleTTS2 Stage 1 (Kokoro-compatible) |
| Language | German (de) |
| Speakers | 51 synthetic voices |
| Training data | ~30,800 samples (synthetic, TTS-generated) |
| Training epochs | 4 |
| Validation Mel Loss | 0.286 |
| Sample rate | 24 kHz |
| G2P | misaki 0.9.4 + espeak-ng 1.50 |
Audio Samples
Generated with the Victoria voice (Stage 2 fine-tune, voices/victoria.pt).
"Schön, dass du da bist. Die Bücher liegen auf dem großen Tisch."
"Ich mache mich auf den Weg nach Aachen, um auch nachts wach zu sein."
"Er aß die Maße in der Straße, aber das Maß war voll."
"Zwei weiße Zwerge zwängen sich zwischen zwei Zweige."
"Ein Pfau pflegt seine Federn an der Pfütze."
"Warum hast du das getan? Das ist ja unglaublich!"
"Das kostet genau einhundertdreiundzwanzig Millionen Euro."
Files
| File | Description |
|---|---|
kikiri_german_base_51spk_ep4.pth |
Model weights (Kokoro-compatible format) |
voices/victoria.pt |
Victoria speaker voicepack (512-dim style embedding) |
audio/test_*.wav |
German phonetic test sentences |
Usage
# Uses the kokoro library as underlying framework
from kokoro import KPipeline
pipeline = KPipeline(lang_code="de", model_path="kikiri_german_base_51spk_ep4.pth")
voicepack = pipeline.load_voice("voices/victoria.pt")
text = "Guten Tag, wie geht es Ihnen?"
audio = pipeline(text, voice=voicepack)
Training
- Stage 1 trains the acoustic model on mel spectrogram reconstruction across all 51 speakers
- Stage 2 fine-tunes a single speaker using WavLM adversarial training (bf16)
- Data pipeline: text → misaki G2P (de) → Kokoro 178-token IPA vocabulary
- All training data phoneme-validated: no
??artifacts, no OOV symbols
Limitations
- Trained entirely on synthetic (TTS-generated) audio — real human recordings may improve naturalness
- Stage 1 alone requires Stage 2 fine-tuning for production-quality single-speaker output
- German number/date normalization is handled by the caller (not built-in)
License
Apache 2.0