--- language: - de license: apache-2.0 tags: - text-to-speech - german - kokoro - styletts2 - multi-speaker pipeline_tag: text-to-speech --- # Kikiri German Base — 51 Speakers Synthetic A German multi-speaker TTS base model trained with [StyleTTS2](https://github.com/yl4579/StyleTTS2) on a synthetic dataset of 51 German speakers. This is a **Stage 1 base model** — it provides the acoustic foundation for speaker-adapted fine-tuning (Stage 2). It is compatible with the [Kokoro](https://github.com/hexgrad/kokoro) inference architecture. ## Model Details | Property | Value | |---|---| | Architecture | StyleTTS2 Stage 1 (Kokoro-compatible) | | Language | German (de) | | Speakers | 51 synthetic voices | | Training data | ~30,800 samples (synthetic, TTS-generated) | | Training epochs | 4 | | Validation Mel Loss | 0.286 | | Sample rate | 24 kHz | | G2P | misaki 0.9.4 + espeak-ng 1.50 | ## Audio Samples Generated with the Victoria voice (Stage 2 fine-tune, `voices/victoria.pt`). **"Schön, dass du da bist. Die Bücher liegen auf dem großen Tisch."** **"Ich mache mich auf den Weg nach Aachen, um auch nachts wach zu sein."** **"Er aß die Maße in der Straße, aber das Maß war voll."** **"Zwei weiße Zwerge zwängen sich zwischen zwei Zweige."** **"Ein Pfau pflegt seine Federn an der Pfütze."** **"Warum hast du das getan? Das ist ja unglaublich!"** **"Das kostet genau einhundertdreiundzwanzig Millionen Euro."** ## Files | File | Description | |---|---| | `kikiri_german_base_51spk_ep4.pth` | Model weights (Kokoro-compatible format) | | `voices/victoria.pt` | Victoria speaker voicepack (512-dim style embedding) | | `audio/test_*.wav` | German phonetic test sentences | ## Usage ```python # Uses the kokoro library as underlying framework from kokoro import KPipeline pipeline = KPipeline(lang_code="de", model_path="kikiri_german_base_51spk_ep4.pth") voicepack = pipeline.load_voice("voices/victoria.pt") text = "Guten Tag, wie geht es Ihnen?" audio = pipeline(text, voice=voicepack) ``` ## Training - **Stage 1** trains the acoustic model on mel spectrogram reconstruction across all 51 speakers - **Stage 2** fine-tunes a single speaker using WavLM adversarial training (bf16) - Data pipeline: text → misaki G2P (de) → Kokoro 178-token IPA vocabulary - All training data phoneme-validated: no `??` artifacts, no OOV symbols ## Limitations - Trained entirely on **synthetic** (TTS-generated) audio — real human recordings may improve naturalness - Stage 1 alone requires Stage 2 fine-tuning for production-quality single-speaker output - German number/date normalization is handled by the caller (not built-in) ## License Apache 2.0