dida-80b's picture
Upload README.md with huggingface_hub
5f67a26 verified
|
Raw
History Blame Contribute Delete
3.82 kB
metadata
language:
  - de
license: apache-2.0
tags:
  - text-to-speech
  - german
  - kokoro
  - styletts2
  - multi-speaker
pipeline_tag: text-to-speech

Kikiri German Base — 51 Speakers Synthetic

A German multi-speaker TTS base model trained with StyleTTS2 on a synthetic dataset of 51 German speakers.

This is a Stage 1 base model — it provides the acoustic foundation for speaker-adapted fine-tuning (Stage 2). It is compatible with the Kokoro inference architecture.

Model Details

Property Value
Architecture StyleTTS2 Stage 1 (Kokoro-compatible)
Language German (de)
Speakers 51 synthetic voices
Training data ~30,800 samples (synthetic, TTS-generated)
Training epochs 4
Validation Mel Loss 0.286
Sample rate 24 kHz
G2P misaki 0.9.4 + espeak-ng 1.50

Audio Samples

Generated with the Victoria voice (Stage 2 fine-tune, voices/victoria.pt).

"Schön, dass du da bist. Die Bücher liegen auf dem großen Tisch."

"Ich mache mich auf den Weg nach Aachen, um auch nachts wach zu sein."

"Er aß die Maße in der Straße, aber das Maß war voll."

"Zwei weiße Zwerge zwängen sich zwischen zwei Zweige."

"Ein Pfau pflegt seine Federn an der Pfütze."

"Warum hast du das getan? Das ist ja unglaublich!"

"Das kostet genau einhundertdreiundzwanzig Millionen Euro."

Files

File Description
kikiri_german_base_51spk_ep4.pth Model weights (Kokoro-compatible format)
voices/victoria.pt Victoria speaker voicepack (512-dim style embedding)
audio/test_*.wav German phonetic test sentences

Usage

# Uses the kokoro library as underlying framework
from kokoro import KPipeline

pipeline = KPipeline(lang_code="de", model_path="kikiri_german_base_51spk_ep4.pth")
voicepack = pipeline.load_voice("voices/victoria.pt")

text = "Guten Tag, wie geht es Ihnen?"
audio = pipeline(text, voice=voicepack)

Training

  • Stage 1 trains the acoustic model on mel spectrogram reconstruction across all 51 speakers
  • Stage 2 fine-tunes a single speaker using WavLM adversarial training (bf16)
  • Data pipeline: text → misaki G2P (de) → Kokoro 178-token IPA vocabulary
  • All training data phoneme-validated: no ?? artifacts, no OOV symbols

Limitations

  • Trained entirely on synthetic (TTS-generated) audio — real human recordings may improve naturalness
  • Stage 1 alone requires Stage 2 fine-tuning for production-quality single-speaker output
  • German number/date normalization is handled by the caller (not built-in)

License

Apache 2.0