File size: 3,815 Bytes
4671882
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f67a26
4671882
 
 
 
 
 
5f67a26
4671882
 
5f67a26
4671882
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6641028
4671882
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
language:
- de
license: apache-2.0
tags:
- text-to-speech
- german
- kokoro
- styletts2
- multi-speaker
pipeline_tag: text-to-speech
---

# Kikiri German Base — 51 Speakers Synthetic

A German multi-speaker TTS base model trained with [StyleTTS2](https://github.com/yl4579/StyleTTS2) on a synthetic dataset of 51 German speakers.

This is a **Stage 1 base model** — it provides the acoustic foundation for speaker-adapted fine-tuning (Stage 2). It is compatible with the [Kokoro](https://github.com/hexgrad/kokoro) inference architecture.

## Model Details

| Property | Value |
|---|---|
| Architecture | StyleTTS2 Stage 1 (Kokoro-compatible) |
| Language | German (de) |
| Speakers | 51 synthetic voices |
| Training data | ~30,800 samples (synthetic, TTS-generated) |
| Training epochs | 4 |
| Validation Mel Loss | 0.286 |
| Sample rate | 24 kHz |
| G2P | misaki 0.9.4 + espeak-ng 1.50 |

## Audio Samples

Generated with the Victoria voice (Stage 2 fine-tune, `voices/victoria.pt`).

**"Schön, dass du da bist. Die Bücher liegen auf dem großen Tisch."**
<audio controls><source src="https://huggingface.co/kikiri-tts/kikiri-german-base-51speakers-synthetic/resolve/main/audio/test_01.wav" type="audio/wav"></audio>

**"Ich mache mich auf den Weg nach Aachen, um auch nachts wach zu sein."**
<audio controls><source src="https://huggingface.co/kikiri-tts/kikiri-german-base-51speakers-synthetic/resolve/main/audio/test_02.wav" type="audio/wav"></audio>

**"Er aß die Maße in der Straße, aber das Maß war voll."**
<audio controls><source src="https://huggingface.co/kikiri-tts/kikiri-german-base-51speakers-synthetic/resolve/main/audio/test_03.wav" type="audio/wav"></audio>

**"Zwei weiße Zwerge zwängen sich zwischen zwei Zweige."**
<audio controls><source src="https://huggingface.co/kikiri-tts/kikiri-german-base-51speakers-synthetic/resolve/main/audio/test_04.wav" type="audio/wav"></audio>

**"Ein Pfau pflegt seine Federn an der Pfütze."**
<audio controls><source src="https://huggingface.co/kikiri-tts/kikiri-german-base-51speakers-synthetic/resolve/main/audio/test_05.wav" type="audio/wav"></audio>

**"Warum hast du das getan? Das ist ja unglaublich!"**
<audio controls><source src="https://huggingface.co/kikiri-tts/kikiri-german-base-51speakers-synthetic/resolve/main/audio/test_06.wav" type="audio/wav"></audio>

**"Das kostet genau einhundertdreiundzwanzig Millionen Euro."**
<audio controls><source src="https://huggingface.co/kikiri-tts/kikiri-german-base-51speakers-synthetic/resolve/main/audio/test_07.wav" type="audio/wav"></audio>

## Files

| File | Description |
|---|---|
| `kikiri_german_base_51spk_ep4.pth` | Model weights (Kokoro-compatible format) |
| `voices/victoria.pt` | Victoria speaker voicepack (512-dim style embedding) |
| `audio/test_*.wav` | German phonetic test sentences |

## Usage

```python
# Uses the kokoro library as underlying framework
from kokoro import KPipeline

pipeline = KPipeline(lang_code="de", model_path="kikiri_german_base_51spk_ep4.pth")
voicepack = pipeline.load_voice("voices/victoria.pt")

text = "Guten Tag, wie geht es Ihnen?"
audio = pipeline(text, voice=voicepack)
```

## Training

- **Stage 1** trains the acoustic model on mel spectrogram reconstruction across all 51 speakers
- **Stage 2** fine-tunes a single speaker using WavLM adversarial training (bf16)
- Data pipeline: text → misaki G2P (de) → Kokoro 178-token IPA vocabulary
- All training data phoneme-validated: no `??` artifacts, no OOV symbols

## Limitations

- Trained entirely on **synthetic** (TTS-generated) audio — real human recordings may improve naturalness
- Stage 1 alone requires Stage 2 fine-tuning for production-quality single-speaker output
- German number/date normalization is handled by the caller (not built-in)

## License

Apache 2.0