NMikka commited on
Commit
ad54cdc
ยท
verified ยท
1 Parent(s): e73672f

Add model card

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ka
4
+ license: cc-by-nc-4.0
5
+ base_model: SWivid/F5-TTS
6
+ tags:
7
+ - tts
8
+ - text-to-speech
9
+ - georgian
10
+ - f5-tts
11
+ - speech-synthesis
12
+ - flow-matching
13
+ pipeline_tag: text-to-speech
14
+ datasets:
15
+ - NMikka/Common-Voice-Geo-Cleaned
16
+ ---
17
+
18
+ # F5-TTS Georgian
19
+
20
+ A fine-tuned version of [SWivid/F5-TTS](https://huggingface.co/SWivid/F5-TTS) (335M params) for **Georgian text-to-speech**. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress.
21
+
22
+ ## Model Details
23
+
24
+ | | |
25
+ |---|---|
26
+ | **Base model** | [SWivid/F5-TTS v1 Base](https://huggingface.co/SWivid/F5-TTS) (335M params, DiT + ConvNeXt V2) |
27
+ | **Fine-tuning** | Full fine-tune (continuation of flow-matching pretraining), no LoRA |
28
+ | **Training data** | [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) โ€” 20,300 samples, 12 speakers |
29
+ | **Training** | 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB) |
30
+ | **Sample rate** | 24 kHz |
31
+ | **Voice cloning** | Works well with training speakers; generalizing to new voices is WIP |
32
+ | **License** | CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights) |
33
+
34
+ ## Evaluation โ€” FLEURS Georgian Benchmark (979 unseen samples)
35
+
36
+ Round-trip CER: TTS generates audio โ†’ [Meta Omnilingual ASR 7B](https://huggingface.co/facebook/omniASR_LLM_7B) transcribes โ†’ compare to original text.
37
+
38
+ | Metric | Value |
39
+ |---|---|
40
+ | **CER mean** | **0.0509** |
41
+ | CER median | 0.0309 |
42
+ | CER p90 | 0.1183 |
43
+ | CER std | 0.0558 |
44
+ | WER mean | 0.1866 |
45
+ | WER median | 0.1600 |
46
+
47
+ **CER distribution:**
48
+ - 65.9% of samples < 5% CER
49
+ - 85.9% of samples < 10% CER
50
+ - 96.5% of samples < 20% CER
51
+ - 0 catastrophic failures (> 50% CER)
52
+
53
+ Evaluated with speaker 3 reference audio (NISQA MOS 4.99).
54
+
55
+ ## Usage
56
+
57
+ ### Install
58
+
59
+ ```bash
60
+ pip install f5-tts
61
+ ```
62
+
63
+ ### Download Model
64
+
65
+ ```python
66
+ from huggingface_hub import hf_hub_download
67
+
68
+ # Download checkpoint and vocab
69
+ ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
70
+ vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")
71
+ ```
72
+
73
+ ### Inference
74
+
75
+ The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress.
76
+
77
+ ```python
78
+ from datasets import load_dataset
79
+ from huggingface_hub import hf_hub_download
80
+ from f5_tts.api import F5TTS
81
+ import soundfile as sf
82
+ import numpy as np
83
+
84
+ # Download model
85
+ ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
86
+ vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")
87
+
88
+ # Load a reference sample from the training dataset
89
+ ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test")
90
+ ref_sample = ds[0] # Pick any sample as voice reference
91
+
92
+ # Save reference audio to temp file (F5-TTS expects a file path)
93
+ ref_path = "/tmp/ref.wav"
94
+ sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"])
95
+
96
+ # Load model
97
+ model = F5TTS(
98
+ ckpt_file=ckpt_path,
99
+ vocab_file=vocab_path,
100
+ device="cuda",
101
+ use_ema=False, # Important: this checkpoint was not trained with EMA
102
+ )
103
+
104
+ # Generate speech using a training speaker as reference
105
+ wav, sr, _ = model.infer(
106
+ ref_file=ref_path,
107
+ ref_text=ref_sample["text"],
108
+ gen_text="แƒ’แƒแƒ›แƒแƒ แƒฏแƒแƒ‘แƒ, แƒ แƒแƒ’แƒแƒ  แƒฎแƒแƒ ? แƒกแƒแƒฅแƒแƒ แƒ—แƒ•แƒ”แƒšแƒ แƒฃแƒšแƒแƒ›แƒแƒ–แƒ”แƒกแƒ˜ แƒฅแƒ•แƒ”แƒงแƒแƒœแƒแƒ.",
109
+ )
110
+ sf.write("output.wav", wav, sr)
111
+ ```
112
+
113
+ ### Generation Parameters
114
+
115
+ ```python
116
+ wav, sr, _ = model.infer(
117
+ ref_file="reference.wav",
118
+ ref_text="reference transcript",
119
+ gen_text="text to synthesize",
120
+ nfe_step=32, # Denoising steps (default 32, higher = better quality, slower)
121
+ cfg_strength=2.0, # Classifier-free guidance (default 2.0)
122
+ speed=1.0, # Speech speed multiplier
123
+ )
124
+ ```
125
+
126
+ ## Training Details
127
+
128
+ | | |
129
+ |---|---|
130
+ | **Method** | Full fine-tune (flow-matching loss, continuation of pretraining) |
131
+ | **Base checkpoint** | `F5TTS_v1_Base/model_1250000.safetensors` |
132
+ | **Learning rate** | 1e-5 |
133
+ | **Warmup** | 500 steps |
134
+ | **Batch size** | 9,600 audio frames per GPU |
135
+ | **Max sequences/batch** | 64 |
136
+ | **Optimizer** | 8-bit Adam (bitsandbytes) |
137
+ | **Epochs** | 100 |
138
+ | **Total updates** | 110,000 |
139
+ | **Tokenizer** | Character-level (`char`, not `pinyin`) |
140
+ | **Vocab** | 2,579 tokens (2,545 pretrained + 34 Georgian characters) |
141
+ | **GPU** | 1x NVIDIA RTX A6000 (48GB) |
142
+
143
+ ### Vocab Extension
144
+
145
+ The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (แƒ-แƒฐ + โ€ž). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 โ†’ 2,580 dimensions.
146
+
147
+ ## Limitations and Future Work
148
+
149
+ - **License**: CC-BY-NC-4.0 โ€” non-commercial use only (inherited from F5-TTS weights)
150
+ - **Voice cloning to new speakers is limited** โ€” the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement.
151
+ - Trained on 12 speakers from Common Voice Georgian โ€” limited speaker diversity
152
+ - Some complex Georgian text with rare characters may produce higher error rates
153
+ - No emotion or prosody control beyond what the reference audio provides
154
+
155
+ ## Part of the Georgian TTS Benchmark
156
+
157
+ This model was trained as part of the first Georgian TTS benchmark โ€” a comparative study of 6 open-source TTS architectures. See the full project: [github.com/NMikaa/TTS_pipelines](https://github.com/NMikaa/TTS_pipelines)
158
+
159
+ ## Citation
160
+
161
+ ```bibtex
162
+ @misc{f5tts-georgian-2026,
163
+ title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian},
164
+ author={NMikka},
165
+ year={2026},
166
+ url={https://huggingface.co/NMikka/F5-TTS-Georgian}
167
+ }
168
+ ```