NMikka
/

F5-TTS-Georgian

+---
+language:
+- ka
+license: cc-by-nc-4.0
+base_model: SWivid/F5-TTS
+tags:
+- tts
+- text-to-speech
+- georgian
+- f5-tts
+- speech-synthesis
+- flow-matching
+pipeline_tag: text-to-speech
+datasets:
+- NMikka/Common-Voice-Geo-Cleaned
+---
+# F5-TTS Georgian
+A fine-tuned version of [SWivid/F5-TTS](https://huggingface.co/SWivid/F5-TTS) (335M params) for **Georgian text-to-speech**. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress.
+## Model Details
+| | |
+|---|---|
+| **Base model** | [SWivid/F5-TTS v1 Base](https://huggingface.co/SWivid/F5-TTS) (335M params, DiT + ConvNeXt V2) |
+| **Fine-tuning** | Full fine-tune (continuation of flow-matching pretraining), no LoRA |
+| **Training data** | [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) — 20,300 samples, 12 speakers |
+| **Training** | 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB) |
+| **Sample rate** | 24 kHz |
+| **Voice cloning** | Works well with training speakers; generalizing to new voices is WIP |
+| **License** | CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights) |
+## Evaluation — FLEURS Georgian Benchmark (979 unseen samples)
+Round-trip CER: TTS generates audio → [Meta Omnilingual ASR 7B](https://huggingface.co/facebook/omniASR_LLM_7B) transcribes → compare to original text.
+| Metric | Value |
+|---|---|
+| **CER mean** | **0.0509** |
+| CER median | 0.0309 |
+| CER p90 | 0.1183 |
+| CER std | 0.0558 |
+| WER mean | 0.1866 |
+| WER median | 0.1600 |
+**CER distribution:**
+- 65.9% of samples < 5% CER
+- 85.9% of samples < 10% CER
+- 96.5% of samples < 20% CER
+- 0 catastrophic failures (> 50% CER)
+Evaluated with speaker 3 reference audio (NISQA MOS 4.99).
+## Usage
+### Install
+```bash
+pip install f5-tts
+```
+### Download Model
+```python
+from huggingface_hub import hf_hub_download
+# Download checkpoint and vocab
+ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
+vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")
+```
+### Inference
+The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress.
+```python
+from datasets import load_dataset
+from huggingface_hub import hf_hub_download
+from f5_tts.api import F5TTS
+import soundfile as sf
+import numpy as np
+# Download model
+ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
+vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")
+# Load a reference sample from the training dataset
+ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test")
+ref_sample = ds[0]  # Pick any sample as voice reference
+# Save reference audio to temp file (F5-TTS expects a file path)
+ref_path = "/tmp/ref.wav"
+sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"])
+# Load model
+model = F5TTS(
+    ckpt_file=ckpt_path,
+    vocab_file=vocab_path,
+    device="cuda",
+    use_ema=False,  # Important: this checkpoint was not trained with EMA
+)
+# Generate speech using a training speaker as reference
+wav, sr, _ = model.infer(
+    ref_file=ref_path,
+    ref_text=ref_sample["text"],
+    gen_text="გამარჯობა, როგორ ხარ? საქართველო ულამაზესი ქვეყანაა.",
+)
+sf.write("output.wav", wav, sr)
+```
+### Generation Parameters
+```python
+wav, sr, _ = model.infer(
+    ref_file="reference.wav",
+    ref_text="reference transcript",
+    gen_text="text to synthesize",
+    nfe_step=32,       # Denoising steps (default 32, higher = better quality, slower)
+    cfg_strength=2.0,  # Classifier-free guidance (default 2.0)
+    speed=1.0,         # Speech speed multiplier
+)
+```
+## Training Details
+| | |
+|---|---|
+| **Method** | Full fine-tune (flow-matching loss, continuation of pretraining) |
+| **Base checkpoint** | `F5TTS_v1_Base/model_1250000.safetensors` |
+| **Learning rate** | 1e-5 |
+| **Warmup** | 500 steps |
+| **Batch size** | 9,600 audio frames per GPU |
+| **Max sequences/batch** | 64 |
+| **Optimizer** | 8-bit Adam (bitsandbytes) |
+| **Epochs** | 100 |
+| **Total updates** | 110,000 |
+| **Tokenizer** | Character-level (`char`, not `pinyin`) |
+| **Vocab** | 2,579 tokens (2,545 pretrained + 34 Georgian characters) |
+| **GPU** | 1x NVIDIA RTX A6000 (48GB) |
+### Vocab Extension
+The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (ა-ჰ + „). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 → 2,580 dimensions.
+## Limitations and Future Work
+- **License**: CC-BY-NC-4.0 — non-commercial use only (inherited from F5-TTS weights)
+- **Voice cloning to new speakers is limited** — the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement.
+- Trained on 12 speakers from Common Voice Georgian — limited speaker diversity
+- Some complex Georgian text with rare characters may produce higher error rates
+- No emotion or prosody control beyond what the reference audio provides
+## Part of the Georgian TTS Benchmark
+This model was trained as part of the first Georgian TTS benchmark — a comparative study of 6 open-source TTS architectures. See the full project: [github.com/NMikaa/TTS_pipelines](https://github.com/NMikaa/TTS_pipelines)
+## Citation
+```bibtex
+@misc{f5tts-georgian-2026,
+  title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian},
+  author={NMikka},
+  year={2026},
+  url={https://huggingface.co/NMikka/F5-TTS-Georgian}
+}
+```