--- language: - ka license: cc-by-nc-4.0 base_model: SWivid/F5-TTS tags: - tts - text-to-speech - georgian - f5-tts - speech-synthesis - flow-matching pipeline_tag: text-to-speech datasets: - NMikka/Common-Voice-Geo-Cleaned --- # F5-TTS Georgian A fine-tuned version of [SWivid/F5-TTS](https://huggingface.co/SWivid/F5-TTS) (335M params) for **Georgian text-to-speech**. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress. ## Model Details | | | |---|---| | **Base model** | [SWivid/F5-TTS v1 Base](https://huggingface.co/SWivid/F5-TTS) (335M params, DiT + ConvNeXt V2) | | **Fine-tuning** | Full fine-tune (continuation of flow-matching pretraining), no LoRA | | **Training data** | [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) — 20,300 samples, 12 speakers | | **Training** | 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB) | | **Sample rate** | 24 kHz | | **Voice cloning** | Works well with training speakers; generalizing to new voices is WIP | | **License** | CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights) | ## Evaluation — FLEURS Georgian Benchmark (979 unseen samples) Round-trip CER: TTS generates audio → [Meta Omnilingual ASR 7B](https://huggingface.co/facebook/omniASR_LLM_7B) transcribes → compare to original text. | Metric | Value | |---|---| | **CER mean** | **0.0509** | | CER median | 0.0309 | | CER p90 | 0.1183 | | CER std | 0.0558 | | WER mean | 0.1866 | | WER median | 0.1600 | **CER distribution:** - 65.9% of samples < 5% CER - 85.9% of samples < 10% CER - 96.5% of samples < 20% CER - 0 catastrophic failures (> 50% CER) Evaluated with speaker 3 reference audio (NISQA MOS 4.99). ## Usage ### Install ```bash pip install f5-tts ``` ### Download Model ```python from huggingface_hub import hf_hub_download # Download checkpoint and vocab ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt") vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt") ``` ### Inference The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress. ```python from datasets import load_dataset from huggingface_hub import hf_hub_download from f5_tts.api import F5TTS import soundfile as sf import numpy as np # Download model ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt") vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt") # Load a reference sample from the training dataset ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test") ref_sample = ds[0] # Pick any sample as voice reference # Save reference audio to temp file (F5-TTS expects a file path) ref_path = "/tmp/ref.wav" sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"]) # Load model model = F5TTS( ckpt_file=ckpt_path, vocab_file=vocab_path, device="cuda", use_ema=False, # Important: this checkpoint was not trained with EMA ) # Generate speech using a training speaker as reference wav, sr, _ = model.infer( ref_file=ref_path, ref_text=ref_sample["text"], gen_text="გამარჯობა, როგორ ხარ? საქართველო ულამაზესი ქვეყანაა.", ) sf.write("output.wav", wav, sr) ``` ### Generation Parameters ```python wav, sr, _ = model.infer( ref_file="reference.wav", ref_text="reference transcript", gen_text="text to synthesize", nfe_step=32, # Denoising steps (default 32, higher = better quality, slower) cfg_strength=2.0, # Classifier-free guidance (default 2.0) speed=1.0, # Speech speed multiplier ) ``` ## Training Details | | | |---|---| | **Method** | Full fine-tune (flow-matching loss, continuation of pretraining) | | **Base checkpoint** | `F5TTS_v1_Base/model_1250000.safetensors` | | **Learning rate** | 1e-5 | | **Warmup** | 500 steps | | **Batch size** | 9,600 audio frames per GPU | | **Max sequences/batch** | 64 | | **Optimizer** | 8-bit Adam (bitsandbytes) | | **Epochs** | 100 | | **Total updates** | 110,000 | | **Tokenizer** | Character-level (`char`, not `pinyin`) | | **Vocab** | 2,579 tokens (2,545 pretrained + 34 Georgian characters) | | **GPU** | 1x NVIDIA RTX A6000 (48GB) | ### Vocab Extension The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (ა-ჰ + „). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 → 2,580 dimensions. ## Limitations and Future Work - **License**: CC-BY-NC-4.0 — non-commercial use only (inherited from F5-TTS weights) - **Voice cloning to new speakers is limited** — the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement. - Trained on 12 speakers from Common Voice Georgian — limited speaker diversity - Some complex Georgian text with rare characters may produce higher error rates - No emotion or prosody control beyond what the reference audio provides ## Part of the Georgian TTS Benchmark This model was trained as part of the first Georgian TTS benchmark — a comparative study of 6 open-source TTS architectures. See the full project: [github.com/NMikaa/TTS_pipelines](https://github.com/NMikaa/TTS_pipelines) ## Citation ```bibtex @misc{f5tts-georgian-2026, title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian}, author={NMikka}, year={2026}, url={https://huggingface.co/NMikka/F5-TTS-Georgian} } ```