Instructions to use NMikka/F5-TTS-Georgian with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- F5-TTS
How to use NMikka/F5-TTS-Georgian with F5-TTS:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| language: | |
| - ka | |
| license: cc-by-nc-4.0 | |
| base_model: SWivid/F5-TTS | |
| tags: | |
| - tts | |
| - text-to-speech | |
| - georgian | |
| - f5-tts | |
| - speech-synthesis | |
| - flow-matching | |
| pipeline_tag: text-to-speech | |
| datasets: | |
| - NMikka/Common-Voice-Geo-Cleaned | |
| # F5-TTS Georgian | |
| A fine-tuned version of [SWivid/F5-TTS](https://huggingface.co/SWivid/F5-TTS) (335M params) for **Georgian text-to-speech**. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress. | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Base model** | [SWivid/F5-TTS v1 Base](https://huggingface.co/SWivid/F5-TTS) (335M params, DiT + ConvNeXt V2) | | |
| | **Fine-tuning** | Full fine-tune (continuation of flow-matching pretraining), no LoRA | | |
| | **Training data** | [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) โ 20,300 samples, 12 speakers | | |
| | **Training** | 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB) | | |
| | **Sample rate** | 24 kHz | | |
| | **Voice cloning** | Works well with training speakers; generalizing to new voices is WIP | | |
| | **License** | CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights) | | |
| ## Evaluation โ FLEURS Georgian Benchmark (979 unseen samples) | |
| Round-trip CER: TTS generates audio โ [Meta Omnilingual ASR 7B](https://huggingface.co/facebook/omniASR_LLM_7B) transcribes โ compare to original text. | |
| | Metric | Value | | |
| |---|---| | |
| | **CER mean** | **0.0509** | | |
| | CER median | 0.0309 | | |
| | CER p90 | 0.1183 | | |
| | CER std | 0.0558 | | |
| | WER mean | 0.1866 | | |
| | WER median | 0.1600 | | |
| **CER distribution:** | |
| - 65.9% of samples < 5% CER | |
| - 85.9% of samples < 10% CER | |
| - 96.5% of samples < 20% CER | |
| - 0 catastrophic failures (> 50% CER) | |
| Evaluated with speaker 3 reference audio (NISQA MOS 4.99). | |
| ## Usage | |
| ### Install | |
| ```bash | |
| pip install f5-tts | |
| ``` | |
| ### Download Model | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| # Download checkpoint and vocab | |
| ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt") | |
| vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt") | |
| ``` | |
| ### Inference | |
| The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress. | |
| ```python | |
| from datasets import load_dataset | |
| from huggingface_hub import hf_hub_download | |
| from f5_tts.api import F5TTS | |
| import soundfile as sf | |
| import numpy as np | |
| # Download model | |
| ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt") | |
| vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt") | |
| # Load a reference sample from the training dataset | |
| ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test") | |
| ref_sample = ds[0] # Pick any sample as voice reference | |
| # Save reference audio to temp file (F5-TTS expects a file path) | |
| ref_path = "/tmp/ref.wav" | |
| sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"]) | |
| # Load model | |
| model = F5TTS( | |
| ckpt_file=ckpt_path, | |
| vocab_file=vocab_path, | |
| device="cuda", | |
| use_ema=False, # Important: this checkpoint was not trained with EMA | |
| ) | |
| # Generate speech using a training speaker as reference | |
| wav, sr, _ = model.infer( | |
| ref_file=ref_path, | |
| ref_text=ref_sample["text"], | |
| gen_text="แแแแแ แฏแแแ, แ แแแแ แฎแแ ? แกแแฅแแ แแแแแ แฃแแแแแแแกแ แฅแแแงแแแแ.", | |
| ) | |
| sf.write("output.wav", wav, sr) | |
| ``` | |
| ### Generation Parameters | |
| ```python | |
| wav, sr, _ = model.infer( | |
| ref_file="reference.wav", | |
| ref_text="reference transcript", | |
| gen_text="text to synthesize", | |
| nfe_step=32, # Denoising steps (default 32, higher = better quality, slower) | |
| cfg_strength=2.0, # Classifier-free guidance (default 2.0) | |
| speed=1.0, # Speech speed multiplier | |
| ) | |
| ``` | |
| ## Training Details | |
| | | | | |
| |---|---| | |
| | **Method** | Full fine-tune (flow-matching loss, continuation of pretraining) | | |
| | **Base checkpoint** | `F5TTS_v1_Base/model_1250000.safetensors` | | |
| | **Learning rate** | 1e-5 | | |
| | **Warmup** | 500 steps | | |
| | **Batch size** | 9,600 audio frames per GPU | | |
| | **Max sequences/batch** | 64 | | |
| | **Optimizer** | 8-bit Adam (bitsandbytes) | | |
| | **Epochs** | 100 | | |
| | **Total updates** | 110,000 | | |
| | **Tokenizer** | Character-level (`char`, not `pinyin`) | | |
| | **Vocab** | 2,579 tokens (2,545 pretrained + 34 Georgian characters) | | |
| | **GPU** | 1x NVIDIA RTX A6000 (48GB) | | |
| ### Vocab Extension | |
| The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (แ-แฐ + โ). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 โ 2,580 dimensions. | |
| ## Limitations and Future Work | |
| - **License**: CC-BY-NC-4.0 โ non-commercial use only (inherited from F5-TTS weights) | |
| - **Voice cloning to new speakers is limited** โ the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement. | |
| - Trained on 12 speakers from Common Voice Georgian โ limited speaker diversity | |
| - Some complex Georgian text with rare characters may produce higher error rates | |
| - No emotion or prosody control beyond what the reference audio provides | |
| ## Part of the Georgian TTS Benchmark | |
| This model was trained as part of the first Georgian TTS benchmark โ a comparative study of 6 open-source TTS architectures. See the full project: [github.com/NMikaa/TTS_pipelines](https://github.com/NMikaa/TTS_pipelines) | |
| ## Citation | |
| ```bibtex | |
| @misc{f5tts-georgian-2026, | |
| title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian}, | |
| author={NMikka}, | |
| year={2026}, | |
| url={https://huggingface.co/NMikka/F5-TTS-Georgian} | |
| } | |
| ``` | |