Add model card

ad54cdc verified 4 months ago

5.82 kB

	---
	language:
	- ka
	license: cc-by-nc-4.0
	base_model: SWivid/F5-TTS
	tags:
	- tts
	- text-to-speech
	- georgian
	- f5-tts
	- speech-synthesis
	- flow-matching
	pipeline_tag: text-to-speech
	datasets:
	- NMikka/Common-Voice-Geo-Cleaned
	---

	# F5-TTS Georgian

	A fine-tuned version of [SWivid/F5-TTS](https://huggingface.co/SWivid/F5-TTS) (335M params) for Georgian text-to-speech. The model produces high-quality Georgian speech when using training speakers as reference. Generalization to arbitrary voice cloning is a work in progress.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Base model \| [SWivid/F5-TTS v1 Base](https://huggingface.co/SWivid/F5-TTS) (335M params, DiT + ConvNeXt V2) \|
	\| Fine-tuning \| Full fine-tune (continuation of flow-matching pretraining), no LoRA \|
	\| Training data \| [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) — 20,300 samples, 12 speakers \|
	\| Training \| 110,000 updates (~100 epochs), single NVIDIA RTX A6000 (48GB) \|
	\| Sample rate \| 24 kHz \|
	\| Voice cloning \| Works well with training speakers; generalizing to new voices is WIP \|
	\| License \| CC-BY-NC-4.0 (inherited from F5-TTS pretrained weights) \|

	## Evaluation — FLEURS Georgian Benchmark (979 unseen samples)

	Round-trip CER: TTS generates audio → [Meta Omnilingual ASR 7B](https://huggingface.co/facebook/omniASR_LLM_7B) transcribes → compare to original text.

	\| Metric \| Value \|
	\|---\|---\|
	\| CER mean \| 0.0509 \|
	\| CER median \| 0.0309 \|
	\| CER p90 \| 0.1183 \|
	\| CER std \| 0.0558 \|
	\| WER mean \| 0.1866 \|
	\| WER median \| 0.1600 \|

	CER distribution:
	- 65.9% of samples < 5% CER
	- 85.9% of samples < 10% CER
	- 96.5% of samples < 20% CER
	- 0 catastrophic failures (> 50% CER)

	Evaluated with speaker 3 reference audio (NISQA MOS 4.99).

	## Usage

	### Install

	```bash
	pip install f5-tts
	```

	### Download Model

	```python
	from huggingface_hub import hf_hub_download

	# Download checkpoint and vocab
	ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
	vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")
	```

	### Inference

	The model works best with reference audio from the training dataset. Voice cloning to arbitrary Georgian speakers is a work in progress.

	```python
	from datasets import load_dataset
	from huggingface_hub import hf_hub_download
	from f5_tts.api import F5TTS
	import soundfile as sf
	import numpy as np

	# Download model
	ckpt_path = hf_hub_download("NMikka/F5-TTS-Georgian", "model_110000.pt")
	vocab_path = hf_hub_download("NMikka/F5-TTS-Georgian", "extended_vocab.txt")

	# Load a reference sample from the training dataset
	ds = load_dataset("NMikka/Common-Voice-Geo-Cleaned", split="test")
	ref_sample = ds[0] # Pick any sample as voice reference

	# Save reference audio to temp file (F5-TTS expects a file path)
	ref_path = "/tmp/ref.wav"
	sf.write(ref_path, np.array(ref_sample["audio"]["array"]), ref_sample["audio"]["sampling_rate"])

	# Load model
	model = F5TTS(
	ckpt_file=ckpt_path,
	vocab_file=vocab_path,
	device="cuda",
	use_ema=False, # Important: this checkpoint was not trained with EMA
	)

	# Generate speech using a training speaker as reference
	wav, sr, _ = model.infer(
	ref_file=ref_path,
	ref_text=ref_sample["text"],
	gen_text="გამარჯობა, როგორ ხარ? საქართველო ულამაზესი ქვეყანაა.",
	)
	sf.write("output.wav", wav, sr)
	```

	### Generation Parameters

	```python
	wav, sr, _ = model.infer(
	ref_file="reference.wav",
	ref_text="reference transcript",
	gen_text="text to synthesize",
	nfe_step=32, # Denoising steps (default 32, higher = better quality, slower)
	cfg_strength=2.0, # Classifier-free guidance (default 2.0)
	speed=1.0, # Speech speed multiplier
	)
	```

	## Training Details

	\| \| \|
	\|---\|---\|
	\| Method \| Full fine-tune (flow-matching loss, continuation of pretraining) \|
	\| Base checkpoint \| `F5TTS_v1_Base/model_1250000.safetensors` \|
	\| Learning rate \| 1e-5 \|
	\| Warmup \| 500 steps \|
	\| Batch size \| 9,600 audio frames per GPU \|
	\| Max sequences/batch \| 64 \|
	\| Optimizer \| 8-bit Adam (bitsandbytes) \|
	\| Epochs \| 100 \|
	\| Total updates \| 110,000 \|
	\| Tokenizer \| Character-level (`char`, not `pinyin`) \|
	\| Vocab \| 2,579 tokens (2,545 pretrained + 34 Georgian characters) \|
	\| GPU \| 1x NVIDIA RTX A6000 (48GB) \|

	### Vocab Extension

	The pretrained F5-TTS uses a pinyin-based vocabulary (2,545 tokens). For Georgian, we extended the vocabulary by appending 34 Georgian Unicode characters (ა-ჰ + „). New embeddings were initialized with the mean of existing pretrained embeddings, then the text embedding layer was resized from 2,546 → 2,580 dimensions.

	## Limitations and Future Work

	- License: CC-BY-NC-4.0 — non-commercial use only (inherited from F5-TTS weights)
	- Voice cloning to new speakers is limited — the model clones training speakers well but does not yet generalize to arbitrary Georgian voices. This is an active area of improvement.
	- Trained on 12 speakers from Common Voice Georgian — limited speaker diversity
	- Some complex Georgian text with rare characters may produce higher error rates
	- No emotion or prosody control beyond what the reference audio provides

	## Part of the Georgian TTS Benchmark

	This model was trained as part of the first Georgian TTS benchmark — a comparative study of 6 open-source TTS architectures. See the full project: [github.com/NMikaa/TTS_pipelines](https://github.com/NMikaa/TTS_pipelines)

	## Citation

	```bibtex
	@misc{f5tts-georgian-2026,
	title={F5-TTS Georgian: Fine-tuned Flow-Matching TTS for Georgian},
	author={NMikka},
	year={2026},
	url={https://huggingface.co/NMikka/F5-TTS-Georgian}
	}
	```