--- license: other license_name: nvidia-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ language: - ka tags: - tts - text-to-speech - georgian - nemo - magpie-tts base_model: nvidia/magpie_tts_multilingual_357m datasets: - NMikka/Common-Voice-Geo-Cleaned pipeline_tag: text-to-speech --- # MagPIE TTS — Georgian A fine-tuned [MagPIE TTS](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) model for Georgian (ქართული) text-to-speech synthesis. This is the **open-source TTS model fine-tuned specifically for Georgian**, produced as part of the [Georgian TTS Benchmark](https://github.com/NikaGaworworw/TTS_pipelines). ## Evaluation Results Evaluated on the full [FLEURS Georgian](https://huggingface.co/datasets/google/fleurs) test set (979 samples) using round-trip intelligibility: | Metric | Score | |--------|-------| | **CER** | **2.16%** | | **WER** | **7.08%** | > CER/WER measured via round-trip: TTS generates audio → [Meta Omnilingual ASR 7B](https://huggingface.co/facebook/omniASR-LLM-7B) transcribes it → compare to original text. ## Quick Start ### Installation ```bash # Requires NeMo 2.7.2 (install from source at the tested commit) pip install nemo_toolkit[tts]@git+https://github.com/NVIDIA-NeMo/NeMo.git@3d73c48aca1ae3be44657267b81f25dc3201161a pip install huggingface_hub ``` > Requires Python 3.10+, PyTorch 2.0+, CUDA 11.8+ ### Inference ```python import re import torch import torchaudio from huggingface_hub import hf_hub_download from nemo.collections.tts.models.magpietts import MagpieTTSModel # Download and load model nemo_path = hf_hub_download(repo_id="NMikka/Magpie-TTS-Geo-357m", filename="magpie_tts_georgian.nemo") model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu") model = model.eval().cuda() TOKENIZER_NAME = "text_ce_tokenizer" MAX_TOKENS_PER_CHUNK = 400 # ~133 Georgian chars, keeps well under 500 decoder steps def split_georgian_text(text: str) -> list[str]: """Split Georgian text into chunks suitable for TTS inference. Splitting priority: 1. Sentence-ending punctuation (. ! ?) 2. Clause-level punctuation (, ; : —) 3. Word boundaries (whitespace) as last resort for very long spans """ sentences = re.split(r'(?<=[.!?])\s+', text) chunks = [] for sentence in sentences: est_tokens = len(sentence.encode('utf-8')) if est_tokens <= MAX_TOKENS_PER_CHUNK: chunks.append(sentence) continue clauses = re.split(r'(?<=[,;:—])\s+', sentence) current = "" for clause in clauses: combined = f"{current} {clause}".strip() if current else clause if len(combined.encode('utf-8')) <= MAX_TOKENS_PER_CHUNK: current = combined else: if current: chunks.append(current) if len(clause.encode('utf-8')) > MAX_TOKENS_PER_CHUNK: words = clause.split() current = "" for word in words: combined = f"{current} {word}".strip() if current else word if len(combined.encode('utf-8')) <= MAX_TOKENS_PER_CHUNK: current = combined else: if current: chunks.append(current) current = word else: current = clause if current: chunks.append(current) return [c for c in chunks if c.strip()] def tokenize_chunks(chunks: list[str], tokenizer, eos_id: int): """Tokenize pre-split text chunks, appending EOS to each.""" chunked_tokens = [] chunked_tokens_len = [] for chunk in chunks: tokens = tokenizer.encode(text=chunk, tokenizer_name=TOKENIZER_NAME) tokens = tokens + [eos_id] tokens = torch.tensor(tokens, dtype=torch.int32) chunked_tokens.append(tokens) chunked_tokens_len.append(tokens.shape[0]) return chunked_tokens, chunked_tokens_len # Synthesize text = "გამარჯობა, მე მქვია მაგპაი და ქართულად ვლაპარაკობ." if text[-1] not in ".!?,:;": text += "." chunks = split_georgian_text(text) chunked_tokens, chunked_tokens_len = tokenize_chunks(chunks, model.tokenizer, model.eos_id) chunk_state = model.create_longform_chunk_state(batch_size=1) all_codes = [] for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)): batch = { "text": toks.unsqueeze(0).cuda(), "text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long), "speaker_indices": 1, # speaker index (0-4) } with torch.no_grad(): output = model.generate_long_form_speech( batch, chunk_state=chunk_state, end_of_text=[i == len(chunked_tokens) - 1], beginning_of_text=(i == 0), use_cfg=True, use_local_transformer_for_inference=True, ) if output.predicted_codes_lens[0] > 0: all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]]) # Decode to waveform codes = torch.cat(all_codes, dim=1).unsqueeze(0) codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long) audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens) waveform = audio[0, :audio_lens[0]].cpu().float().unsqueeze(0) torchaudio.save("output.wav", waveform, 22050) ``` ### Convenience Wrapper For easier use, here's a helper function: ```python def synthesize(model, text, speaker=1, use_cfg=True): """Generate Georgian speech from text. Args: model: Loaded MagpieTTSModel text: Georgian text string speaker: Baked speaker index (0-4). Speaker 1 recommended. use_cfg: Use classifier-free guidance (better quality, 2x slower) Returns: waveform (torch.Tensor): Audio tensor, shape (1, num_samples), 22050 Hz """ text = text.strip() if text[-1] not in ".!?,:;": text += "." chunks = split_georgian_text(text) chunked_tokens, chunked_tokens_len = tokenize_chunks(chunks, model.tokenizer, model.eos_id) chunk_state = model.create_longform_chunk_state(batch_size=1) all_codes = [] for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)): batch = { "text": toks.unsqueeze(0).cuda(), "text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long), "speaker_indices": speaker, } with torch.no_grad(): output = model.generate_long_form_speech( batch, chunk_state=chunk_state, end_of_text=[i == len(chunked_tokens) - 1], beginning_of_text=(i == 0), use_cfg=use_cfg, use_local_transformer_for_inference=True, ) if output.predicted_codes_lens[0] > 0: all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]]) if not all_codes: return None codes = torch.cat(all_codes, dim=1).unsqueeze(0) codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long) audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens) return audio[0, :audio_lens[0]].cpu().float().unsqueeze(0) # Usage: waveform = synthesize(model, "გამარჯობა მსოფლიო") torchaudio.save("hello_world.wav", waveform, 22050) ``` ## How It Works MagPIE TTS is an **encoder-decoder transformer** (not a diffusion or flow model): 1. **ByT5-small** encodes text at the byte level — no language-specific tokenizer needed 2. **6-layer causal encoder** processes text embeddings 3. **CTC monotonic alignment** maps text to audio frames (prevents hallucinations — no skipped or repeated words) 4. **12-layer causal decoder** autoregressively generates NanoCodec tokens 5. **NanoCodec** (22kHz, 8 codebooks) decodes tokens to waveform **Classifier-Free Guidance (CFG)** runs two forward passes (with/without text conditioning) and interpolates. Set `use_cfg=False` for ~2x faster inference with slightly lower quality. ## Text Chunking Georgian text requires custom chunking because NeMo's built-in `split_by_sentence` doesn't handle Georgian properly (incorrect capitalization, no splitting of long sentences). The chunker included above splits text with this priority: 1. **Sentence-ending punctuation** (`.` `!` `?`) 2. **Clause-level punctuation** (`,` `;` `:` `—`) 3. **Word boundaries** as a last resort Each chunk is limited to 400 bytes (~133 Georgian characters), keeping well under the model's 500 decoder step limit. ## Speakers The model has 5 baked speaker embeddings from pretraining. Set via `speaker_indices` in the batch dict. | Index | Quality | |-------|---------| | **1** | **Best** (recommended) | | 0 | Good | | 2 | Acceptable | | 3 | Mediocre | | 4 | Mediocre | ## Parameters You can tune inference parameters via `model.inference_parameters`: ```python model.inference_parameters.temperature = 0.6 # sampling temperature (lower = more deterministic) model.inference_parameters.topk = 80 # top-k sampling (lower = more focused) model.inference_parameters.cfg_scale = 2.5 # CFG strength (higher = follows text more strictly) model.inference_parameters.max_decoder_steps = 500 # max generation length in frames ``` ## Training Details | | | |---|---| | **Base model** | [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) | | **Method** | Full SFT via NeMo | | **Training data** | [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) (~20,300 clips, 24kHz, resampled to 22,050 Hz) | | **Parameters** | 357M (all trainable) | | **Epochs** | 37 | | **Steps** | 15,614 | | **Learning rate** | 2e-5 | | **Precision** | bf16-mixed | | **GPU** | 1x A6000 (48GB) | | **Best val_loss** | 9.5569 | | **Sample rate** | 22,050 Hz | | **Codec** | NanoCodec (8 codebooks, 21.5 fps, 1.89 kbps) | ## Limitations - **Single language**: Fine-tuned on Georgian only. The base model supports 105 languages but this checkpoint is specialized. - **No voice cloning**: Uses 5 baked speaker embeddings from pretraining. Reference audio cloning was not trained. - **Autoregressive**: Not real-time. RTF ~0.6-0.8 on A6000 with CFG, ~0.4-0.7 without. - **NeMo dependency**: Requires NVIDIA NeMo toolkit. Not a standalone model. - **NanoCodec dependency**: The codec model (`nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps`) is downloaded automatically on first use. ## Citation ```bibtex @misc{magpie-tts-georgian-2026, title={MagPIE TTS Georgian: Fine-tuned Text-to-Speech for Georgian}, author={TODO}, year={2026}, url={https://huggingface.co/NMikka/Magpie-TTS-Geo-357m} } ``` ## Acknowledgments - [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) for the MagPIE TTS architecture and training framework - [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) for the cleaned Georgian speech dataset - [Google FLEURS](https://huggingface.co/datasets/google/fleurs) for the evaluation benchmark