---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
language:
- ka
tags:
- tts
- text-to-speech
- georgian
- nemo
- magpie-tts
base_model: nvidia/magpie_tts_multilingual_357m
datasets:
- NMikka/Common-Voice-Geo-Cleaned
pipeline_tag: text-to-speech
---

# MagPIE TTS — Georgian

A fine-tuned [MagPIE TTS](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) model for Georgian (ქართული) text-to-speech synthesis.

This is the **open-source TTS model fine-tuned specifically for Georgian**, produced as part of the [Georgian TTS Benchmark](https://github.com/NikaGaworworw/TTS_pipelines).

## Evaluation Results

Evaluated on the full [FLEURS Georgian](https://huggingface.co/datasets/google/fleurs) test set (979 samples) using round-trip intelligibility:

| Metric | Score |
|--------|-------|
| **CER** | **2.16%** |
| **WER** | **7.08%** |

> CER/WER measured via round-trip: TTS generates audio → [Meta Omnilingual ASR 7B](https://huggingface.co/facebook/omniASR-LLM-7B) transcribes it → compare to original text. 

## Quick Start

### Installation

```bash
# Requires NeMo 2.7.2 (install from source at the tested commit)
pip install nemo_toolkit[tts]@git+https://github.com/NVIDIA-NeMo/NeMo.git@3d73c48aca1ae3be44657267b81f25dc3201161a
pip install huggingface_hub
```

> Requires Python 3.10+, PyTorch 2.0+, CUDA 11.8+

### Inference

```python
import re
import torch
import torchaudio
from huggingface_hub import hf_hub_download
from nemo.collections.tts.models.magpietts import MagpieTTSModel

# Download and load model
nemo_path = hf_hub_download(repo_id="NMikka/Magpie-TTS-Geo-357m", filename="magpie_tts_georgian.nemo")
model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu")
model = model.eval().cuda()

TOKENIZER_NAME = "text_ce_tokenizer"
MAX_TOKENS_PER_CHUNK = 400  # ~133 Georgian chars, keeps well under 500 decoder steps


def split_georgian_text(text: str) -> list[str]:
    """Split Georgian text into chunks suitable for TTS inference.

    Splitting priority:
    1. Sentence-ending punctuation (. ! ?)
    2. Clause-level punctuation (, ; : —)
    3. Word boundaries (whitespace) as last resort for very long spans
    """
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    for sentence in sentences:
        est_tokens = len(sentence.encode('utf-8'))
        if est_tokens <= MAX_TOKENS_PER_CHUNK:
            chunks.append(sentence)
            continue

        clauses = re.split(r'(?<=[,;:—])\s+', sentence)
        current = ""
        for clause in clauses:
            combined = f"{current} {clause}".strip() if current else clause
            if len(combined.encode('utf-8')) <= MAX_TOKENS_PER_CHUNK:
                current = combined
            else:
                if current:
                    chunks.append(current)
                if len(clause.encode('utf-8')) > MAX_TOKENS_PER_CHUNK:
                    words = clause.split()
                    current = ""
                    for word in words:
                        combined = f"{current} {word}".strip() if current else word
                        if len(combined.encode('utf-8')) <= MAX_TOKENS_PER_CHUNK:
                            current = combined
                        else:
                            if current:
                                chunks.append(current)
                            current = word
                else:
                    current = clause
        if current:
            chunks.append(current)

    return [c for c in chunks if c.strip()]


def tokenize_chunks(chunks: list[str], tokenizer, eos_id: int):
    """Tokenize pre-split text chunks, appending EOS to each."""
    chunked_tokens = []
    chunked_tokens_len = []
    for chunk in chunks:
        tokens = tokenizer.encode(text=chunk, tokenizer_name=TOKENIZER_NAME)
        tokens = tokens + [eos_id]
        tokens = torch.tensor(tokens, dtype=torch.int32)
        chunked_tokens.append(tokens)
        chunked_tokens_len.append(tokens.shape[0])
    return chunked_tokens, chunked_tokens_len


# Synthesize
text = "გამარჯობა, მე მქვია მაგპაი და ქართულად ვლაპარაკობ."

if text[-1] not in ".!?,:;":
    text += "."

chunks = split_georgian_text(text)
chunked_tokens, chunked_tokens_len = tokenize_chunks(chunks, model.tokenizer, model.eos_id)

chunk_state = model.create_longform_chunk_state(batch_size=1)
all_codes = []

for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
    batch = {
        "text": toks.unsqueeze(0).cuda(),
        "text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long),
        "speaker_indices": 1,  # speaker index (0-4)
    }
    with torch.no_grad():
        output = model.generate_long_form_speech(
            batch,
            chunk_state=chunk_state,
            end_of_text=[i == len(chunked_tokens) - 1],
            beginning_of_text=(i == 0),
            use_cfg=True,
            use_local_transformer_for_inference=True,
        )
    if output.predicted_codes_lens[0] > 0:
        all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]])

# Decode to waveform
codes = torch.cat(all_codes, dim=1).unsqueeze(0)
codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long)
audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens)
waveform = audio[0, :audio_lens[0]].cpu().float().unsqueeze(0)

torchaudio.save("output.wav", waveform, 22050)
```

### Convenience Wrapper

For easier use, here's a helper function:

```python
def synthesize(model, text, speaker=1, use_cfg=True):
    """Generate Georgian speech from text.

    Args:
        model: Loaded MagpieTTSModel
        text: Georgian text string
        speaker: Baked speaker index (0-4). Speaker 1 recommended.
        use_cfg: Use classifier-free guidance (better quality, 2x slower)

    Returns:
        waveform (torch.Tensor): Audio tensor, shape (1, num_samples), 22050 Hz
    """
    text = text.strip()
    if text[-1] not in ".!?,:;":
        text += "."

    chunks = split_georgian_text(text)
    chunked_tokens, chunked_tokens_len = tokenize_chunks(chunks, model.tokenizer, model.eos_id)

    chunk_state = model.create_longform_chunk_state(batch_size=1)
    all_codes = []

    for i, (toks, toks_len) in enumerate(zip(chunked_tokens, chunked_tokens_len)):
        batch = {
            "text": toks.unsqueeze(0).cuda(),
            "text_lens": torch.tensor([toks_len], device="cuda", dtype=torch.long),
            "speaker_indices": speaker,
        }
        with torch.no_grad():
            output = model.generate_long_form_speech(
                batch,
                chunk_state=chunk_state,
                end_of_text=[i == len(chunked_tokens) - 1],
                beginning_of_text=(i == 0),
                use_cfg=use_cfg,
                use_local_transformer_for_inference=True,
            )
        if output.predicted_codes_lens[0] > 0:
            all_codes.append(output.predicted_codes[0, :, :output.predicted_codes_lens[0]])

    if not all_codes:
        return None

    codes = torch.cat(all_codes, dim=1).unsqueeze(0)
    codes_lens = torch.tensor([codes.shape[2]], device="cuda", dtype=torch.long)
    audio, audio_lens, _ = model.codes_to_audio(codes, codes_lens)
    return audio[0, :audio_lens[0]].cpu().float().unsqueeze(0)


# Usage:
waveform = synthesize(model, "გამარჯობა მსოფლიო")
torchaudio.save("hello_world.wav", waveform, 22050)
```

## How It Works

MagPIE TTS is an **encoder-decoder transformer** (not a diffusion or flow model):

1. **ByT5-small** encodes text at the byte level — no language-specific tokenizer needed
2. **6-layer causal encoder** processes text embeddings
3. **CTC monotonic alignment** maps text to audio frames (prevents hallucinations — no skipped or repeated words)
4. **12-layer causal decoder** autoregressively generates NanoCodec tokens
5. **NanoCodec** (22kHz, 8 codebooks) decodes tokens to waveform

**Classifier-Free Guidance (CFG)** runs two forward passes (with/without text conditioning) and interpolates. Set `use_cfg=False` for ~2x faster inference with slightly lower quality.

## Text Chunking

Georgian text requires custom chunking because NeMo's built-in `split_by_sentence` doesn't handle Georgian properly (incorrect capitalization, no splitting of long sentences). The chunker included above splits text with this priority:

1. **Sentence-ending punctuation** (`.` `!` `?`)
2. **Clause-level punctuation** (`,` `;` `:` `—`)
3. **Word boundaries** as a last resort

Each chunk is limited to 400 bytes (~133 Georgian characters), keeping well under the model's 500 decoder step limit.

## Speakers

The model has 5 baked speaker embeddings from pretraining. Set via `speaker_indices` in the batch dict.

| Index | Quality |
|-------|---------|
| **1** | **Best** (recommended) |
| 0 | Good |
| 2 | Acceptable |
| 3 | Mediocre |
| 4 | Mediocre |

## Parameters

You can tune inference parameters via `model.inference_parameters`:

```python
model.inference_parameters.temperature = 0.6    # sampling temperature (lower = more deterministic)
model.inference_parameters.topk = 80            # top-k sampling (lower = more focused)
model.inference_parameters.cfg_scale = 2.5      # CFG strength (higher = follows text more strictly)
model.inference_parameters.max_decoder_steps = 500  # max generation length in frames
```

## Training Details

| | |
|---|---|
| **Base model** | [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) |
| **Method** | Full SFT via NeMo |
| **Training data** | [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) (~20,300 clips, 24kHz, resampled to 22,050 Hz) |
| **Parameters** | 357M (all trainable) |
| **Epochs** | 37 |
| **Steps** | 15,614 |
| **Learning rate** | 2e-5 |
| **Precision** | bf16-mixed |
| **GPU** | 1x A6000 (48GB) |
| **Best val_loss** | 9.5569 |
| **Sample rate** | 22,050 Hz |
| **Codec** | NanoCodec (8 codebooks, 21.5 fps, 1.89 kbps) |

## Limitations

- **Single language**: Fine-tuned on Georgian only. The base model supports 105 languages but this checkpoint is specialized.
- **No voice cloning**: Uses 5 baked speaker embeddings from pretraining. Reference audio cloning was not trained.
- **Autoregressive**: Not real-time. RTF ~0.6-0.8 on A6000 with CFG, ~0.4-0.7 without.
- **NeMo dependency**: Requires NVIDIA NeMo toolkit. Not a standalone model.
- **NanoCodec dependency**: The codec model (`nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps`) is downloaded automatically on first use.

## Citation

```bibtex
@misc{magpie-tts-georgian-2026,
  title={MagPIE TTS Georgian: Fine-tuned Text-to-Speech for Georgian},
  author={TODO},
  year={2026},
  url={https://huggingface.co/NMikka/Magpie-TTS-Geo-357m}
}
```

## Acknowledgments

- [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) for the MagPIE TTS architecture and training framework
- [NMikka/Common-Voice-Geo-Cleaned](https://huggingface.co/datasets/NMikka/Common-Voice-Geo-Cleaned) for the cleaned Georgian speech dataset
- [Google FLEURS](https://huggingface.co/datasets/google/fleurs) for the evaluation benchmark