---
language:
- si
- en
tags:
- singlish
- sinhala
- transliteration
- seq2seq
- mt5
- lora
license: mit
base_model: google/mt5-base
datasets:
- savinugunarathna/indonlp-final
metrics:
- cer
- wer
- bleu
---

# mT5 — Singlish → Sinhala Transliteration

Fine-tuned version of **google/mt5-base** for the task of **Singlish-to-Sinhala transliteration**, developed as part of the **IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration**.

This is the merged (LoRA weights absorbed) final model.

---

## Task

Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.

| Input (Singlish) | Output (Sinhala) |
|---|---|
| `mama giya` | `මම ගිය` |
| `kohomada` | `කොහොමද` |

---

## Training Pipeline

Trained using a **three-phase curriculum strategy** with **LoRA**, using the same pipeline as the Small100 variant, adapted for the mT5 architecture.

### Data

| Split | Source | Size |
|---|---|---|
| Phase 1 & 2 training | `phonetic_train_1M.csv` | 1,000,000 samples |
| Adhoc fine-tuning | `adhoc.csv` | 11,937 samples |
| Phonetic validation | `phonetic_test.csv` | 10,003 samples |
| Adhoc validation | `adhoc_test.csv` | 5,003 samples |

### Synthetic Augmentation

Adhoc data was expanded with a **rule-based Singlish augmenter**:
- **Vowel dropping** — randomly drops non-boundary vowels
- **Cluster simplification** — collapses common digraphs (`th→t`, `sh→s`, `nd→n`, etc.)
- **Vowel swapping** — substitutes phonetically similar vowels (`a↔e`, `i↔e`, `o↔u`)

Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.

### Input Prefix

Inputs are prefixed with `transliterate: ` at inference time, consistent with T5-style task conditioning.

### Three-Phase Curriculum

| Phase | Data | Epochs | LR | Validation | Aug |
|---|---|---|---|---|---|
| 1 — Foundation | 65% of phonetic train (~650K) | 2 | 1e-4 | Phonetic | 15% |
| 2 — Expansion | Remaining phonetic + 5× adhoc + 80K replay | 2 | 5e-5 | Adhoc | 20% |
| 3 — Mastery | 10× adhoc + 200K phonetic mix | 2 | 2e-5 | Adhoc | 15% |

### LoRA Configuration

| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | `q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2` |

### Training Arguments

| Parameter | Value |
|---|---|
| Batch size | 8 |
| Gradient accumulation | 4 (effective batch: 32) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Warmup ratio | 0.03 |
| Optimizer | AdamW fused |
| Precision | bfloat16 / fp16 |

---

## Evaluation Results

| Test Set | CER ↓ | WER ↓ | BLEU ↑ | BERTScore ↑ |
|---|---|---|---|---|
| Phonetic | 0.0478 | 0.1764 | 0.6213 | 0.9897 |
| Adhoc | 0.1034 | 0.3015 | 0.4223 | 0.9861 |

> BERTScore computed using [Ransaka/sinhala-bert-medium-v2](https://huggingface.co/Ransaka/sinhala-bert-medium-v2).

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "savinugunarathna/mT5-Singlish-Sinhala-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

inputs = tokenizer("transliterate: mama giya", return_tensors="pt")
outputs = model.generate(
    **inputs,
    num_beams=4,
    max_length=128,
    length_penalty=1.2,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය
```

> **Note:** Always prepend `transliterate: ` to inputs. Suppressing `<extra_id_N>` tokens via `bad_words_ids` is recommended for clean output.