mT5 — Singlish → Sinhala Transliteration
Fine-tuned version of google/mt5-base for the task of Singlish-to-Sinhala transliteration, developed as part of the IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration.
This is the merged (LoRA weights absorbed) final model.
Task
Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.
| Input (Singlish) | Output (Sinhala) |
|---|---|
mama giya |
මම ගිය |
kohomada |
කොහොමද |
Training Pipeline
Trained using a three-phase curriculum strategy with LoRA, using the same pipeline as the Small100 variant, adapted for the mT5 architecture.
Data
| Split | Source | Size |
|---|---|---|
| Phase 1 & 2 training | phonetic_train_1M.csv |
1,000,000 samples |
| Adhoc fine-tuning | adhoc.csv |
11,937 samples |
| Phonetic validation | phonetic_test.csv |
10,003 samples |
| Adhoc validation | adhoc_test.csv |
5,003 samples |
Synthetic Augmentation
Adhoc data was expanded with a rule-based Singlish augmenter:
- Vowel dropping — randomly drops non-boundary vowels
- Cluster simplification — collapses common digraphs (
th→t,sh→s,nd→n, etc.) - Vowel swapping — substitutes phonetically similar vowels (
a↔e,i↔e,o↔u)
Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.
Input Prefix
Inputs are prefixed with transliterate: at inference time, consistent with T5-style task conditioning.
Three-Phase Curriculum
| Phase | Data | Epochs | LR | Validation | Aug |
|---|---|---|---|---|---|
| 1 — Foundation | 65% of phonetic train (~650K) | 2 | 1e-4 | Phonetic | 15% |
| 2 — Expansion | Remaining phonetic + 5× adhoc + 80K replay | 2 | 5e-5 | Adhoc | 20% |
| 3 — Mastery | 10× adhoc + 200K phonetic mix | 2 | 2e-5 | Adhoc | 15% |
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, out_proj, fc1, fc2 |
Training Arguments
| Parameter | Value |
|---|---|
| Batch size | 8 |
| Gradient accumulation | 4 (effective batch: 32) |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Warmup ratio | 0.03 |
| Optimizer | AdamW fused |
| Precision | bfloat16 / fp16 |
Evaluation Results
| Test Set | CER ↓ | WER ↓ | BLEU ↑ | BERTScore ↑ |
|---|---|---|---|---|
| Phonetic | 0.0478 | 0.1764 | 0.6213 | 0.9897 |
| Adhoc | 0.1034 | 0.3015 | 0.4223 | 0.9861 |
BERTScore computed using Ransaka/sinhala-bert-medium-v2.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "savinugunarathna/mT5-Singlish-Sinhala-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
inputs = tokenizer("transliterate: mama giya", return_tensors="pt")
outputs = model.generate(
**inputs,
num_beams=4,
max_length=128,
length_penalty=1.2,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය
Note: Always prepend
transliterate:to inputs. Suppressing<extra_id_N>tokens viabad_words_idsis recommended for clean output.
- Downloads last month
- 5
Model tree for savinugunarathna/mT5-Singlish-Sinhala-Merged
Base model
google/mt5-base