mT5 — Singlish → Sinhala Transliteration

Fine-tuned version of google/mt5-base for the task of Singlish-to-Sinhala transliteration, developed as part of the IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration.

This is the merged (LoRA weights absorbed) final model.

Task

Singlish (romanised colloquial Sinhala) → Sinhala script transliteration.

Input (Singlish)	Output (Sinhala)
`mama giya`	`මම ගිය`
`kohomada`	`කොහොමද`

Training Pipeline

Trained using a three-phase curriculum strategy with LoRA, using the same pipeline as the Small100 variant, adapted for the mT5 architecture.

Data

Split	Source	Size
Phase 1 & 2 training	`phonetic_train_1M.csv`	1,000,000 samples
Adhoc fine-tuning	`adhoc.csv`	11,937 samples
Phonetic validation	`phonetic_test.csv`	10,003 samples
Adhoc validation	`adhoc_test.csv`	5,003 samples

Synthetic Augmentation

Adhoc data was expanded with a rule-based Singlish augmenter:

Vowel dropping — randomly drops non-boundary vowels
Cluster simplification — collapses common digraphs (th→t, sh→s, nd→n, etc.)
Vowel swapping — substitutes phonetically similar vowels (a↔e, i↔e, o↔u)

Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases.

Input Prefix

Inputs are prefixed with transliterate: at inference time, consistent with T5-style task conditioning.

Three-Phase Curriculum

Phase	Data	Epochs	LR	Validation	Aug
1 — Foundation	65% of phonetic train (~650K)	2	1e-4	Phonetic	15%
2 — Expansion	Remaining phonetic + 5× adhoc + 80K replay	2	5e-5	Adhoc	20%
3 — Mastery	10× adhoc + 200K phonetic mix	2	2e-5	Adhoc	15%

LoRA Configuration

Parameter	Value
Rank (r)	64
Alpha	128
Dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2`

Training Arguments

Parameter	Value
Batch size	8
Gradient accumulation	4 (effective batch: 32)
Weight decay	0.01
Max grad norm	1.0
Warmup ratio	0.03
Optimizer	AdamW fused
Precision	bfloat16 / fp16

Evaluation Results

Test Set	CER ↓	WER ↓	BLEU ↑	BERTScore ↑
Phonetic	0.0478	0.1764	0.6213	0.9897
Adhoc	0.1034	0.3015	0.4223	0.9861

BERTScore computed using Ransaka/sinhala-bert-medium-v2.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "savinugunarathna/mT5-Singlish-Sinhala-Merged"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

inputs = tokenizer("transliterate: mama giya", return_tensors="pt")
outputs = model.generate(
    **inputs,
    num_beams=4,
    max_length=128,
    length_penalty=1.2,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → මම ගිය

Note: Always prepend transliterate: to inputs. Suppressing <extra_id_N> tokens via bad_words_ids is recommended for clean output.

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for savinugunarathna/mT5-Singlish-Sinhala-Merged

Base model

google/mt5-base

Adapter

(42)

this model