--- language: - si - en tags: - singlish - sinhala - transliteration - seq2seq - mt5 - lora license: mit base_model: google/mt5-base datasets: - savinugunarathna/indonlp-final metrics: - cer - wer - bleu --- # mT5 — Singlish → Sinhala Transliteration Fine-tuned version of **google/mt5-base** for the task of **Singlish-to-Sinhala transliteration**, developed as part of the **IndoNLP 2025 Shared Task on Singlish–Sinhala Transliteration**. This is the merged (LoRA weights absorbed) final model. --- ## Task Singlish (romanised colloquial Sinhala) → Sinhala script transliteration. | Input (Singlish) | Output (Sinhala) | |---|---| | `mama giya` | `මම ගිය` | | `kohomada` | `කොහොමද` | --- ## Training Pipeline Trained using a **three-phase curriculum strategy** with **LoRA**, using the same pipeline as the Small100 variant, adapted for the mT5 architecture. ### Data | Split | Source | Size | |---|---|---| | Phase 1 & 2 training | `phonetic_train_1M.csv` | 1,000,000 samples | | Adhoc fine-tuning | `adhoc.csv` | 11,937 samples | | Phonetic validation | `phonetic_test.csv` | 10,003 samples | | Adhoc validation | `adhoc_test.csv` | 5,003 samples | ### Synthetic Augmentation Adhoc data was expanded with a **rule-based Singlish augmenter**: - **Vowel dropping** — randomly drops non-boundary vowels - **Cluster simplification** — collapses common digraphs (`th→t`, `sh→s`, `nd→n`, etc.) - **Vowel swapping** — substitutes phonetically similar vowels (`a↔e`, `i↔e`, `o↔u`) Aggression factor: 0.5. Applied at 15% / 20% / 15% across the three phases. ### Input Prefix Inputs are prefixed with `transliterate: ` at inference time, consistent with T5-style task conditioning. ### Three-Phase Curriculum | Phase | Data | Epochs | LR | Validation | Aug | |---|---|---|---|---|---| | 1 — Foundation | 65% of phonetic train (~650K) | 2 | 1e-4 | Phonetic | 15% | | 2 — Expansion | Remaining phonetic + 5× adhoc + 80K replay | 2 | 5e-5 | Adhoc | 20% | | 3 — Mastery | 10× adhoc + 200K phonetic mix | 2 | 2e-5 | Adhoc | 15% | ### LoRA Configuration | Parameter | Value | |---|---| | Rank (r) | 64 | | Alpha | 128 | | Dropout | 0.05 | | Target modules | `q_proj`, `k_proj`, `v_proj`, `out_proj`, `fc1`, `fc2` | ### Training Arguments | Parameter | Value | |---|---| | Batch size | 8 | | Gradient accumulation | 4 (effective batch: 32) | | Weight decay | 0.01 | | Max grad norm | 1.0 | | Warmup ratio | 0.03 | | Optimizer | AdamW fused | | Precision | bfloat16 / fp16 | --- ## Evaluation Results | Test Set | CER ↓ | WER ↓ | BLEU ↑ | BERTScore ↑ | |---|---|---|---|---| | Phonetic | 0.0478 | 0.1764 | 0.6213 | 0.9897 | | Adhoc | 0.1034 | 0.3015 | 0.4223 | 0.9861 | > BERTScore computed using [Ransaka/sinhala-bert-medium-v2](https://huggingface.co/Ransaka/sinhala-bert-medium-v2). --- ## Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_id = "savinugunarathna/mT5-Singlish-Sinhala-Merged" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) inputs = tokenizer("transliterate: mama giya", return_tensors="pt") outputs = model.generate( **inputs, num_beams=4, max_length=128, length_penalty=1.2, repetition_penalty=1.2, no_repeat_ngram_size=3, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # → මම ගිය ``` > **Note:** Always prepend `transliterate: ` to inputs. Suppressing `` tokens via `bad_words_ids` is recommended for clean output.