--- language: - si - en license: apache-2.0 library_name: transformers pipeline_tag: translation tags: - singlish - sinhala - translation - byt5 - character-level - two-stage-training base_model: google/byt5-small datasets: - custom --- # Singlish to Sinhala Translation Model (ByT5-Small) A character-level translation model that converts Singlish (romanized Sinhala mixed with English) to Sinhala script. Built on `google/byt5-small` using a two-stage training approach. ## Model Description - **Base Model:** google/byt5-small (character-level T5) - **Task:** Translation (Singlish → Sinhala) - **Languages:** Singlish (romanized) → Sinhala (සිංහල) - **Training Date:** 2026-01-16 - **Architecture:** Character-level encoder-decoder ## Two-Stage Training Strategy This model uses a specialized two-stage training approach to handle both phonetic romanization and shorthand Singlish: ### Stage 1: Phonetic Foundation - **Dataset:** ~500,000 phonetic romanization pairs - **Purpose:** Learn standard Sinhala phonetic patterns - **Learning Rate:** 1e-5 - **Epochs:** 1 - **Batch Size:** 8 (effective: 32 with gradient accumulation) ### Stage 2: Shorthand Fine-tuning - **Dataset:** Ad-hoc Singlish variations - **Purpose:** Adapt to informal, conversational Singlish - **Learning Rate:** 3e-6 (3× lower to prevent catastrophic forgetting) - **Epochs:** 1 - **Strategy:** Unfrozen encoder with gentle learning rate This approach ensures the model handles both formal phonetic romanization AND informal chat-style Singlish. ## Training Details **Hardware & Environment:** - GPU: Tesla P100 - Precision: FP32 - Framework: Hugging Face Transformers - Optimizer: AdamW with warmup **Hyperparameters:** - Max source length: 80 characters - Max target length: 80 characters - Gradient clipping: 1.0 - Weight decay: 0.01 - Warmup steps: 500 (Stage 1), 200 (Stage 2) ## Usage ### Using Transformers Pipeline ```python from transformers import pipeline translator = pipeline("translation", model="savinugunarathna/ByT5-Small-fine-tuned") result = translator("kohomada") print(result[0]["translation_text"]) # Output: කොහොමද ``` ### Manual Loading (Recommended for ByT5) ```python from transformers import AutoTokenizer, T5ForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned") model = T5ForConditionalGeneration.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned") # Translate input_text = "mata badagini" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_length=80, num_beams=5) translation = tokenizer.decode(outputs[0], skip_special_tokens=True) print(translation) # Output: මට බඩගිනි ``` ### Interactive Translator Script ```python import torch device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device) model.eval() while True: text = input("Enter Singlish (or 'quit'): ") if text.lower() == 'quit': break inputs = tokenizer(text, return_tensors="pt").to(device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=80, num_beams=5) print(f"Sinhala: {tokenizer.decode(outputs[0], skip_special_tokens=True)}") ``` ## Example Translations | Singlish Input | Sinhala Output | Type | |----------------|----------------|------| | kohomada | කොහොමද | Phonetic | | mama hodata innawa | මම හොඳට ඉන්නවා | Phonetic | | api yamu | අපි යමු | Phonetic | | oyage nama mokakda | ඔයාගේ නම මොකක්ද | Shorthand | | api koheda yanne | අපි කොහෙද යන්නේ | Shorthand | ## Model Capabilities ✅ **Handles phonetic romanization** (standard Latin script representations) ✅ **Understands informal Singlish** (chat-style abbreviations and variations) ✅ **Character-level processing** (robust to typos and spelling variations) ✅ **No subword tokenization** (ByT5's byte-level approach) ## Limitations - Performance may vary with highly non-standard spellings - Best suited for conversational text - May struggle with very long compound words - Character-level processing is slower than subword models - Can not handle code mix ## Training Data **Stage 1 - Phonetic Dataset:** - Source: Curated phonetic romanization pairs from Swa-bhasha Resource Hub - Size: ~500,000 unique pairs - Type: Standard Sinhala ↔ Latin script mappings **Stage 2 - Shorthand Dataset:** - Source: Ad-hoc Singlish variations - Type: Informal, conversational Singlish patterns - Purpose: Generalization to real-world usage ## Why Two-Stage Training? Direct training on mixed data can cause interference between formal phonetic patterns and informal shorthand. The two-stage approach: 1. **Establishes foundation** with clean phonetic data 2. **Adapts gently** to informal patterns using lower learning rate 3. **Prevents catastrophic forgetting** of base phonetic knowledge 4. **Maintains performance** on both task types ## Comparison with mT5-Small This ByT5 model differs from mT5 in key ways: - **Character-level vs. subword:** More robust to spelling variations - **Smaller vocabulary:** Processes raw UTF-8 bytes - **Better generalization:** Handles unseen romanization patterns ## Citations If you use this model, please cite: ```bibtex @misc{byt5-singlish-sinhala-20260116, author = {savinugunarathna}, title = {Singlish to Sinhala Translation Model (ByT5-Small)}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/savinugunarathna/ByT5-Small-fine-tuned}} } ``` ### Data Source Citation This model uses data from the Swa-bhasha Resource Hub: ```bibtex @article{sumanathilaka2025swa, title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources}, author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP}, journal={arXiv preprint arXiv:2507.09245}, year={2025} } ``` ## License Apache 2.0 ## Acknowledgments - Base model: google/byt5-small - Training data: Swa-bhasha Resource Hub (Sumanathilaka et al., 2025) - Training framework: Hugging Face Transformers - Compute: Kaggle GPU (Tesla P100) ## Model Card Contact For questions or issues, please open an issue in the model repository.