---
language:
- si
- en
license: apache-2.0
library_name: transformers
pipeline_tag: translation
tags:
- singlish
- sinhala
- translation
- byt5
- character-level
- two-stage-training
base_model: google/byt5-small
datasets:
- custom
---

# Singlish to Sinhala Translation Model (ByT5-Small)

A character-level translation model that converts Singlish (romanized Sinhala mixed with English) to Sinhala script. Built on `google/byt5-small` using a two-stage training approach.

## Model Description

- **Base Model:** google/byt5-small (character-level T5)
- **Task:** Translation (Singlish → Sinhala)
- **Languages:** Singlish (romanized) → Sinhala (සිංහල)
- **Training Date:** 2026-01-16
- **Architecture:** Character-level encoder-decoder

## Two-Stage Training Strategy

This model uses a specialized two-stage training approach to handle both phonetic romanization and shorthand Singlish:

### Stage 1: Phonetic Foundation
- **Dataset:** ~500,000 phonetic romanization pairs
- **Purpose:** Learn standard Sinhala phonetic patterns
- **Learning Rate:** 1e-5
- **Epochs:** 1
- **Batch Size:** 8 (effective: 32 with gradient accumulation)

### Stage 2: Shorthand Fine-tuning
- **Dataset:** Ad-hoc Singlish variations
- **Purpose:** Adapt to informal, conversational Singlish
- **Learning Rate:** 3e-6 (3× lower to prevent catastrophic forgetting)
- **Epochs:** 1
- **Strategy:** Unfrozen encoder with gentle learning rate

This approach ensures the model handles both formal phonetic romanization AND informal chat-style Singlish.

## Training Details

**Hardware & Environment:**
- GPU: Tesla P100
- Precision: FP32
- Framework: Hugging Face Transformers
- Optimizer: AdamW with warmup

**Hyperparameters:**
- Max source length: 80 characters
- Max target length: 80 characters
- Gradient clipping: 1.0
- Weight decay: 0.01
- Warmup steps: 500 (Stage 1), 200 (Stage 2)

## Usage

### Using Transformers Pipeline
```python
from transformers import pipeline

translator = pipeline("translation", model="savinugunarathna/ByT5-Small-fine-tuned")
result = translator("kohomada")
print(result[0]["translation_text"])
# Output: කොහොමද
```

### Manual Loading (Recommended for ByT5)
```python
from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")
model = T5ForConditionalGeneration.from_pretrained("savinugunarathna/ByT5-Small-fine-tuned")

# Translate
input_text = "mata badagini"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=80, num_beams=5)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
# Output: මට බඩගිනි
```

### Interactive Translator Script
```python
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

while True:
    text = input("Enter Singlish (or 'quit'): ")
    if text.lower() == 'quit':
        break
    
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=80, num_beams=5)
    print(f"Sinhala: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
```

## Example Translations

| Singlish Input | Sinhala Output | Type |
|----------------|----------------|------|
| kohomada | කොහොමද | Phonetic |
| mama hodata innawa | මම හොඳට ඉන්නවා | Phonetic |
| api yamu | අපි යමු | Phonetic |
| oyage nama mokakda | ඔයාගේ නම මොකක්ද | Shorthand |
| api koheda yanne | අපි කොහෙද යන්නේ | Shorthand |

## Model Capabilities

✅ **Handles phonetic romanization** (standard Latin script representations)  
✅ **Understands informal Singlish** (chat-style abbreviations and variations)  
✅ **Character-level processing** (robust to typos and spelling variations)  
✅ **No subword tokenization** (ByT5's byte-level approach)

## Limitations

- Performance may vary with highly non-standard spellings
- Best suited for conversational text
- May struggle with very long compound words
- Character-level processing is slower than subword models
- Can not handle code mix

## Training Data

**Stage 1 - Phonetic Dataset:**
- Source: Curated phonetic romanization pairs from Swa-bhasha Resource Hub
- Size: ~500,000 unique pairs
- Type: Standard Sinhala ↔ Latin script mappings

**Stage 2 - Shorthand Dataset:**
- Source: Ad-hoc Singlish variations
- Type: Informal, conversational Singlish patterns
- Purpose: Generalization to real-world usage

## Why Two-Stage Training?

Direct training on mixed data can cause interference between formal phonetic patterns and informal shorthand. The two-stage approach:

1. **Establishes foundation** with clean phonetic data
2. **Adapts gently** to informal patterns using lower learning rate
3. **Prevents catastrophic forgetting** of base phonetic knowledge
4. **Maintains performance** on both task types

## Comparison with mT5-Small

This ByT5 model differs from mT5 in key ways:
- **Character-level vs. subword:** More robust to spelling variations
- **Smaller vocabulary:** Processes raw UTF-8 bytes
- **Better generalization:** Handles unseen romanization patterns

## Citations

If you use this model, please cite:
```bibtex
@misc{byt5-singlish-sinhala-20260116,
  author = {savinugunarathna},
  title = {Singlish to Sinhala Translation Model (ByT5-Small)},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/savinugunarathna/ByT5-Small-fine-tuned}}
}
```

### Data Source Citation

This model uses data from the Swa-bhasha Resource Hub:
```bibtex
@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}
```

## License

Apache 2.0

## Acknowledgments

- Base model: google/byt5-small
- Training data: Swa-bhasha Resource Hub (Sumanathilaka et al., 2025)
- Training framework: Hugging Face Transformers
- Compute: Kaggle GPU (Tesla P100)

## Model Card Contact

For questions or issues, please open an issue in the model repository.