--- language: - en - hi license: mit library_name: pytorch tags: - translation - transformer - seq2seq - english-to-hindi - pytorch - from-scratch - ray-tune - optuna datasets: - tatoeba metrics: - bleu model-index: - name: EN-HI Transformer v1.0.0 results: - task: type: translation name: Machine Translation dataset: name: Tatoeba EN-HI (raw export, 13186 pairs) type: tatoeba metrics: - type: bleu value: 75.66 name: BLEU (NLTK method4 ×100) - name: EN-HI Transformer v1.1.0 results: - task: type: translation name: Machine Translation dataset: name: Tatoeba EN-HI (raw export, 13186 pairs) type: tatoeba metrics: - type: bleu value: 83.69 name: BLEU (NLTK method4 ×100) --- # English → Hindi Transformer A **from-scratch PyTorch encoder-decoder Transformer** for English → Hindi machine translation, trained on a **raw [Tatoeba](https://tatoeba.org/en/downloads) EN-HI export** (13 186 sentence pairs, including multiple Hindi translations per English sentence). This repository provides **two versioned checkpoints**: | Version | Description | BLEU | Epochs | Weights file | |---|---|---|---|---| | **v1.0.0** | Baseline — fixed hyperparameters | 0.7566 | 100 | `v1.0.0/transformer_translation_final.pth` | | **v1.1.0** ✔ *recommended* | Ray Tune + Optuna optimised | **0.8369** | **50** | `v1.1.0/m25csa023_ass_4_best_model.pth` | > v1.1.0 achieves **+10.6% BLEU** in **half the epochs** compared to v1.0.0. --- ## Training Summary ![Training & Evaluation Summary](assets/summary.png) **(a)** Training loss curves - baseline (100 ep) vs tuned (50 ep). **(b)** BLEU progression across epochs. **(c)** All 20 Ray Tune trial loss curves (grey = pruned by ASHA, orange = best). **(d)** Hyperparameter importance (Spearman ρ) - batch size & dropout matter most. **(e–g)** Scatter plots: LR / dropout / batch size vs final loss across all trials. **(h)** Final comparison bar chart: time, loss, and BLEU for v1.0.0 vs v1.1.0. --- ## Dataset **Source:** Raw export from [tatoeba.org/en/downloads](https://tatoeba.org/en/downloads) - English-Hindi sentence pairs. **Note:** This is the **unprocessed Tatoeba dump**, not the Helsinki-NLP filtered version. The file used during training: `English-Hindi.tsv` ### TSV Column Structure | Column | Content | Example | |---|---|---| | 1 | English sentence ID (Tatoeba) | `1282` | | 2 | English sentence | `Muiriel is 20 now.` | | 3 | Hindi sentence ID (Tatoeba) | `485968` | | 4 | Hindi sentence | `म्यूरियल अब बीस साल की हो गई है।` | ### Statistics | Property | Value | |---|---| | Total sentence pairs | 13 186 | | Unique English sentences | 11 109 (2 077 have multiple Hindi translations) | | Mean English length | 5.6 words | | Mean Hindi length | 6.3 words | | Max English length | 53 tokens | | Max Hindi length | 57 tokens | | English ID range | 1 277 – 12 886 231 (Tatoeba IDs) | | Hindi ID range | 440 811 – 13 125 624 (Tatoeba IDs) | | Tokenisation | Whitespace split, lowercased | | Min word frequency (vocab) | 2 | ### Repository File Structure ``` en-hi-transformer/ ├── README.md ← model card (this page) ├── config.json ← shared architecture config ├── assets/ │ └── summary.png ← training & evaluation plots ├── v1.0.0/ │ ├── transformer_translation_final.pth ← baseline weights (~192 MB) │ └── config.json ← v1.0.0 hyperparameters ├── v1.1.0/ │ ├── m25csa023_ass_4_best_model.pth ← optimised weights (~216 MB) ← recommended │ └── config.json ← v1.1.0 hyperparameters + search config └── vocab/ ├── en_vocab.pkl ← English vocabulary (4 117 tokens) └── hi_vocab.pkl ← Hindi vocabulary (4 044 tokens) ``` --- ## Model Architecture Built from scratch following [Vaswani et al. (2017) - *"Attention Is All You Need"*](https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf), **no HuggingFace Transformers library used internally**. | Property | Value | |---|---| | Architecture | Encoder-Decoder Transformer | | d_model | 512 | | num_layers | 6 encoder + 6 decoder | | num_heads | 8 | | d_ff | 2048 (v1.0.0) / **2560 (v1.1.0)** | | Dropout | 0.10 (v1.0.0) / **0.081 (v1.1.0)** | | Max sequence length | 50 tokens | | Positional encoding | Sinusoidal (fixed) | | Source vocabulary | 4 117 English tokens | | Target vocabulary | 4 044 Hindi tokens | | Special tokens | `` `` `` `` | --- ## Versions ### v1.0.0 - Baseline Trained for **100 epochs** with manually chosen hyperparameters on an NVIDIA A100 80 GB (BF16 autocast + `torch.compile` + `cudnn.benchmark`). | Hyperparameter | Value | |---|---| | Learning rate | 1e-4 | | Batch size | 60 | | d_ff | 2048 | | Dropout | 0.10 | | Gradient clipping | - | **Results:** BLEU **0.7566** · Loss **0.0998** · Training time **12.3 min** --- ### v1.1.0 - Ray Tune + Optuna Optimised ✔ Hyperparameters discovered automatically using **Ray Tune 2.x** with **OptunaSearch (TPE)** and an **ASHA early-stopping scheduler** (20 trials, ~65% pruned early). | Hyperparameter | Optimised Value | |---|---| | Learning rate | **1.112e-4** | | Batch size | **32** | | d_ff | **2560** | | Dropout | **0.081** | | Gradient clipping | max_norm = 1.0 | **Results:** BLEU **0.8369** · Loss **0.1264** · Training time **13.5 min** · Epochs **50** The winning configuration first surpassed the v1.0.0 BLEU at **epoch 10** during the search sweep. --- ## How to Use ### 1. Clone the repo & install dependencies ```bash git lfs install git clone https://huggingface.co/priyadip/en-hi-transformer pip install torch ``` ### 2. Load a checkpoint ```python import torch, pickle DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load vocabularies with open("en-hi-transformer/vocab/en_vocab.pkl", "rb") as f: en_vocab = pickle.load(f) with open("en-hi-transformer/vocab/hi_vocab.pkl", "rb") as f: hi_vocab = pickle.load(f) # Instantiate the model (Transformer class from the training script) model = Transformer( src_vocab_size = len(en_vocab), tgt_vocab_size = len(hi_vocab), d_model = 512, num_layers = 6, num_heads = 8, d_ff = 2560, # use 2048 for v1.0.0 max_len = 50, dropout = 0.081, # use 0.10 for v1.0.0 ).to(DEVICE) # Load weights - pick the version you need model.load_state_dict( torch.load("en-hi-transformer/v1.1.0/m25csa023_ass_4_best_model.pth", map_location=DEVICE) # or: "en-hi-transformer/v1.0.0/transformer_translation_final.pth" ) model.eval() ``` ### 3. Translate a sentence ```python def translate(model, sentence, max_len=50): tokens = encode_sentence(sentence, en_vocab, max_len) src = torch.tensor(tokens).unsqueeze(0).to(DEVICE) tgt = [hi_vocab[""]] with torch.no_grad(): for _ in range(max_len): out = model(src, torch.tensor(tgt).unsqueeze(0).to(DEVICE), en_vocab[""], hi_vocab[""]) nxt = out[0, -1].argmax().item() tgt.append(nxt) if nxt == hi_vocab[""]: break return " ".join(hi_vocab.itos[i] for i in tgt[1:-1]) print(translate(model, "How are you?")) # → तुम कैसी हो? print(translate(model, "I love you.")) # → मैं तुमसे प्यार करती हूँ। print(translate(model, "What is your name?")) # → आपका नाम क्या है? ``` > `Transformer` and `encode_sentence` are defined in the training script available > in the linked GitHub repository. --- ## Sample Outputs (v1.1.0) | English | Hindi | |---|---| | How are you? | तुम कैसी हो? | | I love you. | मैं तुमसे प्यार करती हूँ। | | What is your name? | आपका नाम क्या है? | | The weather is nice today. | आज मौसम अच्छा है। | | She is a good teacher. | वह अच्छा शिक्षक है। | --- ## Limitations - Vocabulary of ~4 K tokens; unknown words map to ``. - Optimised for short sentences (≤ 10 words); quality degrades on longer input. - Greedy decoding - no beam search. - BLEU evaluated on a small held-out set; treat scores as indicative. --- ## Citation If you use this model, please cite: **This model:** ```bibtex @misc{en_hi_transformer_2026, author = {priyadip}, title = {English to Hindi Transformer (v1.0.0 / v1.1.0)}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/priyadip/en-hi-transformer}}, note = {v1.0.0: BLEU 0.7566 / 100 epochs. v1.1.0: BLEU 0.8369 / 50 epochs via Ray Tune + Optuna (+10.6\%).} } ``` **Architecture - Attention Is All You Need:** ```bibtex @inproceedings{vaswani2017attention, title = {Attention Is All You Need}, author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia}, booktitle = {Advances in Neural Information Processing Systems}, volume = {30}, year = {2017}, url = {https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf} } ``` - Paper: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf - Papers with Code: https://paperswithcode.com/paper/attention-is-all-you-need **Dataset - Tatoeba:** ```bibtex @misc{tatoeba, title = {Tatoeba: A multilingual sentence collection}, author = {Tatoeba contributors}, howpublished = {\url{https://tatoeba.org}}, note = {Raw EN-HI export used; 13 186 pairs including multiple Hindi translations per English sentence.} } ```