Nematus RNN · English → Turkish (BPE)

An attentional RNN (bi-GRU encoder / GRU decoder) neural machine translation model that translates English → Turkish, trained with the Nematus toolkit on a ~142k-sentence news-domain parallel corpus using a joint 32k BPE subword vocabulary.

This is the RNN baseline (reverse direction) for the machine translation models collection. Siblings: the stronger Transformer atahanuz/transformers-translator-en-tr-75M and the forward RNN atahanuz/rnn-translator-tr-en-63M. All four share the same data and the same joint BPE model.

TL;DR

  • BLEU 33.93 on a 1,000-sentence held-out test set (beam 12, length-normalized).
  • ~63M parameters · single A100-80GB · ~125 min training.
  • A baseline: the Transformer sibling scores 42.04 on the same test set (see matrix below).

Model details

Toolkit Nematus (TensorFlow), commit 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd
Architecture Attentional RNN — single-layer bi-GRU encoder, single-layer GRU decoder with attention (conditional GRU, dec_base_transition_depth=2)
Direction English (en) → Turkish (tr)
Embedding size 512
Hidden / state size 1000
Dropout none (0.0)
Embedding tying none (untied)
Subword model joint 32k BPE (subword-nmt)
Vocab caps source 18,000 / target 24,000
Parameters ~63M

Training data

  • Corpus: mt_datasets_vol2 — 144,065 English–Turkish sentence pairs (news / current affairs).
  • Filtering: drop any pair where either side is empty or longer than 60 whitespace tokens → 143,926 pairs.
  • Split (shuffled, seed 42): 141,926 train / 1,000 dev / 1,000 test.

Preprocessing

  1. Tokenization: Moses tokenizer.perl -l <lang> -no-escape.
  2. Joint BPE: 32,000 merges learned on the training side only (EN+TR concatenated), applied without a vocabulary-frequency threshold. Resulting subword vocab: TR 23,211 / EN 17,659.

(BPE is identical across all four models — it is architecture- and direction-independent.)

Training configuration

  • Optimizer: Adam (β1 0.9, β2 0.999, ε 1e-8), learning rate 1e-4 constant, gradient-norm clip 1.0.
  • Batch 320 sentences, maxlen 120, label smoothing 0.0.
  • Validation every 200 updates; early stopping with patience 10 on dev cross-entropy.
  • Hardware: 1× NVIDIA A100-80GB.

Training run

  • Best validation at update 24,400 (dev cross-entropy 56.70) — the checkpoint in this repo.
  • Early-stopped at update ~26,900. Wall-clock ≈ 125 min (RNNs converge slower than the Transformer and run slower per word due to sequential recurrence).

Evaluation

BLEU via multi-bleu.perl on the merged-BPE (Moses-tokenized) hypothesis vs the tokenized reference; beam 12, length-normalized.

Architecture × direction matrix (same 1,000 mirrored test pairs):

Architecture TR → EN EN → TR
Transformer (~75M) 42.78 42.04
RNN (~63M) 35.10 33.93

This model = RNN EN→TR = 33.93 (n-gram 60.1 / 39.6 / 27.9 / 20.0, BP 1.00). The Transformer beats the RNN by ~7–8 BLEU; Turkish-target (→TR) is the slightly harder direction for both architectures (agglutinative morphology).

Example

English (input) Model output Reference
The US Embassy in Bosnia and Herzegovina welcomed the offer to send soldiers to Iraq. ABD'nin Bosna-Hersek Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı. ABD'nin BH Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı.

Files in this repo

  • model.npz.* — Nematus/TensorFlow checkpoint (best-validation, update 24,400).
  • train.bpe.en.json, train.bpe.tr.json — Nematus source/target vocabularies (referenced by model.npz.json).
  • tr-en.bpe.codes, vocab.tr, vocab.en — the joint BPE model, used to segment new input.
  • nematus_tf220_compat.patch — makes Nematus run on TensorFlow ≥ 2.16 / NumPy ≥ 2 / Python 3.12.

Installation

git clone https://github.com/EdinburghNLP/nematus.git
cd nematus && git checkout 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd
git apply /path/to/nematus_tf220_compat.patch
pip install "tensorflow>=2.16" "numpy>=2" subword-nmt sacremoses
cd ..

Usage

# 0. download this model
python3 -c "from huggingface_hub import snapshot_download; \
snapshot_download('atahanuz/rnn-translator-en-tr-63M', local_dir='en_tr_rnn')"

# 1. preprocess an English input file (one sentence per line)
perl nematus/data/tokenizer.perl -l en -no-escape < input.en > input.tok.en
subword-nmt apply-bpe -c en_tr_rnn/tr-en.bpe.codes < input.tok.en > input.bpe.en

# 2. translate — run from the model dir so the dict paths in model.npz.json resolve
cd en_tr_rnn
python3 ../nematus/nematus/translate.py -m model.npz -i ../input.bpe.en -o ../out.bpe.tr -k 12 -n -b 50
cd ..

# 3. postprocess: undo BPE, then detokenize (Turkish)
sed -E 's/(@@ )|(@@ ?$)//g' out.bpe.tr > out.tok.tr
python3 - <<'EOF'
import re
from sacremoses import MosesDetokenizer
d = MosesDetokenizer(lang='tr')
for line in open('out.tok.tr', encoding='utf-8'):
    print(re.sub(r"\s*'\s*", "'", d.detokenize(line.split())))   # join Turkish suffix apostrophes
EOF

Intended use & limitations

  • A research baseline; for best quality use the Transformer sibling (+8.1 BLEU).
  • Domain: news / current affairs. Degrades out of domain and on long / complex sentences.
  • Trained on sentences ≤ 60 tokens.
  • Tokenized multi-bleu.perl BLEU — for citable numbers use sacreBLEU on detokenized output.
  • No safety/bias auditing.

License

cc-by-4.0 placeholder — set to match your training-data terms.

Acknowledgements

Nematus; subword-nmt. GRU attentional NMT: Bahdanau et al. (2015) / Sennrich et al. (Nematus).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including atahanuz/rnn-translator-en-tr-63M