Nematus RNN · English → Turkish (BPE)
An attentional RNN (bi-GRU encoder / GRU decoder) neural machine translation model that translates English → Turkish, trained with the Nematus toolkit on a ~142k-sentence news-domain parallel corpus using a joint 32k BPE subword vocabulary.
This is the RNN baseline (reverse direction) for the machine translation models collection. Siblings:
the stronger Transformer atahanuz/transformers-translator-en-tr-75M
and the forward RNN atahanuz/rnn-translator-tr-en-63M.
All four share the same data and the same joint BPE model.
TL;DR
- BLEU 33.93 on a 1,000-sentence held-out test set (beam 12, length-normalized).
- ~63M parameters · single A100-80GB · ~125 min training.
- A baseline: the Transformer sibling scores 42.04 on the same test set (see matrix below).
Model details
| Toolkit | Nematus (TensorFlow), commit 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd |
| Architecture | Attentional RNN — single-layer bi-GRU encoder, single-layer GRU decoder with attention (conditional GRU, dec_base_transition_depth=2) |
| Direction | English (en) → Turkish (tr) |
| Embedding size | 512 |
| Hidden / state size | 1000 |
| Dropout | none (0.0) |
| Embedding tying | none (untied) |
| Subword model | joint 32k BPE (subword-nmt) |
| Vocab caps | source 18,000 / target 24,000 |
| Parameters | ~63M |
Training data
- Corpus:
mt_datasets_vol2— 144,065 English–Turkish sentence pairs (news / current affairs). - Filtering: drop any pair where either side is empty or longer than 60 whitespace tokens → 143,926 pairs.
- Split (shuffled, seed 42): 141,926 train / 1,000 dev / 1,000 test.
Preprocessing
- Tokenization: Moses
tokenizer.perl -l <lang> -no-escape. - Joint BPE: 32,000 merges learned on the training side only (EN+TR concatenated), applied without a vocabulary-frequency threshold. Resulting subword vocab: TR 23,211 / EN 17,659.
(BPE is identical across all four models — it is architecture- and direction-independent.)
Training configuration
- Optimizer: Adam (β1 0.9, β2 0.999, ε 1e-8), learning rate 1e-4 constant, gradient-norm clip 1.0.
- Batch 320 sentences,
maxlen120, label smoothing 0.0. - Validation every 200 updates; early stopping with patience 10 on dev cross-entropy.
- Hardware: 1× NVIDIA A100-80GB.
Training run
- Best validation at update 24,400 (dev cross-entropy 56.70) — the checkpoint in this repo.
- Early-stopped at update ~26,900. Wall-clock ≈ 125 min (RNNs converge slower than the Transformer and run slower per word due to sequential recurrence).
Evaluation
BLEU via multi-bleu.perl on the merged-BPE (Moses-tokenized) hypothesis vs the tokenized reference; beam 12, length-normalized.
Architecture × direction matrix (same 1,000 mirrored test pairs):
| Architecture | TR → EN | EN → TR |
|---|---|---|
| Transformer (~75M) | 42.78 | 42.04 |
| RNN (~63M) | 35.10 | 33.93 |
This model = RNN EN→TR = 33.93 (n-gram 60.1 / 39.6 / 27.9 / 20.0, BP 1.00). The Transformer beats the RNN by ~7–8 BLEU; Turkish-target (→TR) is the slightly harder direction for both architectures (agglutinative morphology).
Example
| English (input) | Model output | Reference |
|---|---|---|
The US Embassy in Bosnia and Herzegovina welcomed the offer to send soldiers to Iraq. |
ABD'nin Bosna-Hersek Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı. |
ABD'nin BH Büyükelçiliği Irak'a asker gönderme teklifini memnuniyetle karşıladı. |
Files in this repo
model.npz.*— Nematus/TensorFlow checkpoint (best-validation, update 24,400).train.bpe.en.json,train.bpe.tr.json— Nematus source/target vocabularies (referenced bymodel.npz.json).tr-en.bpe.codes,vocab.tr,vocab.en— the joint BPE model, used to segment new input.nematus_tf220_compat.patch— makes Nematus run on TensorFlow ≥ 2.16 / NumPy ≥ 2 / Python 3.12.
Installation
git clone https://github.com/EdinburghNLP/nematus.git
cd nematus && git checkout 49d050863bc9644b8c0a9d9ab6e54ccd30f927dd
git apply /path/to/nematus_tf220_compat.patch
pip install "tensorflow>=2.16" "numpy>=2" subword-nmt sacremoses
cd ..
Usage
# 0. download this model
python3 -c "from huggingface_hub import snapshot_download; \
snapshot_download('atahanuz/rnn-translator-en-tr-63M', local_dir='en_tr_rnn')"
# 1. preprocess an English input file (one sentence per line)
perl nematus/data/tokenizer.perl -l en -no-escape < input.en > input.tok.en
subword-nmt apply-bpe -c en_tr_rnn/tr-en.bpe.codes < input.tok.en > input.bpe.en
# 2. translate — run from the model dir so the dict paths in model.npz.json resolve
cd en_tr_rnn
python3 ../nematus/nematus/translate.py -m model.npz -i ../input.bpe.en -o ../out.bpe.tr -k 12 -n -b 50
cd ..
# 3. postprocess: undo BPE, then detokenize (Turkish)
sed -E 's/(@@ )|(@@ ?$)//g' out.bpe.tr > out.tok.tr
python3 - <<'EOF'
import re
from sacremoses import MosesDetokenizer
d = MosesDetokenizer(lang='tr')
for line in open('out.tok.tr', encoding='utf-8'):
print(re.sub(r"\s*'\s*", "'", d.detokenize(line.split()))) # join Turkish suffix apostrophes
EOF
Intended use & limitations
- A research baseline; for best quality use the Transformer sibling (+8.1 BLEU).
- Domain: news / current affairs. Degrades out of domain and on long / complex sentences.
- Trained on sentences ≤ 60 tokens.
- Tokenized
multi-bleu.perlBLEU — for citable numbers use sacreBLEU on detokenized output. - No safety/bias auditing.
License
cc-by-4.0 placeholder — set to match your training-data terms.
Acknowledgements
Nematus; subword-nmt. GRU attentional NMT: Bahdanau et al. (2015) / Sennrich et al. (Nematus).