Instructions to use projecte-aina/aina-translator-de-ca with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Fairseq
How to use projecte-aina/aina-translator-de-ca with Fairseq:
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub models, cfg, task = load_model_ensemble_and_task_from_hf_hub( "projecte-aina/aina-translator-de-ca" ) - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| datasets: | |
| - projecte-aina/CA-DE_Parallel_Corpus | |
| language: | |
| - de | |
| - ca | |
| metrics: | |
| - bleu | |
| library_name: fairseq | |
| ## Projecte Aina’s German-Catalan machine translation model | |
| ## Model description | |
| This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising both Catalan-German data | |
| sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Germancorpora using | |
| [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). This gave a total of approximately 100 million sentence pairs. | |
| The model is evaluated on the Flores, NTEU and NTREX evaluation sets. | |
| ## Intended uses and limitations | |
| You can use this model for machine translation from German to Catalan. | |
| ## How to use | |
| ### Usage | |
| Required libraries: | |
| ```bash | |
| pip install ctranslate2 pyonmttok | |
| ``` | |
| Translate a sentence using python | |
| ```python | |
| import ctranslate2 | |
| import pyonmttok | |
| from huggingface_hub import snapshot_download | |
| model_dir = snapshot_download(repo_id="projecte-aina/aina-translator-de-ca", revision="main") | |
| tokenizer=pyonmttok.Tokenizer(mode="none", sp_model_path = model_dir + "/spm.model") | |
| tokenized=tokenizer.tokenize("Willkommen beim Projekt Aina") | |
| translator = ctranslate2.Translator(model_dir) | |
| translated = translator.translate_batch([tokenized[0]]) | |
| print(tokenizer.detokenize(translated[0][0]['tokens'])) | |
| ``` | |
| ## Limitations and bias | |
| At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. | |
| However, we are well aware that our models may be biased. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. | |
| ## Training | |
| ### Training data | |
| The model was trained on a combination of the following datasets: | |
| | Datasets | | |
| |-------------------| | |
| | Multi CCAligned | | |
| | WikiMatrix | | |
| | GNOME | | |
| | KDE4 | | |
| | OpenSubtitles | | |
| | GlobalVoices| | |
| | Tatoeba | | |
| | Books | | |
| | Europarl | | |
| | Tilde | | |
| | Multi-Paracawl | | |
| | DGT | | |
| | EU Bookshop | | |
| | NLLB | | |
| | OpenSubtitles | | |
| All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/). | |
| The Europarl and Tilde corpora are a synthetic parallel corpus created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala). | |
| Where a Spanish-German corpus was used, synthetic Catalan was generated from the Spanish side using | |
| [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). | |
| ### Training procedure | |
| ### Data preparation | |
| All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75. | |
| This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE). | |
| The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a | |
| modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py) | |
| #### Tokenization | |
| All data is tokenized using sentencepiece, with a 50 thousand token sentencepiece model learned from the combination of all filtered training data. | |
| This model is included. | |
| #### Hyperparameters | |
| The model is based on the Transformer-XLarge proposed by [Subramanian et al.](https://aclanthology.org/2021.wmt-1.18.pdf) | |
| The following hyperparameters were set on the Fairseq toolkit: | |
| | Hyperparameter | Value | | |
| |------------------------------------|----------------------------------| | |
| | Architecture | transformer_vaswani_wmt_en_de_big | | |
| | Embedding size | 1024 | | |
| | Feedforward size | 4096 | | |
| | Number of heads | 16 | | |
| | Encoder layers | 24 | | |
| | Decoder layers | 6 | | |
| | Normalize before attention | True | | |
| | --share-decoder-input-output-embed | True | | |
| | --share-all-embeddings | True | | |
| | Effective batch size | 48.000 | | |
| | Optimizer | adam | | |
| | Adam betas | (0.9, 0.980) | | |
| | Clip norm | 0.0 | | |
| | Learning rate | 5e-4 | | |
| | Lr. schedurer | inverse sqrt | | |
| | Warmup updates | 8000 | | |
| | Dropout | 0.1 | | |
| | Label smoothing | 0.1 | | |
| The model was trained for a total of 29.000 updates. Weights were saved every 1000 updates and reported results are the average of the last 2 checkpoints. | |
| ## Evaluation | |
| ### Variable and metrics | |
| We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores), NTEU (unpublished) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets. | |
| ### Evaluation results | |
| Below are the evaluation results on the machine translation from German to Catalan compared to [Softcatalà](https://www.softcatala.org/) and | |
| [Google Translate](https://translate.google.es/?hl=es): | |
| | Test set | SoftCatalà | Google Translate | aina-translator-de-ca | | |
| |----------------------|------------|------------------|---------------| | |
| | Flores 101 dev | 28,9 | **35,1** | 33,1 | | |
| | Flores 101 devtest |29,2 | **35,9** | 33,2 | | |
| | NTEU | 38,9 | 39,1 | **42,9** | | |
| | NTREX | 25,7 | **31,2** | 29,1 | | |
| | **Average** | 30,7 | **35,3** | 34,3 | | |
| ## Additional information | |
| ### Author | |
| The Language Technologies Unit from Barcelona Supercomputing Center. | |
| ### Contact | |
| For further information, please send an email to <langtech@bsc.es>. | |
| ### Copyright | |
| Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. | |
| ### License | |
| [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) | |
| ### Funding | |
| This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). | |
| ### Disclaimer | |
| <details> | |
| <summary>Click to expand</summary> | |
| The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. | |
| Be aware that the model may have biases and/or any other undesirable distortions. | |
| When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) | |
| or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, | |
| in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. | |
| In no event shall the owner and creator of the model (Barcelona Supercomputing Center) | |
| be liable for any results arising from the use made by third parties. | |
| </details> |