Transformers
PyTorch
English
Kinyarwanda
m2m_100
text2text-generation
rdelyon commited on
Commit
8bddaff
Β·
verified Β·
1 Parent(s): 594ec24

Update model card: Intended Use, Limitations, code example, BibTeX, licence/email/description fixes

Browse files
Files changed (1) hide show
  1. README.md +67 -26
README.md CHANGED
@@ -1,58 +1,99 @@
1
  ---
2
  license: cc-by-2.0
3
  datasets:
4
- - mbazaNLP/NMT_Tourism_parallel_data_en_kin
5
- - mbazaNLP/NMT_Education_parallel_data_en_kin
6
  - mbazaNLP/Kinyarwanda_English_parallel_dataset
 
 
7
  language:
8
  - en
9
  - rw
10
  library_name: transformers
 
11
  ---
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is a Machine Translation model, finetuned from [NLLB](https://huggingface.co/facebook/nllb-200-distilled-1.3B)-200's distilled 1.3B model, it is meant to be used in machine translation for education-related data.
19
 
 
 
20
 
 
21
 
22
- - **Finetuning code repository:** the code used to finetune this model can be found [here](https://github.com/Digital-Umuganda/twb_nllb_finetuning)
23
 
 
 
24
 
25
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
 
 
 
 
 
 
26
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- ## How to Get Started with the Model
29
 
30
- Use the code below to get started with the model.
 
 
 
31
 
 
 
 
32
 
33
- ### Training Procedure
34
 
35
- The model was finetuned on three datasets; a [general](https://huggingface.co/datasets/mbazaNLP/Kinyarwanda_English_parallel_dataset) purpose dataset, a [tourism](https://huggingface.co/datasets/mbazaNLP/NMT_Tourism_parallel_data_en_kin), and an [education](https://huggingface.co/datasets/mbazaNLP/NMT_Education_parallel_data_en_kin) dataset.
36
- The model was finetuned on an A100 40GB GPU for two epochs.
 
 
37
 
 
38
 
39
  ## Evaluation
40
 
41
- <!-- This section describes the evaluation protocols and provides the results. -->
42
-
43
-
44
- #### Testing Data
45
-
46
- <!-- This should link to a Data Card if possible. -->
47
-
48
-
49
- #### Metrics
50
 
51
- Model performance was measured using BLEU, spBLEU, TER, and chrF++ metrics.
 
 
 
52
 
53
- ### Results
54
 
 
 
 
 
55
 
 
56
 
 
57
 
 
58
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-2.0
3
  datasets:
 
 
4
  - mbazaNLP/Kinyarwanda_English_parallel_dataset
5
+ - mbazaNLP/NMT_Education_parallel_data_en_kin
6
+ - mbazaNLP/NMT_Tourism_parallel_data_en_kin
7
  language:
8
  - en
9
  - rw
10
  library_name: transformers
11
+ pipeline_tag: translation
12
  ---
 
 
 
 
 
13
 
14
+ # Nllb_finetuned_general_en_kin β€” English ↔ Kinyarwanda (General Purpose)
15
 
16
+ General-purpose machine translation model for English ↔ Kinyarwanda.
17
+ Fine-tuned from [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B).
18
 
19
+ **Fine-tuning code:** [Digital-Umuganda/twb_nllb_finetuning](https://github.com/Digital-Umuganda/twb_nllb_finetuning)
20
 
21
+ ## Usage
22
 
23
+ ```python
24
+ from transformers import pipeline
25
 
26
+ # English β†’ Kinyarwanda
27
+ translator = pipeline(
28
+ "translation",
29
+ model="mbazaNLP/Nllb_finetuned_general_en_kin",
30
+ src_lang="eng_Latn",
31
+ tgt_lang="kin_Latn",
32
+ max_length=400,
33
+ )
34
+ result = translator("Rwanda is a country in East Africa known for its biodiversity.")
35
+ print(result[0]["translation_text"])
36
 
37
+ # Kinyarwanda β†’ English
38
+ translator_rev = pipeline(
39
+ "translation",
40
+ model="mbazaNLP/Nllb_finetuned_general_en_kin",
41
+ src_lang="kin_Latn",
42
+ tgt_lang="eng_Latn",
43
+ max_length=400,
44
+ )
45
+ result = translator_rev("U Rwanda ni igihugu giri mu Afurika yo Hagati.")
46
+ print(result[0]["translation_text"])
47
+ ```
48
 
49
+ ## Intended Use
50
 
51
+ **Suitable for:**
52
+ - General-purpose English ↔ Kinyarwanda translation
53
+ - Applications requiring broad language coverage across domains
54
+ - Research baseline for NLLB Kinyarwanda translation
55
 
56
+ **Not intended for:**
57
+ - High-stakes translation without human review
58
+ - Specialised domains where the education or tourism models may perform better
59
 
60
+ ## Training
61
 
62
+ Fine-tuned on a general-purpose corpus in a single phase:
63
+ - [mbazaNLP/Kinyarwanda_English_parallel_dataset](https://huggingface.co/datasets/mbazaNLP/Kinyarwanda_English_parallel_dataset)
64
+ - [mbazaNLP/NMT_Education_parallel_data_en_kin](https://huggingface.co/datasets/mbazaNLP/NMT_Education_parallel_data_en_kin)
65
+ - [mbazaNLP/NMT_Tourism_parallel_data_en_kin](https://huggingface.co/datasets/mbazaNLP/NMT_Tourism_parallel_data_en_kin)
66
 
67
+ Training hardware: A100 40 GB GPU, 2 epochs.
68
 
69
  ## Evaluation
70
 
71
+ <!-- TODO: add BLEU/spBLEU/chrF++ scores from evaluation -->
 
 
 
 
 
 
 
 
72
 
73
+ | Lang. Direction | BLEU | spBLEU | chrF++ | TER |
74
+ |-----------------|------|--------|--------|-----|
75
+ | Eng β†’ Kin | β€” | β€” | β€” | β€” |
76
+ | Kin β†’ Eng | β€” | β€” | β€” | β€” |
77
 
78
+ ## Limitations
79
 
80
+ - General-purpose training may underperform domain-specific models in education or tourism contexts.
81
+ - Low-frequency Kinyarwanda vocabulary and tonal nuances may not be handled accurately.
82
+ - Outputs should be reviewed for high-stakes applications.
83
+ - Maximum reliable input length is approximately 200 tokens.
84
 
85
+ ## Bias and Fairness
86
 
87
+ Training data spans multiple domains but may not equally represent all registers of Kinyarwanda. Colloquial or dialectal text may translate with lower quality.
88
 
89
+ ## Citation
90
 
91
+ ```bibtex
92
+ @misc{mbazaNLP2023nllb_finetuned_general,
93
+ author = {MBAZA-NLP Community},
94
+ title = {Nllb\_finetuned\_general\_en\_kin: English--Kinyarwanda Machine Translation (General Purpose)},
95
+ year = {2023},
96
+ url = {https://huggingface.co/mbazaNLP/Nllb_finetuned_general_en_kin},
97
+ note = {Hugging Face model repository}
98
+ }
99
+ ```