| --- |
| language: fr |
| license: apache-2.0 |
| tags: |
| - tokenizer |
| - wordpiece |
| - french |
| - xnli |
| - nlp-research |
| datasets: |
| - facebook/xnli |
| --- |
| |
| # NIRVLab — WordPiece Tokenizer for French XNLI |
|
|
| A **WordPiece** tokenizer (BERT-style) trained from scratch on the French (`fr`) subset |
| of the [facebook/xnli](https://huggingface.co/datasets/facebook/xnli) dataset. |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |---|---| |
| | Algorithm | WordPiece | |
| | Vocabulary size | 8,000 | |
| | Special tokens | `<s>, <pad>, </s>, <unk>, <mask>` | |
| | Corpus | `facebook/xnli` / `fr` — all splits | |
| | Corpus size | 800,404 sentences | |
| | Normalizer | NFD + StripAccents + NFC | |
| | Pre-tokenizer | Whitespace | |
| | Min frequency | 2 | |
| | Continuing subword prefix | `##` | |
|
|
| ## Evaluation Metrics |
|
|
| | Metric | Value | |
| |---|---| |
| | Tokens / char | `0.2560` | |
| | Fertility (tokens / word) | `1.5407` | |
| | Avg sequence length | `23.52` tokens | |
| | Vocabulary coverage | `1.0000` | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-wordpiece-fr") |
| tokens = tokenizer("Bonjour le monde!", return_tensors="pt") |
| ``` |
|
|