File size: 1,137 Bytes
b1dca1f
dbb47f8
 
 
 
 
 
 
 
 
 
b1dca1f
 
dbb47f8
b1dca1f
dbb47f8
 
b1dca1f
 
 
dbb47f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---
language: fr
license: apache-2.0
tags:
  - tokenizer
  - wordpiece
  - french
  - xnli
  - nlp-research
datasets:
  - facebook/xnli
---

# NIRVLab — WordPiece Tokenizer for French XNLI

A **WordPiece** tokenizer (BERT-style) trained from scratch on the French (`fr`) subset
of the [facebook/xnli](https://huggingface.co/datasets/facebook/xnli) dataset.

## Training Details

| Parameter | Value |
|---|---|
| Algorithm | WordPiece |
| Vocabulary size | 8,000 |
| Special tokens | `<s>, <pad>, </s>, <unk>, <mask>` |
| Corpus | `facebook/xnli` / `fr` — all splits |
| Corpus size | 800,404 sentences |
| Normalizer | NFD + StripAccents + NFC |
| Pre-tokenizer | Whitespace |
| Min frequency | 2 |
| Continuing subword prefix | `##` |

## Evaluation Metrics

| Metric | Value |
|---|---|
| Tokens / char | `0.2560` |
| Fertility (tokens / word) | `1.5407` |
| Avg sequence length | `23.52` tokens |
| Vocabulary coverage | `1.0000` |

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-wordpiece-fr")
tokens = tokenizer("Bonjour le monde!", return_tensors="pt")
```