xnli-wordpiece-fr / README.md
HeyDunaX's picture
Upload README.md with huggingface_hub
dbb47f8 verified
|
Raw
History Blame
1.14 kB
metadata
language: fr
license: apache-2.0
tags:
  - tokenizer
  - wordpiece
  - french
  - xnli
  - nlp-research
datasets:
  - facebook/xnli

NIRVLab — WordPiece Tokenizer for French XNLI

A WordPiece tokenizer (BERT-style) trained from scratch on the French (fr) subset of the facebook/xnli dataset.

Training Details

Parameter Value
Algorithm WordPiece
Vocabulary size 8,000
Special tokens <s>, <pad>, </s>, <unk>, <mask>
Corpus facebook/xnli / fr — all splits
Corpus size 800,404 sentences
Normalizer NFD + StripAccents + NFC
Pre-tokenizer Whitespace
Min frequency 2
Continuing subword prefix ##

Evaluation Metrics

Metric Value
Tokens / char 0.2560
Fertility (tokens / word) 1.5407
Avg sequence length 23.52 tokens
Vocabulary coverage 1.0000

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-wordpiece-fr")
tokens = tokenizer("Bonjour le monde!", return_tensors="pt")