---
language: vi
license: apache-2.0
tags:
- tokenizer
- wordpiece
- vietnamese
- xnli
- nlp-research
datasets:
- facebook/xnli
---
# NIRVLab — WordPiece Tokenizer for Vietnamese XNLI
A **WordPiece** tokenizer (BERT-style) trained from scratch on the Vietnamese (`vi`) subset
of the [facebook/xnli](https://huggingface.co/datasets/facebook/xnli) dataset.
## Training Details
| Parameter | Value |
|---|---|
| Algorithm | WordPiece |
| Vocabulary size | 8,000 |
| Special tokens | `, , , , ` |
| Corpus | `facebook/xnli` / `vi` — all splits |
| Corpus size | 800,404 sentences |
| Normalizer | NFD + StripAccents + NFC |
| Pre-tokenizer | Whitespace |
| Min frequency | 2 |
| Continuing subword prefix | `##` |
## Evaluation Metrics
| Metric | Value |
|---|---|
| Tokens / char | `0.2581` |
| Fertility (tokens / word) | `1.1429` |
| Avg sequence length | `21.57` tokens |
| Vocabulary coverage | `1.0000` |
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-wordpiece-vi")
tokens = tokenizer("Xin chào thế giới!", return_tensors="pt")
```