--- language: vi license: apache-2.0 tags: - tokenizer - wordpiece - vietnamese - xnli - nlp-research datasets: - facebook/xnli --- # NIRVLab — WordPiece Tokenizer for Vietnamese XNLI A **WordPiece** tokenizer (BERT-style) trained from scratch on the Vietnamese (`vi`) subset of the [facebook/xnli](https://huggingface.co/datasets/facebook/xnli) dataset. ## Training Details | Parameter | Value | |---|---| | Algorithm | WordPiece | | Vocabulary size | 8,000 | | Special tokens | `, , , , ` | | Corpus | `facebook/xnli` / `vi` — all splits | | Corpus size | 800,404 sentences | | Normalizer | NFD + StripAccents + NFC | | Pre-tokenizer | Whitespace | | Min frequency | 2 | | Continuing subword prefix | `##` | ## Evaluation Metrics | Metric | Value | |---|---| | Tokens / char | `0.2581` | | Fertility (tokens / word) | `1.1429` | | Avg sequence length | `21.57` tokens | | Vocabulary coverage | `1.0000` | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-wordpiece-vi") tokens = tokenizer("Xin chào thế giới!", return_tensors="pt") ```