--- language: vi license: apache-2.0 tags: - tokenizer - unigram - vietnamese - xnli - nlp-research datasets: - facebook/xnli --- # NIRVLab — Unigram Tokenizer for Vietnamese XNLI A **Unigram Language Model** tokenizer trained from scratch on the Vietnamese (`vi`) subset of the [facebook/xnli](https://huggingface.co/datasets/facebook/xnli) dataset. ## Training Details | Parameter | Value | |---|---| | Algorithm | Unigram LM (SentencePiece-style) | | Vocabulary size | 8,000 | | Special tokens | `, , , , ` | | Corpus | `facebook/xnli` / `vi` — all splits | | Corpus size | 800,404 sentences | | Normalizer | Nmt + NFC Unicode | | Pre-tokenizer | Metaspace (▁ prefix) | | Shrinking factor | 0.75 | | Max piece length | 16 | ## Evaluation Metrics | Metric | Value | |---|---| | Tokens / char | `0.2967` | | Fertility (tokens / word) | `1.3142` | | Avg sequence length | `24.81` tokens | | Vocabulary coverage | `1.0000` | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-unigram-vi") tokens = tokenizer("Xin chào thế giới!", return_tensors="pt") ```