NIRVLab — Unigram Tokenizer for Vietnamese XNLI

A Unigram Language Model tokenizer trained from scratch on the Vietnamese (vi) subset of the facebook/xnli dataset.

Training Details

Parameter	Value
Algorithm	Unigram LM (SentencePiece-style)
Vocabulary size	8,000
Special tokens	`<s>, <pad>, </s>, <unk>, <mask>`
Corpus	`facebook/xnli` / `vi` — all splits
Corpus size	800,404 sentences
Normalizer	Nmt + NFC Unicode
Pre-tokenizer	Metaspace (▁ prefix)
Shrinking factor	0.75
Max piece length	16

Evaluation Metrics

Metric	Value
Tokens / char	`0.2967`
Fertility (tokens / word)	`1.3142`
Avg sequence length	`24.81` tokens
Vocabulary coverage	`1.0000`

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-unigram-vi")
tokens = tokenizer("Xin chào thế giới!", return_tensors="pt")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

NIRVLab
/

xnli-unigram-vi

NIRVLab — Unigram Tokenizer for Vietnamese XNLI

Training Details

Evaluation Metrics

Usage

Dataset used to train NIRVLab/xnli-unigram-vi