--- language: tr license: apache-2.0 tags: - tokenizer - unigram - turkish - xnli - nlp-research datasets: - facebook/xnli --- # NIRVLab — Unigram Tokenizer for Turkish XNLI A **Unigram Language Model** tokenizer trained from scratch on the Turkish (`tr`) subset of the [facebook/xnli](https://huggingface.co/datasets/facebook/xnli) dataset. ## Training Details | Parameter | Value | |---|---| | Algorithm | Unigram LM (SentencePiece-style) | | Vocabulary size | 8,000 | | Special tokens | `, , , , ` | | Corpus | `facebook/xnli` / `tr` — all splits | | Corpus size | 800,404 sentences | | Normalizer | Nmt + NFC Unicode | | Pre-tokenizer | Metaspace (▁ prefix) | | Shrinking factor | 0.75 | | Max piece length | 16 | ## Evaluation Metrics | Metric | Value | |---|---| | Tokens / char | `0.2597` | | Fertility (tokens / word) | `1.9452` | | Avg sequence length | `20.27` tokens | | Vocabulary coverage | `1.0000` | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("NIRVLab/xnli-unigram-tr") tokens = tokenizer("Merhaba dünya!", return_tensors="pt") ```