--- license: mit language: - fas tags: - tokenizer - unigram - flexitok - fineweb2 --- # UnigramLM Tokenizer: fas_Arab (32K) A **UnigramLM** tokenizer trained on **fas_Arab** data from Fineweb-2-HQ. ## Training Details | Parameter | Value | |-----------|-------| | Algorithm | UnigramLM | | Language | `fas_Arab` | | Target Vocab Size | 32,000 | | Final Vocab Size | 0 | | Pre-tokenizer | ByteLevel | | Normalizer | NFC | | Special Tokens | ``, ``, ``, `` | | Training Shards | 2 | ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("") tokens = tokenizer.encode("Hello, world!") ``` ## Files - `tokenizer.json` — Full HuggingFace tokenizer - `vocab.json` — Vocabulary mapping - `tokenizer.model` — SentencePiece protobuf (if available)