Setswana Clean BPE 52k (`setswana_clean_bpe_52k`)

52k SentencePiece BPE tokenizer trained on word-scrubbed Setswana text (setswana_corpus_clean_words.txt), without morphological lexicon surfaces in training.

Training

Algorithm: BPE (SentencePiece)
Vocab size: 52,000
character_coverage=1.0, byte_fallback=true
Corpus: ~585k lines after English/numeric token removal

Evaluation (held-out clean text, 50k lines)

Metric	Value
Fertility (tokens/word)	1.181
vs plain in-house BPE	1.200
vs PuoBERTa tokenizer	1.302

Quick use

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
pieces = sp.encode("Dumela, o tsogile jang kajeno mo Gaborone?", out_type=str)
print(pieces)

Citation

If you use this tokenizer, cite the accompanying paper (Setswana subword tokenization with corpus cleaning and SMTB evaluation).

Limitations

Not a full language model; use with setswana-gpt2-small-clean-bpe-52k for the matched LM probe.
Training corpus is in-house; full raw text is not redistributed here.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW