Setswana Clean BPE 52k (setswana_clean_bpe_52k)
52k SentencePiece BPE tokenizer trained on word-scrubbed Setswana text (setswana_corpus_clean_words.txt), without morphological lexicon surfaces in training.
Training
- Algorithm: BPE (SentencePiece)
- Vocab size: 52,000
character_coverage=1.0,byte_fallback=true- Corpus: ~585k lines after English/numeric token removal
Evaluation (held-out clean text, 50k lines)
| Metric | Value |
|---|---|
| Fertility (tokens/word) | 1.181 |
| vs plain in-house BPE | 1.200 |
| vs PuoBERTa tokenizer | 1.302 |
Quick use
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
pieces = sp.encode("Dumela, o tsogile jang kajeno mo Gaborone?", out_type=str)
print(pieces)
Citation
If you use this tokenizer, cite the accompanying paper (Setswana subword tokenization with corpus cleaning and SMTB evaluation).
Limitations
- Not a full language model; use with
setswana-gpt2-small-clean-bpe-52kfor the matched LM probe. - Training corpus is in-house; full raw text is not redistributed here.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support