Setswana Clean BPE 52k (setswana_clean_bpe_52k)

52k SentencePiece BPE tokenizer trained on word-scrubbed Setswana text (setswana_corpus_clean_words.txt), without morphological lexicon surfaces in training.

Training

  • Algorithm: BPE (SentencePiece)
  • Vocab size: 52,000
  • character_coverage=1.0, byte_fallback=true
  • Corpus: ~585k lines after English/numeric token removal

Evaluation (held-out clean text, 50k lines)

Metric Value
Fertility (tokens/word) 1.181
vs plain in-house BPE 1.200
vs PuoBERTa tokenizer 1.302

Quick use

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")
pieces = sp.encode("Dumela, o tsogile jang kajeno mo Gaborone?", out_type=str)
print(pieces)

Citation

If you use this tokenizer, cite the accompanying paper (Setswana subword tokenization with corpus cleaning and SMTB evaluation).

Limitations

  • Not a full language model; use with setswana-gpt2-small-clean-bpe-52k for the matched LM probe.
  • Training corpus is in-house; full raw text is not redistributed here.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support