🔤 Balochi Tokenizer — Balochi_SP_47K

Production tokenizer for Southern Balochi · SentencePiece Unigram · 47,000 vocabulary · AraToken-normalized · Entropy-pruned Evaluated against 27 tokenizers across 6 research phases. Optimized for integration with the Gemma 2B Language Extension Pipeline (LEP).


Why This Tokenizer Exists

Standard multilingual tokenizers fragment Balochi text at 2× the rate of this custom model. BERT shatters the single word کپیٹلسٹک into 6 pieces; this tokenizer keeps it whole.

Tokenizer کپیٹلسٹک (capitalistic) Tokens
Balochi SP 47K ['▁کپیٹل', 'سٹک'] 2
BERT Multilingual ['ک', '##پی', '##ٹ', '##لس', '##ٹ', '##ک'] 6
Gemma 2B ['ک', 'پی', 'ٹ', 'لس', 'ٹ', 'ک'] 6

In a 4,096-token context window, this tokenizer fits ~3,593 Balochi words vs. ~1,665 for Gemma — nearly 2× more semantic content per forward pass.


Latest Performance Metrics (47K Vocabulary)

Tokenizer Token Count Compression Fertility Roundtrip
Balochi SP 47K 5,746 4.08 1.141 Lossless
Balochi BPE 47K 5,767 4.06 1.145 Lossless
Balochi WP 47K 5,912 3.96 1.174 Lossless
Gemma 2B 11,621 2.02 2.308 Fragmented
BERT Multilingual 10,991 2.13 2.183 Fragmented

Quick Start

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("balochi_sp_47k.model")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

# Lossless roundtrip (unique to SentencePiece family)
assert sp.Decode(ids) == text
# HuggingFace-compatible wrapper
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="balochi_sp_47k.json")

Performance vs. Multilingual Baselines

Evaluated on liberal capitalism.txt (5,036 words of Southern Balochi political text):

Tokenizer Compression Fertility UNK Rate Continuation Rate Roundtrip
Balochi SP 47K ~4.13 ~1.14 0.00% ~12.4% ✅ Lossless
Balochi SP 64K 4.13 1.14 0.00% 12.4% ✅ Lossless
Balochi SP 128K 4.27 1.10 0.00% 8.5% ✅ Lossless
BERT Multilingual 2.25 1.97 0.44% 43.7% ✗ Lossy
Gemma 2B 2.12 2.10 0.00% 52.4% ✗ Lossy
AraBERT v2 3.72 1.20 51.0% 7.1% ✗ Lossy
UrduBERT 0.58 7.63 0.00% 0.0% ✗ Lossy

Fertility = tokens per word (lower → better). 1.14 means nearly one token per Balochi word.
UrduBERT's fertility of 7.63 confirms script similarity ≠ tokenizer compatibility.


Why 47K and Not 64K?

Pruning from 64K was driven by entropy analysis on the full normalized corpus (54.3M tokens):

Vocab Size Shannon Entropy (H₁) Effective Vocab (V_eff) Efficiency (η)
128,000 7.3924 168.0 0.13%
64,000 7.4184 171.1 0.27%
47,000 7.4295 172.4 0.37%

Removing 17,000 noise tokens raises both entropy and effective vocabulary. Those tokens were diacritical variants and ZWNJ-split artifacts that only existed because the training corpus hadn't been normalized yet — they would never appear in clean text.

Cumulative frequency analysis showed that 99% of real Balochi text is covered by just 46,406 tokens. The 47K model rounds up slightly for safety.

Coverage Target Vocab Required Tokens Pruned
90% 10,219 53,781
95% 21,141 42,859
99% 46,406 17,594
99.5% 53,240 10,760

For Gemma 2B LEP integration, this means ~17,000 fewer new embedding rows — a 26.6% reduction in embedding layer dimensions with no measurable coverage loss.


Normalization Pipeline

This tokenizer was trained on text pre-processed with a Balochi-adapted version of the AraToken normalization methodology:

Rule Characters Note
NFKC Unicode normalization All compatibility chars Applied first
Diacritics removal \u064B\u065F Inconsistently used in Balochi; causes duplicate embeddings
Alif variant unification أ/إ/آ/ٱ → ا Standardizes Arabic loanword stems
Tatweel removal \u0640 Decorative only; no linguistic content
Arabic punctuation ؟→? · ؛→; · ،→, Prevents duplicate punctuation tokens
Arabic-Indic numerals ٠١٢…٩ → 0-9 Eliminates numeral inconsistency
ZWNJ/ZWJ removal \u200C, \u200D Balochi-specific — causes spurious verb-compound splits
RLM/ALM removal \u200F, \u061C Balochi-specific — copy-paste artifacts in mixed-direction text
Ye variant preserved ے vs ی Inverted from AraToken — this distinction carries grammatical case in Southern Balochi
import re, unicodedata

def normalize_balochi(text: str, drop_diacritics: bool = True,
                      preserve_ye: bool = True) -> str:
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'[أإآٱ]', 'ا', text)
    if not preserve_ye:
        text = text.replace('ے', 'ی')
    text = text.translate(str.maketrans('٠١٢٣٤٥٦٧٨٩', '0123456789'))
    text = text.replace('؟','?').replace('؛',';').replace('،',',')
    text = text.replace('\u0640', '')
    text = text.replace('\u200C','').replace('\u200D','')
    text = text.replace('\u200F','').replace('\u061C','')
    if drop_diacritics:
        text = re.sub(r'(?<!ء)[\u064B-\u065F\u0610-\u061A\u06D6-\u06DC]', '', text)
    return re.sub(r'\s+', ' ', text).strip()

Training Corpus

These tokenizers were trained from scratch on a massive, deduplicated dataset comprising ~54 Million words.

Source Size Content
balochi_dedup_corpus.txt ~52M words Deduplicated Balochi prose: literature, news, religious, conversational
balochi_clean_corpus_dictionary.txt ~lexical Curated dictionary entries and normalized grammatical forms
english_corpus_2M.txt ~2M words English text for code-switching coverage
Total ~54M words 185.6M characters

The Mixed-Corpus Advantage: To prevent the models from destroying or over-fragmenting English loan-words or technical terminology, the training data strategically includes ~2 million English words seamlessly mixed with the ~52 million Balochi words. This highly optimized ratio ensures that the tokenizers can flawlessly process English text (e.g., "war industry", mathematical symbols, code) without sacrificing the vocabulary slots dedicated strictly to core Balochi morphological roots.


Model Variants

Model Vocab Best For
Balochi_SP_47K 47,000 Gemma/LLaMA CPT · optimal embedding efficiency · lossless roundtrip
Balochi_SP_64K 64,000 Resource-constrained production · compression/memory sweet spot
Balochi_SP_128K 128,000 Maximum sequence packing · best raw compression (4.26 chars/token)
Balochi_WP_47K 47,000 BERT / ALBERT fine-tuning · native WordPiece architecture · protects Latin alphabet and Arabic numerals natively
Balochi_BPE_47K 47,000 GPT-2 / RoBERTa pre-training · BPE-aligned training · relies on HF's fast tokenizers

All SP models include byte_fallback=Truezero UNK rate on any Unicode input.


💻 Usage Code for Tokenizer Architectures

1. WordPiece Tokenizer (BERT-style)

from tokenizers import Tokenizer

# Load the custom WordPiece tokenizer
wp_tokenizer = Tokenizer.from_file("Balochi_WP_47K.json")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
encoded = wp_tokenizer.encode(text)

print("WordPiece Tokens:")
print(encoded.tokens)

2. SentencePiece Tokenizer (Gemma / LLaMA-style)

import sentencepiece as spm

# Load the custom SentencePiece model
spm_model = spm.SentencePieceProcessor()
spm_model.load("balochi_sp_47k.model")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
tokens = spm_model.encode_as_pieces(text)

print("SentencePiece Tokens:")
print(tokens)

3. HF BPE Tokenizer (RoBERTa / GPT-2-style)

from tokenizers import Tokenizer

# Load the custom Hugging Face BPE tokenizer
bpe_tokenizer = Tokenizer.from_file("Balochi_BPE_47K.json")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
encoded = bpe_tokenizer.encode(text)

print("HF BPE Tokens:")
print(encoded.tokens)

Training Configuration

Parameter Value
Algorithm SentencePiece Unigram
Implementation Google sentencepiece
Vocabulary size 47,000
Character coverage 0.9995
Byte fallback True
Space handling word-start marker
Special tokens <pad>, <unk>, <s>, </s>
Roundtrip fidelity ✅ Lossless
Normalization AraToken-adapted (Balochi-specific rules)

Intended Uses & Limitations

Suitable for:

  • Continual pre-training (CPT) of LLMs on Southern Balochi text
  • BERT-style masked language modeling (use Balochi_WP_47K for WordPiece compatibility)
  • Named entity recognition, text classification, sentiment analysis on Balochi
  • Gemma 2B Language Extension Pipeline (LEP) via vocabulary extension + mean subtoken initialization

Limitations:

  • Optimized for Southern Balochi; Northern, Eastern, and Makrani dialect performance is untested
  • Evaluated on political/economic text only — cross-domain generalization (literary, social media, technical) requires validation
  • No downstream task benchmarking (NER F1, MLM perplexity) has been performed yet
  • English coverage is functional but not optimized; a bilingual Bal-Eng tokenizer is planned

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support