🔤 Balochi Tokenizer — `Balochi_SP_47K`

Production tokenizer for Southern Balochi · SentencePiece Unigram · 47,000 vocabulary · AraToken-normalized · Entropy-pruned Evaluated against 27 tokenizers across 6 research phases. Optimized for integration with the Gemma 2B Language Extension Pipeline (LEP).

Why This Tokenizer Exists

Standard multilingual tokenizers fragment Balochi text at 2× the rate of this custom model. BERT shatters the single word کپیٹلسٹک into 6 pieces; this tokenizer keeps it whole.

Tokenizer	`کپیٹلسٹک` (capitalistic)	Tokens
Balochi SP 47K	`['▁کپیٹل', 'سٹک']`	2 ✅
BERT Multilingual	`['ک', '##پی', '##ٹ', '##لس', '##ٹ', '##ک']`	6 ❌
Gemma 2B	`['ک', 'پی', 'ٹ', 'لس', 'ٹ', 'ک']`	6 ❌

In a 4,096-token context window, this tokenizer fits ~3,593 Balochi words vs. ~1,665 for Gemma — nearly 2× more semantic content per forward pass.

Latest Performance Metrics (47K Vocabulary)

Tokenizer	Token Count	Compression	Fertility	Roundtrip
Balochi SP 47K	5,746	4.08	1.141	Lossless
Balochi BPE 47K	5,767	4.06	1.145	Lossless
Balochi WP 47K	5,912	3.96	1.174	Lossless
Gemma 2B	11,621	2.02	2.308	Fragmented
BERT Multilingual	10,991	2.13	2.183	Fragmented

Quick Start

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("balochi_sp_47k.model")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
tokens = sp.EncodeAsPieces(text)
ids    = sp.EncodeAsIds(text)

# Lossless roundtrip (unique to SentencePiece family)
assert sp.Decode(ids) == text

# HuggingFace-compatible wrapper
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(tokenizer_file="balochi_sp_47k.json")

Performance vs. Multilingual Baselines

Evaluated on liberal capitalism.txt (5,036 words of Southern Balochi political text):

Tokenizer	Compression	Fertility	UNK Rate	Continuation Rate	Roundtrip
Balochi SP 47K	~4.13	~1.14	0.00%	~12.4%	✅ Lossless
Balochi SP 64K	4.13	1.14	0.00%	12.4%	✅ Lossless
Balochi SP 128K	4.27	1.10	0.00%	8.5%	✅ Lossless
BERT Multilingual	2.25	1.97	0.44%	43.7%	✗ Lossy
Gemma 2B	2.12	2.10	0.00%	52.4%	✗ Lossy
AraBERT v2	3.72	1.20	51.0%	7.1%	✗ Lossy
UrduBERT	0.58	7.63	0.00%	0.0%	✗ Lossy

Fertility = tokens per word (lower → better). 1.14 means nearly one token per Balochi word.
UrduBERT's fertility of 7.63 confirms script similarity ≠ tokenizer compatibility.

Why 47K and Not 64K?

Pruning from 64K was driven by entropy analysis on the full normalized corpus (54.3M tokens):

Vocab Size	Shannon Entropy (H₁)	Effective Vocab (V_eff)	Efficiency (η)
128,000	7.3924	168.0	0.13%
64,000	7.4184	171.1	0.27%
47,000	7.4295	172.4	0.37% ✅

Removing 17,000 noise tokens raises both entropy and effective vocabulary. Those tokens were diacritical variants and ZWNJ-split artifacts that only existed because the training corpus hadn't been normalized yet — they would never appear in clean text.

Cumulative frequency analysis showed that 99% of real Balochi text is covered by just 46,406 tokens. The 47K model rounds up slightly for safety.

Coverage Target	Vocab Required	Tokens Pruned
90%	10,219	53,781
95%	21,141	42,859
99%	46,406	17,594
99.5%	53,240	10,760

For Gemma 2B LEP integration, this means ~17,000 fewer new embedding rows — a 26.6% reduction in embedding layer dimensions with no measurable coverage loss.

Normalization Pipeline

This tokenizer was trained on text pre-processed with a Balochi-adapted version of the AraToken normalization methodology:

Rule	Characters	Note
NFKC Unicode normalization	All compatibility chars	Applied first
Diacritics removal	`\u064B`–`\u065F`	Inconsistently used in Balochi; causes duplicate embeddings
Alif variant unification	أ/إ/آ/ٱ → ا	Standardizes Arabic loanword stems
Tatweel removal	`\u0640`	Decorative only; no linguistic content
Arabic punctuation	؟→? · ؛→; · ،→,	Prevents duplicate punctuation tokens
Arabic-Indic numerals	٠١٢…٩ → 0-9	Eliminates numeral inconsistency
ZWNJ/ZWJ removal	`\u200C`, `\u200D`	Balochi-specific — causes spurious verb-compound splits
RLM/ALM removal	`\u200F`, `\u061C`	Balochi-specific — copy-paste artifacts in mixed-direction text
Ye variant preserved	`ے` vs `ی`	Inverted from AraToken — this distinction carries grammatical case in Southern Balochi

import re, unicodedata

def normalize_balochi(text: str, drop_diacritics: bool = True,
                      preserve_ye: bool = True) -> str:
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'[أإآٱ]', 'ا', text)
    if not preserve_ye:
        text = text.replace('ے', 'ی')
    text = text.translate(str.maketrans('٠١٢٣٤٥٦٧٨٩', '0123456789'))
    text = text.replace('؟','?').replace('؛',';').replace('،',',')
    text = text.replace('\u0640', '')
    text = text.replace('\u200C','').replace('\u200D','')
    text = text.replace('\u200F','').replace('\u061C','')
    if drop_diacritics:
        text = re.sub(r'(?<!ء)[\u064B-\u065F\u0610-\u061A\u06D6-\u06DC]', '', text)
    return re.sub(r'\s+', ' ', text).strip()

Training Corpus

These tokenizers were trained from scratch on a massive, deduplicated dataset comprising ~54 Million words.

Source	Size	Content
`balochi_dedup_corpus.txt`	~52M words	Deduplicated Balochi prose: literature, news, religious, conversational
`balochi_clean_corpus_dictionary.txt`	~lexical	Curated dictionary entries and normalized grammatical forms
`english_corpus_2M.txt`	~2M words	English text for code-switching coverage
Total	~54M words	185.6M characters

The Mixed-Corpus Advantage: To prevent the models from destroying or over-fragmenting English loan-words or technical terminology, the training data strategically includes ~2 million English words seamlessly mixed with the ~52 million Balochi words. This highly optimized ratio ensures that the tokenizers can flawlessly process English text (e.g., "war industry", mathematical symbols, code) without sacrificing the vocabulary slots dedicated strictly to core Balochi morphological roots.

Model Variants

Model	Vocab	Best For
⭐ `Balochi_SP_47K`	47,000	Gemma/LLaMA CPT · optimal embedding efficiency · lossless roundtrip
`Balochi_SP_64K`	64,000	Resource-constrained production · compression/memory sweet spot
`Balochi_SP_128K`	128,000	Maximum sequence packing · best raw compression (4.26 chars/token)
`Balochi_WP_47K`	47,000	BERT / ALBERT fine-tuning · native WordPiece architecture · protects Latin alphabet and Arabic numerals natively
`Balochi_BPE_47K`	47,000	GPT-2 / RoBERTa pre-training · BPE-aligned training · relies on HF's fast tokenizers

All SP models include byte_fallback=True — zero UNK rate on any Unicode input.

💻 Usage Code for Tokenizer Architectures

1. WordPiece Tokenizer (BERT-style)

from tokenizers import Tokenizer

# Load the custom WordPiece tokenizer
wp_tokenizer = Tokenizer.from_file("Balochi_WP_47K.json")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
encoded = wp_tokenizer.encode(text)

print("WordPiece Tokens:")
print(encoded.tokens)

2. SentencePiece Tokenizer (Gemma / LLaMA-style)

import sentencepiece as spm

# Load the custom SentencePiece model
spm_model = spm.SentencePieceProcessor()
spm_model.load("balochi_sp_47k.model")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
tokens = spm_model.encode_as_pieces(text)

print("SentencePiece Tokens:")
print(tokens)

3. HF BPE Tokenizer (RoBERTa / GPT-2-style)

from tokenizers import Tokenizer

# Load the custom Hugging Face BPE tokenizer
bpe_tokenizer = Tokenizer.from_file("Balochi_BPE_47K.json")

text =
 "
باسک یک لبزانکی ءُ اِلمی دیوان انت ۔ پہ زبان ءُ لبزانک ءِ دیمروئ ءَ باسکءِ تند جاہ  سال 2005 ءَ بنگیج کنگ بیتگ ۔   اے جہدءِ بندات ءَ باسک ءَ  ھالتران ، سازءُ زیمل ءِ دیمروئ، دودءُ ربیدگ ءِ پجار ءِ سرھالانی سرءَ ھم کار کُتگ ۔ اے کاروان ءَ بازینے باسک اتکگ ءُ بازینے باسک شُتگ اَنت بلے ھرکس ءَ کہ اے دیوان ءَ کلہوکے ھم کارکُتگ آئ ءِ نام چَہ ادءَ گار نہ بیتگ۔
"
encoded = bpe_tokenizer.encode(text)

print("HF BPE Tokens:")
print(encoded.tokens)

Training Configuration

Parameter	Value
Algorithm	SentencePiece Unigram
Implementation	Google `sentencepiece`
Vocabulary size	47,000
Character coverage	`0.9995`
Byte fallback	`True`
Space handling	`▁` word-start marker
Special tokens	`<pad>`, `<unk>`, `<s>`, `</s>`
Roundtrip fidelity	✅ Lossless
Normalization	AraToken-adapted (Balochi-specific rules)

Intended Uses & Limitations

Suitable for:

Continual pre-training (CPT) of LLMs on Southern Balochi text
BERT-style masked language modeling (use Balochi_WP_47K for WordPiece compatibility)
Named entity recognition, text classification, sentiment analysis on Balochi
Gemma 2B Language Extension Pipeline (LEP) via vocabulary extension + mean subtoken initialization

Limitations:

Optimized for Southern Balochi; Northern, Eastern, and Makrani dialect performance is untested
Evaluated on political/economic text only — cross-domain generalization (literary, social media, technical) requires validation
No downstream task benchmarking (NER F1, MLM perplexity) has been performed yet
English coverage is functional but not optimized; a bilingual Bal-Eng tokenizer is planned

Downloads last month: -; Downloads are not tracked for this model. How to track

🔤 Balochi Tokenizer — Balochi_SP_47K