--- tags: - kabyle - tokenizer - sentencepiece - low-resource language: kab license: apache-2.0 --- # Kabyle Tokenizer for T5 SentencePiece tokenizer trained on 787,648 Kabyle sentences from Tatoeba, designed for T5-style models. ## Vocabulary - Size: 32,000 tokens - Type: BPE (Byte Pair Encoding) - Character coverage: 99.99% ## Special Tokens - ``: Unknown token - ``: Padding token - ``: End of sequence - ``: Beginning of sequence - ``: Mask token (for T5) ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("boffire/kabyle-tokenizer-T5") tokens = tokenizer.tokenize("Aqcic-nni yeɣra adlis.") print(tokens) # ['▁Aqcic', '-', 'nni', '▁yeɣra', '▁adlis', '.'] ``` ## Training Data - Source: Tatoeba Kabyle corpus - Sentences: 787,648 - Cleaning: Greek/Cyrillic contamination removed (ε→ɛ, γ→ɣ, Σ→Ɛ, Γ→Ɣ, Ԑ→Ɛ, ԑ→ɛ) ## Comparison with T5-original | Phrase | T5-original tokens | Kabyle-SPM tokens | |--------|-------------------|-------------------| | Aqcic-nni yeɣra adlis. | 19 | 6 | | Tettmeslayeḍ taqbaylit? | 18 | 4 | | Ur zmireɣ ara ad qqimeɣ argaz-a. | ~20 | 10 | ## Kabyle Characters Preserved - ɛ / Ɛ (open e) - ɣ / Ɣ (gamma) - č / Č (c with caron) - ǧ / Ǧ (g with caron) - ḍ / Ḍ (d with dot below) - ḥ / Ḥ (h with dot below) - ṛ / Ṛ (r with dot below) - ṣ / Ṣ (s with dot below) - ṭ / Ṭ (t with dot below) - ẓ / Ẓ (z with dot below) ## Limitations - Optimized for short sentences (Tatoeba style) - May split rare compound words (e.g., "tebirt" → "teb" + "irt") - Requires T5 model with resized embeddings for full compatibility