File size: 1,693 Bytes
3ae6489 c5a600d 3ae6489 c5a600d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | ---
tags:
- kabyle
- tokenizer
- sentencepiece
- low-resource
language: kab
license: apache-2.0
---
# Kabyle Tokenizer for T5
SentencePiece tokenizer trained on 787,648 Kabyle sentences from Tatoeba, designed for T5-style models.
## Vocabulary
- Size: 32,000 tokens
- Type: BPE (Byte Pair Encoding)
- Character coverage: 99.99%
## Special Tokens
- `<unk>`: Unknown token
- `<pad>`: Padding token
- `</s>`: End of sequence
- `<s>`: Beginning of sequence
- `<mask>`: Mask token (for T5)
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("boffire/kabyle-tokenizer-T5")
tokens = tokenizer.tokenize("Aqcic-nni yeɣra adlis.")
print(tokens) # ['▁Aqcic', '-', 'nni', '▁yeɣra', '▁adlis', '.']
```
## Training Data
- Source: Tatoeba Kabyle corpus
- Sentences: 787,648
- Cleaning: Greek/Cyrillic contamination removed (ε→ɛ, γ→ɣ, Σ→Ɛ, Γ→Ɣ, Ԑ→Ɛ, ԑ→ɛ)
## Comparison with T5-original
| Phrase | T5-original tokens | Kabyle-SPM tokens |
|--------|-------------------|-------------------|
| Aqcic-nni yeɣra adlis. | 19 | 6 |
| Tettmeslayeḍ taqbaylit? | 18 | 4 |
| Ur zmireɣ ara ad qqimeɣ argaz-a. | ~20 | 10 |
## Kabyle Characters Preserved
- ɛ / Ɛ (open e)
- ɣ / Ɣ (gamma)
- č / Č (c with caron)
- ǧ / Ǧ (g with caron)
- ḍ / Ḍ (d with dot below)
- ḥ / Ḥ (h with dot below)
- ṛ / Ṛ (r with dot below)
- ṣ / Ṣ (s with dot below)
- ṭ / Ṭ (t with dot below)
- ẓ / Ẓ (z with dot below)
## Limitations
- Optimized for short sentences (Tatoeba style)
- May split rare compound words (e.g., "tebirt" → "teb" + "irt")
- Requires T5 model with resized embeddings for full compatibility |