File size: 1,693 Bytes

3ae6489
c5a600d
 
 
 
 
 
 
3ae6489
c5a600d

---
tags:
- kabyle
- tokenizer
- sentencepiece
- low-resource
language: kab
license: apache-2.0
---

# Kabyle Tokenizer for T5

SentencePiece tokenizer trained on 787,648 Kabyle sentences from Tatoeba, designed for T5-style models.

## Vocabulary
- Size: 32,000 tokens
- Type: BPE (Byte Pair Encoding)
- Character coverage: 99.99%

## Special Tokens
- `<unk>`: Unknown token
- `<pad>`: Padding token
- `</s>`: End of sequence
- `<s>`: Beginning of sequence
- `<mask>`: Mask token (for T5)

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("boffire/kabyle-tokenizer-T5")
tokens = tokenizer.tokenize("Aqcic-nni yeɣra adlis.")
print(tokens)  # ['▁Aqcic', '-', 'nni', '▁yeɣra', '▁adlis', '.']
```

## Training Data
- Source: Tatoeba Kabyle corpus
- Sentences: 787,648
- Cleaning: Greek/Cyrillic contamination removed (ε→ɛ, γ→ɣ, Σ→Ɛ, Γ→Ɣ, Ԑ→Ɛ, ԑ→ɛ)

## Comparison with T5-original

| Phrase | T5-original tokens | Kabyle-SPM tokens |
|--------|-------------------|-------------------|
| Aqcic-nni yeɣra adlis. | 19 | 6 |
| Tettmeslayeḍ taqbaylit? | 18 | 4 |
| Ur zmireɣ ara ad qqimeɣ argaz-a. | ~20 | 10 |

## Kabyle Characters Preserved
- ɛ / Ɛ (open e)
- ɣ / Ɣ (gamma)
- č / Č (c with caron)
- ǧ / Ǧ (g with caron)
- ḍ / Ḍ (d with dot below)
- ḥ / Ḥ (h with dot below)
- ṛ / Ṛ (r with dot below)
- ṣ / Ṣ (s with dot below)
- ṭ / Ṭ (t with dot below)
- ẓ / Ẓ (z with dot below)

## Limitations
- Optimized for short sentences (Tatoeba style)
- May split rare compound words (e.g., "tebirt" → "teb" + "irt")
- Requires T5 model with resized embeddings for full compatibility