Mailwoman SentencePiece Tokenizer v0.6.0-a0
Multi-script SentencePiece tokenizer trained for address parsing. Achieves 0% byte-fallback on CJK, Korean, Thai, and Arabic scripts (down from 36–75% on prior tokenizers).
- Source: https://github.com/sister-software/mailwoman
- License: AGPL-3.0
Training data
2.19 million place-name records from Who's On First across 7 countries:
| Script group | Lines |
|---|---|
| Latin (en/fr/de/es/it/pt/nl/sv/pl) | 500,000 |
| Chinese (zho) | 500,000 |
| Cyrillic (rus/ukr/bul/srp) | 468,111 |
| Arabic (ara/fas/urd) | 285,109 |
| Japanese (jpn) | 183,219 |
| Korean (kor) | 94,414 |
| Other (tha/hin/ben/heb/ell) | 160,407 |
Format
| Field | Value |
|---|---|
| Algorithm | SentencePiece unigram |
| Vocabulary size | 48,000 |
| Character coverage | 0.9999 |
| Byte-fallback | enabled |
| Split digits | false |
| User-defined symbols | 64 (US state abbreviations + postcode formats) |
| Model size | 1.0 MB |
Byte-fallback rates
Measured on 12 hard multi-script address examples:
| Script | Old (v0.5.0-a1) | This tokenizer (v0.6.0-a0) |
|---|---|---|
| Chinese | 50–75% | 0% |
| Japanese | 58–60% | 0% |
| Korean | 41% | 0% |
| Thai | 30% | 0% |
| Arabic | 0% | 0% |
| Latin | 0% | 0% |
| Aggregate | 36.6% | 0.0% |
Usage
import sentencepiece as spm
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("tokenizer.model")
pieces = tokenizer.encode_as_pieces("東京都新宿区西新宿2-8-1")
# ['▁', '東京', '都', '新宿', '区', '西', '新宿', '2', '-', '8', '-', '1']
Files
| File | Description |
|---|---|
tokenizer.model |
SentencePiece binary (load with spm.SentencePieceProcessor.load) |
tokenizer.vocab |
Plain-text vocabulary listing (one piece per line, with score) |
model_card.json |
Training provenance metadata (SHA256, training lines, etc.) |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support