Mailwoman SentencePiece Tokenizer v0.6.0-a0

Multi-script SentencePiece tokenizer trained for address parsing. Achieves 0% byte-fallback on CJK, Korean, Thai, and Arabic scripts (down from 36–75% on prior tokenizers).

Training data

2.19 million place-name records from Who's On First across 7 countries:

Script group Lines
Latin (en/fr/de/es/it/pt/nl/sv/pl) 500,000
Chinese (zho) 500,000
Cyrillic (rus/ukr/bul/srp) 468,111
Arabic (ara/fas/urd) 285,109
Japanese (jpn) 183,219
Korean (kor) 94,414
Other (tha/hin/ben/heb/ell) 160,407

Format

Field Value
Algorithm SentencePiece unigram
Vocabulary size 48,000
Character coverage 0.9999
Byte-fallback enabled
Split digits false
User-defined symbols 64 (US state abbreviations + postcode formats)
Model size 1.0 MB

Byte-fallback rates

Measured on 12 hard multi-script address examples:

Script Old (v0.5.0-a1) This tokenizer (v0.6.0-a0)
Chinese 50–75% 0%
Japanese 58–60% 0%
Korean 41% 0%
Thai 30% 0%
Arabic 0% 0%
Latin 0% 0%
Aggregate 36.6% 0.0%

Usage

import sentencepiece as spm
tokenizer = spm.SentencePieceProcessor()
tokenizer.load("tokenizer.model")

pieces = tokenizer.encode_as_pieces("東京都新宿区西新宿2-8-1")
# ['▁', '東京', '都', '新宿', '区', '西', '新宿', '2', '-', '8', '-', '1']

Files

File Description
tokenizer.model SentencePiece binary (load with spm.SentencePieceProcessor.load)
tokenizer.vocab Plain-text vocabulary listing (one piece per line, with score)
model_card.json Training provenance metadata (SHA256, training lines, etc.)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support