File size: 1,611 Bytes
2e9b095 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | ---
language: ["khm"]
tokenizer_type: "HybridBPE-MD"
license: "mit"
tags:
- khmer
- tokenizer
- bpe-md
- sentencepiece
- hybrid
- language-model
- text-processing
---
# π°π Khmer BPE-MD-v3-SPM (Hybrid Tokenizer)
**BPE-MD-v3-SPM** is a hybrid Khmer tokenizer that combines:
- **BPE-MD (Morphology-Driven)** rules for Khmer word segmentation, and
- **SentencePiece BPE** modeling for subword learning, coverage, and byte safety.
This tokenizer is built for both Khmer and bilingual (Khmer + English + Math) text.
It handles Unicode normalization, symbols, and numerics gracefully β ideal for LLMs, translation models, or RAG systems.
---
### π§ Features
- **Hybrid design:** BPE-MD (morphology) Γ SentencePiece (subword)
- **Script coverage:** Khmer + Latin + Math + Digits
- **Vocab size:** 16 100
- **Character coverage:** 1.0
- **Includes:** user-defined math and chemical tokens (β, Β², ββ, HβO, logββ, etc.)
---
### π§© Example usage
(from transformers import T5Tokenizer)
(tok = T5Tokenizer.from_pretrained("Msok99/km-bpe-md-v3-spm"))
(text = "αααα»αααΆαααααΆααΆ β25 + 3Β² = 34")
(print(tok.tokenize(text)))
(print(tok.decode(tok.encode(text))))
---
### π Training details
- **Base:** Khmer Morphology-Driven corpus (education, news, QA)
- **Algorithm:** SentencePiece (BPE mode)
- **User symbols:** Mathematical, scientific, and Khmer-digit patterns
- **Goal:** Robust tokenization for LLM fine-tuning on Khmer + mixed-script data
---
### π License
MIT
|