File size: 1,611 Bytes
2e9b095
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---

language: ["khm"]
tokenizer_type: "HybridBPE-MD"
license: "mit"
tags:
  - khmer
  - tokenizer
  - bpe-md
  - sentencepiece
  - hybrid
  - language-model
  - text-processing
---


# πŸ‡°πŸ‡­ Khmer BPE-MD-v3-SPM (Hybrid Tokenizer)

**BPE-MD-v3-SPM** is a hybrid Khmer tokenizer that combines:
- **BPE-MD (Morphology-Driven)** rules for Khmer word segmentation, and  
- **SentencePiece BPE** modeling for subword learning, coverage, and byte safety.

This tokenizer is built for both Khmer and bilingual (Khmer + English + Math) text.
It handles Unicode normalization, symbols, and numerics gracefully β€” ideal for LLMs, translation models, or RAG systems.

---

### 🧠 Features
- **Hybrid design:** BPE-MD (morphology) Γ— SentencePiece (subword)  
- **Script coverage:** Khmer + Latin + Math + Digits  
- **Vocab size:** 16 100  
- **Character coverage:** 1.0  
- **Includes:** user-defined math and chemical tokens (√, Β², ₁₀, Hβ‚‚O, log₁₀, etc.)

---

### 🧩 Example usage

(from transformers import T5Tokenizer)

(tok = T5Tokenizer.from_pretrained("Msok99/km-bpe-md-v3-spm"))



(text = "αžαŸ’αž‰αž»αŸ†αž”αžΆαž“αž‚αžŽαž“αžΆαžαžΆ √25 + 3Β² = 34")



(print(tok.tokenize(text)))



(print(tok.decode(tok.encode(text))))



---



### πŸ“Š Training details

- **Base:** Khmer Morphology-Driven corpus (education, news, QA)

- **Algorithm:** SentencePiece (BPE mode)

- **User symbols:** Mathematical, scientific, and Khmer-digit patterns

- **Goal:** Robust tokenization for LLM fine-tuning on Khmer + mixed-script data



---



### πŸ“œ License

MIT