File size: 1,611 Bytes

2e9b095

---

language: ["khm"]
tokenizer_type: "HybridBPE-MD"
license: "mit"
tags:
  - khmer
  - tokenizer
  - bpe-md
  - sentencepiece
  - hybrid
  - language-model
  - text-processing
---


# 🇰🇭 Khmer BPE-MD-v3-SPM (Hybrid Tokenizer)

**BPE-MD-v3-SPM** is a hybrid Khmer tokenizer that combines:
- **BPE-MD (Morphology-Driven)** rules for Khmer word segmentation, and  
- **SentencePiece BPE** modeling for subword learning, coverage, and byte safety.

This tokenizer is built for both Khmer and bilingual (Khmer + English + Math) text.
It handles Unicode normalization, symbols, and numerics gracefully — ideal for LLMs, translation models, or RAG systems.

---

### 🧠 Features
- **Hybrid design:** BPE-MD (morphology) × SentencePiece (subword)  
- **Script coverage:** Khmer + Latin + Math + Digits  
- **Vocab size:** 16 100  
- **Character coverage:** 1.0  
- **Includes:** user-defined math and chemical tokens (√, ², ₁₀, H₂O, log₁₀, etc.)

---

### 🧩 Example usage

(from transformers import T5Tokenizer)

(tok = T5Tokenizer.from_pretrained("Msok99/km-bpe-md-v3-spm"))



(text = "ខ្ញុំបានគណនាថា √25 + 3² = 34")



(print(tok.tokenize(text)))



(print(tok.decode(tok.encode(text))))



---



### 📊 Training details

- **Base:** Khmer Morphology-Driven corpus (education, news, QA)

- **Algorithm:** SentencePiece (BPE mode)

- **User symbols:** Mathematical, scientific, and Khmer-digit patterns

- **Goal:** Robust tokenization for LLM fine-tuning on Khmer + mixed-script data



---



### 📜 License

MIT