File size: 1,507 Bytes
6a65062 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | ---
language: th
tags:
- thai
- tokenizer
- nlp
- text-processing
license: mit
---
# ZombitX64 Thai Tokenizer
A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.
## Features
- **Newline Preservation**: Correctly handles and preserves newlines in tokenized text
- **Thai Character Support**: Recognizes and processes Thai Unicode characters
- **Hugging Face Compatible**: Works with transformers library
- **Simple API**: Easy to use tokenize and detokenize methods
## Usage
```python
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")
# Tokenize text
text = "สวัสดีครับ\nนี่คือตัวอย่าง"
tokens = tokenizer.tokenize(text)
print(tokens)
# Encode to IDs
token_ids = tokenizer.encode(text)
print(token_ids)
# Decode back
decoded = tokenizer.decode(token_ids)
print(decoded)
```
## Model Details
- **Model Type**: Thai Tokenizer
- **Language**: Thai (th)
- **Vocab Size**: 112
- **Max Length**: 512
## Training Data
This tokenizer was trained on basic Thai character sets and common patterns.
## Limitations
- Basic Thai word segmentation (can be improved with pythainlp)
- Simple vocabulary (expandable for specific use cases)
## Contact
For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).
|