| ---
|
| language: th
|
| tags:
|
| - thai
|
| - tokenizer
|
| - nlp
|
| - text-processing
|
| license: mit
|
| ---
|
|
|
| # ZombitX64 Thai Tokenizer
|
|
|
| A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.
|
|
|
| ## Features
|
|
|
| - **Newline Preservation**: Correctly handles and preserves newlines in tokenized text
|
| - **Thai Character Support**: Recognizes and processes Thai Unicode characters
|
| - **Hugging Face Compatible**: Works with transformers library
|
| - **Simple API**: Easy to use tokenize and detokenize methods
|
|
|
| ## Usage
|
|
|
| ```python
|
| from transformers import AutoTokenizer
|
|
|
| # Load the tokenizer
|
| tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")
|
|
|
| # Tokenize text
|
| text = "สวัสดีครับ\nนี่คือตัวอย่าง"
|
| tokens = tokenizer.tokenize(text)
|
| print(tokens)
|
|
|
| # Encode to IDs
|
| token_ids = tokenizer.encode(text)
|
| print(token_ids)
|
|
|
| # Decode back
|
| decoded = tokenizer.decode(token_ids)
|
| print(decoded)
|
| ```
|
|
|
| ## Model Details
|
|
|
| - **Model Type**: Thai Tokenizer
|
| - **Language**: Thai (th)
|
| - **Vocab Size**: 112
|
| - **Max Length**: 512
|
|
|
| ## Training Data
|
|
|
| This tokenizer was trained on basic Thai character sets and common patterns.
|
|
|
| ## Limitations
|
|
|
| - Basic Thai word segmentation (can be improved with pythainlp)
|
| - Simple vocabulary (expandable for specific use cases)
|
|
|
| ## Contact
|
|
|
| For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).
|
|
|