--- language: th tags: - thai - tokenizer - nlp - text-processing license: mit --- # ZombitX64 Thai Tokenizer A simple Thai language tokenizer that properly handles newlines and Thai text segmentation. ## Features - **Newline Preservation**: Correctly handles and preserves newlines in tokenized text - **Thai Character Support**: Recognizes and processes Thai Unicode characters - **Hugging Face Compatible**: Works with transformers library - **Simple API**: Easy to use tokenize and detokenize methods ## Usage ```python from transformers import AutoTokenizer # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer") # Tokenize text text = "สวัสดีครับ\nนี่คือตัวอย่าง" tokens = tokenizer.tokenize(text) print(tokens) # Encode to IDs token_ids = tokenizer.encode(text) print(token_ids) # Decode back decoded = tokenizer.decode(token_ids) print(decoded) ``` ## Model Details - **Model Type**: Thai Tokenizer - **Language**: Thai (th) - **Vocab Size**: 112 - **Max Length**: 512 ## Training Data This tokenizer was trained on basic Thai character sets and common patterns. ## Limitations - Basic Thai word segmentation (can be improved with pythainlp) - Simple vocabulary (expandable for specific use cases) ## Contact For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).