---
language: th
tags:
- thai
- tokenizer
- nlp
- text-processing
license: mit
---

# ZombitX64 Thai Tokenizer

A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.

## Features

- **Newline Preservation**: Correctly handles and preserves newlines in tokenized text
- **Thai Character Support**: Recognizes and processes Thai Unicode characters
- **Hugging Face Compatible**: Works with transformers library
- **Simple API**: Easy to use tokenize and detokenize methods

## Usage

```python
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")

# Tokenize text
text = "สวัสดีครับ\nนี่คือตัวอย่าง"
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode to IDs
token_ids = tokenizer.encode(text)
print(token_ids)

# Decode back
decoded = tokenizer.decode(token_ids)
print(decoded)
```

## Model Details

- **Model Type**: Thai Tokenizer
- **Language**: Thai (th)
- **Vocab Size**: 112
- **Max Length**: 512

## Training Data

This tokenizer was trained on basic Thai character sets and common patterns.

## Limitations

- Basic Thai word segmentation (can be improved with pythainlp)
- Simple vocabulary (expandable for specific use cases)

## Contact

For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).