File size: 1,507 Bytes

6a65062

---

language: th
tags:
- thai
- tokenizer
- nlp
- text-processing
license: mit
---


# ZombitX64 Thai Tokenizer

A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.

## Features

- **Newline Preservation**: Correctly handles and preserves newlines in tokenized text
- **Thai Character Support**: Recognizes and processes Thai Unicode characters
- **Hugging Face Compatible**: Works with transformers library
- **Simple API**: Easy to use tokenize and detokenize methods

## Usage

```python

from transformers import AutoTokenizer



# Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")



# Tokenize text

text = "สวัสดีครับ\nนี่คือตัวอย่าง"

tokens = tokenizer.tokenize(text)

print(tokens)



# Encode to IDs

token_ids = tokenizer.encode(text)

print(token_ids)



# Decode back

decoded = tokenizer.decode(token_ids)

print(decoded)

```

## Model Details

- **Model Type**: Thai Tokenizer
- **Language**: Thai (th)
- **Vocab Size**: 112
- **Max Length**: 512

## Training Data

This tokenizer was trained on basic Thai character sets and common patterns.

## Limitations

- Basic Thai word segmentation (can be improved with pythainlp)
- Simple vocabulary (expandable for specific use cases)

## Contact

For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).