File size: 1,507 Bytes
6a65062
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---

language: th
tags:
- thai
- tokenizer
- nlp
- text-processing
license: mit
---


# ZombitX64 Thai Tokenizer

A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.

## Features

- **Newline Preservation**: Correctly handles and preserves newlines in tokenized text
- **Thai Character Support**: Recognizes and processes Thai Unicode characters
- **Hugging Face Compatible**: Works with transformers library
- **Simple API**: Easy to use tokenize and detokenize methods

## Usage

```python

from transformers import AutoTokenizer



# Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")



# Tokenize text

text = "สวัสดีครับ\nนี่คือตัวอย่าง"

tokens = tokenizer.tokenize(text)

print(tokens)



# Encode to IDs

token_ids = tokenizer.encode(text)

print(token_ids)



# Decode back

decoded = tokenizer.decode(token_ids)

print(decoded)

```

## Model Details

- **Model Type**: Thai Tokenizer
- **Language**: Thai (th)
- **Vocab Size**: 112
- **Max Length**: 512

## Training Data

This tokenizer was trained on basic Thai character sets and common patterns.

## Limitations

- Basic Thai word segmentation (can be improved with pythainlp)
- Simple vocabulary (expandable for specific use cases)

## Contact

For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).