ZombitX64
/

zombitx64-thaitokenizer

text-processing

Model card Files Files and versions

zombitx64-thaitokenizer / README.md

JonusNattapong's picture

Upload Thai tokenizer v1.0.0

6a65062 verified 9 months ago

|

history blame contribute delete

1.51 kB

	---
	language: th
	tags:
	- thai
	- tokenizer
	- nlp
	- text-processing
	license: mit
	---

	# ZombitX64 Thai Tokenizer

	A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.

	## Features

	- Newline Preservation: Correctly handles and preserves newlines in tokenized text
	- Thai Character Support: Recognizes and processes Thai Unicode characters
	- Hugging Face Compatible: Works with transformers library
	- Simple API: Easy to use tokenize and detokenize methods

	## Usage

	```python
	from transformers import AutoTokenizer

	# Load the tokenizer
	tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")

	# Tokenize text
	text = "สวัสดีครับ\nนี่คือตัวอย่าง"
	tokens = tokenizer.tokenize(text)
	print(tokens)

	# Encode to IDs
	token_ids = tokenizer.encode(text)
	print(token_ids)

	# Decode back
	decoded = tokenizer.decode(token_ids)
	print(decoded)
	```

	## Model Details

	- Model Type: Thai Tokenizer
	- Language: Thai (th)
	- Vocab Size: 112
	- Max Length: 512

	## Training Data

	This tokenizer was trained on basic Thai character sets and common patterns.

	## Limitations

	- Basic Thai word segmentation (can be improved with pythainlp)
	- Simple vocabulary (expandable for specific use cases)

	## Contact

	For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).