JonusNattapong commited on
Commit
6a65062
·
verified ·
1 Parent(s): 3c2032b

Upload Thai tokenizer v1.0.0

Browse files
Files changed (4) hide show
  1. README.md +62 -0
  2. special_tokens_map.json +6 -0
  3. tokenizer_config.json +12 -0
  4. vocab.json +114 -0
README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: th
3
+ tags:
4
+ - thai
5
+ - tokenizer
6
+ - nlp
7
+ - text-processing
8
+ license: mit
9
+ ---
10
+
11
+ # ZombitX64 Thai Tokenizer
12
+
13
+ A simple Thai language tokenizer that properly handles newlines and Thai text segmentation.
14
+
15
+ ## Features
16
+
17
+ - **Newline Preservation**: Correctly handles and preserves newlines in tokenized text
18
+ - **Thai Character Support**: Recognizes and processes Thai Unicode characters
19
+ - **Hugging Face Compatible**: Works with transformers library
20
+ - **Simple API**: Easy to use tokenize and detokenize methods
21
+
22
+ ## Usage
23
+
24
+ ```python
25
+ from transformers import AutoTokenizer
26
+
27
+ # Load the tokenizer
28
+ tokenizer = AutoTokenizer.from_pretrained("ZombitX64/zombitx64-thaitokenizer")
29
+
30
+ # Tokenize text
31
+ text = "สวัสดีครับ\nนี่คือตัวอย่าง"
32
+ tokens = tokenizer.tokenize(text)
33
+ print(tokens)
34
+
35
+ # Encode to IDs
36
+ token_ids = tokenizer.encode(text)
37
+ print(token_ids)
38
+
39
+ # Decode back
40
+ decoded = tokenizer.decode(token_ids)
41
+ print(decoded)
42
+ ```
43
+
44
+ ## Model Details
45
+
46
+ - **Model Type**: Thai Tokenizer
47
+ - **Language**: Thai (th)
48
+ - **Vocab Size**: 112
49
+ - **Max Length**: 512
50
+
51
+ ## Training Data
52
+
53
+ This tokenizer was trained on basic Thai character sets and common patterns.
54
+
55
+ ## Limitations
56
+
57
+ - Basic Thai word segmentation (can be improved with pythainlp)
58
+ - Simple vocabulary (expandable for specific use cases)
59
+
60
+ ## Contact
61
+
62
+ For questions or issues, please visit the [GitHub repository](https://github.com/ZombitX64/ZombitX64-Thaitokenizer).
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "pad_token": "[PAD]",
3
+ "unk_token": "[UNK]",
4
+ "bos_token": "[BOS]",
5
+ "eos_token": "[EOS]"
6
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "ZombitX64-Thaitokenizer",
3
+ "version": "1.0.0",
4
+ "description": "Thai language tokenizer with newline preservation",
5
+ "vocab_size": 112,
6
+ "max_length": 512,
7
+ "pad_token": "[PAD]",
8
+ "unk_token": "[UNK]",
9
+ "bos_token": "[BOS]",
10
+ "eos_token": "[EOS]",
11
+ "model_type": "thai_tokenizer"
12
+ }
vocab.json ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "[PAD]": 0,
3
+ "[UNK]": 1,
4
+ "[BOS]": 2,
5
+ "[EOS]": 3,
6
+ "\n": 4,
7
+ "\r\n": 5,
8
+ "\r": 6,
9
+ " ": 7,
10
+ "\t": 8,
11
+ " ": 9,
12
+ " ": 10,
13
+ "ก": 11,
14
+ "ข": 12,
15
+ "ฃ": 13,
16
+ "ค": 14,
17
+ "ฅ": 15,
18
+ "ฆ": 16,
19
+ "ง": 17,
20
+ "จ": 18,
21
+ "ฉ": 19,
22
+ "ช": 20,
23
+ "ซ": 21,
24
+ "ฌ": 22,
25
+ "ญ": 23,
26
+ "ฎ": 24,
27
+ "ฏ": 25,
28
+ "ฐ": 26,
29
+ "ฑ": 27,
30
+ "ฒ": 28,
31
+ "ณ": 29,
32
+ "ด": 30,
33
+ "ต": 31,
34
+ "ถ": 32,
35
+ "ท": 33,
36
+ "ธ": 34,
37
+ "น": 35,
38
+ "บ": 36,
39
+ "ป": 37,
40
+ "ผ": 38,
41
+ "ฝ": 39,
42
+ "พ": 40,
43
+ "ฟ": 41,
44
+ "ภ": 42,
45
+ "ม": 43,
46
+ "ย": 44,
47
+ "ร": 45,
48
+ "ฤ": 46,
49
+ "ล": 47,
50
+ "ฦ": 48,
51
+ "ว": 49,
52
+ "ศ": 50,
53
+ "ษ": 51,
54
+ "ส": 52,
55
+ "ห": 53,
56
+ "ฬ": 54,
57
+ "อ": 55,
58
+ "ฮ": 56,
59
+ "ะ": 57,
60
+ "ั": 58,
61
+ "า": 59,
62
+ "ำ": 60,
63
+ "ิ": 61,
64
+ "ี": 62,
65
+ "ึ": 63,
66
+ "ื": 64,
67
+ "ุ": 65,
68
+ "ู": 66,
69
+ "ฺ": 67,
70
+ "฻": 68,
71
+ "฼": 69,
72
+ "฽": 70,
73
+ "฾": 71,
74
+ "฿": 72,
75
+ "เ": 73,
76
+ "แ": 74,
77
+ "โ": 75,
78
+ "ใ": 76,
79
+ "ไ": 77,
80
+ "ๅ": 78,
81
+ "ๆ": 79,
82
+ "็": 80,
83
+ "่": 81,
84
+ "้": 82,
85
+ "๊": 83,
86
+ "๋": 84,
87
+ "์": 85,
88
+ "ํ": 86,
89
+ "๎": 87,
90
+ "๐": 88,
91
+ "๑": 89,
92
+ "๒": 90,
93
+ "๓": 91,
94
+ "๔": 92,
95
+ "๕": 93,
96
+ "๖": 94,
97
+ "๗": 95,
98
+ "๘": 96,
99
+ "๙": 97,
100
+ ".": 98,
101
+ ",": 99,
102
+ "!": 100,
103
+ "?": 101,
104
+ ";": 102,
105
+ ":": 103,
106
+ "\"": 104,
107
+ "'": 105,
108
+ "(": 106,
109
+ ")": 107,
110
+ "[": 108,
111
+ "]": 109,
112
+ "{": 110,
113
+ "}": 111
114
+ }