gsaltintas's picture
Upload folder using huggingface_hub
24782a4 verified
|
Raw
History Blame Contribute Delete
1.31 kB
metadata
license: mit
language:
  - ind
tags:
  - tokenizer
  - bpe
  - flexitok
  - fineweb2

Byte-Level BPE Tokenizer: ind_Latn (16K)

A Byte-Level BPE tokenizer trained on ind_Latn data from Fineweb-2-HQ.

Training Details

Parameter Value
Algorithm Byte-Level BPE
Language ind_Latn
Target Vocab Size 16,000
Final Vocab Size 16,961
Pre-tokenizer custom:ind_Latn
Number handling ltr_3digit
Contraction handling True
Normalizer NFC
Special Tokens <s>, </s>, <pad>, <unk>
Training Shards 2

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("flexitok/bpe_ltr_ind_Latn_16000_v2")
tokens = tokenizer.encode("Hello, world!")

Files

  • tokenizer.json — Full HuggingFace tokenizer
  • vocab.json — Vocabulary mapping
  • merges.txt — BPE merge rules

Sample Encoding

Text Tokens Token IDs
Hello, world! 12345 This is a test. こんにちは H, ello, ,, Ġw, orld, !, Ġ, 123, 45, ĠThis, Ġis, Ġa, Ġtest, ., Ġ, ãģ, ĵ, ãĤ, ĵ, ãģ 42, 15107, 14, 429, 4639, 3, 223, 16355, 4529, 13915, 1153, 395, 7029, 16, 223, 9732, 244, 15716, 244, 9732