Fast ByteLevel BPE Tokenizer

A fast ByteLevel BPE tokenizer trained on Modal using a mixed English, Spanish, code, Wikipedia, and educational web corpus.

Tokenizer trained in 3.50 minutes.

Overview

This tokenizer was trained from scratch using Hugging Face tokenizers with a ByteLevel BPE setup.

Training Stats

Field Value
Tokenizer type ByteLevel BPE
Vocab size 32,000
Target texts 700,000
Elapsed minutes 3.50
Training platform Modal
Output file tokenizer.json

Dataset Mix

Share Dataset Config Split Column
50% allenai/c4 en train text
20% HuggingFaceFW/fineweb-edu None train text
10% wikimedia/wikipedia 20231101.en train text
10% codeparrot/codeparrot-clean None train content
10% allenai/c4 es train text

Files

Expected files:

tokenizer-bpe-32k/
├── tokenizer.json
├── metadata.json
├── README.md
├── vocab.json      # optional, if exported
└── merges.txt      # optional, if exported

tokenizer.json is the main all-in-one tokenizer file.

vocab.json and merges.txt are optional classic BPE files. Some older GPT-2/RoBERTa-style tools may ask for them.

Install

pip install tokenizers

Load Tokenizer

from tokenizers import Tokenizer

tok = Tokenizer.from_file("./tokenizer-bpe-32k/tokenizer.json")

enc = tok.encode("Hello!")
print(enc.tokens)
print(enc.ids)

Example output:

['ĠHello', '!']
[25831, 5]

Token Examples

English

Input:  Hello!
Tokens: ['ĠHello', '!']
IDs:    [25831, 5]
Count:  2

Spanish

Input:  Hola amigo, el tokenizer funciona muy bien.
Tokens: ['ĠHol', 'a', 'Ġamigo', ',', 'Ġel', 'Ġtoken', 'izer', 'Ġfunciona', 'Ġmuy', 'Ġbien', '.']
Count:  11

Code

Input:
import torch
print(torch.__version__)

Tokens:
['Ġimport', 'Ġtor', 'ch', 'Ċ', 'print', '(', 'tor', 'ch', '.__', 'version', '__)']

Count: 11

Meme / Emoji Text

Input:  BROOOOOOOOOOOOOO 💀💀💀🔥🔥🔥
Count:  24

Emoji-heavy text may split into many byte-level pieces. That is normal for ByteLevel BPE.

Notes

What does Ġ mean?

Ġ marks a space before a token. For example:

ĠHello

means the token represents Hello with a leading space behavior.

What does Ċ mean?

Ċ represents a newline in byte-level tokenization.

Why does decoding add a leading space?

This tokenizer was trained with ByteLevel behavior that adds a prefix space. So encoding Hello! can decode as:

" Hello!"

This is normal for this tokenizer style.

Export vocab.json and merges.txt

If you only have tokenizer.json, you can export classic BPE files like this:

from tokenizers import Tokenizer

out = "./tokenizer-bpe-32k"
tok = Tokenizer.from_file(f"{out}/tokenizer.json")

tok.model.save(out)

This should create:

vocab.json
merges.txt

Modal Download Command

To download the tokenizer folder from the Modal Volume:

modal volume get tokenizer-outputs /tokenizer-bpe-32k .

Quick Test

python - <<'PY'
from tokenizers import Tokenizer

tok = Tokenizer.from_file("./tokenizer-bpe-32k/tokenizer.json")

tests = [
    "Hello!",
    "The quick brown fox jumps over the lazy dog.",
    "Hola amigo, el tokenizer funciona muy bien.",
    "def hello_world(): print('hi')",
    "BROOOOOOOOOOOOOO 💀🔥"
]

for text in tests:
    enc = tok.encode(text)
    print("\nTEXT:", text)
    print("TOKENS:", enc.tokens)
    print("IDS:", enc.ids)
    print("COUNT:", len(enc.ids))
PY

License

This tokenizer is released under the Apache License 2.0.

The tokenizer artifacts include:

  • tokenizer.json
  • vocab.json
  • merges.txt
  • metadata.json

The license applies to the tokenizer files in this repository.

Dataset Attribution

This tokenizer was trained on a mixture of public datasets:

  • allenai/c4
  • HuggingFaceFW/fineweb-edu
  • wikimedia/wikipedia
  • codeparrot/codeparrot-clean
  • allenai/c4 Spanish config

Users should respect the licenses and terms of the original datasets. The allenai/c4 dataset card lists its license as odc-by, so attribution is especially important.

Final Verdict

This tokenizer is strong for:

  • English
  • Spanish
  • Python/code-like text
  • URLs and emails
  • General web text

It is weaker for:

  • Emoji-heavy text
  • CJK scripts
  • Korean
  • Arabic
  • Very meme-specific strings

For a 3.50 minute Modal run, this is a clean good-tier tokenizer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support