Salesforce/wikitext
Viewer โข Updated โข 3.71M โข 1.35M โข 694
How to use yakul259/english-wordpiece-tokenizer-60k with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("yakul259/english-wordpiece-tokenizer-60k", dtype="auto")This repository contains a custom WordPiece-based tokenizer trained from scratch on the WikiText-103 Raw v1 dataset.
The tokenizer is designed for use in natural language processing tasks such as language modeling, text classification, and information retrieval.
Key Features:
[CLS] and [SEP] special tokens.## prefix for subwords.train[CLS] โ Classification token[SEP] โ Separator token[UNK] โ Unknown token[PAD] โ Padding token[MASK] โ Masking token (MLM tasks)[CLS] $A [SEP][CLS] $A [SEP] $B [SEP]## prefix handling.WordPieceTrainer from Hugging Face tokenizers library[CLS], [SEP], [UNK], [PAD], [MASK]This tokenizer is released under the MIT License.
If you use this tokenizer, please cite:
title = Custom WordPiece Tokenizer Trained on WikiText-103 Raw v1
author = yakul259
year = 2025
publisher = Hugging Face
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("yakul259/english-wordpiece-tokenizer-60k", dtype="auto")