--- language: en tags: - tokenizer - wordpiece - NLP - wikitext license: mit datasets: - wikitext library_name: transformers --- # **Custom WordPiece Tokenizer (Trained on WikiText-103 Raw v1)** ## **Model Overview** This repository contains a custom **WordPiece-based tokenizer** trained from scratch on the **WikiText-103 Raw v1** dataset. The tokenizer is designed for use in natural language processing tasks such as **language modeling**, **text classification**, and **information retrieval**. **Key Features:** - Custom `[CLS]` and `[SEP]` special tokens. - WordPiece subword segmentation with `##` prefix for subwords. - Template-based post-processing for both single and paired sequences. - Configured decoding using the WordPiece decoder for seamless reconstruction of original text. --- ## **Training Details** ### **Dataset** - **Name:** [WikiText-103 Raw v1](https://huggingface.co/datasets/wikitext) - **Source:** High-quality, long-form Wikipedia articles. - **Split Used:** `train` - **Size:** ~103 million tokens - **Loading Method:** Streaming mode for efficient large-scale training without local storage bottlenecks. ### **Tokenizer Configuration** - **Model Type:** WordPiece - **Vocabulary Size:** *60,000* (medium-scale for general-purpose LLMs) - **Lowercasing:** Enabled - **Special Tokens:** - `[CLS]` — Classification token - `[SEP]` — Separator token - `[UNK]` — Unknown token - `[PAD]` — Padding token - `[MASK]` — Masking token (MLM tasks) - **Post-Processing Template:** - **Single Sequence:** `[CLS]` $A `[SEP]` - **Paired Sequences:** `[CLS]` $A `[SEP]` $B `[SEP]` - **Decoder:** WordPiece decoder with `##` prefix handling. ### **Training Method** - **Corpus Source:** Streaming iterator from WikiText-103 Raw v1 (train split) - **Batch Size:** 1000 lines per batch - **Trainer:** `WordPieceTrainer` from Hugging Face `tokenizers` library - **Special Tokens Added:** `[CLS]`, `[SEP]`, `[UNK]`, `[PAD]`, `[MASK]` --- ## **Intended Uses & Limitations** ### Intended Uses - Pre-tokenization for training Transformer-based LLMs. - Downstream NLP tasks: - Language modeling - Text classification - Question answering - Summarization ### Limitations - Trained exclusively on English Wikipedia text — performance may degrade in informal, domain-specific, or multilingual contexts. - May inherit biases present in Wikipedia data. --- ## **License** This tokenizer is released under the **MIT License**. --- ## **Citation** If you use this tokenizer, please cite: title = Custom WordPiece Tokenizer Trained on WikiText-103 Raw v1 author = yakul259 year = 2025 publisher = Hugging Face