---
language: en
tags:
- tokenizer
- wordpiece
- NLP
- wikitext
license: mit
datasets:
- wikitext
library_name: transformers
---

# **Custom WordPiece Tokenizer (Trained on WikiText-103 Raw v1)**

## **Model Overview**
This repository contains a custom **WordPiece-based tokenizer** trained from scratch on the **WikiText-103 Raw v1** dataset.  
The tokenizer is designed for use in natural language processing tasks such as **language modeling**, **text classification**, and **information retrieval**.  

**Key Features:**
- Custom `[CLS]` and `[SEP]` special tokens.
- WordPiece subword segmentation with `##` prefix for subwords.
- Template-based post-processing for both single and paired sequences.
- Configured decoding using the WordPiece decoder for seamless reconstruction of original text.

---

## **Training Details**

### **Dataset**
- **Name:** [WikiText-103 Raw v1](https://huggingface.co/datasets/wikitext)
- **Source:** High-quality, long-form Wikipedia articles.
- **Split Used:** `train`
- **Size:** ~103 million tokens
- **Loading Method:** Streaming mode for efficient large-scale training without local storage bottlenecks.

### **Tokenizer Configuration**
- **Model Type:** WordPiece
- **Vocabulary Size:** *60,000* (medium-scale for general-purpose LLMs)
- **Lowercasing:** Enabled
- **Special Tokens:**
  - `[CLS]` — Classification token
  - `[SEP]` — Separator token
  - `[UNK]` — Unknown token
  - `[PAD]` — Padding token
  - `[MASK]` — Masking token (MLM tasks)
- **Post-Processing Template:**
  - **Single Sequence:** `[CLS]` $A `[SEP]`
  - **Paired Sequences:** `[CLS]` $A `[SEP]` $B `[SEP]`
- **Decoder:** WordPiece decoder with `##` prefix handling.

### **Training Method**
- **Corpus Source:** Streaming iterator from WikiText-103 Raw v1 (train split)
- **Batch Size:** 1000 lines per batch
- **Trainer:** `WordPieceTrainer` from Hugging Face `tokenizers` library
- **Special Tokens Added:** `[CLS]`, `[SEP]`, `[UNK]`, `[PAD]`, `[MASK]`

---

## **Intended Uses & Limitations**

### Intended Uses
- Pre-tokenization for training Transformer-based LLMs.
- Downstream NLP tasks:
  - Language modeling
  - Text classification
  - Question answering
  - Summarization

### Limitations
- Trained exclusively on English Wikipedia text — performance may degrade in informal, domain-specific, or multilingual contexts.
- May inherit biases present in Wikipedia data.

---

## **License**
This tokenizer is released under the **MIT License**.

---

## **Citation**
If you use this tokenizer, please cite:  

title = Custom WordPiece Tokenizer Trained on WikiText-103 Raw v1  
author = yakul259  
year = 2025  
publisher = Hugging Face