--- languages: - zh - en - multilingual tags: - tokenizer - bilingual - chinese - english - multilingual license: apache-2.0 --- # QiTianTokenizer-Base **QiTianTokenizer** is a *universal multilingual tokenizer* primarily optimized for **Chinese–English mixed text**, offering consistent and reversible tokenization across diverse languages and scripts. It is designed as a **general-purpose tokenizer**, not tied to any specific model, and fully compatible with the 🤗 **Transformers** ecosystem. --- ## ✨ Overview | Property | Value | |-------------------------|---------------------------------------| | **Name** | QiTianTokenizer-Base | | **Type** | Tokenizer-only repository | | **Purpose** | General multilingual tokenization | | **Primary Languages** | Chinese, English | | **Extended Support** | Multilingual (Unicode-complete) | | **Architecture** | Byte-level BPE | | **Vocabulary Size** | 32,000 tokens | | **Fast Implementation** | ✅ Available (`QiTianTokenizerFast`) | | **Framework** | 🤗 `transformers` | | **License** | Apache 2.0 | --- ## 🧩 QiTian Tokenizer Series | Variant | Vocabulary Size | Description | Recommended Use | |---------------------------------------------------------------------------------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------| | [**QiTianTokenizer-Tiny**](https://huggingface.co/Morton-Li/QiTianTokenizer-Tiny) | 12k | Lightweight tokenizer designed for compact or embedded models. | On-device or low-resource tasks | | [**QiTianTokenizer-Base**](https://huggingface.co/Morton-Li/QiTianTokenizer-Base) | 32k | Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases. | **Recommended for general use** | | [**QiTianTokenizer-Medium**](https://huggingface.co/Morton-Li/QiTianTokenizer-Medium) | 64k | **Optimal balance in language coverage** — broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity. | **Recommended for multilingual and high-quality general-purpose models** | | [**QiTianTokenizer-Large**](https://huggingface.co/Morton-Li/QiTianTokenizer-Large) | 96k | Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models. | High-resource training | | [**QiTianTokenizer-XLarge**](https://huggingface.co/Morton-Li/QiTianTokenizer-XLarge) | 128k | Full-script and domain-extensive vocabulary for comprehensive multilingual modeling. | Research & large-scale pretraining | > All variants share consistent token definitions, special tokens, and compatible configurations. --- ## ⚙️ Usage You can load this tokenizer directly with `AutoTokenizer`: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Base", trust_remote_code=True) # Example text = "你好,QiTian!" tokens = tokenizer(text) print(tokens["input_ids"]) ``` ### ➕ Batch Example ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Base", trust_remote_code=True) # Example texts = ["Hello, 世界!", "QiTian is multilingual."] batch_tokens = tokenizer(texts, padding=True, return_tensors="pt") print(batch_tokens["input_ids"]) ``` ### 💬 Chat Template (`apply_chat_template`) For chat-style data, you can format a list of messages using `apply_chat_template`: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Base", trust_remote_code=True) messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "你好,介绍一下 QiTianTokenizer。"}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False, ) print(text) # If you need token ids directly: inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, enable_thinking=False, return_tensors="pt", ) print(inputs["input_ids"]) ``` **Parameters** - `add_generation_prompt` - `True`: append the assistant role token (e.g. `<|assistant|>`) at the end, so the model can continue generating. - `False`: do not append generation prompt (useful for evaluating full dialogues). - `enable_thinking` - `True`: wrap the assistant part with a thinking span (e.g. `<|begin_of_think|> ... <|end_of_think|>`), if your training/inference uses it. - `False`: keep plain assistant content without the thinking wrapper. --- ## 📦 Files Included | File | Description | |---------------------------|------------------------------------------------| | `tokenizer.json` | Serialized fast tokenizer definition | | `tokenizer_config.json` | Configuration (max length, padding side, etc.) | | `tokenizer.py` | Tokenizer implementation | --- ## 🔍 Special Tokens | Token | Purpose | |------------------------|----------------------------------------------------| | `<\|bos\|>` | Beginning of sequence | | `<\|eos\|>` | End of sequence | | `<\|eot\|>` | End of turn (marks message boundary) | | `<\|pad\|>` | Padding token for batch alignment | | `<\|mask\|>` | Masked token for MLM-style objectives | | `<\|system\|>` | Defines system or meta-instruction context | | `<\|user\|>` | Marks user message boundary in conversational data | | `<\|assistant\|>` | Marks assistant message boundary | | `<\|begin_of_think\|>` | Begin internal reasoning span | | `<\|end_of_think\|>` | End internal reasoning span | --- ## 🔖 License This tokenizer and vocabulary are released under the **Apache License 2.0**. You are free to use, modify, and redistribute it under the same license terms. --- ## 📚 Citation If you use **QiTianTokenizer** in your research or project, please cite it as: ```bibtex @misc{QiTianTokenizer, title = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization}, author = {Morton Li}, year = {2026}, } ```