---
languages:
    - zh
    - en
    - multilingual
tags:
    - tokenizer
    - bilingual
    - chinese
    - english
    - multilingual
license: apache-2.0
---
# QiTianTokenizer-Base

**QiTianTokenizer** is a *universal multilingual tokenizer* primarily optimized for **Chinese–English mixed text**,  
offering consistent and reversible tokenization across diverse languages and scripts.  
It is designed as a **general-purpose tokenizer**, not tied to any specific model,  
and fully compatible with the 🤗 **Transformers** ecosystem.

---

## ✨ Overview

| Property                | Value                                 |
|-------------------------|---------------------------------------|
| **Name**                | QiTianTokenizer-Base                  |
| **Type**                | Tokenizer-only repository             |
| **Purpose**             | General multilingual tokenization     |
| **Primary Languages**   | Chinese, English                      |
| **Extended Support**    | Multilingual (Unicode-complete)       |
| **Architecture**        | Byte-level BPE                        |
| **Vocabulary Size**     | 32,000 tokens                         |
| **Fast Implementation** | ✅ Available (`QiTianTokenizerFast`)   |
| **Framework**           | 🤗 `transformers`                     |
| **License**             | Apache 2.0                            |

---

## 🧩 QiTian Tokenizer Series

| Variant                                                                               | Vocabulary Size | Description                                                                                                                                         | Recommended Use                                                          |
|---------------------------------------------------------------------------------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------|
| [**QiTianTokenizer-Tiny**](https://huggingface.co/Morton-Li/QiTianTokenizer-Tiny)     | 12k             | Lightweight tokenizer designed for compact or embedded models.                                                                                      | On-device or low-resource tasks                                          |
| [**QiTianTokenizer-Base**](https://huggingface.co/Morton-Li/QiTianTokenizer-Base)     | 32k             | Balanced vocabulary offering solid coverage and efficiency for most multilingual use cases.                                                         | **Recommended for general use**                                          |
| [**QiTianTokenizer-Medium**](https://huggingface.co/Morton-Li/QiTianTokenizer-Medium) | 64k             | **Optimal balance in language coverage** — broad enough to capture fine-grained linguistic diversity while maintaining reasonable model complexity. | **Recommended for multilingual and high-quality general-purpose models** |
| [**QiTianTokenizer-Large**](https://huggingface.co/Morton-Li/QiTianTokenizer-Large)   | 96k             | Extended multilingual vocabulary designed for diverse cross-lingual pretraining and high-capacity language models.                                  | High-resource training                                                   |
| [**QiTianTokenizer-XLarge**](https://huggingface.co/Morton-Li/QiTianTokenizer-XLarge) | 128k            | Full-script and domain-extensive vocabulary for comprehensive multilingual modeling.                                                                | Research & large-scale pretraining                                       |

> All variants share consistent token definitions, special tokens, and compatible configurations.

---

## ⚙️ Usage

You can load this tokenizer directly with `AutoTokenizer`:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Base", trust_remote_code=True)

# Example
text = "你好，QiTian！"
tokens = tokenizer(text)
print(tokens["input_ids"])
```

### ➕ Batch Example

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Base", trust_remote_code=True)

# Example
texts = ["Hello, 世界！", "QiTian is multilingual."]
batch_tokens = tokenizer(texts, padding=True, return_tensors="pt")
print(batch_tokens["input_ids"])
```

### 💬 Chat Template (`apply_chat_template`)

For chat-style data, you can format a list of messages using `apply_chat_template`:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Morton-Li/QiTianTokenizer-Base", trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "你好，介绍一下 QiTianTokenizer。"},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
print(text)

# If you need token ids directly:
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    enable_thinking=False,
    return_tensors="pt",
)
print(inputs["input_ids"])
```

**Parameters**

- `add_generation_prompt`
  - `True`: append the assistant role token (e.g. `<|assistant|>`) at the end, so the model can continue generating.
  - `False`: do not append generation prompt (useful for evaluating full dialogues).

- `enable_thinking`
  - `True`: wrap the assistant part with a thinking span (e.g. `<|begin_of_think|> ... <|end_of_think|>`), if your training/inference uses it.
  - `False`: keep plain assistant content without the thinking wrapper.

---

## 📦 Files Included

| File                      | Description                                    |
|---------------------------|------------------------------------------------|
| `tokenizer.json`          | Serialized fast tokenizer definition           |
| `tokenizer_config.json`   | Configuration (max length, padding side, etc.) |
| `tokenizer.py`            | Tokenizer implementation                       |

---

## 🔍 Special Tokens

| Token                  | Purpose                                            |
|------------------------|----------------------------------------------------|
| `<\|bos\|>`            | Beginning of sequence                              |
| `<\|eos\|>`            | End of sequence                                    |
| `<\|eot\|>`            | End of turn (marks message boundary)               |
| `<\|pad\|>`            | Padding token for batch alignment                  |
| `<\|mask\|>`           | Masked token for MLM-style objectives              |
| `<\|system\|>`         | Defines system or meta-instruction context         |
| `<\|user\|>`           | Marks user message boundary in conversational data |
| `<\|assistant\|>`      | Marks assistant message boundary                   |
| `<\|begin_of_think\|>` | Begin internal reasoning span                      |
| `<\|end_of_think\|>`   | End internal reasoning span                        |

---

## 🔖 License

This tokenizer and vocabulary are released under the **Apache License 2.0**.
You are free to use, modify, and redistribute it under the same license terms.

---

## 📚 Citation

If you use **QiTianTokenizer** in your research or project, please cite it as:

```bibtex
@misc{QiTianTokenizer,
  title  = {QiTianTokenizer: A Universal Multilingual Tokenizer with Chinese–English Optimization},
  author = {Morton Li},
  year   = {2026},
}
```