---
license: apache-2.0
datasets:
- diabolic6045/Sanskrit-shlok-collection
- roneneldan/TinyStories
language:
- sa
- en
pipeline_tag: text-generation
---
# Native Sanskrit-English Tokenizer for Qwen2.5

## Problem Statement

The original Qwen2.5 tokenizer produces inefficient byte-level tokens for Sanskrit text:
- **Qwen's output**: `['à¤¹', 'à¤°', 'à¥ĩ', 'Ġà¤ķ', 'à¥', 'ĥ', 'à¤·', 'à¥įà¤', '£']` (36 tokens)
- **Our output**: `['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे']` (8 tokens)

This tokenizer provides 4.5x better efficiency with readable, meaningful tokens.

## Usage

```python
from transformers import AutoTokenizer

# Load tokenizer (native Hugging Face format)
tokenizer = AutoTokenizer.from_pretrained("diabolic6045/Sanskrit-English-qwen2-tokenizer")

# Test Sanskrit tokenization
text = "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे"
tokens = tokenizer.tokenize(text)
print(tokens)  # ['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे']

# Perfect reconstruction
decoded = tokenizer.decode(tokenizer.encode(text))
print(decoded)  # "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे"

# Chat template support
messages = [{'role': 'user', 'content': 'What is the meaning of हरे कृष्ण?'}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(formatted)
```

## Performance Comparison

| Tokenizer | Tokens | Readable | Efficiency | Format |
|-----------|--------|----------|------------|---------|
| **Ours** | 8 | YES | **4.5x better** | Native HF |
| Qwen | 36 | NO | Poor | ByteLevel BPE |

## Training with Axolotl

```yaml
# qwen.yaml
base_model: Qwen/Qwen2.5-1.5B
tokenizer_config: diabolic6045/Sanskrit-English-qwen2-tokenizer
resize_token_embeddings_to_32x: true
```

```bash
# Start training
accelerate launch -m axolotl.cli.train qwen.yaml
```

## Key Features

- **Native Hugging Face Format** - No custom code needed
- **120,000 vocabulary** trained on massive English+Sanskrit corpus  
- **Clean, readable tokens** - no more byte-level artifacts
- **4.5x more efficient** than Qwen's original tokenizer
- **Official Qwen chat template** - ready for inference
- **Personalized identity** - "Created by Divax Shah (diabolic6045)"
- **Axolotl compatible** - works seamlessly with distributed training

## Training Pipeline

1. **Base Model Training** - Train on Sanskrit text completion
2. **Instruct Tuning** - Add chat capabilities with proper formatting
3. **Deployment** - Use for Sanskrit-English applications

### Technical Details: [TECHNICAL_README.md](./TECHNICAL_README.md)