--- license: apache-2.0 datasets: - diabolic6045/Sanskrit-shlok-collection - roneneldan/TinyStories language: - sa - en pipeline_tag: text-generation --- # Native Sanskrit-English Tokenizer for Qwen2.5 ## Problem Statement The original Qwen2.5 tokenizer produces inefficient byte-level tokens for Sanskrit text: - **Qwen's output**: `['ह', 'र', 'à¥ĩ', 'Ġà¤ķ', 'à¥', 'ĥ', 'ष', 'à¥įà¤', '£']` (36 tokens) - **Our output**: `['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे']` (8 tokens) This tokenizer provides 4.5x better efficiency with readable, meaningful tokens. ## Usage ```python from transformers import AutoTokenizer # Load tokenizer (native Hugging Face format) tokenizer = AutoTokenizer.from_pretrained("diabolic6045/Sanskrit-English-qwen2-tokenizer") # Test Sanskrit tokenization text = "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे" tokens = tokenizer.tokenize(text) print(tokens) # ['▁हरे', '▁कृष्ण', '▁हरे', '▁कृष्ण', '▁कृष्ण', '▁कृष्ण', '▁हरे', '▁हरे'] # Perfect reconstruction decoded = tokenizer.decode(tokenizer.encode(text)) print(decoded) # "हरे कृष्ण हरे कृष्ण कृष्ण कृष्ण हरे हरे" # Chat template support messages = [{'role': 'user', 'content': 'What is the meaning of हरे कृष्ण?'}] formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) print(formatted) ``` ## Performance Comparison | Tokenizer | Tokens | Readable | Efficiency | Format | |-----------|--------|----------|------------|---------| | **Ours** | 8 | YES | **4.5x better** | Native HF | | Qwen | 36 | NO | Poor | ByteLevel BPE | ## Training with Axolotl ```yaml # qwen.yaml base_model: Qwen/Qwen2.5-1.5B tokenizer_config: diabolic6045/Sanskrit-English-qwen2-tokenizer resize_token_embeddings_to_32x: true ``` ```bash # Start training accelerate launch -m axolotl.cli.train qwen.yaml ``` ## Key Features - **Native Hugging Face Format** - No custom code needed - **120,000 vocabulary** trained on massive English+Sanskrit corpus - **Clean, readable tokens** - no more byte-level artifacts - **4.5x more efficient** than Qwen's original tokenizer - **Official Qwen chat template** - ready for inference - **Personalized identity** - "Created by Divax Shah (diabolic6045)" - **Axolotl compatible** - works seamlessly with distributed training ## Training Pipeline 1. **Base Model Training** - Train on Sanskrit text completion 2. **Instruct Tuning** - Add chat capabilities with proper formatting 3. **Deployment** - Use for Sanskrit-English applications ### Technical Details: [TECHNICAL_README.md](./TECHNICAL_README.md)