Hindi fine-tune of MiniCPM5-1B now available + GGUF quants

#6
by pankajpandey-dev - opened

Hi @openbmb team and community! πŸ‘‹

Thanks for releasing MiniCPM5-1B β€” the tokenizer handles Devanagari beautifully (0.81 tokens/char on Hindi text) and the model is the perfect size for low-resource Indic adaptation.

I've released a Hindi instruction-tuned version trained on AI4Bharat's indic-instruct-data-v0.1 (anudesh + dolly Hindi splits, ~4k high-quality examples):

πŸ”— HF Model: https://huggingface.co/pankajpandey-dev/MiniCPM5-1B-Hindi-Instruct
πŸ”— GGUF Quants (Q3_K_M, Q4_K_M, Q5_K_M, Q6_K, Q8_0): https://huggingface.co/pankajpandey-dev/MiniCPM5-1B-Hindi-Instruct-v1-GGUF

Training stack: Unsloth + TRL + LoRA (r=32), 60 min on a single T4. Full details on the model card.

One note for the llama.cpp folks: the BPE pre-tokenizer hash isn't in llama.cpp's registry yet β€” I registered 36f3066e97b7f3994b379aaacde306c1444c6ae84e81a5ae3cd2b7ed3b8c42d4 β†’ qwen2 as the closest match and conversion worked cleanly. Happy to submit a PR to llama.cpp upstream if this is the right pre-tokenizer family for MiniCPM5.

Looking forward to more Indic fine-tunes of this base β€” thanks again!

Sign up or log in to comment