--- license: apache-2.0 base_model: unsloth/pixtral-12b-2409-unsloth-bnb-4bit tags: - manchu - ocr - vision-language-model - lora - peft - unsloth datasets: - mic7ch/manchu-2025-0033 language: - mnc - en library_name: peft pipeline_tag: image-text-to-text --- # manchu-ocr-pixtral-final LoRA adapters for Manchu script OCR, fine-tuned from **`unsloth/pixtral-12b-2409-unsloth-bnb-4bit`** on the [`mic7ch/manchu-2025-0033`](https://huggingface.co/datasets/mic7ch/manchu-2025-0033) dataset. Part of a replication / ablation study on VLM-based OCR for Manchu. The VLM is asked to output both the Manchu graph and a romanized transliteration in a structured format (`Manchu: {text}\nRoman: {text}`). ## Contents - `best_model/` — trainer's final-step save (selected by `load_best_model_at_end` on `manchu_cer`). - `checkpoints/checkpoint-{step}/` — every intermediate save at `save_steps=500` (10 checkpoints total). > **Paper selection** uses the sweep winner on held-out `real_val` rather than the trainer's > `best_model/`, because `best_model/` is selected on training-time `manchu_cer` and does not > always coincide with the real-test peak. See the table below. ## Best checkpoint by split | Split | Checkpoint path | manchu_word_accuracy | |---|---|---| | real_val | `checkpoints/checkpoint-4000/` | 0.9940 | | test | `checkpoints/checkpoint-5000/` | 0.9602 | | validation | `checkpoints/checkpoint-5000/` | 0.8980 | ## Training recipe - **Base model:** `unsloth/pixtral-12b-2409-unsloth-bnb-4bit` (4-bit, bitsandbytes) - **Framework:** [Unsloth](https://github.com/unslothai/unsloth) + TRL `SFTTrainer` - **LoRA:** r=32, alpha=64, dropout=0.05; targets include all attention + MLP projections plus vision/language heads - **Optimizer:** `paged_adamw_8bit`, lr=2e-4, cosine_with_restarts, warmup=100 - **Batch:** `per_device_train_batch_size=4`, `gradient_accumulation_steps=2`, 4× GPU DDP → effective 32 - **Save cadence:** `save_steps=500`, `save_total_limit=50` - **Primary training metric:** `manchu_cer` - **Selection metric for paper:** `manchu_word_accuracy` on held-out `real_val` ## Usage (inference) ```python from transformers import AutoProcessor, AutoModelForVision2Seq from peft import PeftModel base = "unsloth/pixtral-12b-2409-unsloth-bnb-4bit" adapter = "mic7ch/manchu-ocr-pixtral-final" # loads best_model by default; or use subfolder= processor = AutoProcessor.from_pretrained(base) model = AutoModelForVision2Seq.from_pretrained(base, device_map="auto", load_in_4bit=True) model = PeftModel.from_pretrained(model, adapter, subfolder="best_model") # For a specific step: # model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoints/checkpoint-{N}") ``` The model outputs: ``` Manchu: Roman: ``` ## Checkpoints NOT included We strip optimizer state (`optimizer.pt`), scheduler state (`scheduler.pt`), RNG state (`rng_state_*.pth`), and TensorBoard events from each checkpoint. The uploaded files are sufficient for inference and for continued LoRA fine-tuning from any step, but not for an *exact* bit-identical training resume. ## Citation Paper forthcoming. Please cite the repository meanwhile: ``` @software{manchu_ocr_2026, author = {Chung, H.-M. and collaborators}, title = {Vision-language-model OCR for Manchu script}, year = {2026}, url = {https://huggingface.co/mic7ch/manchu-ocr-pixtral-final} } ```