---
license: apache-2.0
base_model: unsloth/pixtral-12b-2409-unsloth-bnb-4bit
tags:
- manchu
- ocr
- vision-language-model
- lora
- peft
- unsloth
datasets:
- mic7ch/manchu-2025-0033
language:
- mnc
- en
library_name: peft
pipeline_tag: image-text-to-text
---

# manchu-ocr-pixtral-final

LoRA adapters for Manchu script OCR, fine-tuned from **`unsloth/pixtral-12b-2409-unsloth-bnb-4bit`** on the
[`mic7ch/manchu-2025-0033`](https://huggingface.co/datasets/mic7ch/manchu-2025-0033) dataset.
Part of a replication / ablation study on VLM-based OCR for Manchu.

The VLM is asked to output both the Manchu graph and a romanized transliteration
in a structured format (`Manchu: {text}\nRoman: {text}`).

## Contents

- `best_model/` — trainer's final-step save (selected by `load_best_model_at_end` on `manchu_cer`).
- `checkpoints/checkpoint-{step}/` — every intermediate save at `save_steps=500` (10 checkpoints total).

> **Paper selection** uses the sweep winner on held-out `real_val` rather than the trainer's
> `best_model/`, because `best_model/` is selected on training-time `manchu_cer` and does not
> always coincide with the real-test peak. See the table below.

## Best checkpoint by split

| Split | Checkpoint path | manchu_word_accuracy |
|---|---|---|
| real_val | `checkpoints/checkpoint-4000/` | 0.9940 |
| test | `checkpoints/checkpoint-5000/` | 0.9602 |
| validation | `checkpoints/checkpoint-5000/` | 0.8980 |

## Training recipe

- **Base model:** `unsloth/pixtral-12b-2409-unsloth-bnb-4bit` (4-bit, bitsandbytes)
- **Framework:** [Unsloth](https://github.com/unslothai/unsloth) + TRL `SFTTrainer`
- **LoRA:** r=32, alpha=64, dropout=0.05; targets include all attention + MLP projections plus vision/language heads
- **Optimizer:** `paged_adamw_8bit`, lr=2e-4, cosine_with_restarts, warmup=100
- **Batch:** `per_device_train_batch_size=4`, `gradient_accumulation_steps=2`, 4× GPU DDP → effective 32
- **Save cadence:** `save_steps=500`, `save_total_limit=50`
- **Primary training metric:** `manchu_cer`
- **Selection metric for paper:** `manchu_word_accuracy` on held-out `real_val`

## Usage (inference)

```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from peft import PeftModel

base = "unsloth/pixtral-12b-2409-unsloth-bnb-4bit"
adapter = "mic7ch/manchu-ocr-pixtral-final"            # loads best_model by default; or use subfolder=
processor = AutoProcessor.from_pretrained(base)
model = AutoModelForVision2Seq.from_pretrained(base, device_map="auto", load_in_4bit=True)
model = PeftModel.from_pretrained(model, adapter, subfolder="best_model")

# For a specific step:
# model = PeftModel.from_pretrained(model, adapter, subfolder="checkpoints/checkpoint-{N}")
```

The model outputs:
```
Manchu: <manchu glyphs>
Roman: <romanized transliteration>
```

## Checkpoints NOT included

We strip optimizer state (`optimizer.pt`), scheduler state (`scheduler.pt`), RNG state
(`rng_state_*.pth`), and TensorBoard events from each checkpoint. The uploaded files are
sufficient for inference and for continued LoRA fine-tuning from any step, but not for an
*exact* bit-identical training resume.

## Citation

Paper forthcoming. Please cite the repository meanwhile:
```
@software{manchu_ocr_2026,
  author = {Chung, H.-M. and collaborators},
  title  = {Vision-language-model OCR for Manchu script},
  year   = {2026},
  url    = {https://huggingface.co/mic7ch/manchu-ocr-pixtral-final}
}
```