Instructions to use Vokturz/Loyca-Qwen3-VL-2B-Instruct-OCR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Vokturz/Loyca-Qwen3-VL-2B-Instruct-OCR with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Vokturz/Loyca-Qwen3-VL-2B-Instruct-OCR", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - vision-language | |
| - ocr | |
| - multimodal | |
| - qwen | |
| - lora | |
| - instruction-tuning | |
| datasets: | |
| - Vokturz/sourceforge-app-screenshots-ocr | |
| base_model: | |
| - unsloth/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit | |
| # Model Card for Vokturz/Loyca-Qwen3-VL-2B-Instruct-OCR | |
| ## Model Details | |
| ### Model Description | |
| **Loyca-Qwen3-VL-2B-Instruct-OCR** is a lightweight LoRA adapter built on top of **Qwen/Qwen3-VL-2B-Instruct**, fine-tuned for **visual text recognition (OCR)** and **screen content understanding**. | |
| It enhances the base model’s ability to read and interpret text embedded in images — particularly screenshots and user interfaces — and respond with structured, instruction-following outputs. | |
| ### Model Sources | |
| - **Repository:** [https://huggingface.co/Vokturz/Loyca-Qwen3-VL-2B-Instruct-OCR](https://huggingface.co/Vokturz/Loyca-Qwen3-VL-2B-Instruct-OCR) | |
| - **Base model:** [https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) | |
| - **Fine-tuning run:** [W&B Experiment](https://wandb.ai/vokturz/Loyca-Qwen3-VL-2B-OCR) | |
| --- | |
| ## Uses | |
| This model can be used directly for Optical Character Recognition (OCR) on screenshots, UI layouts, or application previews. | |
| The model is **not designed** for: | |
| * Handwritten OCR | |
| * Scene text in natural environments (e.g., street signs) | |
| * Legal or financial document processing without human review | |
| --- | |
| ## Training Details | |
| ### Training Data | |
| The model was trained on **`Vokturz/sourceforge-app-screenshots-ocr`** (~1100 records), a custom dataset of annotated application screenshots containing readable text and UI elements. | |
| The dataset focuses on **clean UI text extraction** rather than general image captioning. | |
| ### Training Hyperparameters | |
| | Parameter | Value | | |
| | --------------------- | ---------------- | | |
| | Epochs | 8 | | |
| | Batch size | 8 | | |
| | Learning rate | 3e-4 | | |
| | LoRA rank | 64 | | |
| | LoRA alpha | 64 | | |
| | Precision | bfloat16 (mixed) | | |
| | Optimizer | AdamW | | |
| | Scheduler | Cosine decay | | |
| | Gradient accumulation | 2 | | |
| | Weight decay | 0.01 | |