Instructions to use OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR") model = AutoModelForMultimodalLM.from_pretrained("OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR
- SGLang
How to use OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR with Docker Model Runner:
docker model run hf.co/OpenLLM-Korea/VARCO-VISION-2.0-1.7B-OCR
VARCO-VISION-2.0-1.7B-OCR
Introduction
VARCO-VISION-2.0-1.7B-OCR is a lightweight yet powerful OCR-specialized model derived from VARCO-VISION-2.0-1.7B, designed to deliver efficient and accurate text recognition in real-world scenarios. Unlike conventional vision-language models (VLMs) that primarily focus on transcribing visible text, this model performs both recognition and spatial localization by detecting bounding boxes around each character, enabling structured, layout-aware OCR outputs.
The model supports both Korean and English, making it well-suited for multilingual environments where mixed-script documents are common. Each recognized character is paired with its precise position in the image, formatted as <char>{characters}</char><bbox>{x1}, {y1}, {x2}, {y2}</bbox>, where the coordinates correspond to the top-left (x1, y1) and bottom-right (x2, y2) corners of the character's bounding box.
While VARCO-VISION-2.0-14B demonstrates strong OCR capabilities as part of its broader multimodal reasoning skills, deploying such a large model for single-task use cases can be computationally inefficient. VARCO-VISION-2.0-1.7B-OCR addresses this with a task-optimized design that retains high accuracy while significantly reducing resource requirements, making it ideal for real-time or resource-constrained applications.
🚨News🎙️
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at link
- 📰 2025-07-28: We released VARCO-VISION-2.0-1.7B at link
- 📰 2025-07-18: Updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
- 📰 2025-07-16: We released VARCO-VISION-2.0-14B at link
- 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at link
VARCO-VISION-2.0 Family
| Model Name | Base Models (Vision / Language) | HF Link |
|---|---|---|
| VARCO-VISION-2.0-14B | siglip2-so400m-patch16-384 / Qwen3-14B | link |
| VARCO-VISION-2.0-1.7B | siglip2-so400m-patch16-384 / Qwen3-1.7B | link |
| VARCO-VISION-2.0-1.7B-OCR | siglip2-so400m-patch16-384 / Qwen3-1.7B | link |
| GME-VARCO-VISION-Embedding | Qwen2-VL-7B-Instruct | link |
Model Architecture
VARCO-VISION-2.0 follows the architecture of LLaVA-OneVision.
Evaluation
OCR Benchmark
| Benchmark | CLOVA OCR | PaddleOCR | EasyOCR | VARCO-VISION-2.0-1.7B-OCR |
|---|---|---|---|---|
| CORD | 93.9 | 91.4 | 77.8 | 95.6 |
| ICDAR2013 | 94.4 | 92.0 | 85.0 | 95.5 |
| ICDAR2015 | 84.1 | 73.7 | 57.9 | 75.4 |
Usage
To use this model, we recommend installing transformers version 4.53.1 or higher.
Additionally, for best results, we recommend upscaling input images to a minimum resolution of 2,304 on the longer side if they are smaller.
import torch
from PIL import Image
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
model_name = "NCSOFT/VARCO-VISION-2.0-1.7B-OCR"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
model_name,
torch_dtype=torch.float16,
attn_implementation="sdpa",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)
image = Image.open("file:///path/to/image.jpg")
# Image upscaling for OCR performance boost
w, h = image.size
target_size = 2304
if max(w, h) < target_size:
scaling_factor = target_size / max(w, h)
new_w = int(w * scaling_factor)
new_h = int(h * scaling_factor)
image = image.resize((new_w, new_h))
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "<ocr>"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=1024)
generate_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
]
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
print(output)
- Downloads last month
- 26
