lmms-lab/DocVQA
Viewer • Updated • 16.6k • 31.1k • 79
This is the recommended version for inference. Fully merged 16-bit model combining Qwen3-VL-4B-Instruct with trained LoRA weights.
Trained on 4 datasets covering:
Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Marathi, Manipuri, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu, English
from transformers import Qwen3VLForConditionalGeneration, Qwen3VLProcessor
from PIL import Image
# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"mashriram/Sarvam-1-VL-4B-Instruct-VLLM",
torch_dtype="auto",
device_map="auto"
)
processor = Qwen3VLProcessor.from_pretrained("mashriram/Sarvam-1-VL-4B-Instruct-VLLM")
# Prepare input
image = Image.open("document.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Translate this document from English to Hindi."}
]
}
]
# Generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0], skip_special_tokens=True))
Apache 2.0
Base model
Qwen/Qwen3-VL-4B-Instruct