build-small-hackathon/kirana-invoice-train-data
Viewer • Updated • 500 • 49
Fine-tuned openbmb/MiniCPM-V-4.6 for
structured JSON extraction from Indian distributor (kirana) invoices.
QLoRA adapter weights are fully merged into the base model — no PEFT dependency at inference time. Part of the Kirana Detective project: a six-agent AI pipeline that audits invoices for pricing anomalies, missing deliveries, and GST errors.
| Attribute | Value |
|---|---|
| Base model | openbmb/MiniCPM-V-4.6 |
| Task | Vision-language OCR + structured JSON extraction |
| Fine-tuning method | QLoRA — 4-bit NF4 base, LoRA rank 16, α 32 |
| Trainable parameters | 9,486,336 / 1,309,914,352 (0.72%) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training epochs | 3 |
| Final eval loss | 0.2120 (↓ from 0.2901 at epoch 1) |
| Training hardware | NVIDIA A10G 22 GB VRAM (Modal) |
| Training duration | ~52 minutes |
| Output format | Merged full weights — bfloat16 |
| Inference runtime | transformers (AutoModel + model.chat()) |
Dataset: build-small-hackathon/kirana-invoice-train-data
| Split | Examples |
|---|---|
| Train | 450 |
| Eval | 50 |
Synthetic Indian distributor invoices generated with Pillow across:
| Epoch | Train Loss | Eval Loss |
|---|---|---|
| 1 | — | 0.2901 |
| 2 | — | 0.2281 |
| 3 | — | 0.2120 |
| Format | Example |
|---|---|
| Printed GST invoice | Standard B2B tax invoice with HSN codes |
| Tally PDF export | Machine-generated tabular layout |
| Handwritten invoice | Photo of handwritten bill |
| WhatsApp screenshot | Low-resolution forwarded invoice image |
The model returns only a JSON object matching this schema — no markdown, no prose:
{
"invoice_number": "INV-2024-001",
"supplier": "Hindustan Unilever Ltd.",
"date": "2026-06-10",
"items": [
{
"product_raw": "SURF XL 1KG",
"quantity": 12,
"unit_price": 95.00,
"gst_rate": 18,
"line_total": 1140.00
},
{
"product_raw": "MAGGI MASALA 70G",
"quantity": 48,
"unit_price": 14.00,
"gst_rate": 5,
"line_total": 672.00
}
],
"grand_total": 9650.00,
"extraction_warnings": []
}
Field notes:
product_raw — verbatim as printed on the invoice (abbreviations, typos preserved)gst_rate — percentage value (5, 12, 18, 28), not a decimaldate — ISO 8601 (YYYY-MM-DD) when parseable, raw string otherwiseextraction_warnings — list of issues noticed (missing fields, illegible areas, GST anomalies)0 when unreadable; invoice_number/supplier/date default to nullimport torch
from transformers import AutoModel, AutoTokenizer
from PIL import Image
model = AutoModel.from_pretrained(
"naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
"naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged",
trust_remote_code=True,
)
image = Image.open("invoice.jpg").convert("RGB")
prompt = (
"You are an OCR agent for Indian kirana store invoices. "
"Extract all information from this invoice image and return ONLY valid JSON "
"matching this schema exactly:\n"
'{"invoice_number": string|null, "supplier": string|null, "date": string|null, '
'"items": [{"product_raw": string, "quantity": number, "unit_price": number, '
'"gst_rate": number, "line_total": number}], '
'"grand_total": number, "extraction_warnings": [string]}\n'
"Return ONLY the JSON object, no markdown, no prose."
)
msgs = [{"role": "user", "content": [image, prompt]}]
response = model.chat(image=None, msgs=msgs, tokenizer=tokenizer, sampling=False, max_new_tokens=2048)
print(response)
import fitz # PyMuPDF
from PIL import Image
import io, json
doc = fitz.open("invoice.pdf")
results = []
for page in doc:
pix = page.get_pixmap(matrix=fitz.Matrix(2.0, 2.0))
img = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")
msgs = [{"role": "user", "content": [img, prompt]}]
raw = model.chat(image=None, msgs=msgs, tokenizer=tokenizer, sampling=False, max_new_tokens=2048)
results.append(json.loads(raw))
product_raw) — normalization to canonical SKU
names is handled downstream by the MiniCPM5-1B normalizer agent.grand_total extraction can fail on invoices with complex multi-page subtotal structures.@misc{kirana_detective_minicpmv_2026,
author = {Syed Naazim Hussain},
title = {MiniCPM-V 4.6 Fine-Tuned for Indian Invoice Extraction},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged}},
}
Apache 2.0 — same license as the base openbmb/MiniCPM-V-4.6 model.
Base model
openbmb/MiniCPM-V-4.6