MiniCPM-V 4.6 — Indian Invoice Extraction (Merged)

Fine-tuned openbmb/MiniCPM-V-4.6 for structured JSON extraction from Indian distributor (kirana) invoices.

QLoRA adapter weights are fully merged into the base model — no PEFT dependency at inference time. Part of the Kirana Detective project: a six-agent AI pipeline that audits invoices for pricing anomalies, missing deliveries, and GST errors.


Model Details

Attribute Value
Base model openbmb/MiniCPM-V-4.6
Task Vision-language OCR + structured JSON extraction
Fine-tuning method QLoRA — 4-bit NF4 base, LoRA rank 16, α 32
Trainable parameters 9,486,336 / 1,309,914,352 (0.72%)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training epochs 3
Final eval loss 0.2120 (↓ from 0.2901 at epoch 1)
Training hardware NVIDIA A10G 22 GB VRAM (Modal)
Training duration ~52 minutes
Output format Merged full weights — bfloat16
Inference runtime transformers (AutoModel + model.chat())

Training Data

Dataset: build-small-hackathon/kirana-invoice-train-data

Split Examples
Train 450
Eval 50

Synthetic Indian distributor invoices generated with Pillow across:

  • 10 suppliers: HUL, Nestlé, Parle, Britannia, ITC, Amul, Dabur, Marico, Emami, Godrej
  • 4 invoice formats: Printed GST bill, Tally PDF export, handwritten, WhatsApp screenshot
  • Intentional errors injected: GST rate mismatches, duplicate line items, price spikes — to train the model to surface extraction warnings alongside extracted data

Training Metrics

Epoch Train Loss Eval Loss
1 — 0.2901
2 — 0.2281
3 — 0.2120

Supported Input Formats

Format Example
Printed GST invoice Standard B2B tax invoice with HSN codes
Tally PDF export Machine-generated tabular layout
Handwritten invoice Photo of handwritten bill
WhatsApp screenshot Low-resolution forwarded invoice image

Output Schema

The model returns only a JSON object matching this schema — no markdown, no prose:

{
  "invoice_number": "INV-2024-001",
  "supplier": "Hindustan Unilever Ltd.",
  "date": "2026-06-10",
  "items": [
    {
      "product_raw": "SURF XL 1KG",
      "quantity": 12,
      "unit_price": 95.00,
      "gst_rate": 18,
      "line_total": 1140.00
    },
    {
      "product_raw": "MAGGI MASALA 70G",
      "quantity": 48,
      "unit_price": 14.00,
      "gst_rate": 5,
      "line_total": 672.00
    }
  ],
  "grand_total": 9650.00,
  "extraction_warnings": []
}

Field notes:

  • product_raw — verbatim as printed on the invoice (abbreviations, typos preserved)
  • gst_rate — percentage value (5, 12, 18, 28), not a decimal
  • date — ISO 8601 (YYYY-MM-DD) when parseable, raw string otherwise
  • extraction_warnings — list of issues noticed (missing fields, illegible areas, GST anomalies)
  • Numeric fields default to 0 when unreadable; invoice_number/supplier/date default to null

Usage

Basic Inference

import torch
from transformers import AutoModel, AutoTokenizer
from PIL import Image

model = AutoModel.from_pretrained(
    "naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
    "naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged",
    trust_remote_code=True,
)

image = Image.open("invoice.jpg").convert("RGB")

prompt = (
    "You are an OCR agent for Indian kirana store invoices. "
    "Extract all information from this invoice image and return ONLY valid JSON "
    "matching this schema exactly:\n"
    '{"invoice_number": string|null, "supplier": string|null, "date": string|null, '
    '"items": [{"product_raw": string, "quantity": number, "unit_price": number, '
    '"gst_rate": number, "line_total": number}], '
    '"grand_total": number, "extraction_warnings": [string]}\n'
    "Return ONLY the JSON object, no markdown, no prose."
)

msgs = [{"role": "user", "content": [image, prompt]}]
response = model.chat(image=None, msgs=msgs, tokenizer=tokenizer, sampling=False, max_new_tokens=2048)
print(response)

From a PDF (multi-page)

import fitz  # PyMuPDF
from PIL import Image
import io, json

doc = fitz.open("invoice.pdf")
results = []
for page in doc:
    pix = page.get_pixmap(matrix=fitz.Matrix(2.0, 2.0))
    img = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")
    msgs = [{"role": "user", "content": [img, prompt]}]
    raw = model.chat(image=None, msgs=msgs, tokenizer=tokenizer, sampling=False, max_new_tokens=2048)
    results.append(json.loads(raw))

Limitations

  • Trained on synthetic invoices only — real-world performance may vary on heavily degraded, stamped, or non-standard layouts until production data is collected.
  • Optimised for English and numeric invoice content; Hindi/regional-language invoices are not yet covered.
  • Product names are extracted verbatim (product_raw) — normalization to canonical SKU names is handled downstream by the MiniCPM5-1B normalizer agent.
  • grand_total extraction can fail on invoices with complex multi-page subtotal structures.

Citation

@misc{kirana_detective_minicpmv_2026,
  author    = {Syed Naazim Hussain},
  title     = {MiniCPM-V 4.6 Fine-Tuned for Indian Invoice Extraction},
  year      = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged}},
}

License

Apache 2.0 — same license as the base openbmb/MiniCPM-V-4.6 model.

Downloads last month
33
Safetensors
Model size
1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for build-small-hackathon/minicpm-v-4-6-indian-invoice-extraction-merged

Finetuned
(15)
this model

Dataset used to train build-small-hackathon/minicpm-v-4-6-indian-invoice-extraction-merged

Space using build-small-hackathon/minicpm-v-4-6-indian-invoice-extraction-merged 1

Collection including build-small-hackathon/minicpm-v-4-6-indian-invoice-extraction-merged

Article mentioning build-small-hackathon/minicpm-v-4-6-indian-invoice-extraction-merged