Spaces:

build-small-hackathon
/

kirana-detective

Sleeping

App Files Files Community

kirana-detective / finetune /README.md

naazimsnh02

Fix documentation

3b757a5 8 days ago

preview code

Raw

History Blame

6.38 kB

	# Model Fine-tuning Guide

	Fine-tune Kirana Detective's three models on Indian FMCG invoice data.

	## Quick Start (TL;DR)

	```bash
	export ROBOFLOW_API_KEY=<your-key>
	export HF_TOKEN=<your-token>
	modal run finetune/generate_invoices.py # 10 min
	modal run finetune/train_minicpm_v.py # 2 hours
	modal run finetune/train_minicpm5_1b.py # 1 hour
	modal run finetune/train_yolo26n.py # 2 hours
	```

	Models auto-publish to HuggingFace Hub on completion.

	---

	## Three Models, Three Pipelines

	### 1. MiniCPM-V 4.6 (Invoice OCR) — `train_minicpm_v.py`

	Purpose: Extract line items, amounts, GST from invoice images (printed PDFs, handwritten, WhatsApp screenshots)

	Input: 500 synthetic invoices (4 formats)
	Method: QLoRA fine-tuning via PEFT + bitsandbytes (Unsloth incompatible with MiniCPM-V-4.6)
	Output: LoRA adapter → merged HF weights (bfloat16). GGUF conversion is a separate manual step via [gguf-my-repo Space](https://huggingface.co/spaces/ggml-org/gguf-my-repo).
	Hardware: A10G, 22 GB VRAM, ~52 min (actual)

	Datasets used:
	- Synthetic invoices generated by `generate_invoices.py`
	- Splits: train/val/test = 400/50/50
	- Formats: pure Pillow (no native deps) — GST, Tally PDF, handwritten, WhatsApp

	---

	### 2. MiniCPM5-1B (Product Name Normalizer) — `train_minicpm5_1b.py`

	Purpose: Map invoice abbreviations (e.g., "MAGGI NDL 70GM") to canonical names

	Input: 2,000 synthetic (raw, canonical) pairs
	Method: QLoRA, 4-bit base + LoRA adapters
	Output: GGUF quantized model
	Hardware: A10G, ~1 hour

	Dataset generation:
	- Hand-curated 200 SKU catalog
	- Rule-based augmentation: abbreviation expansion, typo injection, truncation
	- Coverage: 10 major Indian FMCG suppliers

	---

	### 3. YOLO26n (Product Detection) — `train_yolo26n.py`

	Purpose: Count packaged products in shelf/counter photos

	Input: 3 Roboflow datasets merged (11,000+ images)
	Method: Ultralytics standard training pipeline
	Output: ONNX format for CPU/GPU inference
	Hardware: A10G, ~2 hours

	Datasets merged:
	1. [agentsk47/indian-grocery-object-detection](https://universe.roboflow.com/agentsk47/indian-grocery-object-detection-mfsnx) v1
	2. [iit-patna/grocery_items](https://universe.roboflow.com/iit-patna-qg1jh/grocery_items-7i2em) v45 (6,695 images)
	3. [project-c5ho0/indian-market](https://universe.roboflow.com/project-c5ho0/indian-market-qieug) v2 (4,694 images)

	---

	## Prerequisites

	```bash
	# 1. Clone this repo
	git clone https://github.com/naazimsnh02/kirana-detective.git
	cd kirana-detective

	# 2. Install local deps (for generated synthetics preview only)
	pip install -r requirements.txt

	# 3. Set up secrets for Modal/HF
	modal token new
	export ROBOFLOW_API_KEY=<from Roboflow universe account>
	export HF_TOKEN=<from huggingface.co/settings/tokens>

	# 4. Test Modal setup
	modal run finetune/generate_invoices.py
	```

	---

	## Reproducibility Checklist

	- [ ] Dataset versioning: All Roboflow versions pinned (v1, v45, v2)
	- [ ] Seed control: Random seeds fixed in all training scripts
	- [ ] Output validation: Run `tests/` after each model completes
	- [ ] HF Hub publish logs: Check model card auto-generated from training
	- [ ] GGUF quantization: Verified mAP/F1 vs. float32 baseline

	---

	## Known Limitations & Biases

	\| Model \| Limitation \| Impact \| Mitigation \|
	\|---\|---\|---\|---\|
	\| MiniCPM-V \| Only 10 FMCG suppliers in training data \| Fails on uncommon brands \| Add more invoices post-hackathon \|
	\| MiniCPM5-1B \| Synthetic data only (no real invoice typos) \| Overfits to rule-based augmentation \| Collect 200+ real examples next \|
	\| YOLO26n \| Merged dataset skewed toward beauty/personal care (Tresemmé, Nivea, Patanjali) \| May underperform on grocery staples \| Balance class distribution across grocery categories \|

	---

	## Troubleshooting

	"Modal timeout after 2 hours?"
	→ YOLO training can take 2–3h depending on GPU queue. Increase timeout in `modal.json`.

	"GGUF quantization fails?"
	→ Ensure llama.cpp is compiled with CUDA support if GPU quantization intended.

	"HF Hub publish returns 403?"
	→ `HF_TOKEN` must have write access. Regenerate at huggingface.co/settings/tokens.

	---

	## Output Files

	Training scripts publish initially to the personal `naazimsnh02/` namespace; models are then
	manually transferred to the `build-small-hackathon/` org for the hackathon submission.

	After training runs, check HF Hub (`naazimsnh02/`):

	- MiniCPM-V LoRA adapter: `naazimsnh02/minicpm-v-4-6-indian-invoice-extraction`
	- LoRA adapter files (`adapter_config.json`, `adapter_model.safetensors`, etc.)
	- `mmproj.gguf` (vision encoder, uploaded separately via `export_minicpm_v_gguf.py`)

	- MiniCPM-V merged weights: `naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged`
	- Full merged bfloat16 weights (no PEFT required at inference)
	- Run `modal run finetune/export_minicpm_v_gguf.py` after training to create this repo

	- MiniCPM5-1B: `naazimsnh02/minicpm5-1b-indian-fmcg-normalizer`
	- `model.gguf` (Q4_K_M, ~1.2 GB)

	- YOLO26n: `naazimsnh02/yolo26n-indian-fmcg-detection`
	- `yolo26n_fmcg.onnx` (~15 MB, opset 12)
	- `best.pt` (PyTorch checkpoint)
	- `class_names.json` (1,831 unified classes from merged dataset)

	Hackathon / production repos (after manual transfer):

	- `build-small-hackathon/minicpm-v-4-6-indian-invoice-extraction-merged`
	- `build-small-hackathon/minicpm5-1b-indian-fmcg-normalizer`
	- `build-small-hackathon/yolo26n-indian-fmcg-detection`
	- `build-small-hackathon/kirana-invoice-train-data` (HF dataset)

	Sharing is Caring — trace datasets:

	```bash
	# Upload Claude Code build sessions (run once after project is complete)
	export HF_TOKEN=<your-token>
	python finetune/upload_build_traces.py
	# → publishes to build-small-hackathon/kirana-detective-build-traces
	# → viewable in HF Data Studio native trace viewer

	# Runtime audit traces are auto-published by tracer.py during app use
	# → build-small-hackathon/kirana-detective-traces
	```

	---

	## Next Steps Post-Hackathon

	1. Collect real invoice data from partnered kirana stores (500 minimum)
	2. Expand product taxonomy (currently 200 SKUs → 2000)
	3. Add regional variants (Hindi/Tamil/Malayalam abbreviations)
	4. Benchmark inference latency on Raspberry Pi / Android devices