# Model Fine-tuning Guide

Fine-tune Kirana Detective's three models on Indian FMCG invoice data.

## Quick Start (TL;DR)

```bash
export ROBOFLOW_API_KEY=<your-key>
export HF_TOKEN=<your-token>
modal run finetune/generate_invoices.py     # 10 min
modal run finetune/train_minicpm_v.py       # 2 hours
modal run finetune/train_minicpm5_1b.py     # 1 hour
modal run finetune/train_yolo26n.py         # 2 hours
```

Models auto-publish to HuggingFace Hub on completion.

---

## Three Models, Three Pipelines

### 1. MiniCPM-V 4.6 (Invoice OCR) — `train_minicpm_v.py`

**Purpose**: Extract line items, amounts, GST from invoice images (printed PDFs, handwritten, WhatsApp screenshots)

**Input**: 500 synthetic invoices (4 formats)  
**Method**: QLoRA fine-tuning via PEFT + bitsandbytes (Unsloth incompatible with MiniCPM-V-4.6)  
**Output**: LoRA adapter → merged HF weights (bfloat16). GGUF conversion is a separate manual step via [gguf-my-repo Space](https://huggingface.co/spaces/ggml-org/gguf-my-repo).  
**Hardware**: A10G, 22 GB VRAM, ~52 min (actual)

**Datasets used**:
- Synthetic invoices generated by `generate_invoices.py` 
- Splits: train/val/test = 400/50/50
- Formats: pure Pillow (no native deps) — GST, Tally PDF, handwritten, WhatsApp

---

### 2. MiniCPM5-1B (Product Name Normalizer) — `train_minicpm5_1b.py`

**Purpose**: Map invoice abbreviations (e.g., "MAGGI NDL 70GM") to canonical names

**Input**: 2,000 synthetic (raw, canonical) pairs  
**Method**: QLoRA, 4-bit base + LoRA adapters  
**Output**: GGUF quantized model  
**Hardware**: A10G, ~1 hour

**Dataset generation**:
- Hand-curated 200 SKU catalog
- Rule-based augmentation: abbreviation expansion, typo injection, truncation
- Coverage: 10 major Indian FMCG suppliers

---

### 3. YOLO26n (Product Detection) — `train_yolo26n.py`

**Purpose**: Count packaged products in shelf/counter photos

**Input**: 3 Roboflow datasets merged (11,000+ images)  
**Method**: Ultralytics standard training pipeline  
**Output**: ONNX format for CPU/GPU inference  
**Hardware**: A10G, ~2 hours

**Datasets merged**:
1. [agentsk47/indian-grocery-object-detection](https://universe.roboflow.com/agentsk47/indian-grocery-object-detection-mfsnx) v1
2. [iit-patna/grocery_items](https://universe.roboflow.com/iit-patna-qg1jh/grocery_items-7i2em) v45 (6,695 images)
3. [project-c5ho0/indian-market](https://universe.roboflow.com/project-c5ho0/indian-market-qieug) v2 (4,694 images)

---

## Prerequisites

```bash
# 1. Clone this repo
git clone https://github.com/naazimsnh02/kirana-detective.git
cd kirana-detective

# 2. Install local deps (for generated synthetics preview only)
pip install -r requirements.txt

# 3. Set up secrets for Modal/HF
modal token new
export ROBOFLOW_API_KEY=<from Roboflow universe account>
export HF_TOKEN=<from huggingface.co/settings/tokens>

# 4. Test Modal setup
modal run finetune/generate_invoices.py
```

---

## Reproducibility Checklist

- [ ] **Dataset versioning**: All Roboflow versions pinned (v1, v45, v2)
- [ ] **Seed control**: Random seeds fixed in all training scripts
- [ ] **Output validation**: Run `tests/` after each model completes
- [ ] **HF Hub publish logs**: Check model card auto-generated from training
- [ ] **GGUF quantization**: Verified mAP/F1 vs. float32 baseline

---

## Known Limitations & Biases

| Model | Limitation | Impact | Mitigation |
|---|---|---|---|
| MiniCPM-V | Only 10 FMCG suppliers in training data | Fails on uncommon brands | Add more invoices post-hackathon |
| MiniCPM5-1B | Synthetic data only (no real invoice typos) | Overfits to rule-based augmentation | Collect 200+ real examples next |
| YOLO26n | Merged dataset skewed toward beauty/personal care (Tresemmé, Nivea, Patanjali) | May underperform on grocery staples | Balance class distribution across grocery categories |

---

## Troubleshooting

**"Modal timeout after 2 hours?"**  
→ YOLO training can take 2–3h depending on GPU queue. Increase timeout in `modal.json`.

**"GGUF quantization fails?"**  
→ Ensure llama.cpp is compiled with CUDA support if GPU quantization intended.

**"HF Hub publish returns 403?"**  
→ `HF_TOKEN` must have write access. Regenerate at huggingface.co/settings/tokens.

---

## Output Files

Training scripts publish initially to the personal `naazimsnh02/` namespace; models are then
manually transferred to the `build-small-hackathon/` org for the hackathon submission.

**After training runs, check HF Hub (`naazimsnh02/`):**

- **MiniCPM-V LoRA adapter**: `naazimsnh02/minicpm-v-4-6-indian-invoice-extraction`
  - LoRA adapter files (`adapter_config.json`, `adapter_model.safetensors`, etc.)
  - `mmproj.gguf` (vision encoder, uploaded separately via `export_minicpm_v_gguf.py`)

- **MiniCPM-V merged weights**: `naazimsnh02/minicpm-v-4-6-indian-invoice-extraction-merged`
  - Full merged bfloat16 weights (no PEFT required at inference)
  - Run `modal run finetune/export_minicpm_v_gguf.py` after training to create this repo

- **MiniCPM5-1B**: `naazimsnh02/minicpm5-1b-indian-fmcg-normalizer`
  - `model.gguf` (Q4_K_M, ~1.2 GB)

- **YOLO26n**: `naazimsnh02/yolo26n-indian-fmcg-detection`
  - `yolo26n_fmcg.onnx` (~15 MB, opset 12)
  - `best.pt` (PyTorch checkpoint)
  - `class_names.json` (1,831 unified classes from merged dataset)

**Hackathon / production repos (after manual transfer):**

- `build-small-hackathon/minicpm-v-4-6-indian-invoice-extraction-merged`
- `build-small-hackathon/minicpm5-1b-indian-fmcg-normalizer`
- `build-small-hackathon/yolo26n-indian-fmcg-detection`
- `build-small-hackathon/kirana-invoice-train-data` (HF dataset)

**Sharing is Caring — trace datasets:**

```bash
# Upload Claude Code build sessions (run once after project is complete)
export HF_TOKEN=<your-token>
python finetune/upload_build_traces.py
# → publishes to build-small-hackathon/kirana-detective-build-traces
# → viewable in HF Data Studio native trace viewer

# Runtime audit traces are auto-published by tracer.py during app use
# → build-small-hackathon/kirana-detective-traces
```

---

## Next Steps Post-Hackathon

1. **Collect real invoice data** from partnered kirana stores (500 minimum)
2. **Expand product taxonomy** (currently 200 SKUs → 2000)
3. **Add regional variants** (Hindi/Tamil/Malayalam abbreviations)
4. **Benchmark inference latency** on Raspberry Pi / Android devices