Spaces:
Running
Running
Claude commited on
docs: rewrite README as comprehensive English documentation
Browse filesReplace the bilingual French/English README with a complete English version
covering all features, installation methods, CLI commands, web API endpoints,
pipeline modes, supported engines, normalization profiles, error taxonomy,
project structure, environment variables, CI/CD setup, and development guide.
https://claude.ai/code/session_01PJLbDjPUK3VFiqKK6gT8Hh
README.md
CHANGED
|
@@ -9,203 +9,441 @@ pinned: false
|
|
| 9 |
|
| 10 |
# Picarones
|
| 11 |
|
| 12 |
-
> **
|
| 13 |
-
Apache 2.0
|
| 14 |
|
| 15 |
[](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
|
| 16 |
[](https://www.python.org/downloads/)
|
| 17 |
[](LICENSE)
|
|
|
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
-
**Picarones**
|
| 22 |
-
(Tesseract, Pero OCR, Kraken,
|
| 23 |
-
|
| 24 |
|
| 25 |
-
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
-
##
|
| 32 |
-
|
| 33 |
-
- [
|
| 34 |
-
- [
|
| 35 |
-
- [
|
| 36 |
-
- [
|
| 37 |
-
- [
|
| 38 |
-
- [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
- [Roadmap](#roadmap)
|
| 40 |
-
- [
|
|
|
|
| 41 |
|
| 42 |
---
|
| 43 |
|
| 44 |
-
##
|
| 45 |
-
|
| 46 |
-
###
|
| 47 |
-
|
| 48 |
-
- **CER** (Character Error Rate) :
|
| 49 |
-
|
| 50 |
-
- **
|
| 51 |
-
- **
|
| 52 |
-
- **
|
| 53 |
-
|
| 54 |
-
- **
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
-
|
| 60 |
-
- Modes : texte seul, image+texte, zero-shot
|
| 61 |
-
- Détection de **sur-normalisation LLM** : le LLM modernise-t-il à tort les graphies anciennes ?
|
| 62 |
-
- Bibliothèque de prompts pour manuscrits médiévaux, imprimés anciens, latin…
|
| 63 |
|
| 64 |
-
###
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
| eScriptorium | `EScriptoriumClient` |
|
| 74 |
|
| 75 |
-
###
|
| 76 |
|
| 77 |
-
-
|
| 78 |
-
-
|
| 79 |
-
-
|
| 80 |
-
-
|
| 81 |
-
|
| 82 |
-
-
|
| 83 |
-
- Export CSV, JSON, ALTO XML, PAGE XML, images annotées
|
| 84 |
|
| 85 |
-
###
|
| 86 |
|
| 87 |
-
- **
|
| 88 |
-
- **
|
| 89 |
-
-
|
| 90 |
-
-
|
| 91 |
-
-
|
|
|
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
-
##
|
| 96 |
|
| 97 |
```bash
|
| 98 |
-
#
|
| 99 |
git clone https://github.com/maribakulj/Picarones.git
|
| 100 |
-
cd
|
| 101 |
pip install -e .
|
| 102 |
|
| 103 |
-
# Tesseract (
|
| 104 |
# Ubuntu/Debian
|
| 105 |
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
|
| 106 |
|
| 107 |
# macOS
|
| 108 |
brew install tesseract
|
| 109 |
|
| 110 |
-
#
|
|
|
|
|
|
|
|
|
|
| 111 |
picarones engines
|
| 112 |
-
```
|
| 113 |
|
| 114 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
```bash
|
| 121 |
-
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
-
|
| 125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
-
#
|
| 128 |
-
picarones report --results resultats.json --output rapport.html
|
| 129 |
|
| 130 |
-
#
|
| 131 |
-
picarones metrics --reference gt.txt --hypothesis ocr.txt
|
| 132 |
|
| 133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
|
| 135 |
|
| 136 |
-
#
|
| 137 |
-
picarones
|
|
|
|
|
|
|
| 138 |
picarones history --engine tesseract --regression
|
| 139 |
|
| 140 |
-
#
|
| 141 |
picarones robustness --corpus ./gt/ --engine tesseract --demo
|
| 142 |
|
| 143 |
-
#
|
| 144 |
-
picarones
|
| 145 |
```
|
| 146 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
---
|
| 148 |
|
| 149 |
-
##
|
| 150 |
-
|
| 151 |
-
|
|
| 152 |
-
|--------|------|--------------|
|
| 153 |
-
| **Tesseract 5** | Local CLI | `pip install pytesseract` +
|
| 154 |
-
| **Pero OCR** | Local Python | `pip install pero-ocr` |
|
| 155 |
-
| **Kraken** | Local Python | `pip install kraken` |
|
| 156 |
-
| **Mistral OCR** | API
|
| 157 |
-
| **
|
| 158 |
-
| **
|
| 159 |
-
| **
|
| 160 |
-
| **
|
| 161 |
-
| **
|
| 162 |
-
| **
|
| 163 |
-
| **
|
| 164 |
-
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
---
|
| 167 |
|
| 168 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
```
|
| 171 |
picarones/
|
| 172 |
-
├──
|
| 173 |
-
├──
|
|
|
|
|
|
|
|
|
|
| 174 |
├── core/
|
| 175 |
-
│ ├── corpus.py #
|
| 176 |
-
│ ├── metrics.py # CER, WER, MER, WIL (jiwer)
|
| 177 |
-
│ ├── normalization.py #
|
| 178 |
-
│ ├── statistics.py # Bootstrap CI, Wilcoxon,
|
| 179 |
-
│ ├──
|
| 180 |
-
│ ├──
|
| 181 |
-
│ ├──
|
| 182 |
-
│ ├──
|
| 183 |
-
│ ├──
|
| 184 |
-
│ ├──
|
| 185 |
-
│ ├──
|
| 186 |
-
│ ├──
|
| 187 |
-
│ ├──
|
| 188 |
-
│
|
| 189 |
-
├──
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
├──
|
| 193 |
-
├──
|
| 194 |
-
|
| 195 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
```
|
| 197 |
|
| 198 |
---
|
| 199 |
|
| 200 |
-
##
|
|
|
|
|
|
|
| 201 |
|
| 202 |
```bash
|
| 203 |
-
#
|
| 204 |
export OPENAI_API_KEY="sk-..."
|
| 205 |
export ANTHROPIC_API_KEY="sk-ant-..."
|
| 206 |
export MISTRAL_API_KEY="..."
|
| 207 |
|
| 208 |
-
#
|
| 209 |
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
|
| 210 |
export AWS_ACCESS_KEY_ID="..."
|
| 211 |
export AWS_SECRET_ACCESS_KEY="..."
|
|
@@ -214,80 +452,92 @@ export AZURE_DOC_INTEL_ENDPOINT="https://..."
|
|
| 214 |
export AZURE_DOC_INTEL_KEY="..."
|
| 215 |
```
|
| 216 |
|
|
|
|
|
|
|
| 217 |
---
|
| 218 |
|
| 219 |
-
##
|
| 220 |
|
| 221 |
-
|
| 222 |
-
|--------|--------|-----------|
|
| 223 |
-
| Sprint 1 | ✅ | Structure, Tesseract, Pero OCR, CER/WER, CLI |
|
| 224 |
-
| Sprint 2 | ✅ | Rapport HTML v1, diff coloré, galerie |
|
| 225 |
-
| Sprint 3 | ✅ | Pipelines OCR+LLM, GPT-4o, Claude |
|
| 226 |
-
| Sprint 4 | ✅ | APIs cloud, import IIIF, normalisation diplomatique |
|
| 227 |
-
| Sprint 5 | ✅ | Métriques avancées : confusion unicode, ligatures, taxonomie |
|
| 228 |
-
| Sprint 6 | ✅ | Interface web FastAPI, HTR-United, HuggingFace, Ollama |
|
| 229 |
-
| Sprint 7 | ✅ | Rapport HTML v2 : Wilcoxon, clustering, scatter plots |
|
| 230 |
-
| Sprint 8 | ✅ | eScriptorium, Gallica API, historique longitudinal, robustesse |
|
| 231 |
-
| Sprint 9 | ✅ | Documentation, packaging, Docker, CI/CD |
|
| 232 |
|
| 233 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 234 |
|
| 235 |
-
##
|
| 236 |
|
| 237 |
-
|
|
|
|
| 238 |
|
| 239 |
---
|
| 240 |
|
| 241 |
-
##
|
| 242 |
|
| 243 |
-
|
|
|
|
|
|
|
| 244 |
|
| 245 |
-
|
|
|
|
| 246 |
|
| 247 |
-
|
|
|
|
| 248 |
|
| 249 |
-
#
|
|
|
|
| 250 |
|
| 251 |
-
#
|
|
|
|
| 252 |
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
|
| 257 |
-
|
| 258 |
|
| 259 |
-
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
-
|
| 263 |
-
|
| 264 |
-
-
|
| 265 |
-
Datasets, HTR-United, eScriptorium
|
| 266 |
-
- **Interactive HTML report**: self-contained file, sortable ranking, gallery, coloured diff,
|
| 267 |
-
unicode character view, CSV/JSON/ALTO/PAGE XML export
|
| 268 |
-
- **Longitudinal tracking**: SQLite benchmark history, CER evolution curves, automatic regression
|
| 269 |
-
detection
|
| 270 |
-
- **Robustness analysis**: degraded image versions (noise, blur, rotation, resolution,
|
| 271 |
-
binarisation), critical threshold detection
|
| 272 |
|
| 273 |
-
|
| 274 |
|
| 275 |
-
|
| 276 |
-
pip install -e .
|
| 277 |
-
sudo apt install tesseract-ocr tesseract-ocr-fra # Ubuntu/Debian
|
| 278 |
-
picarones demo # demo report without any engine installed
|
| 279 |
-
picarones engines # list available engines
|
| 280 |
-
picarones run --corpus ./corpus/ --engines tesseract --output results.json
|
| 281 |
-
picarones report --results results.json
|
| 282 |
-
```
|
| 283 |
|
| 284 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 285 |
|
| 286 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
|
| 288 |
-
|
| 289 |
-
Google Vision · AWS Textract · Azure Document Intelligence · Ollama (local LLMs) · Custom YAML engine
|
| 290 |
|
| 291 |
-
|
| 292 |
|
| 293 |
-
|
|
|
|
| 9 |
|
| 10 |
# Picarones
|
| 11 |
|
| 12 |
+
> **OCR/HTR Benchmarking Platform for Heritage Documents**
|
|
|
|
| 13 |
|
| 14 |
[](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
|
| 15 |
[](https://www.python.org/downloads/)
|
| 16 |
[](LICENSE)
|
| 17 |
+
[](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones)
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
+
**Picarones** is an open-source platform for rigorously comparing OCR and HTR engines
|
| 22 |
+
(Tesseract, Pero OCR, Kraken, cloud APIs...) and OCR+LLM pipelines on historical document
|
| 23 |
+
corpora -- manuscripts, early printed books, and archives.
|
| 24 |
|
| 25 |
+
It provides heritage-specific metrics (diplomatic CER, ligature scores, medieval abbreviation
|
| 26 |
+
handling), composable OCR+LLM pipelines, interactive HTML reports, and multiple corpus import
|
| 27 |
+
sources including IIIF, Gallica, HuggingFace, and eScriptorium.
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
+
## Table of Contents
|
| 32 |
+
|
| 33 |
+
- [Features](#features)
|
| 34 |
+
- [Heritage-Specific Metrics](#heritage-specific-metrics)
|
| 35 |
+
- [OCR+LLM Pipelines](#ocr-llm-pipelines)
|
| 36 |
+
- [Corpus Import](#corpus-import)
|
| 37 |
+
- [Interactive HTML Report](#interactive-html-report)
|
| 38 |
+
- [Longitudinal Tracking & Robustness](#longitudinal-tracking--robustness)
|
| 39 |
+
- [Web Interface](#web-interface)
|
| 40 |
+
- [Quick Start](#quick-start)
|
| 41 |
+
- [Installation](#installation)
|
| 42 |
+
- [From Source](#from-source)
|
| 43 |
+
- [Docker](#docker)
|
| 44 |
+
- [Optional Extras](#optional-extras)
|
| 45 |
+
- [Usage](#usage)
|
| 46 |
+
- [CLI Commands](#cli-commands)
|
| 47 |
+
- [Web Interface](#web-interface-1)
|
| 48 |
+
- [Pipeline Modes](#pipeline-modes)
|
| 49 |
+
- [Supported Engines](#supported-engines)
|
| 50 |
+
- [Normalization Profiles](#normalization-profiles)
|
| 51 |
+
- [Error Taxonomy](#error-taxonomy)
|
| 52 |
+
- [Project Structure](#project-structure)
|
| 53 |
+
- [Environment Variables](#environment-variables)
|
| 54 |
+
- [CI/CD](#cicd)
|
| 55 |
+
- [Development](#development)
|
| 56 |
- [Roadmap](#roadmap)
|
| 57 |
+
- [Contributing](#contributing)
|
| 58 |
+
- [License](#license)
|
| 59 |
|
| 60 |
---
|
| 61 |
|
| 62 |
+
## Features
|
| 63 |
+
|
| 64 |
+
### Heritage-Specific Metrics
|
| 65 |
+
|
| 66 |
+
- **CER** (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and
|
| 67 |
+
**diplomatic** (historical equivalences: long s = s, u = v, i = j, etc.)
|
| 68 |
+
- **WER**, **MER**, **WIL** with historical-aware tokenization (via [jiwer](https://github.com/jitsi/jiwer))
|
| 69 |
+
- **Unicode confusion matrix** -- fingerprint each engine's character-level errors
|
| 70 |
+
- **Ligature and diacritic scores** -- track handling of fi, fl, ff, oe, ae, p-bar, and other
|
| 71 |
+
medieval glyphs
|
| 72 |
+
- **10-class error taxonomy** -- automatic classification of every error (visual confusion,
|
| 73 |
+
abbreviation, segmentation, lacuna, over-normalization, etc.)
|
| 74 |
+
- **Bootstrap 95% confidence intervals** and **Wilcoxon signed-rank tests** for statistical
|
| 75 |
+
significance
|
| 76 |
+
- **Intrinsic difficulty score** per document, independent of engine performance
|
| 77 |
+
- **Line-level error distribution** with Gini coefficient and percentile analysis
|
| 78 |
+
- **VLM hallucination detection** -- anchor score and length ratio to flag fabricated output
|
| 79 |
+
|
| 80 |
+
### OCR+LLM Pipelines
|
| 81 |
+
|
| 82 |
+
- Composable chains: `tesseract -> gpt-4o`, `pero_ocr -> claude-sonnet`, zero-shot VLM, etc.
|
| 83 |
+
- Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot
|
| 84 |
+
- **Over-normalization detection** -- does the LLM silently modernize historical spellings?
|
| 85 |
+
- Versioned prompt library for medieval French, early modern French, medieval Latin, medieval
|
| 86 |
+
English, and early modern English -- both correction and zero-shot variants
|
| 87 |
+
|
| 88 |
+
### Corpus Import
|
| 89 |
+
|
| 90 |
+
| Source | Method |
|
| 91 |
+
|--------|--------|
|
| 92 |
+
| Local folder | `picarones run --corpus ./corpus/` |
|
| 93 |
+
| IIIF manifests (Gallica, Bodleian, BL...) | `picarones import iiif <manifest-url>` |
|
| 94 |
+
| Gallica API (SRU + OCR) | `GallicaClient` / `picarones import iiif` |
|
| 95 |
+
| HuggingFace Datasets | `picarones import hf <dataset-id>` |
|
| 96 |
+
| HTR-United catalogue | `picarones import htr-united` |
|
| 97 |
+
| eScriptorium | `EScriptoriumClient` |
|
| 98 |
+
| ZIP upload (browser) | Web interface upload endpoint |
|
| 99 |
|
| 100 |
+
Supported corpus formats: plain text pairs (image + ground truth), **ALTO XML**, and **PAGE XML**.
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
### Interactive HTML Report
|
| 103 |
|
| 104 |
+
- **Self-contained HTML file** -- works offline, no server needed
|
| 105 |
+
- Sortable ranking table, radar charts, histograms (powered by Chart.js)
|
| 106 |
+
- Gallery view with dynamic filters and color-coded CER badges
|
| 107 |
+
- GitHub-style colored diff with synchronized N-way scrolling
|
| 108 |
+
- Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction
|
| 109 |
+
- Unicode character view: interactive confusion matrix explorer
|
| 110 |
+
- Export to **CSV**, **JSON**, **ALTO XML**, **PAGE XML**, and annotated images
|
|
|
|
| 111 |
|
| 112 |
+
### Longitudinal Tracking & Robustness
|
| 113 |
|
| 114 |
+
- Optional **SQLite database** to record benchmark history across runs
|
| 115 |
+
- **CER evolution curves** over time, per engine
|
| 116 |
+
- **Automatic regression detection** between consecutive runs
|
| 117 |
+
- **Robustness analysis**: measure engine resilience to noise, blur, rotation, resolution
|
| 118 |
+
reduction, and binarization
|
| 119 |
+
- Critical degradation threshold identification
|
|
|
|
| 120 |
|
| 121 |
+
### Web Interface
|
| 122 |
|
| 123 |
+
- **FastAPI** application with real-time **Server-Sent Events** (SSE) progress streaming
|
| 124 |
+
- Upload corpus as a **ZIP file** directly from the browser
|
| 125 |
+
- Dynamic engine and normalization profile selectors
|
| 126 |
+
- Browse and re-download generated HTML reports
|
| 127 |
+
- Bilingual **French/English** interface
|
| 128 |
+
- Deployable on HuggingFace Spaces (Docker, port 7860)
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
+
## Quick Start
|
| 133 |
|
| 134 |
```bash
|
| 135 |
+
# Clone and install
|
| 136 |
git clone https://github.com/maribakulj/Picarones.git
|
| 137 |
+
cd Picarones
|
| 138 |
pip install -e .
|
| 139 |
|
| 140 |
+
# Install Tesseract (system binary, required for the Tesseract engine)
|
| 141 |
# Ubuntu/Debian
|
| 142 |
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
|
| 143 |
|
| 144 |
# macOS
|
| 145 |
brew install tesseract
|
| 146 |
|
| 147 |
+
# Generate a demo report (no OCR engine needed)
|
| 148 |
+
picarones demo --output demo_report.html
|
| 149 |
+
|
| 150 |
+
# List available engines
|
| 151 |
picarones engines
|
|
|
|
| 152 |
|
| 153 |
+
# Run a benchmark
|
| 154 |
+
picarones run --corpus ./corpus/ --engines tesseract --output results.json
|
| 155 |
+
|
| 156 |
+
# Generate HTML report
|
| 157 |
+
picarones report --results results.json --output report.html
|
| 158 |
+
|
| 159 |
+
# Launch the web interface
|
| 160 |
+
picarones serve --port 8080
|
| 161 |
+
```
|
| 162 |
|
| 163 |
---
|
| 164 |
|
| 165 |
+
## Installation
|
| 166 |
+
|
| 167 |
+
### From Source
|
| 168 |
+
|
| 169 |
+
```bash
|
| 170 |
+
git clone https://github.com/maribakulj/Picarones.git
|
| 171 |
+
cd Picarones
|
| 172 |
+
pip install -e ".[dev,web]" # includes test and web dependencies
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
**System requirements:**
|
| 176 |
+
|
| 177 |
+
- Python >= 3.11
|
| 178 |
+
- [Tesseract OCR 5](https://github.com/tesseract-ocr/tesseract) (for the Tesseract engine)
|
| 179 |
+
|
| 180 |
+
### Docker
|
| 181 |
|
| 182 |
```bash
|
| 183 |
+
docker build -t picarones .
|
| 184 |
+
docker run -p 7860:7860 \
|
| 185 |
+
-e MISTRAL_API_KEY=... \
|
| 186 |
+
-e OPENAI_API_KEY=... \
|
| 187 |
+
picarones
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs
|
| 191 |
+
(fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls
|
| 192 |
+
`/health` every 30 seconds.
|
| 193 |
|
| 194 |
+
The [HuggingFace Space](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) uses this
|
| 195 |
+
same Docker image.
|
| 196 |
+
|
| 197 |
+
### Optional Extras
|
| 198 |
+
|
| 199 |
+
| Extra | Install command | What it adds |
|
| 200 |
+
|-------|----------------|--------------|
|
| 201 |
+
| `dev` | `pip install -e ".[dev]"` | pytest, pytest-cov, httpx, linting |
|
| 202 |
+
| `web` | `pip install -e ".[web]"` | FastAPI, uvicorn, python-multipart |
|
| 203 |
+
| `llm` | `pip install -e ".[llm]"` | OpenAI, Anthropic, Mistral SDKs |
|
| 204 |
+
| `hf` | `pip install -e ".[hf]"` | HuggingFace Datasets |
|
| 205 |
+
| `pero` | `pip install -e ".[pero]"` | Pero OCR engine |
|
| 206 |
+
| `kraken` | `pip install -e ".[kraken]"` | Kraken engine |
|
| 207 |
+
| `ocr-cloud` | `pip install -e ".[ocr-cloud]"` | Google Vision, AWS Textract, Azure Doc Intelligence |
|
| 208 |
+
| `all` | `pip install -e ".[all]"` | Everything except ocr-cloud |
|
| 209 |
+
|
| 210 |
+
See [INSTALL.md](INSTALL.md) for detailed instructions on Linux, macOS, Windows, and Docker.
|
| 211 |
+
|
| 212 |
+
---
|
| 213 |
|
| 214 |
+
## Usage
|
|
|
|
| 215 |
|
| 216 |
+
### CLI Commands
|
|
|
|
| 217 |
|
| 218 |
+
| Command | Description |
|
| 219 |
+
|---------|-------------|
|
| 220 |
+
| `picarones run` | Run a full benchmark on a corpus |
|
| 221 |
+
| `picarones report` | Generate an HTML report from JSON results |
|
| 222 |
+
| `picarones demo` | Generate a demo report with synthetic data (no engine required) |
|
| 223 |
+
| `picarones metrics` | Calculate CER/WER between two text files |
|
| 224 |
+
| `picarones engines` | List all available OCR engines and LLM adapters |
|
| 225 |
+
| `picarones info` | Display version and system information |
|
| 226 |
+
| `picarones serve` | Launch the FastAPI web interface |
|
| 227 |
+
| `picarones history` | Query longitudinal benchmark history (SQLite) |
|
| 228 |
+
| `picarones robustness` | Run robustness analysis with degraded images |
|
| 229 |
+
| `picarones import` | Import corpus from IIIF, HuggingFace, or HTR-United |
|
| 230 |
+
|
| 231 |
+
**Examples:**
|
| 232 |
+
|
| 233 |
+
```bash
|
| 234 |
+
# Benchmark with Tesseract, French language, PSM 6
|
| 235 |
+
picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \
|
| 236 |
+
--output results.json --verbose
|
| 237 |
+
|
| 238 |
+
# Compare two text files
|
| 239 |
+
picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt
|
| 240 |
+
|
| 241 |
+
# Import 10 pages from a Gallica IIIF manifest
|
| 242 |
picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
|
| 243 |
|
| 244 |
+
# Import a HuggingFace dataset
|
| 245 |
+
picarones import hf medieval-ocr/dataset-name
|
| 246 |
+
|
| 247 |
+
# View benchmark history with regression detection
|
| 248 |
picarones history --engine tesseract --regression
|
| 249 |
|
| 250 |
+
# Robustness demo (noise, blur, rotation, resolution)
|
| 251 |
picarones robustness --corpus ./gt/ --engine tesseract --demo
|
| 252 |
|
| 253 |
+
# Fail CI if CER exceeds threshold
|
| 254 |
+
picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15
|
| 255 |
```
|
| 256 |
|
| 257 |
+
### Web Interface
|
| 258 |
+
|
| 259 |
+
```bash
|
| 260 |
+
picarones serve --host 0.0.0.0 --port 8080
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
**API endpoints include:**
|
| 264 |
+
|
| 265 |
+
| Endpoint | Method | Description |
|
| 266 |
+
|----------|--------|-------------|
|
| 267 |
+
| `/` | GET | Main single-page application |
|
| 268 |
+
| `/api/status` | GET | Version and application status |
|
| 269 |
+
| `/api/engines` | GET | Available OCR/LLM engines |
|
| 270 |
+
| `/api/normalization/profiles` | GET | Normalization profiles (read dynamically) |
|
| 271 |
+
| `/api/benchmark/start` | POST | Start a benchmark job (returns `job_id`) |
|
| 272 |
+
| `/api/benchmark/{job_id}/stream` | GET | SSE real-time progress stream |
|
| 273 |
+
| `/api/benchmark/{job_id}/cancel` | POST | Cancel a running benchmark |
|
| 274 |
+
| `/api/corpus/browse` | GET | Browse server-side corpus folders |
|
| 275 |
+
| `/api/htr-united/catalogue` | GET | Browse HTR-United catalogue |
|
| 276 |
+
| `/api/huggingface/search` | GET | Search HuggingFace datasets |
|
| 277 |
+
| `/reports/{filename}` | GET | Download generated HTML reports |
|
| 278 |
+
|
| 279 |
+
### Pipeline Modes
|
| 280 |
+
|
| 281 |
+
Picarones supports three modes for OCR+LLM pipelines:
|
| 282 |
+
|
| 283 |
+
| Mode | Description | Model type |
|
| 284 |
+
|------|-------------|------------|
|
| 285 |
+
| `zero_shot` | LLM receives the image directly and transcribes without prior OCR | VLM (vision) |
|
| 286 |
+
| `post_correction_texte` | OCR produces raw text, then LLM corrects it | Text-only LLM |
|
| 287 |
+
| `post_correction_image_texte` | OCR produces raw text, then LLM receives both image and text for correction | VLM (vision) |
|
| 288 |
+
|
| 289 |
+
**Example:** `ministral-3b-latest` is a text-only model and should use `post_correction_texte`.
|
| 290 |
+
GPT-4o and Claude support all three modes.
|
| 291 |
+
|
| 292 |
---
|
| 293 |
|
| 294 |
+
## Supported Engines
|
| 295 |
+
|
| 296 |
+
| Engine | Type | Execution Mode | Installation |
|
| 297 |
+
|--------|------|---------------|-------------|
|
| 298 |
+
| **Tesseract 5** | Local CLI | CPU (ProcessPool) | `pip install pytesseract` + system binary |
|
| 299 |
+
| **Pero OCR** | Local Python | CPU (ProcessPool) | `pip install pero-ocr` |
|
| 300 |
+
| **Kraken** | Local Python | CPU (ProcessPool) | `pip install kraken` |
|
| 301 |
+
| **Mistral OCR** | Cloud API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
|
| 302 |
+
| **Google Vision** | Cloud API | IO (ThreadPool) | `GOOGLE_APPLICATION_CREDENTIALS` env var |
|
| 303 |
+
| **Azure Doc Intelligence** | Cloud API | IO (ThreadPool) | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
|
| 304 |
+
| **GPT-4o** (VLM) | LLM API | IO (ThreadPool) | `OPENAI_API_KEY` env var |
|
| 305 |
+
| **Claude Sonnet** (VLM) | LLM API | IO (ThreadPool) | `ANTHROPIC_API_KEY` env var |
|
| 306 |
+
| **Mistral Large** (LLM) | LLM API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
|
| 307 |
+
| **Ollama** (local LLM) | Local LLM | IO (ThreadPool) | `ollama serve` running locally |
|
| 308 |
+
| **Custom engine** | CLI or API | Configurable | YAML declaration, no code required |
|
| 309 |
+
|
| 310 |
+
Engines declare their `execution_mode` (`"io"` or `"cpu"`), allowing the runner to use
|
| 311 |
+
`ThreadPoolExecutor` for IO-bound engines and `ProcessPoolExecutor` for CPU-bound engines
|
| 312 |
+
simultaneously.
|
| 313 |
|
| 314 |
---
|
| 315 |
|
| 316 |
+
## Normalization Profiles
|
| 317 |
+
|
| 318 |
+
Picarones includes eight built-in diplomatic normalization profiles designed for historical
|
| 319 |
+
text comparison. These reduce noise from expected orthographic variation so metrics reflect
|
| 320 |
+
genuine OCR errors, not historical spelling differences.
|
| 321 |
+
|
| 322 |
+
| Profile | Period | Key equivalences |
|
| 323 |
+
|---------|--------|-----------------|
|
| 324 |
+
| `nfc` | Any | Unicode NFC normalization only |
|
| 325 |
+
| `minimal` | Any | NFC + long s (ſ -> s) |
|
| 326 |
+
| `medieval_french` | 12th-15th c. | ſ=s, u=v, i=j, y=i, ae=ae, oe=oe, p-bar=per, etc. |
|
| 327 |
+
| `early_modern_french` | 16th-18th c. | ſ=s, ae=ae, oe=oe, y-tilde=yn, &=et |
|
| 328 |
+
| `medieval_latin` | 12th-15th c. | ſ=s, u=v, i=j, ae=ae, oe=oe, p-bar=per, q-bar=que |
|
| 329 |
+
| `medieval_english` | 12th-15th c. | ſ=s, u=v, i=j, thorn=th, eth=th, yogh=y, p-bar=per |
|
| 330 |
+
| `early_modern_english` | 16th-18th c. | ſ=s, u=v, i=j, vv=w, thorn=th, eth=th, yogh=y |
|
| 331 |
+
| `secretary_hand` | 16th-17th c. | Early modern English + secretary hand visual confusions |
|
| 332 |
+
|
| 333 |
+
Custom profiles can be loaded from YAML files with user-defined diplomatic tables.
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## Error Taxonomy
|
| 338 |
+
|
| 339 |
+
Every character-level error is automatically classified into one of 10 categories:
|
| 340 |
+
|
| 341 |
+
| Class | Name | Description |
|
| 342 |
+
|-------|------|-------------|
|
| 343 |
+
| 1 | `visual_confusion` | Morphologically similar characters (rn/m, l/1, O/0, u/n) |
|
| 344 |
+
| 2 | `diacritic_error` | Missing, incorrect, or spurious diacritical mark |
|
| 345 |
+
| 3 | `case_error` | Case difference only (A/a) |
|
| 346 |
+
| 4 | `ligature_error` | Ligature not resolved or incorrectly resolved |
|
| 347 |
+
| 5 | `abbreviation_error` | Medieval abbreviation not expanded |
|
| 348 |
+
| 6 | `hapax` | Word not found in any reference lexicon |
|
| 349 |
+
| 7 | `segmentation_error` | Token fusion or fragmentation (words/lines) |
|
| 350 |
+
| 8 | `oov_character` | Character outside the engine's vocabulary |
|
| 351 |
+
| 9 | `lacuna` | Text present in ground truth but absent from OCR output |
|
| 352 |
+
| 10 | `over_normalization` | LLM silently modernized a historical spelling |
|
| 353 |
+
|
| 354 |
+
---
|
| 355 |
+
|
| 356 |
+
## Project Structure
|
| 357 |
|
| 358 |
```
|
| 359 |
picarones/
|
| 360 |
+
├── __init__.py # Version (1.0.0), package metadata
|
| 361 |
+
├── cli.py # Click CLI: run, demo, report, metrics, engines, info,
|
| 362 |
+
│ # serve, import, history, robustness
|
| 363 |
+
├── fixtures.py # Realistic synthetic test data (medieval documents)
|
| 364 |
+
│
|
| 365 |
├── core/
|
| 366 |
+
│ ├── corpus.py # Corpus loading (folder, ALTO XML, PAGE XML)
|
| 367 |
+
│ ├── metrics.py # CER, WER, MER, WIL (via jiwer)
|
| 368 |
+
│ ├── normalization.py # Unicode normalization, 8 diplomatic profiles
|
| 369 |
+
│ ├── statistics.py # Bootstrap CI 95%, Wilcoxon test, correlations
|
| 370 |
+
│ ├── runner.py # Benchmark orchestrator (ThreadPool + ProcessPool)
|
| 371 |
+
│ ├── results.py # DocumentResult, BenchmarkResults, JSON export
|
| 372 |
+
│ ├── confusion.py # Unicode confusion matrix
|
| 373 |
+
│ ├── char_scores.py # Ligature and diacritic scores
|
| 374 |
+
│ ├── taxonomy.py # 10-class error taxonomy
|
| 375 |
+
│ ├── structure.py # Structural analysis (blocks, lines, words)
|
| 376 |
+
│ ├── image_quality.py # Image quality metrics (contrast, noise, resolution)
|
| 377 |
+
│ ├── difficulty.py # Intrinsic difficulty score per document
|
| 378 |
+
│ ├── hallucination.py # VLM hallucination detection
|
| 379 |
+
│ ├── line_metrics.py # Line-level error distribution (Gini, percentiles)
|
| 380 |
+
│ ├── history.py # SQLite longitudinal tracking
|
| 381 |
+
│ └── robustness.py # Robustness analysis (noise, blur, rotation, resolution)
|
| 382 |
+
│
|
| 383 |
+
├── engines/
|
| 384 |
+
│ ├── base.py # BaseOCREngine (execution_mode: "io" | "cpu")
|
| 385 |
+
│ ├── tesseract.py # Tesseract 5 adapter (CPU)
|
| 386 |
+
│ ├── pero_ocr.py # Pero OCR adapter (CPU)
|
| 387 |
+
│ ├── mistral_ocr.py # Mistral OCR API (/v1/ocr endpoint)
|
| 388 |
+
│ ├── google_vision.py # Google Cloud Vision adapter
|
| 389 |
+
│ └── azure_doc_intel.py # Azure Document Intelligence adapter
|
| 390 |
+
│
|
| 391 |
+
├── llm/
|
| 392 |
+
│ ├── base.py # BaseLLMAdapter interface
|
| 393 |
+
│ ├── openai_adapter.py # OpenAI / GPT-4o adapter
|
| 394 |
+
│ ├── anthropic_adapter.py # Anthropic / Claude adapter
|
| 395 |
+
│ ├── mistral_adapter.py # Mistral chat completions adapter
|
| 396 |
+
│ └── ollama_adapter.py # Ollama local LLM adapter
|
| 397 |
+
│
|
| 398 |
+
├── pipelines/
|
| 399 |
+
│ ├── base.py # OCRLLMPipeline orchestrator
|
| 400 |
+
│ └── over_normalization.py # Over-normalization detection
|
| 401 |
+
│
|
| 402 |
+
├── prompts/ # Versioned prompt templates (FR + EN)
|
| 403 |
+
│ ├── correction_medieval_french.txt
|
| 404 |
+
│ ├── zero_shot_medieval_french.txt
|
| 405 |
+
│ ├── correction_imprime_ancien.txt
|
| 406 |
+
│ ├── zero_shot_imprime_ancien.txt
|
| 407 |
+
│ ├── correction_medieval_english.txt
|
| 408 |
+
│ ├── zero_shot_medieval_english.txt
|
| 409 |
+
│ ├── correction_early_modern_english.txt
|
| 410 |
+
│ └── correction_image_medieval_french.txt
|
| 411 |
+
│
|
| 412 |
+
├── report/
|
| 413 |
+
│ ├── generator.py # Self-contained HTML report (Chart.js + diff2html)
|
| 414 |
+
│ └── diff_utils.py # Diff computation utilities
|
| 415 |
+
│
|
| 416 |
+
├── web/
|
| 417 |
+
│ ├── app.py # FastAPI app (SSE, ZIP upload, dynamic endpoints)
|
| 418 |
+
│ └── static/ # CSS assets
|
| 419 |
+
│
|
| 420 |
+
└── importers/
|
| 421 |
+
├── iiif.py # IIIF manifest importer
|
| 422 |
+
├── gallica.py # Gallica API client (BnF)
|
| 423 |
+
├── htr_united.py # HTR-United catalogue importer
|
| 424 |
+
├── huggingface.py # HuggingFace Datasets importer
|
| 425 |
+
└── escriptorium.py # eScriptorium client
|
| 426 |
+
|
| 427 |
+
tests/ # ~1020 unit and integration tests
|
| 428 |
+
.github/workflows/
|
| 429 |
+
├── ci.yml # CI: Python 3.11/3.12, Linux/macOS/Windows
|
| 430 |
+
└── sync_to_huggingface.yml # Auto-sync to HuggingFace Space on push to main
|
| 431 |
+
Dockerfile # Multi-stage Docker build for HuggingFace Spaces
|
| 432 |
```
|
| 433 |
|
| 434 |
---
|
| 435 |
|
| 436 |
+
## Environment Variables
|
| 437 |
+
|
| 438 |
+
Configure API keys depending on which engines and LLM adapters you use:
|
| 439 |
|
| 440 |
```bash
|
| 441 |
+
# LLM APIs
|
| 442 |
export OPENAI_API_KEY="sk-..."
|
| 443 |
export ANTHROPIC_API_KEY="sk-ant-..."
|
| 444 |
export MISTRAL_API_KEY="..."
|
| 445 |
|
| 446 |
+
# Cloud OCR APIs (optional)
|
| 447 |
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
|
| 448 |
export AWS_ACCESS_KEY_ID="..."
|
| 449 |
export AWS_SECRET_ACCESS_KEY="..."
|
|
|
|
| 452 |
export AZURE_DOC_INTEL_KEY="..."
|
| 453 |
```
|
| 454 |
|
| 455 |
+
For deployment on HuggingFace Spaces, set these in **Settings > Variables and secrets**.
|
| 456 |
+
|
| 457 |
---
|
| 458 |
|
| 459 |
+
## CI/CD
|
| 460 |
|
| 461 |
+
### GitHub Actions (`ci.yml`)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 462 |
|
| 463 |
+
- **Triggers:** push to `main`/`develop`/`feature/*`/`sprint/*`/`claude/*`, PRs to
|
| 464 |
+
`main`/`develop`, manual dispatch
|
| 465 |
+
- **Matrix:** Python 3.11 + 3.12 on Linux, macOS, and Windows
|
| 466 |
+
- **Jobs:**
|
| 467 |
+
1. **Tests** -- full pytest suite with coverage, uploaded to Codecov
|
| 468 |
+
2. **Demo** -- end-to-end demo report generation with history and robustness
|
| 469 |
+
3. **Build** -- wheel and sdist with twine validation
|
| 470 |
+
4. **Lint** -- ruff check for errors (E, W, F; ignores E501, E402)
|
| 471 |
|
| 472 |
+
### HuggingFace Sync (`sync_to_huggingface.yml`)
|
| 473 |
|
| 474 |
+
- Automatically pushes `main` to the HuggingFace Space `Ma-Ri-Ba-Ku/Picarones`
|
| 475 |
+
- Requires the `HF_TOKEN` secret in GitHub repository settings
|
| 476 |
|
| 477 |
---
|
| 478 |
|
| 479 |
+
## Development
|
| 480 |
|
| 481 |
+
```bash
|
| 482 |
+
# Install with dev + web dependencies
|
| 483 |
+
pip install -e ".[dev,web]"
|
| 484 |
|
| 485 |
+
# Run the test suite
|
| 486 |
+
pytest tests/ -q --tb=short
|
| 487 |
|
| 488 |
+
# Run with coverage
|
| 489 |
+
pytest tests/ --cov=picarones --cov-report=term-missing
|
| 490 |
|
| 491 |
+
# Generate a demo report
|
| 492 |
+
picarones demo --output demo_report.html
|
| 493 |
|
| 494 |
+
# Launch the web UI in development mode
|
| 495 |
+
picarones serve --port 8080
|
| 496 |
|
| 497 |
+
# Full refresh (useful in Codespaces)
|
| 498 |
+
git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html
|
| 499 |
+
```
|
| 500 |
|
| 501 |
+
**Key development conventions:**
|
| 502 |
|
| 503 |
+
- Never use bare `except Exception: pass` -- always log with `logger.warning()`
|
| 504 |
+
- Normalization profiles are read dynamically from `picarones/core/normalization.py` --
|
| 505 |
+
never hardcode them in endpoint handlers
|
| 506 |
+
- Engines declare their `execution_mode` (`"io"` or `"cpu"`) so the runner can select the
|
| 507 |
+
appropriate executor
|
| 508 |
+
- `python-multipart` must remain in dependencies (FastAPI checks at import time)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 509 |
|
| 510 |
+
---
|
| 511 |
|
| 512 |
+
## Roadmap
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 513 |
|
| 514 |
+
| Sprint | Status | Deliverables |
|
| 515 |
+
|--------|--------|-------------|
|
| 516 |
+
| 1 | Done | Project structure, Tesseract, Pero OCR, CER/WER, CLI |
|
| 517 |
+
| 2 | Done | HTML report v1: Chart.js, colored diff, gallery |
|
| 518 |
+
| 3 | Done | OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama |
|
| 519 |
+
| 4 | Done | Cloud OCR APIs, IIIF import, diplomatic normalization |
|
| 520 |
+
| 5 | Done | Advanced metrics: confusion matrix, ligatures, 9-class taxonomy |
|
| 521 |
+
| 6 | Done | FastAPI web interface, HTR-United, HuggingFace, bilingual UI |
|
| 522 |
+
| 7 | Done | HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score |
|
| 523 |
+
| 8 | Done | eScriptorium, Gallica API, SQLite history, robustness analysis |
|
| 524 |
+
| 9 | Done | Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta |
|
| 525 |
+
| 10 | Done | Line error distribution (Gini), VLM hallucination detection |
|
| 526 |
+
| 11 | Done | Internationalization FR/EN, English normalization profiles |
|
| 527 |
+
| 12 | Done | Browser ZIP upload, macOS file filtering, dynamic model selector |
|
| 528 |
+
| 13 | Done | pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation |
|
| 529 |
|
| 530 |
+
---
|
| 531 |
+
|
| 532 |
+
## Contributing
|
| 533 |
+
|
| 534 |
+
See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on adding an OCR engine, an LLM
|
| 535 |
+
adapter, or submitting a pull request.
|
| 536 |
+
|
| 537 |
+
---
|
| 538 |
|
| 539 |
+
## License
|
|
|
|
| 540 |
|
| 541 |
+
[Apache License 2.0](LICENSE)
|
| 542 |
|
| 543 |
+
Copyright 2024 Picarones contributors.
|