Claude commited on
Commit
f6a6dc4
·
unverified ·
1 Parent(s): 9e7f788

docs: rewrite README as comprehensive English documentation

Browse files

Replace the bilingual French/English README with a complete English version
covering all features, installation methods, CLI commands, web API endpoints,
pipeline modes, supported engines, normalization profiles, error taxonomy,
project structure, environment variables, CI/CD setup, and development guide.

https://claude.ai/code/session_01PJLbDjPUK3VFiqKK6gT8Hh

Files changed (1) hide show
  1. README.md +426 -176
README.md CHANGED
@@ -9,203 +9,441 @@ pinned: false
9
 
10
  # Picarones
11
 
12
- > **Plateforme de comparaison de moteurs OCR/HTR pour documents patrimoniaux**
13
- Apache 2.0
14
 
15
  [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
16
  [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
17
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
 
18
 
19
  ---
20
 
21
- **Picarones** est un outil open-source conçu pour comparer des moteurs OCR et HTR
22
- (Tesseract, Pero OCR, Kraken, APIs cloud) ainsi que des pipelines OCR+LLM sur des corpus de
23
- documents historiques (manuscrits, imprimés anciens, archives).
24
 
25
- ---
26
-
27
- *[English version below](#english)*
28
 
29
  ---
30
 
31
- ## Sommaire
32
-
33
- - [Fonctionnalités](#fonctionnalités)
34
- - [Installation rapide](#installation-rapide)
35
- - [Usage rapide](#usage-rapide)
36
- - [Moteurs supportés](#moteurs-supportés)
37
- - [Structure du projet](#structure-du-projet)
38
- - [Variables d'environnement](#variables-denvironnement)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  - [Roadmap](#roadmap)
40
- - [English](#english)
 
41
 
42
  ---
43
 
44
- ## Fonctionnalités
45
-
46
- ### Métriques adaptées aux documents patrimoniaux
47
-
48
- - **CER** (Character Error Rate) : brut, NFC, caseless, diplomatique (ſ=s, u=v, i=j…)
49
- - **WER**, MER, WIL avec tokenisation historique
50
- - **Matrice de confusion unicode** fingerprint de chaque moteur
51
- - **Scores ligatures** : fi, fl, ff, œ, æ, ꝑ, ꝓ…
52
- - **Scores diacritiques** : accents, cédilles, trémas
53
- - **Taxonomie des erreurs** en 10 classes (confusion visuelle, abréviation, ligature, casse…)
54
- - **Intervalles de confiance à 95%** par bootstrap tests de Wilcoxon pour la significativité
55
- - **Score de difficulté intrinsèque** par document (indépendant des moteurs)
56
-
57
- ### Pipelines OCR+LLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- - Chaînes composables : `tesseract gpt-4o`, `pero_ocr claude-sonnet`, LLM zero-shot…
60
- - Modes : texte seul, image+texte, zero-shot
61
- - Détection de **sur-normalisation LLM** : le LLM modernise-t-il à tort les graphies anciennes ?
62
- - Bibliothèque de prompts pour manuscrits médiévaux, imprimés anciens, latin…
63
 
64
- ### Import de corpus
65
 
66
- | Source | Commande |
67
- |--------|----------|
68
- | Dossier local | `picarones run --corpus ./corpus/` |
69
- | IIIF (Gallica, Bodleian, BL…) | `picarones import iiif <url>` |
70
- | Gallica (API SRU + OCR) | `GallicaClient` / `picarones import iiif` |
71
- | HuggingFace Datasets | `picarones import hf <dataset>` |
72
- | HTR-United | `picarones import htr-united` |
73
- | eScriptorium | `EScriptoriumClient` |
74
 
75
- ### Rapport HTML interactif
76
 
77
- - Fichier HTML **auto-contenu**, lisible hors-ligne
78
- - Tableau de classement trié, graphiques radar, histogrammes
79
- - Vue galerie avec filtres dynamiques et badges CER colorés
80
- - Diff coloré façon GitHub, scroll synchronisé N-way
81
- - Vue spécifique OCR+LLM : diff triple GT / OCR brut / après correction
82
- - Vue Caractères : matrice de confusion unicode interactive
83
- - Export CSV, JSON, ALTO XML, PAGE XML, images annotées
84
 
85
- ### Suivi longitudinal & robustesse
86
 
87
- - **Base SQLite** optionnelle pour historiser les runs
88
- - **Courbes d'évolution CER** dans le temps par moteur
89
- - **Détection automatique des régressions** entre deux runs
90
- - **Analyse de robustesse** : bruit, flou, rotation, réduction de résolution, binarisation
91
- - Commandes `picarones history`, `picarones robustness`
 
92
 
93
  ---
94
 
95
- ## Installation rapide
96
 
97
  ```bash
98
- # Cloner et installer
99
  git clone https://github.com/maribakulj/Picarones.git
100
- cd picarones
101
  pip install -e .
102
 
103
- # Tesseract (binaire système, obligatoire pour le moteur Tesseract)
104
  # Ubuntu/Debian
105
  sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
106
 
107
  # macOS
108
  brew install tesseract
109
 
110
- # Vérifier l'installation
 
 
 
111
  picarones engines
112
- ```
113
 
114
- Voir [INSTALL.md](INSTALL.md) pour un guide détaillé (Linux, macOS, Windows, Docker).
 
 
 
 
 
 
 
 
115
 
116
  ---
117
 
118
- ## Usage rapide
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ```bash
121
- # Rapport de démonstration (sans moteur OCR installé)
122
- picarones demo
 
 
 
 
 
 
 
 
123
 
124
- # Benchmark sur un corpus local
125
- picarones run --corpus ./mon_corpus/ --engines tesseract --output resultats.json
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
- # Générer le rapport HTML interactif
128
- picarones report --results resultats.json --output rapport.html
129
 
130
- # Calculer CER/WER entre deux fichiers
131
- picarones metrics --reference gt.txt --hypothesis ocr.txt
132
 
133
- # Importer depuis Gallica (IIIF)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
135
 
136
- # Suivi longitudinal (historique des runs)
137
- picarones history --demo
 
 
138
  picarones history --engine tesseract --regression
139
 
140
- # Analyse de robustesse
141
  picarones robustness --corpus ./gt/ --engine tesseract --demo
142
 
143
- # Interface web locale
144
- picarones serve
145
  ```
146
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  ---
148
 
149
- ## Moteurs supportés
150
-
151
- | Moteur | Type | Installation |
152
- |--------|------|--------------|
153
- | **Tesseract 5** | Local CLI | `pip install pytesseract` + binaire système |
154
- | **Pero OCR** | Local Python | `pip install pero-ocr` |
155
- | **Kraken** | Local Python | `pip install kraken` |
156
- | **Mistral OCR** | API REST | Clé `MISTRAL_API_KEY` |
157
- | **GPT-4o** (LLM) | API REST | Clé `OPENAI_API_KEY` |
158
- | **Claude Sonnet** (LLM) | API REST | Clé `ANTHROPIC_API_KEY` |
159
- | **Mistral Large** (LLM) | API REST | Clé `MISTRAL_API_KEY` |
160
- | **Google Vision** | API REST | Credentials JSON Google |
161
- | **AWS Textract** | API REST | Credentials AWS |
162
- | **Azure Doc. Intel.** | API REST | Endpoint + clé Azure |
163
- | **Ollama** (LLM local) | Local | `ollama serve` |
164
- | **Moteur custom** | CLI/API YAML | Déclaration YAML, sans code |
 
 
 
165
 
166
  ---
167
 
168
- ## Structure du projet
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
  ```
171
  picarones/
172
- ├── cli.py # CLI Click (run, demo, report, history, robustness…)
173
- ├── fixtures.py # Données de test fictives réalistes
 
 
 
174
  ├── core/
175
- │ ├── corpus.py # Chargement corpus (dossier, ALTO, PAGE XML)
176
- │ ├── metrics.py # CER, WER, MER, WIL (jiwer)
177
- │ ├── normalization.py # Normalisation unicode, profils diplomatiques
178
- │ ├── statistics.py # Bootstrap CI, Wilcoxon, corrélations
179
- │ ├── confusion.py # Matrice de confusion unicode
180
- │ ├── char_scores.py # Scores ligatures et diacritiques
181
- │ ├── taxonomy.py # Taxonomie des erreurs (10 classes)
182
- │ ├── structure.py # Analyse structurelle
183
- │ ├── image_quality.py # Métriques qualité image
184
- │ ├── difficulty.py # Score de difficulté intrinsèque
185
- │ ├── history.py # Suivi longitudinal SQLite
186
- │ ├── robustness.py # Analyse de robustesse
187
- │ ├── results.py # Modèles de données + export JSON
188
- ── runner.py # Orchestrateur benchmark
189
- ├── engines/ # Adaptateurs moteurs OCR
190
- ── llm/ # Adaptateurs LLM (OpenAI, Anthropic, Mistral, Ollama)
191
- ├── importers/ # Sources d'import (IIIF, Gallica, eScriptorium, HF…)
192
- ├── pipelines/ # Orchestrateur OCR+LLM
193
- ├── report/ # Générateur rapport HTML
194
- ── web/ # Interface web FastAPI
195
- tests/ # Tests unitaires et d'intégration (743 tests)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  ```
197
 
198
  ---
199
 
200
- ## Variables d'environnement
 
 
201
 
202
  ```bash
203
- # APIs LLM (selon les moteurs utilisés)
204
  export OPENAI_API_KEY="sk-..."
205
  export ANTHROPIC_API_KEY="sk-ant-..."
206
  export MISTRAL_API_KEY="..."
207
 
208
- # APIs OCR cloud (optionnel)
209
  export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
210
  export AWS_ACCESS_KEY_ID="..."
211
  export AWS_SECRET_ACCESS_KEY="..."
@@ -214,80 +452,92 @@ export AZURE_DOC_INTEL_ENDPOINT="https://..."
214
  export AZURE_DOC_INTEL_KEY="..."
215
  ```
216
 
 
 
217
  ---
218
 
219
- ## Roadmap
220
 
221
- | Sprint | Statut | Livrables |
222
- |--------|--------|-----------|
223
- | Sprint 1 | ✅ | Structure, Tesseract, Pero OCR, CER/WER, CLI |
224
- | Sprint 2 | ✅ | Rapport HTML v1, diff coloré, galerie |
225
- | Sprint 3 | ✅ | Pipelines OCR+LLM, GPT-4o, Claude |
226
- | Sprint 4 | ✅ | APIs cloud, import IIIF, normalisation diplomatique |
227
- | Sprint 5 | ✅ | Métriques avancées : confusion unicode, ligatures, taxonomie |
228
- | Sprint 6 | ✅ | Interface web FastAPI, HTR-United, HuggingFace, Ollama |
229
- | Sprint 7 | ✅ | Rapport HTML v2 : Wilcoxon, clustering, scatter plots |
230
- | Sprint 8 | ✅ | eScriptorium, Gallica API, historique longitudinal, robustesse |
231
- | Sprint 9 | ✅ | Documentation, packaging, Docker, CI/CD |
232
 
233
- ---
 
 
 
 
 
 
 
234
 
235
- ## Contribuer
236
 
237
- Voir [CONTRIBUTING.md](CONTRIBUTING.md) pour ajouter un moteur OCR, un adaptateur LLM, ou soumettre une pull request.
 
238
 
239
  ---
240
 
241
- ## Licence
242
 
243
- Apache License 2.0
 
 
244
 
245
- ---
 
246
 
247
- ---
 
248
 
249
- # English
 
250
 
251
- ## Picarones OCR/HTR Benchmark Platform for Heritage Documents
 
252
 
253
- **Picarones** is an open-source platform for rigorously comparing OCR and HTR engines (Tesseract,
254
- Pero OCR, Kraken, cloud APIs…) and OCR+LLM pipelines on historical document corpora — manuscripts,
255
- early printed books, archives.
256
 
257
- ### Key Features
258
 
259
- - **Metrics tailored to historical documents**: CER (raw, NFC, caseless, diplomatic), WER, MER,
260
- WIL; unicode confusion matrix; ligature and diacritic scores; 10-class error taxonomy; bootstrap
261
- confidence intervals; Wilcoxon significance tests
262
- - **OCR+LLM pipelines**: composable chains (`tesseract gpt-4o`), three modes (text-only,
263
- image+text, zero-shot), LLM over-normalisation detection
264
- - **Corpus import**: local folder, IIIF (Gallica, Bodleian, BL…), Gallica API + OCR, HuggingFace
265
- Datasets, HTR-United, eScriptorium
266
- - **Interactive HTML report**: self-contained file, sortable ranking, gallery, coloured diff,
267
- unicode character view, CSV/JSON/ALTO/PAGE XML export
268
- - **Longitudinal tracking**: SQLite benchmark history, CER evolution curves, automatic regression
269
- detection
270
- - **Robustness analysis**: degraded image versions (noise, blur, rotation, resolution,
271
- binarisation), critical threshold detection
272
 
273
- ### Quick Start
274
 
275
- ```bash
276
- pip install -e .
277
- sudo apt install tesseract-ocr tesseract-ocr-fra # Ubuntu/Debian
278
- picarones demo # demo report without any engine installed
279
- picarones engines # list available engines
280
- picarones run --corpus ./corpus/ --engines tesseract --output results.json
281
- picarones report --results results.json
282
- ```
283
 
284
- See [INSTALL.md](INSTALL.md) for detailed installation on Linux, macOS, Windows, and Docker.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
- ### Supported Engines
 
 
 
 
 
 
 
287
 
288
- Tesseract 5 · Pero OCR · Kraken · Mistral OCR · GPT-4o · Claude Sonnet · Mistral Large ·
289
- Google Vision · AWS Textract · Azure Document Intelligence · Ollama (local LLMs) · Custom YAML engine
290
 
291
- ### License
292
 
293
- Apache License 2.0
 
9
 
10
  # Picarones
11
 
12
+ > **OCR/HTR Benchmarking Platform for Heritage Documents**
 
13
 
14
  [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
15
  [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
16
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
17
+ [![HuggingFace Space](https://img.shields.io/badge/%F0%9F%A4%97-HuggingFace%20Space-yellow.svg)](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones)
18
 
19
  ---
20
 
21
+ **Picarones** is an open-source platform for rigorously comparing OCR and HTR engines
22
+ (Tesseract, Pero OCR, Kraken, cloud APIs...) and OCR+LLM pipelines on historical document
23
+ corpora -- manuscripts, early printed books, and archives.
24
 
25
+ It provides heritage-specific metrics (diplomatic CER, ligature scores, medieval abbreviation
26
+ handling), composable OCR+LLM pipelines, interactive HTML reports, and multiple corpus import
27
+ sources including IIIF, Gallica, HuggingFace, and eScriptorium.
28
 
29
  ---
30
 
31
+ ## Table of Contents
32
+
33
+ - [Features](#features)
34
+ - [Heritage-Specific Metrics](#heritage-specific-metrics)
35
+ - [OCR+LLM Pipelines](#ocr-llm-pipelines)
36
+ - [Corpus Import](#corpus-import)
37
+ - [Interactive HTML Report](#interactive-html-report)
38
+ - [Longitudinal Tracking & Robustness](#longitudinal-tracking--robustness)
39
+ - [Web Interface](#web-interface)
40
+ - [Quick Start](#quick-start)
41
+ - [Installation](#installation)
42
+ - [From Source](#from-source)
43
+ - [Docker](#docker)
44
+ - [Optional Extras](#optional-extras)
45
+ - [Usage](#usage)
46
+ - [CLI Commands](#cli-commands)
47
+ - [Web Interface](#web-interface-1)
48
+ - [Pipeline Modes](#pipeline-modes)
49
+ - [Supported Engines](#supported-engines)
50
+ - [Normalization Profiles](#normalization-profiles)
51
+ - [Error Taxonomy](#error-taxonomy)
52
+ - [Project Structure](#project-structure)
53
+ - [Environment Variables](#environment-variables)
54
+ - [CI/CD](#cicd)
55
+ - [Development](#development)
56
  - [Roadmap](#roadmap)
57
+ - [Contributing](#contributing)
58
+ - [License](#license)
59
 
60
  ---
61
 
62
+ ## Features
63
+
64
+ ### Heritage-Specific Metrics
65
+
66
+ - **CER** (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and
67
+ **diplomatic** (historical equivalences: long s = s, u = v, i = j, etc.)
68
+ - **WER**, **MER**, **WIL** with historical-aware tokenization (via [jiwer](https://github.com/jitsi/jiwer))
69
+ - **Unicode confusion matrix** -- fingerprint each engine's character-level errors
70
+ - **Ligature and diacritic scores** -- track handling of fi, fl, ff, oe, ae, p-bar, and other
71
+ medieval glyphs
72
+ - **10-class error taxonomy** -- automatic classification of every error (visual confusion,
73
+ abbreviation, segmentation, lacuna, over-normalization, etc.)
74
+ - **Bootstrap 95% confidence intervals** and **Wilcoxon signed-rank tests** for statistical
75
+ significance
76
+ - **Intrinsic difficulty score** per document, independent of engine performance
77
+ - **Line-level error distribution** with Gini coefficient and percentile analysis
78
+ - **VLM hallucination detection** -- anchor score and length ratio to flag fabricated output
79
+
80
+ ### OCR+LLM Pipelines
81
+
82
+ - Composable chains: `tesseract -> gpt-4o`, `pero_ocr -> claude-sonnet`, zero-shot VLM, etc.
83
+ - Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot
84
+ - **Over-normalization detection** -- does the LLM silently modernize historical spellings?
85
+ - Versioned prompt library for medieval French, early modern French, medieval Latin, medieval
86
+ English, and early modern English -- both correction and zero-shot variants
87
+
88
+ ### Corpus Import
89
+
90
+ | Source | Method |
91
+ |--------|--------|
92
+ | Local folder | `picarones run --corpus ./corpus/` |
93
+ | IIIF manifests (Gallica, Bodleian, BL...) | `picarones import iiif <manifest-url>` |
94
+ | Gallica API (SRU + OCR) | `GallicaClient` / `picarones import iiif` |
95
+ | HuggingFace Datasets | `picarones import hf <dataset-id>` |
96
+ | HTR-United catalogue | `picarones import htr-united` |
97
+ | eScriptorium | `EScriptoriumClient` |
98
+ | ZIP upload (browser) | Web interface upload endpoint |
99
 
100
+ Supported corpus formats: plain text pairs (image + ground truth), **ALTO XML**, and **PAGE XML**.
 
 
 
101
 
102
+ ### Interactive HTML Report
103
 
104
+ - **Self-contained HTML file** -- works offline, no server needed
105
+ - Sortable ranking table, radar charts, histograms (powered by Chart.js)
106
+ - Gallery view with dynamic filters and color-coded CER badges
107
+ - GitHub-style colored diff with synchronized N-way scrolling
108
+ - Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction
109
+ - Unicode character view: interactive confusion matrix explorer
110
+ - Export to **CSV**, **JSON**, **ALTO XML**, **PAGE XML**, and annotated images
 
111
 
112
+ ### Longitudinal Tracking & Robustness
113
 
114
+ - Optional **SQLite database** to record benchmark history across runs
115
+ - **CER evolution curves** over time, per engine
116
+ - **Automatic regression detection** between consecutive runs
117
+ - **Robustness analysis**: measure engine resilience to noise, blur, rotation, resolution
118
+ reduction, and binarization
119
+ - Critical degradation threshold identification
 
120
 
121
+ ### Web Interface
122
 
123
+ - **FastAPI** application with real-time **Server-Sent Events** (SSE) progress streaming
124
+ - Upload corpus as a **ZIP file** directly from the browser
125
+ - Dynamic engine and normalization profile selectors
126
+ - Browse and re-download generated HTML reports
127
+ - Bilingual **French/English** interface
128
+ - Deployable on HuggingFace Spaces (Docker, port 7860)
129
 
130
  ---
131
 
132
+ ## Quick Start
133
 
134
  ```bash
135
+ # Clone and install
136
  git clone https://github.com/maribakulj/Picarones.git
137
+ cd Picarones
138
  pip install -e .
139
 
140
+ # Install Tesseract (system binary, required for the Tesseract engine)
141
  # Ubuntu/Debian
142
  sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
143
 
144
  # macOS
145
  brew install tesseract
146
 
147
+ # Generate a demo report (no OCR engine needed)
148
+ picarones demo --output demo_report.html
149
+
150
+ # List available engines
151
  picarones engines
 
152
 
153
+ # Run a benchmark
154
+ picarones run --corpus ./corpus/ --engines tesseract --output results.json
155
+
156
+ # Generate HTML report
157
+ picarones report --results results.json --output report.html
158
+
159
+ # Launch the web interface
160
+ picarones serve --port 8080
161
+ ```
162
 
163
  ---
164
 
165
+ ## Installation
166
+
167
+ ### From Source
168
+
169
+ ```bash
170
+ git clone https://github.com/maribakulj/Picarones.git
171
+ cd Picarones
172
+ pip install -e ".[dev,web]" # includes test and web dependencies
173
+ ```
174
+
175
+ **System requirements:**
176
+
177
+ - Python >= 3.11
178
+ - [Tesseract OCR 5](https://github.com/tesseract-ocr/tesseract) (for the Tesseract engine)
179
+
180
+ ### Docker
181
 
182
  ```bash
183
+ docker build -t picarones .
184
+ docker run -p 7860:7860 \
185
+ -e MISTRAL_API_KEY=... \
186
+ -e OPENAI_API_KEY=... \
187
+ picarones
188
+ ```
189
+
190
+ The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs
191
+ (fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls
192
+ `/health` every 30 seconds.
193
 
194
+ The [HuggingFace Space](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) uses this
195
+ same Docker image.
196
+
197
+ ### Optional Extras
198
+
199
+ | Extra | Install command | What it adds |
200
+ |-------|----------------|--------------|
201
+ | `dev` | `pip install -e ".[dev]"` | pytest, pytest-cov, httpx, linting |
202
+ | `web` | `pip install -e ".[web]"` | FastAPI, uvicorn, python-multipart |
203
+ | `llm` | `pip install -e ".[llm]"` | OpenAI, Anthropic, Mistral SDKs |
204
+ | `hf` | `pip install -e ".[hf]"` | HuggingFace Datasets |
205
+ | `pero` | `pip install -e ".[pero]"` | Pero OCR engine |
206
+ | `kraken` | `pip install -e ".[kraken]"` | Kraken engine |
207
+ | `ocr-cloud` | `pip install -e ".[ocr-cloud]"` | Google Vision, AWS Textract, Azure Doc Intelligence |
208
+ | `all` | `pip install -e ".[all]"` | Everything except ocr-cloud |
209
+
210
+ See [INSTALL.md](INSTALL.md) for detailed instructions on Linux, macOS, Windows, and Docker.
211
+
212
+ ---
213
 
214
+ ## Usage
 
215
 
216
+ ### CLI Commands
 
217
 
218
+ | Command | Description |
219
+ |---------|-------------|
220
+ | `picarones run` | Run a full benchmark on a corpus |
221
+ | `picarones report` | Generate an HTML report from JSON results |
222
+ | `picarones demo` | Generate a demo report with synthetic data (no engine required) |
223
+ | `picarones metrics` | Calculate CER/WER between two text files |
224
+ | `picarones engines` | List all available OCR engines and LLM adapters |
225
+ | `picarones info` | Display version and system information |
226
+ | `picarones serve` | Launch the FastAPI web interface |
227
+ | `picarones history` | Query longitudinal benchmark history (SQLite) |
228
+ | `picarones robustness` | Run robustness analysis with degraded images |
229
+ | `picarones import` | Import corpus from IIIF, HuggingFace, or HTR-United |
230
+
231
+ **Examples:**
232
+
233
+ ```bash
234
+ # Benchmark with Tesseract, French language, PSM 6
235
+ picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \
236
+ --output results.json --verbose
237
+
238
+ # Compare two text files
239
+ picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt
240
+
241
+ # Import 10 pages from a Gallica IIIF manifest
242
  picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
243
 
244
+ # Import a HuggingFace dataset
245
+ picarones import hf medieval-ocr/dataset-name
246
+
247
+ # View benchmark history with regression detection
248
  picarones history --engine tesseract --regression
249
 
250
+ # Robustness demo (noise, blur, rotation, resolution)
251
  picarones robustness --corpus ./gt/ --engine tesseract --demo
252
 
253
+ # Fail CI if CER exceeds threshold
254
+ picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15
255
  ```
256
 
257
+ ### Web Interface
258
+
259
+ ```bash
260
+ picarones serve --host 0.0.0.0 --port 8080
261
+ ```
262
+
263
+ **API endpoints include:**
264
+
265
+ | Endpoint | Method | Description |
266
+ |----------|--------|-------------|
267
+ | `/` | GET | Main single-page application |
268
+ | `/api/status` | GET | Version and application status |
269
+ | `/api/engines` | GET | Available OCR/LLM engines |
270
+ | `/api/normalization/profiles` | GET | Normalization profiles (read dynamically) |
271
+ | `/api/benchmark/start` | POST | Start a benchmark job (returns `job_id`) |
272
+ | `/api/benchmark/{job_id}/stream` | GET | SSE real-time progress stream |
273
+ | `/api/benchmark/{job_id}/cancel` | POST | Cancel a running benchmark |
274
+ | `/api/corpus/browse` | GET | Browse server-side corpus folders |
275
+ | `/api/htr-united/catalogue` | GET | Browse HTR-United catalogue |
276
+ | `/api/huggingface/search` | GET | Search HuggingFace datasets |
277
+ | `/reports/{filename}` | GET | Download generated HTML reports |
278
+
279
+ ### Pipeline Modes
280
+
281
+ Picarones supports three modes for OCR+LLM pipelines:
282
+
283
+ | Mode | Description | Model type |
284
+ |------|-------------|------------|
285
+ | `zero_shot` | LLM receives the image directly and transcribes without prior OCR | VLM (vision) |
286
+ | `post_correction_texte` | OCR produces raw text, then LLM corrects it | Text-only LLM |
287
+ | `post_correction_image_texte` | OCR produces raw text, then LLM receives both image and text for correction | VLM (vision) |
288
+
289
+ **Example:** `ministral-3b-latest` is a text-only model and should use `post_correction_texte`.
290
+ GPT-4o and Claude support all three modes.
291
+
292
  ---
293
 
294
+ ## Supported Engines
295
+
296
+ | Engine | Type | Execution Mode | Installation |
297
+ |--------|------|---------------|-------------|
298
+ | **Tesseract 5** | Local CLI | CPU (ProcessPool) | `pip install pytesseract` + system binary |
299
+ | **Pero OCR** | Local Python | CPU (ProcessPool) | `pip install pero-ocr` |
300
+ | **Kraken** | Local Python | CPU (ProcessPool) | `pip install kraken` |
301
+ | **Mistral OCR** | Cloud API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
302
+ | **Google Vision** | Cloud API | IO (ThreadPool) | `GOOGLE_APPLICATION_CREDENTIALS` env var |
303
+ | **Azure Doc Intelligence** | Cloud API | IO (ThreadPool) | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
304
+ | **GPT-4o** (VLM) | LLM API | IO (ThreadPool) | `OPENAI_API_KEY` env var |
305
+ | **Claude Sonnet** (VLM) | LLM API | IO (ThreadPool) | `ANTHROPIC_API_KEY` env var |
306
+ | **Mistral Large** (LLM) | LLM API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
307
+ | **Ollama** (local LLM) | Local LLM | IO (ThreadPool) | `ollama serve` running locally |
308
+ | **Custom engine** | CLI or API | Configurable | YAML declaration, no code required |
309
+
310
+ Engines declare their `execution_mode` (`"io"` or `"cpu"`), allowing the runner to use
311
+ `ThreadPoolExecutor` for IO-bound engines and `ProcessPoolExecutor` for CPU-bound engines
312
+ simultaneously.
313
 
314
  ---
315
 
316
+ ## Normalization Profiles
317
+
318
+ Picarones includes eight built-in diplomatic normalization profiles designed for historical
319
+ text comparison. These reduce noise from expected orthographic variation so metrics reflect
320
+ genuine OCR errors, not historical spelling differences.
321
+
322
+ | Profile | Period | Key equivalences |
323
+ |---------|--------|-----------------|
324
+ | `nfc` | Any | Unicode NFC normalization only |
325
+ | `minimal` | Any | NFC + long s (ſ -> s) |
326
+ | `medieval_french` | 12th-15th c. | ſ=s, u=v, i=j, y=i, ae=ae, oe=oe, p-bar=per, etc. |
327
+ | `early_modern_french` | 16th-18th c. | ſ=s, ae=ae, oe=oe, y-tilde=yn, &=et |
328
+ | `medieval_latin` | 12th-15th c. | ſ=s, u=v, i=j, ae=ae, oe=oe, p-bar=per, q-bar=que |
329
+ | `medieval_english` | 12th-15th c. | ſ=s, u=v, i=j, thorn=th, eth=th, yogh=y, p-bar=per |
330
+ | `early_modern_english` | 16th-18th c. | ſ=s, u=v, i=j, vv=w, thorn=th, eth=th, yogh=y |
331
+ | `secretary_hand` | 16th-17th c. | Early modern English + secretary hand visual confusions |
332
+
333
+ Custom profiles can be loaded from YAML files with user-defined diplomatic tables.
334
+
335
+ ---
336
+
337
+ ## Error Taxonomy
338
+
339
+ Every character-level error is automatically classified into one of 10 categories:
340
+
341
+ | Class | Name | Description |
342
+ |-------|------|-------------|
343
+ | 1 | `visual_confusion` | Morphologically similar characters (rn/m, l/1, O/0, u/n) |
344
+ | 2 | `diacritic_error` | Missing, incorrect, or spurious diacritical mark |
345
+ | 3 | `case_error` | Case difference only (A/a) |
346
+ | 4 | `ligature_error` | Ligature not resolved or incorrectly resolved |
347
+ | 5 | `abbreviation_error` | Medieval abbreviation not expanded |
348
+ | 6 | `hapax` | Word not found in any reference lexicon |
349
+ | 7 | `segmentation_error` | Token fusion or fragmentation (words/lines) |
350
+ | 8 | `oov_character` | Character outside the engine's vocabulary |
351
+ | 9 | `lacuna` | Text present in ground truth but absent from OCR output |
352
+ | 10 | `over_normalization` | LLM silently modernized a historical spelling |
353
+
354
+ ---
355
+
356
+ ## Project Structure
357
 
358
  ```
359
  picarones/
360
+ ├── __init__.py # Version (1.0.0), package metadata
361
+ ├── cli.py # Click CLI: run, demo, report, metrics, engines, info,
362
+ │ # serve, import, history, robustness
363
+ ├── fixtures.py # Realistic synthetic test data (medieval documents)
364
+
365
  ├── core/
366
+ │ ├── corpus.py # Corpus loading (folder, ALTO XML, PAGE XML)
367
+ │ ├── metrics.py # CER, WER, MER, WIL (via jiwer)
368
+ │ ├── normalization.py # Unicode normalization, 8 diplomatic profiles
369
+ │ ├── statistics.py # Bootstrap CI 95%, Wilcoxon test, correlations
370
+ │ ├── runner.py # Benchmark orchestrator (ThreadPool + ProcessPool)
371
+ │ ├── results.py # DocumentResult, BenchmarkResults, JSON export
372
+ │ ├── confusion.py # Unicode confusion matrix
373
+ │ ├── char_scores.py # Ligature and diacritic scores
374
+ │ ├── taxonomy.py # 10-class error taxonomy
375
+ │ ├── structure.py # Structural analysis (blocks, lines, words)
376
+ │ ├── image_quality.py # Image quality metrics (contrast, noise, resolution)
377
+ │ ├── difficulty.py # Intrinsic difficulty score per document
378
+ │ ├── hallucination.py # VLM hallucination detection
379
+ ── line_metrics.py # Line-level error distribution (Gini, percentiles)
380
+ ├── history.py # SQLite longitudinal tracking
381
+ │ └── robustness.py # Robustness analysis (noise, blur, rotation, resolution)
382
+
383
+ ├── engines/
384
+ ├── base.py # BaseOCREngine (execution_mode: "io" | "cpu")
385
+ │ ├── tesseract.py # Tesseract 5 adapter (CPU)
386
+ │ ├── pero_ocr.py # Pero OCR adapter (CPU)
387
+ │ ├── mistral_ocr.py # Mistral OCR API (/v1/ocr endpoint)
388
+ │ ├── google_vision.py # Google Cloud Vision adapter
389
+ │ └── azure_doc_intel.py # Azure Document Intelligence adapter
390
+
391
+ ├── llm/
392
+ │ ├── base.py # BaseLLMAdapter interface
393
+ │ ├── openai_adapter.py # OpenAI / GPT-4o adapter
394
+ │ ├── anthropic_adapter.py # Anthropic / Claude adapter
395
+ │ ├── mistral_adapter.py # Mistral chat completions adapter
396
+ │ └── ollama_adapter.py # Ollama local LLM adapter
397
+
398
+ ├── pipelines/
399
+ │ ├── base.py # OCRLLMPipeline orchestrator
400
+ │ └── over_normalization.py # Over-normalization detection
401
+
402
+ ├── prompts/ # Versioned prompt templates (FR + EN)
403
+ │ ├── correction_medieval_french.txt
404
+ │ ├── zero_shot_medieval_french.txt
405
+ │ ├── correction_imprime_ancien.txt
406
+ │ ├── zero_shot_imprime_ancien.txt
407
+ │ ├── correction_medieval_english.txt
408
+ │ ├── zero_shot_medieval_english.txt
409
+ │ ├── correction_early_modern_english.txt
410
+ │ └── correction_image_medieval_french.txt
411
+
412
+ ├── report/
413
+ │ ├── generator.py # Self-contained HTML report (Chart.js + diff2html)
414
+ │ └── diff_utils.py # Diff computation utilities
415
+
416
+ ├── web/
417
+ │ ├── app.py # FastAPI app (SSE, ZIP upload, dynamic endpoints)
418
+ │ └── static/ # CSS assets
419
+
420
+ └── importers/
421
+ ├── iiif.py # IIIF manifest importer
422
+ ├── gallica.py # Gallica API client (BnF)
423
+ ├── htr_united.py # HTR-United catalogue importer
424
+ ├── huggingface.py # HuggingFace Datasets importer
425
+ └── escriptorium.py # eScriptorium client
426
+
427
+ tests/ # ~1020 unit and integration tests
428
+ .github/workflows/
429
+ ├── ci.yml # CI: Python 3.11/3.12, Linux/macOS/Windows
430
+ └── sync_to_huggingface.yml # Auto-sync to HuggingFace Space on push to main
431
+ Dockerfile # Multi-stage Docker build for HuggingFace Spaces
432
  ```
433
 
434
  ---
435
 
436
+ ## Environment Variables
437
+
438
+ Configure API keys depending on which engines and LLM adapters you use:
439
 
440
  ```bash
441
+ # LLM APIs
442
  export OPENAI_API_KEY="sk-..."
443
  export ANTHROPIC_API_KEY="sk-ant-..."
444
  export MISTRAL_API_KEY="..."
445
 
446
+ # Cloud OCR APIs (optional)
447
  export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
448
  export AWS_ACCESS_KEY_ID="..."
449
  export AWS_SECRET_ACCESS_KEY="..."
 
452
  export AZURE_DOC_INTEL_KEY="..."
453
  ```
454
 
455
+ For deployment on HuggingFace Spaces, set these in **Settings > Variables and secrets**.
456
+
457
  ---
458
 
459
+ ## CI/CD
460
 
461
+ ### GitHub Actions (`ci.yml`)
 
 
 
 
 
 
 
 
 
 
462
 
463
+ - **Triggers:** push to `main`/`develop`/`feature/*`/`sprint/*`/`claude/*`, PRs to
464
+ `main`/`develop`, manual dispatch
465
+ - **Matrix:** Python 3.11 + 3.12 on Linux, macOS, and Windows
466
+ - **Jobs:**
467
+ 1. **Tests** -- full pytest suite with coverage, uploaded to Codecov
468
+ 2. **Demo** -- end-to-end demo report generation with history and robustness
469
+ 3. **Build** -- wheel and sdist with twine validation
470
+ 4. **Lint** -- ruff check for errors (E, W, F; ignores E501, E402)
471
 
472
+ ### HuggingFace Sync (`sync_to_huggingface.yml`)
473
 
474
+ - Automatically pushes `main` to the HuggingFace Space `Ma-Ri-Ba-Ku/Picarones`
475
+ - Requires the `HF_TOKEN` secret in GitHub repository settings
476
 
477
  ---
478
 
479
+ ## Development
480
 
481
+ ```bash
482
+ # Install with dev + web dependencies
483
+ pip install -e ".[dev,web]"
484
 
485
+ # Run the test suite
486
+ pytest tests/ -q --tb=short
487
 
488
+ # Run with coverage
489
+ pytest tests/ --cov=picarones --cov-report=term-missing
490
 
491
+ # Generate a demo report
492
+ picarones demo --output demo_report.html
493
 
494
+ # Launch the web UI in development mode
495
+ picarones serve --port 8080
496
 
497
+ # Full refresh (useful in Codespaces)
498
+ git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html
499
+ ```
500
 
501
+ **Key development conventions:**
502
 
503
+ - Never use bare `except Exception: pass` -- always log with `logger.warning()`
504
+ - Normalization profiles are read dynamically from `picarones/core/normalization.py` --
505
+ never hardcode them in endpoint handlers
506
+ - Engines declare their `execution_mode` (`"io"` or `"cpu"`) so the runner can select the
507
+ appropriate executor
508
+ - `python-multipart` must remain in dependencies (FastAPI checks at import time)
 
 
 
 
 
 
 
509
 
510
+ ---
511
 
512
+ ## Roadmap
 
 
 
 
 
 
 
513
 
514
+ | Sprint | Status | Deliverables |
515
+ |--------|--------|-------------|
516
+ | 1 | Done | Project structure, Tesseract, Pero OCR, CER/WER, CLI |
517
+ | 2 | Done | HTML report v1: Chart.js, colored diff, gallery |
518
+ | 3 | Done | OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama |
519
+ | 4 | Done | Cloud OCR APIs, IIIF import, diplomatic normalization |
520
+ | 5 | Done | Advanced metrics: confusion matrix, ligatures, 9-class taxonomy |
521
+ | 6 | Done | FastAPI web interface, HTR-United, HuggingFace, bilingual UI |
522
+ | 7 | Done | HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score |
523
+ | 8 | Done | eScriptorium, Gallica API, SQLite history, robustness analysis |
524
+ | 9 | Done | Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta |
525
+ | 10 | Done | Line error distribution (Gini), VLM hallucination detection |
526
+ | 11 | Done | Internationalization FR/EN, English normalization profiles |
527
+ | 12 | Done | Browser ZIP upload, macOS file filtering, dynamic model selector |
528
+ | 13 | Done | pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation |
529
 
530
+ ---
531
+
532
+ ## Contributing
533
+
534
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on adding an OCR engine, an LLM
535
+ adapter, or submitting a pull request.
536
+
537
+ ---
538
 
539
+ ## License
 
540
 
541
+ [Apache License 2.0](LICENSE)
542
 
543
+ Copyright 2024 Picarones contributors.