Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Running

Claude commited on Apr 13

Commit

f6a6dc4

unverified ·

1 Parent(s): 9e7f788

docs: rewrite README as comprehensive English documentation

Replace the bilingual French/English README with a complete English version
covering all features, installation methods, CLI commands, web API endpoints,
pipeline modes, supported engines, normalization profiles, error taxonomy,
project structure, environment variables, CI/CD setup, and development guide.

https://claude.ai/code/session_01PJLbDjPUK3VFiqKK6gT8Hh

Files changed (1) hide show

README.md +426 -176

README.md CHANGED Viewed

@@ -9,203 +9,441 @@ pinned: false
 # Picarones
-> **Plateforme de comparaison de moteurs OCR/HTR pour documents patrimoniaux**
-Apache 2.0
 [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
 [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
 ---
-**Picarones** est un outil open-source conçu pour comparer des moteurs OCR et HTR
-(Tesseract, Pero OCR, Kraken, APIs cloud…) ainsi que des pipelines OCR+LLM sur des corpus de
-documents historiques (manuscrits, imprimés anciens, archives).
----
-*[English version below](#english)*
 ---
-## Sommaire
-- [Fonctionnalités](#fonctionnalités)
-- [Installation rapide](#installation-rapide)
-- [Usage rapide](#usage-rapide)
-- [Moteurs supportés](#moteurs-supportés)
-- [Structure du projet](#structure-du-projet)
-- [Variables d'environnement](#variables-denvironnement)
 - [Roadmap](#roadmap)
-- [English](#english)
 ---
-## Fonctionnalités
-### Métriques adaptées aux documents patrimoniaux
-- **CER** (Character Error Rate) : brut, NFC, caseless, diplomatique (ſ=s, u=v, i=j…)
-- **WER**, MER, WIL avec tokenisation historique
-- **Matrice de confusion unicode** — fingerprint de chaque moteur
-- **Scores ligatures** : fi, fl, ff, œ, æ, ꝑ, ꝓ…
-- **Scores diacritiques** : accents, cédilles, trémas
-- **Taxonomie des erreurs** en 10 classes (confusion visuelle, abréviation, ligature, casse…)
-- **Intervalles de confiance à 95%** par bootstrap — tests de Wilcoxon pour la significativité
-- **Score de difficulté intrinsèque** par document (indépendant des moteurs)
-### Pipelines OCR+LLM
-- Chaînes composables : `tesseract → gpt-4o`, `pero_ocr → claude-sonnet`, LLM zero-shot…
-- Modes : texte seul, image+texte, zero-shot
-- Détection de **sur-normalisation LLM** : le LLM modernise-t-il à tort les graphies anciennes ?
-- Bibliothèque de prompts pour manuscrits médiévaux, imprimés anciens, latin…
-### Import de corpus
-| Source | Commande |
-|--------|----------|
-| Dossier local | `picarones run --corpus ./corpus/` |
-| IIIF (Gallica, Bodleian, BL…) | `picarones import iiif <url>` |
-| Gallica (API SRU + OCR) | `GallicaClient` / `picarones import iiif` |
-| HuggingFace Datasets | `picarones import hf <dataset>` |
-| HTR-United | `picarones import htr-united` |
-| eScriptorium | `EScriptoriumClient` |
-### Rapport HTML interactif
-- Fichier HTML **auto-contenu**, lisible hors-ligne
-- Tableau de classement trié, graphiques radar, histogrammes
-- Vue galerie avec filtres dynamiques et badges CER colorés
-- Diff coloré façon GitHub, scroll synchronisé N-way
-- Vue spécifique OCR+LLM : diff triple GT / OCR brut / après correction
-- Vue Caractères : matrice de confusion unicode interactive
-- Export CSV, JSON, ALTO XML, PAGE XML, images annotées
-### Suivi longitudinal & robustesse
-- **Base SQLite** optionnelle pour historiser les runs
-- **Courbes d'évolution CER** dans le temps par moteur
-- **Détection automatique des régressions** entre deux runs
-- **Analyse de robustesse** : bruit, flou, rotation, réduction de résolution, binarisation
-- Commandes `picarones history`, `picarones robustness`
 ---
-## Installation rapide
 ```bash
-# Cloner et installer
 git clone https://github.com/maribakulj/Picarones.git
-cd picarones
 pip install -e .
-# Tesseract (binaire système, obligatoire pour le moteur Tesseract)
 # Ubuntu/Debian
 sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
 # macOS
 brew install tesseract
-# Vérifier l'installation
 picarones engines
-```
-Voir [INSTALL.md](INSTALL.md) pour un guide détaillé (Linux, macOS, Windows, Docker).
 ---
-## Usage rapide
 ```bash
-# Rapport de démonstration (sans moteur OCR installé)
-picarones demo
-# Benchmark sur un corpus local
-picarones run --corpus ./mon_corpus/ --engines tesseract --output resultats.json
-# Générer le rapport HTML interactif
-picarones report --results resultats.json --output rapport.html
-# Calculer CER/WER entre deux fichiers
-picarones metrics --reference gt.txt --hypothesis ocr.txt
-# Importer depuis Gallica (IIIF)
 picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
-# Suivi longitudinal (historique des runs)
-picarones history --demo
 picarones history --engine tesseract --regression
-# Analyse de robustesse
 picarones robustness --corpus ./gt/ --engine tesseract --demo
-# Interface web locale
-picarones serve
 ```
 ---
-## Moteurs supportés
-| Moteur | Type | Installation |
-|--------|------|--------------|
-| **Tesseract 5** | Local CLI | `pip install pytesseract` + binaire système |
-| **Pero OCR** | Local Python | `pip install pero-ocr` |
-| **Kraken** | Local Python | `pip install kraken` |
-| **Mistral OCR** | API REST | Clé `MISTRAL_API_KEY` |
-| **GPT-4o** (LLM) | API REST | Clé `OPENAI_API_KEY` |
-| **Claude Sonnet** (LLM) | API REST | Clé `ANTHROPIC_API_KEY` |
-| **Mistral Large** (LLM) | API REST | Clé `MISTRAL_API_KEY` |
-| **Google Vision** | API REST | Credentials JSON Google |
-| **AWS Textract** | API REST | Credentials AWS |
-| **Azure Doc. Intel.** | API REST | Endpoint + clé Azure |
-| **Ollama** (LLM local) | Local | `ollama serve` |
-| **Moteur custom** | CLI/API YAML | Déclaration YAML, sans code |
 ---
-## Structure du projet
 ```
 picarones/
-├── cli.py                      # CLI Click (run, demo, report, history, robustness…)
-├── fixtures.py                 # Données de test fictives réalistes
 ├── core/
-│   ├── corpus.py               # Chargement corpus (dossier, ALTO, PAGE XML…)
-│   ├── metrics.py              # CER, WER, MER, WIL (jiwer)
-│   ├── normalization.py        # Normalisation unicode, profils diplomatiques
-│   ├── statistics.py           # Bootstrap CI, Wilcoxon, corrélations
-│   ├── confusion.py            # Matrice de confusion unicode
-│   ├── char_scores.py          # Scores ligatures et diacritiques
-│   ├── taxonomy.py             # Taxonomie des erreurs (10 classes)
-│   ├── structure.py            # Analyse structurelle
-│   ├── image_quality.py        # Métriques qualité image
-│   ├── difficulty.py           # Score de difficulté intrinsèque
-│   ├── history.py              # Suivi longitudinal SQLite
-│   ├── robustness.py           # Analyse de robustesse
-│   ├── results.py              # Modèles de données + export JSON
-│   └── runner.py               # Orchestrateur benchmark
-├── engines/                    # Adaptateurs moteurs OCR
-├── llm/                        # Adaptateurs LLM (OpenAI, Anthropic, Mistral, Ollama)
-├── importers/                  # Sources d'import (IIIF, Gallica, eScriptorium, HF…)
-├── pipelines/                  # Orchestrateur OCR+LLM
-├── report/                     # Générateur rapport HTML
-└── web/                        # Interface web FastAPI
-tests/                          # Tests unitaires et d'intégration (743 tests)
 ```
 ---
-## Variables d'environnement
 ```bash
-# APIs LLM (selon les moteurs utilisés)
 export OPENAI_API_KEY="sk-..."
 export ANTHROPIC_API_KEY="sk-ant-..."
 export MISTRAL_API_KEY="..."
-# APIs OCR cloud (optionnel)
 export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
 export AWS_ACCESS_KEY_ID="..."
 export AWS_SECRET_ACCESS_KEY="..."
@@ -214,80 +452,92 @@ export AZURE_DOC_INTEL_ENDPOINT="https://..."
 export AZURE_DOC_INTEL_KEY="..."
 ```
 ---
-## Roadmap
-| Sprint | Statut | Livrables |
-|--------|--------|-----------|
-| Sprint 1 | ✅ | Structure, Tesseract, Pero OCR, CER/WER, CLI |
-| Sprint 2 | ✅ | Rapport HTML v1, diff coloré, galerie |
-| Sprint 3 | ✅ | Pipelines OCR+LLM, GPT-4o, Claude |
-| Sprint 4 | ✅ | APIs cloud, import IIIF, normalisation diplomatique |
-| Sprint 5 | ✅ | Métriques avancées : confusion unicode, ligatures, taxonomie |
-| Sprint 6 | ✅ | Interface web FastAPI, HTR-United, HuggingFace, Ollama |
-| Sprint 7 | ✅ | Rapport HTML v2 : Wilcoxon, clustering, scatter plots |
-| Sprint 8 | ✅ | eScriptorium, Gallica API, historique longitudinal, robustesse |
-| Sprint 9 | ✅ | Documentation, packaging, Docker, CI/CD |
----
-## Contribuer
-Voir [CONTRIBUTING.md](CONTRIBUTING.md) pour ajouter un moteur OCR, un adaptateur LLM, ou soumettre une pull request.
 ---
-## Licence
-Apache License 2.0
----
----
-# English
-## Picarones — OCR/HTR Benchmark Platform for Heritage Documents
-**Picarones** is an open-source platform for rigorously comparing OCR and HTR engines (Tesseract,
-Pero OCR, Kraken, cloud APIs…) and OCR+LLM pipelines on historical document corpora — manuscripts,
-early printed books, archives.
-### Key Features
-- **Metrics tailored to historical documents**: CER (raw, NFC, caseless, diplomatic), WER, MER,
-  WIL; unicode confusion matrix; ligature and diacritic scores; 10-class error taxonomy; bootstrap
-  confidence intervals; Wilcoxon significance tests
-- **OCR+LLM pipelines**: composable chains (`tesseract → gpt-4o`), three modes (text-only,
-  image+text, zero-shot), LLM over-normalisation detection
-- **Corpus import**: local folder, IIIF (Gallica, Bodleian, BL…), Gallica API + OCR, HuggingFace
-  Datasets, HTR-United, eScriptorium
-- **Interactive HTML report**: self-contained file, sortable ranking, gallery, coloured diff,
-  unicode character view, CSV/JSON/ALTO/PAGE XML export
-- **Longitudinal tracking**: SQLite benchmark history, CER evolution curves, automatic regression
-  detection
-- **Robustness analysis**: degraded image versions (noise, blur, rotation, resolution,
-  binarisation), critical threshold detection
-### Quick Start
-```bash
-pip install -e .
-sudo apt install tesseract-ocr tesseract-ocr-fra   # Ubuntu/Debian
-picarones demo          # demo report without any engine installed
-picarones engines       # list available engines
-picarones run --corpus ./corpus/ --engines tesseract --output results.json
-picarones report --results results.json
-```
-See [INSTALL.md](INSTALL.md) for detailed installation on Linux, macOS, Windows, and Docker.
-### Supported Engines
-Tesseract 5 · Pero OCR · Kraken · Mistral OCR · GPT-4o · Claude Sonnet · Mistral Large ·
-Google Vision · AWS Textract · Azure Document Intelligence · Ollama (local LLMs) · Custom YAML engine
-### License
-Apache License 2.0

 # Picarones
+> **OCR/HTR Benchmarking Platform for Heritage Documents**
 [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
 [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
+[![HuggingFace Space](https://img.shields.io/badge/%F0%9F%A4%97-HuggingFace%20Space-yellow.svg)](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones)
 ---
+**Picarones** is an open-source platform for rigorously comparing OCR and HTR engines
+(Tesseract, Pero OCR, Kraken, cloud APIs...) and OCR+LLM pipelines on historical document
+corpora -- manuscripts, early printed books, and archives.
+It provides heritage-specific metrics (diplomatic CER, ligature scores, medieval abbreviation
+handling), composable OCR+LLM pipelines, interactive HTML reports, and multiple corpus import
+sources including IIIF, Gallica, HuggingFace, and eScriptorium.
 ---
+## Table of Contents
+- [Features](#features)
+  - [Heritage-Specific Metrics](#heritage-specific-metrics)
+  - [OCR+LLM Pipelines](#ocr-llm-pipelines)
+  - [Corpus Import](#corpus-import)
+  - [Interactive HTML Report](#interactive-html-report)
+  - [Longitudinal Tracking & Robustness](#longitudinal-tracking--robustness)
+  - [Web Interface](#web-interface)
+- [Quick Start](#quick-start)
+- [Installation](#installation)
+  - [From Source](#from-source)
+  - [Docker](#docker)
+  - [Optional Extras](#optional-extras)
+- [Usage](#usage)
+  - [CLI Commands](#cli-commands)
+  - [Web Interface](#web-interface-1)
+  - [Pipeline Modes](#pipeline-modes)
+- [Supported Engines](#supported-engines)
+- [Normalization Profiles](#normalization-profiles)
+- [Error Taxonomy](#error-taxonomy)
+- [Project Structure](#project-structure)
+- [Environment Variables](#environment-variables)
+- [CI/CD](#cicd)
+- [Development](#development)
 - [Roadmap](#roadmap)
+- [Contributing](#contributing)
+- [License](#license)
 ---
+## Features
+### Heritage-Specific Metrics
+- **CER** (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and
+  **diplomatic** (historical equivalences: long s = s, u = v, i = j, etc.)
+- **WER**, **MER**, **WIL** with historical-aware tokenization (via [jiwer](https://github.com/jitsi/jiwer))
+- **Unicode confusion matrix** -- fingerprint each engine's character-level errors
+- **Ligature and diacritic scores** -- track handling of fi, fl, ff, oe, ae, p-bar, and other
+  medieval glyphs
+- **10-class error taxonomy** -- automatic classification of every error (visual confusion,
+  abbreviation, segmentation, lacuna, over-normalization, etc.)
+- **Bootstrap 95% confidence intervals** and **Wilcoxon signed-rank tests** for statistical
+  significance
+- **Intrinsic difficulty score** per document, independent of engine performance
+- **Line-level error distribution** with Gini coefficient and percentile analysis
+- **VLM hallucination detection** -- anchor score and length ratio to flag fabricated output
+### OCR+LLM Pipelines
+- Composable chains: `tesseract -> gpt-4o`, `pero_ocr -> claude-sonnet`, zero-shot VLM, etc.
+- Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot
+- **Over-normalization detection** -- does the LLM silently modernize historical spellings?
+- Versioned prompt library for medieval French, early modern French, medieval Latin, medieval
+  English, and early modern English -- both correction and zero-shot variants
+### Corpus Import
+| Source | Method |
+|--------|--------|
+| Local folder | `picarones run --corpus ./corpus/` |
+| IIIF manifests (Gallica, Bodleian, BL...) | `picarones import iiif <manifest-url>` |
+| Gallica API (SRU + OCR) | `GallicaClient` / `picarones import iiif` |
+| HuggingFace Datasets | `picarones import hf <dataset-id>` |
+| HTR-United catalogue | `picarones import htr-united` |
+| eScriptorium | `EScriptoriumClient` |
+| ZIP upload (browser) | Web interface upload endpoint |
+Supported corpus formats: plain text pairs (image + ground truth), **ALTO XML**, and **PAGE XML**.
+### Interactive HTML Report
+- **Self-contained HTML file** -- works offline, no server needed
+- Sortable ranking table, radar charts, histograms (powered by Chart.js)
+- Gallery view with dynamic filters and color-coded CER badges
+- GitHub-style colored diff with synchronized N-way scrolling
+- Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction
+- Unicode character view: interactive confusion matrix explorer
+- Export to **CSV**, **JSON**, **ALTO XML**, **PAGE XML**, and annotated images
+### Longitudinal Tracking & Robustness
+- Optional **SQLite database** to record benchmark history across runs
+- **CER evolution curves** over time, per engine
+- **Automatic regression detection** between consecutive runs
+- **Robustness analysis**: measure engine resilience to noise, blur, rotation, resolution
+  reduction, and binarization
+- Critical degradation threshold identification
+### Web Interface
+- **FastAPI** application with real-time **Server-Sent Events** (SSE) progress streaming
+- Upload corpus as a **ZIP file** directly from the browser
+- Dynamic engine and normalization profile selectors
+- Browse and re-download generated HTML reports
+- Bilingual **French/English** interface
+- Deployable on HuggingFace Spaces (Docker, port 7860)
 ---
+## Quick Start
 ```bash
+# Clone and install
 git clone https://github.com/maribakulj/Picarones.git
+cd Picarones
 pip install -e .
+# Install Tesseract (system binary, required for the Tesseract engine)
 # Ubuntu/Debian
 sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
 # macOS
 brew install tesseract
+# Generate a demo report (no OCR engine needed)
+picarones demo --output demo_report.html
+# List available engines
 picarones engines
+# Run a benchmark
+picarones run --corpus ./corpus/ --engines tesseract --output results.json
+# Generate HTML report
+picarones report --results results.json --output report.html
+# Launch the web interface
+picarones serve --port 8080
+```
 ---
+## Installation
+### From Source
+```bash
+git clone https://github.com/maribakulj/Picarones.git
+cd Picarones
+pip install -e ".[dev,web]"    # includes test and web dependencies
+```
+**System requirements:**
+- Python >= 3.11
+- [Tesseract OCR 5](https://github.com/tesseract-ocr/tesseract) (for the Tesseract engine)
+### Docker
 ```bash
+docker build -t picarones .
+docker run -p 7860:7860 \
+  -e MISTRAL_API_KEY=... \
+  -e OPENAI_API_KEY=... \
+  picarones
+```
+The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs
+(fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls
+`/health` every 30 seconds.
+The [HuggingFace Space](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) uses this
+same Docker image.
+### Optional Extras
+| Extra | Install command | What it adds |
+|-------|----------------|--------------|
+| `dev` | `pip install -e ".[dev]"` | pytest, pytest-cov, httpx, linting |
+| `web` | `pip install -e ".[web]"` | FastAPI, uvicorn, python-multipart |
+| `llm` | `pip install -e ".[llm]"` | OpenAI, Anthropic, Mistral SDKs |
+| `hf` | `pip install -e ".[hf]"` | HuggingFace Datasets |
+| `pero` | `pip install -e ".[pero]"` | Pero OCR engine |
+| `kraken` | `pip install -e ".[kraken]"` | Kraken engine |
+| `ocr-cloud` | `pip install -e ".[ocr-cloud]"` | Google Vision, AWS Textract, Azure Doc Intelligence |
+| `all` | `pip install -e ".[all]"` | Everything except ocr-cloud |
+See [INSTALL.md](INSTALL.md) for detailed instructions on Linux, macOS, Windows, and Docker.
+---
+## Usage
+### CLI Commands
+| Command | Description |
+|---------|-------------|
+| `picarones run` | Run a full benchmark on a corpus |
+| `picarones report` | Generate an HTML report from JSON results |
+| `picarones demo` | Generate a demo report with synthetic data (no engine required) |
+| `picarones metrics` | Calculate CER/WER between two text files |
+| `picarones engines` | List all available OCR engines and LLM adapters |
+| `picarones info` | Display version and system information |
+| `picarones serve` | Launch the FastAPI web interface |
+| `picarones history` | Query longitudinal benchmark history (SQLite) |
+| `picarones robustness` | Run robustness analysis with degraded images |
+| `picarones import` | Import corpus from IIIF, HuggingFace, or HTR-United |
+**Examples:**
+```bash
+# Benchmark with Tesseract, French language, PSM 6
+picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \
+  --output results.json --verbose
+# Compare two text files
+picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt
+# Import 10 pages from a Gallica IIIF manifest
 picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
+# Import a HuggingFace dataset
+picarones import hf medieval-ocr/dataset-name
+# View benchmark history with regression detection
 picarones history --engine tesseract --regression
+# Robustness demo (noise, blur, rotation, resolution)
 picarones robustness --corpus ./gt/ --engine tesseract --demo
+# Fail CI if CER exceeds threshold
+picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15
 ```
+### Web Interface
+```bash
+picarones serve --host 0.0.0.0 --port 8080
+```
+**API endpoints include:**
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | Main single-page application |
+| `/api/status` | GET | Version and application status |
+| `/api/engines` | GET | Available OCR/LLM engines |
+| `/api/normalization/profiles` | GET | Normalization profiles (read dynamically) |
+| `/api/benchmark/start` | POST | Start a benchmark job (returns `job_id`) |
+| `/api/benchmark/{job_id}/stream` | GET | SSE real-time progress stream |
+| `/api/benchmark/{job_id}/cancel` | POST | Cancel a running benchmark |
+| `/api/corpus/browse` | GET | Browse server-side corpus folders |
+| `/api/htr-united/catalogue` | GET | Browse HTR-United catalogue |
+| `/api/huggingface/search` | GET | Search HuggingFace datasets |
+| `/reports/{filename}` | GET | Download generated HTML reports |
+### Pipeline Modes
+Picarones supports three modes for OCR+LLM pipelines:
+| Mode | Description | Model type |
+|------|-------------|------------|
+| `zero_shot` | LLM receives the image directly and transcribes without prior OCR | VLM (vision) |
+| `post_correction_texte` | OCR produces raw text, then LLM corrects it | Text-only LLM |
+| `post_correction_image_texte` | OCR produces raw text, then LLM receives both image and text for correction | VLM (vision) |
+**Example:** `ministral-3b-latest` is a text-only model and should use `post_correction_texte`.
+GPT-4o and Claude support all three modes.
 ---
+## Supported Engines
+| Engine | Type | Execution Mode | Installation |
+|--------|------|---------------|-------------|
+| **Tesseract 5** | Local CLI | CPU (ProcessPool) | `pip install pytesseract` + system binary |
+| **Pero OCR** | Local Python | CPU (ProcessPool) | `pip install pero-ocr` |
+| **Kraken** | Local Python | CPU (ProcessPool) | `pip install kraken` |
+| **Mistral OCR** | Cloud API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
+| **Google Vision** | Cloud API | IO (ThreadPool) | `GOOGLE_APPLICATION_CREDENTIALS` env var |
+| **Azure Doc Intelligence** | Cloud API | IO (ThreadPool) | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
+| **GPT-4o** (VLM) | LLM API | IO (ThreadPool) | `OPENAI_API_KEY` env var |
+| **Claude Sonnet** (VLM) | LLM API | IO (ThreadPool) | `ANTHROPIC_API_KEY` env var |
+| **Mistral Large** (LLM) | LLM API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
+| **Ollama** (local LLM) | Local LLM | IO (ThreadPool) | `ollama serve` running locally |
+| **Custom engine** | CLI or API | Configurable | YAML declaration, no code required |
+Engines declare their `execution_mode` (`"io"` or `"cpu"`), allowing the runner to use
+`ThreadPoolExecutor` for IO-bound engines and `ProcessPoolExecutor` for CPU-bound engines
+simultaneously.
 ---
+## Normalization Profiles
+Picarones includes eight built-in diplomatic normalization profiles designed for historical
+text comparison. These reduce noise from expected orthographic variation so metrics reflect
+genuine OCR errors, not historical spelling differences.
+| Profile | Period | Key equivalences |
+|---------|--------|-----------------|
+| `nfc` | Any | Unicode NFC normalization only |
+| `minimal` | Any | NFC + long s (ſ -> s) |
+| `medieval_french` | 12th-15th c. | ſ=s, u=v, i=j, y=i, ae=ae, oe=oe, p-bar=per, etc. |
+| `early_modern_french` | 16th-18th c. | ſ=s, ae=ae, oe=oe, y-tilde=yn, &=et |
+| `medieval_latin` | 12th-15th c. | ſ=s, u=v, i=j, ae=ae, oe=oe, p-bar=per, q-bar=que |
+| `medieval_english` | 12th-15th c. | ſ=s, u=v, i=j, thorn=th, eth=th, yogh=y, p-bar=per |
+| `early_modern_english` | 16th-18th c. | ſ=s, u=v, i=j, vv=w, thorn=th, eth=th, yogh=y |
+| `secretary_hand` | 16th-17th c. | Early modern English + secretary hand visual confusions |
+Custom profiles can be loaded from YAML files with user-defined diplomatic tables.
+---
+## Error Taxonomy
+Every character-level error is automatically classified into one of 10 categories:
+| Class | Name | Description |
+|-------|------|-------------|
+| 1 | `visual_confusion` | Morphologically similar characters (rn/m, l/1, O/0, u/n) |
+| 2 | `diacritic_error` | Missing, incorrect, or spurious diacritical mark |
+| 3 | `case_error` | Case difference only (A/a) |
+| 4 | `ligature_error` | Ligature not resolved or incorrectly resolved |
+| 5 | `abbreviation_error` | Medieval abbreviation not expanded |
+| 6 | `hapax` | Word not found in any reference lexicon |
+| 7 | `segmentation_error` | Token fusion or fragmentation (words/lines) |
+| 8 | `oov_character` | Character outside the engine's vocabulary |
+| 9 | `lacuna` | Text present in ground truth but absent from OCR output |
+| 10 | `over_normalization` | LLM silently modernized a historical spelling |
+---
+## Project Structure
 ```
 picarones/
+├── __init__.py                 # Version (1.0.0), package metadata
+├── cli.py                      # Click CLI: run, demo, report, metrics, engines, info,
+│                               #   serve, import, history, robustness
+├── fixtures.py                 # Realistic synthetic test data (medieval documents)
+│
 ├── core/
+│   ├── corpus.py               # Corpus loading (folder, ALTO XML, PAGE XML)
+│   ├── metrics.py              # CER, WER, MER, WIL (via jiwer)
+│   ├── normalization.py        # Unicode normalization, 8 diplomatic profiles
+│   ├── statistics.py           # Bootstrap CI 95%, Wilcoxon test, correlations
+│   ├── runner.py               # Benchmark orchestrator (ThreadPool + ProcessPool)
+│   ├── results.py              # DocumentResult, BenchmarkResults, JSON export
+│   ├── confusion.py            # Unicode confusion matrix
+│   ├── char_scores.py          # Ligature and diacritic scores
+│   ├── taxonomy.py             # 10-class error taxonomy
+│   ├── structure.py            # Structural analysis (blocks, lines, words)
+│   ├── image_quality.py        # Image quality metrics (contrast, noise, resolution)
+│   ├── difficulty.py           # Intrinsic difficulty score per document
+│   ├── hallucination.py        # VLM hallucination detection
+│   ├── line_metrics.py         # Line-level error distribution (Gini, percentiles)
+│   ├── history.py              # SQLite longitudinal tracking
+│   └── robustness.py           # Robustness analysis (noise, blur, rotation, resolution)
+│
+├── engines/
+│   ├── base.py                 # BaseOCREngine (execution_mode: "io" | "cpu")
+│   ├── tesseract.py            # Tesseract 5 adapter (CPU)
+│   ├── pero_ocr.py             # Pero OCR adapter (CPU)
+│   ├── mistral_ocr.py          # Mistral OCR API (/v1/ocr endpoint)
+│   ├── google_vision.py        # Google Cloud Vision adapter
+│   └── azure_doc_intel.py      # Azure Document Intelligence adapter
+│
+├── llm/
+│   ├── base.py                 # BaseLLMAdapter interface
+│   ├── openai_adapter.py       # OpenAI / GPT-4o adapter
+│   ├── anthropic_adapter.py    # Anthropic / Claude adapter
+│   ├── mistral_adapter.py      # Mistral chat completions adapter
+│   └── ollama_adapter.py       # Ollama local LLM adapter
+│
+├── pipelines/
+│   ├── base.py                 # OCRLLMPipeline orchestrator
+│   └── over_normalization.py   # Over-normalization detection
+│
+├── prompts/                    # Versioned prompt templates (FR + EN)
+│   ├── correction_medieval_french.txt
+│   ├── zero_shot_medieval_french.txt
+│   ├── correction_imprime_ancien.txt
+│   ├── zero_shot_imprime_ancien.txt
+│   ├── correction_medieval_english.txt
+│   ├── zero_shot_medieval_english.txt
+│   ├── correction_early_modern_english.txt
+│   └── correction_image_medieval_french.txt
+│
+├── report/
+│   ├── generator.py            # Self-contained HTML report (Chart.js + diff2html)
+│   └── diff_utils.py           # Diff computation utilities
+│
+├── web/
+│   ├── app.py                  # FastAPI app (SSE, ZIP upload, dynamic endpoints)
+│   └── static/                 # CSS assets
+│
+└── importers/
+    ├── iiif.py                 # IIIF manifest importer
+    ├── gallica.py              # Gallica API client (BnF)
+    ├── htr_united.py           # HTR-United catalogue importer
+    ├── huggingface.py          # HuggingFace Datasets importer
+    └── escriptorium.py         # eScriptorium client
+tests/                          # ~1020 unit and integration tests
+.github/workflows/
+├── ci.yml                      # CI: Python 3.11/3.12, Linux/macOS/Windows
+└── sync_to_huggingface.yml     # Auto-sync to HuggingFace Space on push to main
+Dockerfile                      # Multi-stage Docker build for HuggingFace Spaces
 ```
 ---
+## Environment Variables
+Configure API keys depending on which engines and LLM adapters you use:
 ```bash
+# LLM APIs
 export OPENAI_API_KEY="sk-..."
 export ANTHROPIC_API_KEY="sk-ant-..."
 export MISTRAL_API_KEY="..."
+# Cloud OCR APIs (optional)
 export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
 export AWS_ACCESS_KEY_ID="..."
 export AWS_SECRET_ACCESS_KEY="..."
 export AZURE_DOC_INTEL_KEY="..."
 ```
+For deployment on HuggingFace Spaces, set these in **Settings > Variables and secrets**.
 ---
+## CI/CD
+### GitHub Actions (`ci.yml`)
+- **Triggers:** push to `main`/`develop`/`feature/*`/`sprint/*`/`claude/*`, PRs to
+  `main`/`develop`, manual dispatch
+- **Matrix:** Python 3.11 + 3.12 on Linux, macOS, and Windows
+- **Jobs:**
+  1. **Tests** -- full pytest suite with coverage, uploaded to Codecov
+  2. **Demo** -- end-to-end demo report generation with history and robustness
+  3. **Build** -- wheel and sdist with twine validation
+  4. **Lint** -- ruff check for errors (E, W, F; ignores E501, E402)
+### HuggingFace Sync (`sync_to_huggingface.yml`)
+- Automatically pushes `main` to the HuggingFace Space `Ma-Ri-Ba-Ku/Picarones`
+- Requires the `HF_TOKEN` secret in GitHub repository settings
 ---
+## Development
+```bash
+# Install with dev + web dependencies
+pip install -e ".[dev,web]"
+# Run the test suite
+pytest tests/ -q --tb=short
+# Run with coverage
+pytest tests/ --cov=picarones --cov-report=term-missing
+# Generate a demo report
+picarones demo --output demo_report.html
+# Launch the web UI in development mode
+picarones serve --port 8080
+# Full refresh (useful in Codespaces)
+git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html
+```
+**Key development conventions:**
+- Never use bare `except Exception: pass` -- always log with `logger.warning()`
+- Normalization profiles are read dynamically from `picarones/core/normalization.py` --
+  never hardcode them in endpoint handlers
+- Engines declare their `execution_mode` (`"io"` or `"cpu"`) so the runner can select the
+  appropriate executor
+- `python-multipart` must remain in dependencies (FastAPI checks at import time)
+---
+## Roadmap
+| Sprint | Status | Deliverables |
+|--------|--------|-------------|
+| 1 | Done | Project structure, Tesseract, Pero OCR, CER/WER, CLI |
+| 2 | Done | HTML report v1: Chart.js, colored diff, gallery |
+| 3 | Done | OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama |
+| 4 | Done | Cloud OCR APIs, IIIF import, diplomatic normalization |
+| 5 | Done | Advanced metrics: confusion matrix, ligatures, 9-class taxonomy |
+| 6 | Done | FastAPI web interface, HTR-United, HuggingFace, bilingual UI |
+| 7 | Done | HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score |
+| 8 | Done | eScriptorium, Gallica API, SQLite history, robustness analysis |
+| 9 | Done | Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta |
+| 10 | Done | Line error distribution (Gini), VLM hallucination detection |
+| 11 | Done | Internationalization FR/EN, English normalization profiles |
+| 12 | Done | Browser ZIP upload, macOS file filtering, dynamic model selector |
+| 13 | Done | pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation |
+---
+## Contributing
+See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on adding an OCR engine, an LLM
+adapter, or submitting a pull request.
+---
+## License
+[Apache License 2.0](LICENSE)
+Copyright 2024 Picarones contributors.