--- title: Picarones emoji: 📜 colorFrom: blue colorTo: purple sdk: docker pinned: false --- # Picarones > **Heritage OCR / HTR / VLM and post-correction benchmarking platform** > > **Banc d'essai d'OCR / HTR / VLM et de post-correction pour documents patrimoniaux** [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml) [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE) [![Code style: ruff](https://img.shields.io/badge/lint-ruff-46aef7.svg)](https://github.com/astral-sh/ruff) [![HuggingFace Space](https://img.shields.io/badge/%F0%9F%A4%97-HuggingFace%20Space-yellow.svg)](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) --- ## What is Picarones? **Picarones** is an open-source benchmarking platform for OCR, HTR, VLM and post-correction pipelines on **heritage documents** (manuscripts, early printed books, archives). The input is a folder of `(image, ground truth)` pairs — ground truth in plain text, ALTO XML, or PAGE XML. Picarones runs the AIs you plug in (OCR engines, VLMs, OCR+LLM pipelines, ALTO mappers, ensembles…) on every page, compares each output to the ground truth at every relevant level (text, ALTO, PAGE, entities, reading order), and produces a **self-contained HTML report** with factual numbers, statistical tests and a reproducibility snapshot. **Without ground truth, no benchmark** — Picarones measures how well an AI matches a known reference, not how it transcribes an arbitrary document. > *Version française ci-dessous.* ### Use case A digital library plans to OCR a production corpus — say, several thousand 17th-century parish registers, 19th-century newspapers, or medieval glossed manuscripts. Several pipelines are on the table (alternative OCR, LLM correction, ALTO mappers, ensembles); the question is which one to deploy. The candidates cannot be benchmarked on the production corpus itself (no ground truth). A small **golden dataset** matching the target profile is assembled; Picarones runs each candidate on it and reports CER, recovered fuzzy searchability, preserved numerical sequences, errors introduced by post-correctors, and statistical significance. The numbers inform the deployment decision. ### En français **Picarones** est une plateforme open-source de banc d'essai pour des IA d'OCR, HTR, VLM et des pipelines de post-correction sur documents patrimoniaux. L'entrée est un dossier de paires `(image, vérité terrain)` — VT en texte brut, ALTO XML ou PAGE XML. Picarones exécute les IA que vous branchez sur chaque page, compare la sortie à la VT à tous les niveaux pertinents et produit un rapport HTML autonome avec chiffres factuels, tests statistiques et snapshot de reproductibilité. Sans vérité terrain, pas de benchmark. --- ## Features ### Heritage-specific metrics Three families of metrics calibrated for historical documents: - **Classical OCR/HTR** — CER (raw, NFC, caseless, **diplomatic**), WER, MER, WIL via [jiwer](https://github.com/jitsi/jiwer); 10-class error taxonomy; bootstrap 95% CIs; line-level Gini distribution. - **Philological** — MUFI coverage, abbreviation expansion (Capelli), early-modern typography (long-s, ligatures, tilde nasals), modern archives markers, Roman numerals, Unicode block accuracy, NER precision (HIPE), reading-order F1 (ICDAR 2015), layout F1. - **Comparison & decision** — Friedman + Nemenyi + **Critical Difference Diagram** (Demšar 2006); cross-engine taxonomic divergence + oracle complementarity; cost / speed / CO₂ Pareto front; multi-run stability (Cohen κ, Krippendorff α); longitudinal trend with change-point detection; controlled per-slot ANOVA-like comparison. For the full list with definitions, see [`docs/views.md`](docs/views.md) and the contextual glossary embedded in every report (25 bilingual entries). ### OCR+LLM pipelines Composable chains: `tesseract -> gpt-4o`, `pero_ocr -> claude-sonnet`, zero-shot VLM, etc. Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot. **Over-normalisation detection** flags LLMs that silently modernise historical spellings. A composed-pipeline benchmarking layer (Sprint 63+) runs N candidate pipelines on the same corpus and ranks them by a chosen metric. ### Corpus import | Source | Method | |--------|--------| | Local folder | `picarones run --corpus ./corpus/` | | IIIF manifests (any institutional repository) | `picarones import iiif ` | | Gallica API (BnF SRU + IIIF) | `GallicaClient` / `picarones import iiif` | | HuggingFace Datasets | Web UI: `POST /api/huggingface/import` | | HTR-United catalogue | Web UI: `POST /api/htr-united/import` | | eScriptorium | `EScriptoriumClient` | | ZIP upload (browser) | Web upload endpoint | Supported corpus formats: plain text pairs, ALTO XML, PAGE XML. ### Interactive HTML report A single self-contained HTML file (or with `--lazy-images` for large corpora). Five views: - **Ranking** — sortable table of all engines and metrics. - **Gallery** — color-coded CER badges per document. - **Document** — synchronized N-way diff, triple diff for OCR+LLM. - **Analyses** — distribution charts, Pareto, calibration, robustness projection, philological profile, longitudinal trends, levers. - **Characters** — Unicode confusion matrix, ligature analysis. Above the views: factual narrative synthesis (20+ deterministic detectors, every number traceable to the input — anti-hallucination proven by tests), Critical Difference Diagram, Pareto front. Side panels for contextual glossary and Advanced mode (visible columns, strata filters, opt-in personal composite score). ### Web interface FastAPI application with real-time SSE progress streaming, ZIP upload from the browser, dynamic engine and normalization profile selectors, browse and re-download generated reports, bilingual French/English UI. Deployable on HuggingFace Spaces (Docker, port 7860) **and** on institutional infrastructure (see [`docs/operations/deployment-institutional.md`](docs/operations/deployment-institutional.md)). ### Longitudinal tracking & robustness Optional SQLite database recording benchmark history across runs. CER evolution curves per engine, automatic regression detection between consecutive runs (Pettitt change-point analysis, Sprint 92). **Robustness analysis** measures engine resilience to noise, blur, rotation, resolution and binarization, projected on the real corpus quality profile (Sprint 81). --- ## Quick start ```bash # Install pip install -e ".[dev,web]" # Tesseract (system binary, required for the Tesseract engine) sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat # Debian/Ubuntu brew install tesseract tesseract-lang # macOS # Generate a demo report (no engine needed) picarones demo --output demo_report.html # Run a benchmark picarones run --corpus ./corpus/ --engines tesseract --output results.json picarones report --results results.json --output report.html # Web UI picarones serve --port 8080 ``` For Docker, institutional deployment, or HuggingFace Spaces, see [`INSTALL.md`](INSTALL.md) and [`docs/operations/deployment-institutional.md`](docs/operations/deployment-institutional.md). --- ## Supported engines | Engine | Type | Installation | |--------|------|-------------| | **Azure Doc Intelligence** | Cloud API | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` | | **Google Vision** | Cloud API | `GOOGLE_APPLICATION_CREDENTIALS` env var | | **Mistral OCR** | Cloud API | `MISTRAL_API_KEY` env var | | **Pero OCR** | Local Python | `pip install -e .[pero]` | | **Tesseract 5** | Local CLI | `pip install pytesseract` + system binary | LLM/VLM adapters (used through pipelines, not as standalone OCR engines): GPT-4o, Claude, Mistral Large, Ollama (local). See [`docs/cli-workflows.md`](docs/cli-workflows.md). The `Engine` table is regenerated automatically by `scripts/gen_readme_tables.py` — adding a new adapter under `picarones/engines/` makes the next CI run update this table or fail. --- ## CLI commands | Command | Description | |---------|-------------| | `picarones compare` | Compare two benchmark JSON runs and flag regressions (Sprint 28) | | `picarones demo` | Generate a demo report with synthetic data (no engine required) | | `picarones diagnose` | Pre-wired workflow: bench + improvement levers + factual recommendations | | `picarones economics` | Pre-wired workflow: bench + effective throughput + cost projection | | `picarones edition` | Pre-wired workflow: bench + philological metrics for critical editing | | `picarones engines` | List available OCR engines and LLM adapters | | `picarones history` | Query longitudinal benchmark history (SQLite) | | `picarones import` | Import a corpus from a remote source (IIIF, HF, HTR-United) | | `picarones info` | Display version and system information | | `picarones metrics` | Compute CER/WER between two text files | | `picarones pipeline` | Run / compare composed pipelines from a YAML spec (Sprint 70) | | `picarones report` | Generate an HTML report from JSON results | | `picarones robustness` | Run robustness analysis with degraded images | | `picarones run` | Run a full benchmark on a corpus | | `picarones serve` | Launch the FastAPI web interface | Each command supports `--help` for full options. See [`docs/cli-workflows.md`](docs/cli-workflows.md) for end-to-end examples. --- ## Web API endpoints The web app exposes a documented OpenAPI spec at `/docs` (Swagger UI) when running. Summary: | Method | Endpoint | Summary | |--------|----------|---------| | `GET` | `/` | Index | | `POST` | `/api/benchmark/run` | Api Benchmark Run | | `POST` | `/api/benchmark/start` | Api Benchmark Start | | `POST` | `/api/benchmark/{job_id}/cancel` | Api Benchmark Cancel | | `GET` | `/api/benchmark/{job_id}/status` | Api Benchmark Status | | `GET` | `/api/benchmark/{job_id}/stream` | Api Benchmark Stream | | `GET` | `/api/benchmark/{job_id}/synthesis_preview` | Api Benchmark Synthesis Preview | | `POST` | `/api/config/load` | Api Config Load | | `POST` | `/api/config/save` | Api Config Save | | `GET` | `/api/corpus/browse` | Api Corpus Browse | | `GET` | `/api/corpus/image/{upload_id}/{filename}` | Api Corpus Image | | `POST` | `/api/corpus/upload` | Api Corpus Upload | | `GET` | `/api/corpus/uploads` | Api Corpus Uploads | | `DELETE` | `/api/corpus/uploads/{corpus_id}` | Api Corpus Delete | | `GET` | `/api/csrf/token` | Api Csrf Token | | `GET` | `/api/engines` | Api Engines | | `GET` | `/api/history/regressions` | Api History Regressions | | `GET` | `/api/htr-united/catalogue` | Api Htr United Catalogue | | `POST` | `/api/htr-united/import` | Api Htr United Import | | `POST` | `/api/huggingface/import` | Api Huggingface Import | | `GET` | `/api/huggingface/search` | Api Huggingface Search | | `GET` | `/api/lang` | Api Get Lang | | `POST` | `/api/lang/{lang_code}` | Api Set Lang | | `GET` | `/api/models/{provider}` | Api Models | | `GET` | `/api/normalization/profiles` | Api Normalization Profiles | | `GET` | `/api/reports` | Api Reports | | `GET` | `/api/status` | Api Status | | `GET` | `/health` | Health | | `GET` | `/reports/{filename}` | Serve Report | The complete OpenAPI JSON is also exposed at `/openapi.json` for client generation. --- ## Normalization profiles Picarones ships **11 built-in normalization profiles** for historical text comparison (defined in [`picarones/measurements/normalization.py`](picarones/measurements/normalization.py), exposed via `/api/normalization/profiles`): `nfc`, `caseless`, `minimal`, `medieval_french`, `early_modern_french`, `medieval_latin`, `medieval_english`, `early_modern_english`, `secretary_hand`, `sans_ponctuation`, `sans_apostrophes`. Custom profiles can be loaded from YAML files with user-defined diplomatic tables and `exclude_chars` sets. See [`docs/profiles.md`](docs/profiles.md). A traceability table mapping each profile to its source standard (MUFI v4.0, TEI P5, DEAF) will ship in Sprint A12 (B-6). --- ## Project structure ``` picarones/ ├── core/ Cercle 1 — pure abstractions (7 modules) ├── measurements/ Cercle 2 — official metrics (~70 modules + narrative engine) ├── engines/ Cercle 2 — 5 OCR adapters ├── llm/ Cercle 2 — 4 LLM adapters ├── pipelines/ Cercle 2 — OCR+LLM pipelines ├── modules/ Cercle 2 — official BaseModule modules ├── extras/ Cercle 3 — plugins (importers, historical) ├── report/ Cercle 3 — HTML rendering ├── cli/ Cercle 3 — Click CLI (15 commands) ├── web/ Cercle 3 — FastAPI app + 11 routers ├── prompts/ 8 versioned prompt templates └── data/ Indicative tables (pricing.yaml) ``` Strict 3-circle architecture: imports flow only from outer to inner. Enforced by `tests/core/test_circle_dependencies.py` (Sprint A3). See [`docs/architecture.md`](docs/architecture.md) for the full manifesto. --- ## Environment variables See [`.env.example`](.env.example) for the complete list. Key variables: ```bash # Security & mode (cf. SECURITY.md) PICARONES_PUBLIC_MODE= # 1/true/yes for HF Space (no cloud OCR) PICARONES_CSRF_REQUIRED= # 1 for institutional deployment PICARONES_BROWSE_ROOTS= # restrict browse to specific paths # Cloud API keys (optional) MISTRAL_API_KEY= OPENAI_API_KEY= ANTHROPIC_API_KEY= GOOGLE_APPLICATION_CREDENTIALS= AZURE_DOC_INTEL_ENDPOINT= AZURE_DOC_INTEL_KEY= # RGPD retention (Sprint A11) PICARONES_UPLOAD_RETENTION_DAYS=7 ``` For HuggingFace Spaces, set these in **Settings → Variables and secrets**. --- ## CI/CD GitHub Actions: `.github/workflows/` - `ci.yml` — tests on Python 3.11/3.12/3.13 × Linux/macOS/Windows, ruff, mypy strict on core/, security scanners (bandit + pip-audit + trivy), coverage gate `--cov-fail-under=85`, pytest-timeout 300s. - `precommit.yml` — replays pre-commit hooks (catches `--no-verify` bypass). - `release.yml` — on tag `v*.*.*`: PyPI + ghcr.io multi-arch + GitHub Release with notes from CHANGELOG. - `perf_regression.yml` — weekly cron + PR-triggered: CER anti-regression on a synthetic reference corpus. - `sync_to_huggingface.yml` — auto-syncs `main` to the HF Space. --- ## Development ```bash pip install -e ".[dev,web]" pre-commit install pytest tests/ -q ruff check picarones/ tests/ python -m mypy picarones/core/ ``` **Test suite**: ~3865 tests, ~3 min on a modern laptop. Coverage floor at 85% (currently ~87%). The `network` marker excludes tests requiring live HTTP. For end-to-end developer guides, see [`docs/developer/index.md`](docs/developer/index.md) (FR) / [`docs/developer/index.en.md`](docs/developer/index.en.md) (EN). ### Conventions - Never `except Exception: pass` — use `logger.warning("[module] degraded feature: %s", e)`. - One canonical home per module — circle dependency direction enforced by tests. - Engines declare `execution_mode` (`"io"` or `"cpu"`) so the runner picks `ThreadPoolExecutor` vs `ProcessPoolExecutor` appropriately. - Hardcoded UI strings forbidden — always go through i18n (cf. [`docs/developer/extending-i18n.md`](docs/developer/extending-i18n.md)). --- ## Roadmap Detailed history and current direction live in: - [`CHANGELOG.md`](CHANGELOG.md) — Keep a Changelog format, one entry per sprint up to the latest release. - [`docs/roadmap/evolution-2026.md`](docs/roadmap/evolution-2026.md) — technical evolution roadmap (axes A and B for 2026+). - [`docs/audits/`](docs/audits/) — institutional readiness audit and remediation plan (sprints A1–A15). The **Phase 1 of the institutional readiness plan** (sprints A1–A11) is complete as of May 2026: CI hardening, doc consistency gates, 3-circle refactor, web hardening, perf+concurrency tests, WCAG 2.1 AA accessibility, reproducibility ops (lock files, Docker pinning), PyPI/ghcr.io release pipeline, governance & COI policies, institutional deployment guide & RGPD documentation. Remaining: scientific publication track (CITATION + JOSS, sprint A12), README/SPECS final polish (this sprint and A14), external audits (RGAA + security pentest, A15). --- ## Documentation | Audience | Entry point | |----------|-------------| | **End user** | [`docs/user/reading-a-report.md`](docs/user/reading-a-report.md) ([EN](docs/user/reading-a-report.en.md)) | | **Developer** | [`docs/developer/index.md`](docs/developer/index.md) ([EN](docs/developer/index.en.md)) | | **Operations / DSI** | [`docs/operations/deployment-institutional.md`](docs/operations/deployment-institutional.md), [`docs/operations/data-retention-rgpd.md`](docs/operations/data-retention-rgpd.md), [`docs/operations/release-process.md`](docs/operations/release-process.md) | | **Architect** | [`docs/architecture.md`](docs/architecture.md), [`docs/api-stable.md`](docs/api-stable.md) | | **Researcher** | [`docs/case-studies/`](docs/case-studies/), [`docs/reproducibility-snapshots.md`](docs/reproducibility-snapshots.md) | | **Contributor** | [`CONTRIBUTING.md`](CONTRIBUTING.md), [`GOVERNANCE.md`](GOVERNANCE.md), [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) | | **Security** | [`SECURITY.md`](SECURITY.md) | | **Accessibility** | [`ACCESSIBILITY.md`](ACCESSIBILITY.md) | The complete functional specification is in [`SPECS.md`](SPECS.md) (full refresh planned in Sprint A14). --- ## Citation A `CITATION.cff` file and a Zenodo DOI will land in Sprint A12 (scientific publication track). Until then, cite the GitHub repo with the commit SHA used in your benchmark — every Picarones report embeds the commit and full snapshot for reproducibility (cf. [`docs/reproducibility-snapshots.md`](docs/reproducibility-snapshots.md)). --- ## Contributing See [`CONTRIBUTING.md`](CONTRIBUTING.md) (FR) / [`CONTRIBUTING.en.md`](CONTRIBUTING.en.md) (EN). Code of conduct: [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) (Contributor Covenant 2.1). Governance & maintainership: [`GOVERNANCE.md`](GOVERNANCE.md). --- ## License [Apache License 2.0](LICENSE) Copyright 2024–2026 Picarones contributors.