Picarones / README.md
Claude
feat(sprint-A16): build Docker reproductible (digest + lock file) — clôture M-2
df7146b unverified
|
Raw
History Blame
18.7 kB
metadata
title: Picarones
emoji: 📜
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

Picarones

Heritage OCR / HTR / VLM and post-correction benchmarking platform

Banc d'essai d'OCR / HTR / VLM et de post-correction pour documents patrimoniaux

CI Python 3.11+ License: Apache 2.0 Code style: ruff HuggingFace Space


What is Picarones?

Picarones is an open-source benchmarking platform for OCR, HTR, VLM and post-correction pipelines on heritage documents (manuscripts, early printed books, archives).

The input is a folder of (image, ground truth) pairs — ground truth in plain text, ALTO XML, or PAGE XML. Picarones runs the AIs you plug in (OCR engines, VLMs, OCR+LLM pipelines, ALTO mappers, ensembles…) on every page, compares each output to the ground truth at every relevant level (text, ALTO, PAGE, entities, reading order), and produces a self-contained HTML report with factual numbers, statistical tests and a reproducibility snapshot.

Without ground truth, no benchmark — Picarones measures how well an AI matches a known reference, not how it transcribes an arbitrary document.

Version française ci-dessous.

Use case

A digital library plans to OCR a production corpus — say, several thousand 17th-century parish registers, 19th-century newspapers, or medieval glossed manuscripts. Several pipelines are on the table (alternative OCR, LLM correction, ALTO mappers, ensembles); the question is which one to deploy.

The candidates cannot be benchmarked on the production corpus itself (no ground truth). A small golden dataset matching the target profile is assembled; Picarones runs each candidate on it and reports CER, recovered fuzzy searchability, preserved numerical sequences, errors introduced by post-correctors, and statistical significance. The numbers inform the deployment decision.

En français

Picarones est une plateforme open-source de banc d'essai pour des IA d'OCR, HTR, VLM et des pipelines de post-correction sur documents patrimoniaux.

L'entrée est un dossier de paires (image, vérité terrain) — VT en texte brut, ALTO XML ou PAGE XML. Picarones exécute les IA que vous branchez sur chaque page, compare la sortie à la VT à tous les niveaux pertinents et produit un rapport HTML autonome avec chiffres factuels, tests statistiques et snapshot de reproductibilité. Sans vérité terrain, pas de benchmark.


Features

Heritage-specific metrics

Three families of metrics calibrated for historical documents:

  • Classical OCR/HTR — CER (raw, NFC, caseless, diplomatic), WER, MER, WIL via jiwer; 10-class error taxonomy; bootstrap 95% CIs; line-level Gini distribution.
  • Philological — MUFI coverage, abbreviation expansion (Capelli), early-modern typography (long-s, ligatures, tilde nasals), modern archives markers, Roman numerals, Unicode block accuracy, NER precision (HIPE), reading-order F1 (ICDAR 2015), layout F1.
  • Comparison & decision — Friedman + Nemenyi + Critical Difference Diagram (Demšar 2006); cross-engine taxonomic divergence + oracle complementarity; cost / speed / CO₂ Pareto front; multi-run stability (Cohen κ, Krippendorff α); longitudinal trend with change-point detection; controlled per-slot ANOVA-like comparison.

For the full list with definitions, see docs/views.md and the contextual glossary embedded in every report (25 bilingual entries).

OCR+LLM pipelines

Composable chains: tesseract -> gpt-4o, pero_ocr -> claude-sonnet, zero-shot VLM, etc. Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot. Over-normalisation detection flags LLMs that silently modernise historical spellings. A composed-pipeline benchmarking layer (Sprint 63+) runs N candidate pipelines on the same corpus and ranks them by a chosen metric.

Corpus import

Source Method
Local folder picarones run --corpus ./corpus/
IIIF manifests (any institutional repository) picarones import iiif <manifest-url>
Gallica API (BnF SRU + IIIF) GallicaClient / picarones import iiif
HuggingFace Datasets Web UI: POST /api/huggingface/import
HTR-United catalogue Web UI: POST /api/htr-united/import
eScriptorium EScriptoriumClient
ZIP upload (browser) Web upload endpoint

Supported corpus formats: plain text pairs, ALTO XML, PAGE XML.

Interactive HTML report

A single self-contained HTML file (or with --lazy-images for large corpora). Five views:

  • Ranking — sortable table of all engines and metrics.
  • Gallery — color-coded CER badges per document.
  • Document — synchronized N-way diff, triple diff for OCR+LLM.
  • Analyses — distribution charts, Pareto, calibration, robustness projection, philological profile, longitudinal trends, levers.
  • Characters — Unicode confusion matrix, ligature analysis.

Above the views: factual narrative synthesis (20+ deterministic detectors, every number traceable to the input — anti-hallucination proven by tests), Critical Difference Diagram, Pareto front. Side panels for contextual glossary and Advanced mode (visible columns, strata filters, opt-in personal composite score).

Web interface

FastAPI application with real-time SSE progress streaming, ZIP upload from the browser, dynamic engine and normalization profile selectors, browse and re-download generated reports, bilingual French/English UI. Deployable on HuggingFace Spaces (Docker, port 7860) and on institutional infrastructure (see docs/operations/deployment-institutional.md).

Longitudinal tracking & robustness

Optional SQLite database recording benchmark history across runs. CER evolution curves per engine, automatic regression detection between consecutive runs (Pettitt change-point analysis, Sprint 92). Robustness analysis measures engine resilience to noise, blur, rotation, resolution and binarization, projected on the real corpus quality profile (Sprint 81).


Quick start

# Install
pip install -e ".[dev,web]"

# Tesseract (system binary, required for the Tesseract engine)
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat   # Debian/Ubuntu
brew install tesseract tesseract-lang                                # macOS

# Generate a demo report (no engine needed)
picarones demo --output demo_report.html

# Run a benchmark
picarones run --corpus ./corpus/ --engines tesseract --output results.json
picarones report --results results.json --output report.html

# Web UI
picarones serve --port 8080

For Docker, institutional deployment, or HuggingFace Spaces, see INSTALL.md and docs/operations/deployment-institutional.md.


Supported engines

Engine Type Installation
Azure Doc Intelligence Cloud API AZURE_DOC_INTEL_ENDPOINT + AZURE_DOC_INTEL_KEY
Google Vision Cloud API GOOGLE_APPLICATION_CREDENTIALS env var
Mistral OCR Cloud API MISTRAL_API_KEY env var
Pero OCR Local Python pip install -e .[pero]
Tesseract 5 Local CLI pip install pytesseract + system binary

LLM/VLM adapters (used through pipelines, not as standalone OCR engines): GPT-4o, Claude, Mistral Large, Ollama (local). See docs/cli-workflows.md.

The Engine table is regenerated automatically by scripts/gen_readme_tables.py — adding a new adapter under picarones/engines/ makes the next CI run update this table or fail.


CLI commands

Command Description
picarones compare Compare two benchmark JSON runs and flag regressions (Sprint 28)
picarones demo Generate a demo report with synthetic data (no engine required)
picarones diagnose Pre-wired workflow: bench + improvement levers + factual recommendations
picarones economics Pre-wired workflow: bench + effective throughput + cost projection
picarones edition Pre-wired workflow: bench + philological metrics for critical editing
picarones engines List available OCR engines and LLM adapters
picarones history Query longitudinal benchmark history (SQLite)
picarones import Import a corpus from a remote source (IIIF, HF, HTR-United)
picarones info Display version and system information
picarones metrics Compute CER/WER between two text files
picarones pipeline Run / compare composed pipelines from a YAML spec (Sprint 70)
picarones report Generate an HTML report from JSON results
picarones robustness Run robustness analysis with degraded images
picarones run Run a full benchmark on a corpus
picarones serve Launch the FastAPI web interface

Each command supports --help for full options. See docs/cli-workflows.md for end-to-end examples.


Web API endpoints

The web app exposes a documented OpenAPI spec at /docs (Swagger UI) when running. Summary:

Method Endpoint Summary
GET / Index
POST /api/benchmark/run Api Benchmark Run
POST /api/benchmark/start Api Benchmark Start
POST /api/benchmark/{job_id}/cancel Api Benchmark Cancel
GET /api/benchmark/{job_id}/status Api Benchmark Status
GET /api/benchmark/{job_id}/stream Api Benchmark Stream
GET /api/benchmark/{job_id}/synthesis_preview Api Benchmark Synthesis Preview
POST /api/config/load Api Config Load
POST /api/config/save Api Config Save
GET /api/corpus/browse Api Corpus Browse
GET /api/corpus/image/{upload_id}/{filename} Api Corpus Image
POST /api/corpus/upload Api Corpus Upload
GET /api/corpus/uploads Api Corpus Uploads
DELETE /api/corpus/uploads/{corpus_id} Api Corpus Delete
GET /api/csrf/token Api Csrf Token
GET /api/engines Api Engines
GET /api/history/regressions Api History Regressions
GET /api/htr-united/catalogue Api Htr United Catalogue
POST /api/htr-united/import Api Htr United Import
POST /api/huggingface/import Api Huggingface Import
GET /api/huggingface/search Api Huggingface Search
GET /api/lang Api Get Lang
POST /api/lang/{lang_code} Api Set Lang
GET /api/models/{provider} Api Models
GET /api/normalization/profiles Api Normalization Profiles
GET /api/reports Api Reports
GET /api/status Api Status
GET /health Health
GET /reports/{filename} Serve Report

The complete OpenAPI JSON is also exposed at /openapi.json for client generation.


Normalization profiles

Picarones ships 11 built-in normalization profiles for historical text comparison (defined in picarones/measurements/normalization.py, exposed via /api/normalization/profiles):

nfc, caseless, minimal, medieval_french, early_modern_french, medieval_latin, medieval_english, early_modern_english, secretary_hand, sans_ponctuation, sans_apostrophes.

Custom profiles can be loaded from YAML files with user-defined diplomatic tables and exclude_chars sets. See docs/profiles.md.

A traceability table mapping each profile to its source standard (MUFI v4.0, TEI P5, DEAF) will ship in Sprint A12 (B-6).


Project structure

picarones/
├── core/                       Cercle 1 — pure abstractions (7 modules)
├── measurements/               Cercle 2 — official metrics (~70 modules + narrative engine)
├── engines/                    Cercle 2 — 5 OCR adapters
├── llm/                        Cercle 2 — 4 LLM adapters
├── pipelines/                  Cercle 2 — OCR+LLM pipelines
├── modules/                    Cercle 2 — official BaseModule modules
├── extras/                     Cercle 3 — plugins (importers, historical)
├── report/                     Cercle 3 — HTML rendering
├── cli/                        Cercle 3 — Click CLI (15 commands)
├── web/                        Cercle 3 — FastAPI app + 11 routers
├── prompts/                    8 versioned prompt templates
└── data/                       Indicative tables (pricing.yaml)

Strict 3-circle architecture: imports flow only from outer to inner. Enforced by tests/core/test_circle_dependencies.py (Sprint A3). See docs/architecture.md for the full manifesto.


Environment variables

See .env.example for the complete list. Key variables:

# Security & mode (cf. SECURITY.md)
PICARONES_PUBLIC_MODE=         # 1/true/yes for HF Space (no cloud OCR)
PICARONES_CSRF_REQUIRED=       # 1 for institutional deployment
PICARONES_BROWSE_ROOTS=        # restrict browse to specific paths

# Cloud API keys (optional)
MISTRAL_API_KEY=
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_APPLICATION_CREDENTIALS=
AZURE_DOC_INTEL_ENDPOINT=
AZURE_DOC_INTEL_KEY=

# RGPD retention (Sprint A11)
PICARONES_UPLOAD_RETENTION_DAYS=7

For HuggingFace Spaces, set these in Settings → Variables and secrets.


CI/CD

GitHub Actions: .github/workflows/

  • ci.yml — tests on Python 3.11/3.12/3.13 × Linux/macOS/Windows, ruff, mypy strict on core/, security scanners (bandit + pip-audit
    • trivy), coverage gate --cov-fail-under=85, pytest-timeout 300s.
  • precommit.yml — replays pre-commit hooks (catches --no-verify bypass).
  • release.yml — on tag v*.*.*: PyPI + ghcr.io multi-arch + GitHub Release with notes from CHANGELOG.
  • perf_regression.yml — weekly cron + PR-triggered: CER anti-regression on a synthetic reference corpus.
  • sync_to_huggingface.yml — auto-syncs main to the HF Space.

Development

pip install -e ".[dev,web]"
pre-commit install
pytest tests/ -q
ruff check picarones/ tests/
python -m mypy picarones/core/

Test suite: ~3763 tests, ~3 min on a modern laptop. Coverage floor at 85% (currently ~87%). The network marker excludes tests requiring live HTTP.

For end-to-end developer guides, see docs/developer/index.md (FR) / docs/developer/index.en.md (EN).

Conventions

  • Never except Exception: pass — use logger.warning("[module] degraded feature: %s", e).
  • One canonical home per module — circle dependency direction enforced by tests.
  • Engines declare execution_mode ("io" or "cpu") so the runner picks ThreadPoolExecutor vs ProcessPoolExecutor appropriately.
  • Hardcoded UI strings forbidden — always go through i18n (cf. docs/developer/extending-i18n.md).

Roadmap

Detailed history and current direction live in:

The Phase 1 of the institutional readiness plan (sprints A1–A11) is complete as of May 2026: CI hardening, doc consistency gates, 3-circle refactor, web hardening, perf+concurrency tests, WCAG 2.1 AA accessibility, reproducibility ops (lock files, Docker pinning), PyPI/ghcr.io release pipeline, governance & COI policies, institutional deployment guide & RGPD documentation.

Remaining: scientific publication track (CITATION + JOSS, sprint A12), README/SPECS final polish (this sprint and A14), external audits (RGAA + security pentest, A15).


Documentation

The complete functional specification is in SPECS.md (full refresh planned in Sprint A14).


Citation

A CITATION.cff file and a Zenodo DOI will land in Sprint A12 (scientific publication track). Until then, cite the GitHub repo with the commit SHA used in your benchmark — every Picarones report embeds the commit and full snapshot for reproducibility (cf. docs/reproducibility-snapshots.md).


Contributing

See CONTRIBUTING.md (FR) / CONTRIBUTING.en.md (EN). Code of conduct: CODE_OF_CONDUCT.md (Contributor Covenant 2.1). Governance & maintainership: GOVERNANCE.md.


License

Apache License 2.0

Copyright 2024–2026 Picarones contributors.