Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

App Files Files Community

Picarones / README.md

Claude

feat(sprint-A16): build Docker reproductible (digest + lock file) — clôture M-2

df7146b unverified about 2 months ago

18.7 kB

title: Picarones
emoji: 📜
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

Picarones

Heritage OCR / HTR / VLM and post-correction benchmarking platform

Banc d'essai d'OCR / HTR / VLM et de post-correction pour documents patrimoniaux

What is Picarones?

Picarones is an open-source benchmarking platform for OCR, HTR, VLM and post-correction pipelines on heritage documents (manuscripts, early printed books, archives).

The input is a folder of (image, ground truth) pairs — ground truth in plain text, ALTO XML, or PAGE XML. Picarones runs the AIs you plug in (OCR engines, VLMs, OCR+LLM pipelines, ALTO mappers, ensembles…) on every page, compares each output to the ground truth at every relevant level (text, ALTO, PAGE, entities, reading order), and produces a self-contained HTML report with factual numbers, statistical tests and a reproducibility snapshot.

Without ground truth, no benchmark — Picarones measures how well an AI matches a known reference, not how it transcribes an arbitrary document.

Version française ci-dessous.

Use case

A digital library plans to OCR a production corpus — say, several thousand 17th-century parish registers, 19th-century newspapers, or medieval glossed manuscripts. Several pipelines are on the table (alternative OCR, LLM correction, ALTO mappers, ensembles); the question is which one to deploy.

The candidates cannot be benchmarked on the production corpus itself (no ground truth). A small golden dataset matching the target profile is assembled; Picarones runs each candidate on it and reports CER, recovered fuzzy searchability, preserved numerical sequences, errors introduced by post-correctors, and statistical significance. The numbers inform the deployment decision.

En français

Picarones est une plateforme open-source de banc d'essai pour des IA d'OCR, HTR, VLM et des pipelines de post-correction sur documents patrimoniaux.

L'entrée est un dossier de paires (image, vérité terrain) — VT en texte brut, ALTO XML ou PAGE XML. Picarones exécute les IA que vous branchez sur chaque page, compare la sortie à la VT à tous les niveaux pertinents et produit un rapport HTML autonome avec chiffres factuels, tests statistiques et snapshot de reproductibilité. Sans vérité terrain, pas de benchmark.

Features

Heritage-specific metrics

Three families of metrics calibrated for historical documents:

Classical OCR/HTR — CER (raw, NFC, caseless, diplomatic), WER, MER, WIL via jiwer; 10-class error taxonomy; bootstrap 95% CIs; line-level Gini distribution.
Philological — MUFI coverage, abbreviation expansion (Capelli), early-modern typography (long-s, ligatures, tilde nasals), modern archives markers, Roman numerals, Unicode block accuracy, NER precision (HIPE), reading-order F1 (ICDAR 2015), layout F1.
Comparison & decision — Friedman + Nemenyi + Critical Difference Diagram (Demšar 2006); cross-engine taxonomic divergence + oracle complementarity; cost / speed / CO₂ Pareto front; multi-run stability (Cohen κ, Krippendorff α); longitudinal trend with change-point detection; controlled per-slot ANOVA-like comparison.

For the full list with definitions, see docs/views.md and the contextual glossary embedded in every report (25 bilingual entries).

OCR+LLM pipelines

Composable chains: tesseract -> gpt-4o, pero_ocr -> claude-sonnet, zero-shot VLM, etc. Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot. Over-normalisation detection flags LLMs that silently modernise historical spellings. A composed-pipeline benchmarking layer (Sprint 63+) runs N candidate pipelines on the same corpus and ranks them by a chosen metric.

Corpus import

Source	Method
Local folder	`picarones run --corpus ./corpus/`
IIIF manifests (any institutional repository)	`picarones import iiif <manifest-url>`
Gallica API (BnF SRU + IIIF)	`GallicaClient` / `picarones import iiif`
HuggingFace Datasets	Web UI: `POST /api/huggingface/import`
HTR-United catalogue	Web UI: `POST /api/htr-united/import`
eScriptorium	`EScriptoriumClient`
ZIP upload (browser)	Web upload endpoint

Supported corpus formats: plain text pairs, ALTO XML, PAGE XML.

Interactive HTML report

A single self-contained HTML file (or with --lazy-images for large corpora). Five views:

Ranking — sortable table of all engines and metrics.
Gallery — color-coded CER badges per document.
Document — synchronized N-way diff, triple diff for OCR+LLM.
Analyses — distribution charts, Pareto, calibration, robustness projection, philological profile, longitudinal trends, levers.
Characters — Unicode confusion matrix, ligature analysis.

Above the views: factual narrative synthesis (20+ deterministic detectors, every number traceable to the input — anti-hallucination proven by tests), Critical Difference Diagram, Pareto front. Side panels for contextual glossary and Advanced mode (visible columns, strata filters, opt-in personal composite score).

Web interface

FastAPI application with real-time SSE progress streaming, ZIP upload from the browser, dynamic engine and normalization profile selectors, browse and re-download generated reports, bilingual French/English UI. Deployable on HuggingFace Spaces (Docker, port 7860) and on institutional infrastructure (see docs/operations/deployment-institutional.md).

Longitudinal tracking & robustness

Optional SQLite database recording benchmark history across runs. CER evolution curves per engine, automatic regression detection between consecutive runs (Pettitt change-point analysis, Sprint 92). Robustness analysis measures engine resilience to noise, blur, rotation, resolution and binarization, projected on the real corpus quality profile (Sprint 81).

Quick start

# Install
pip install -e ".[dev,web]"

# Tesseract (system binary, required for the Tesseract engine)
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat   # Debian/Ubuntu
brew install tesseract tesseract-lang                                # macOS

# Generate a demo report (no engine needed)
picarones demo --output demo_report.html

# Run a benchmark
picarones run --corpus ./corpus/ --engines tesseract --output results.json
picarones report --results results.json --output report.html

# Web UI
picarones serve --port 8080

For Docker, institutional deployment, or HuggingFace Spaces, see INSTALL.md and docs/operations/deployment-institutional.md.

Supported engines

Engine	Type	Installation
Azure Doc Intelligence	Cloud API	`AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY`
Google Vision	Cloud API	`GOOGLE_APPLICATION_CREDENTIALS` env var
Mistral OCR	Cloud API	`MISTRAL_API_KEY` env var
Pero OCR	Local Python	`pip install -e .[pero]`
Tesseract 5	Local CLI	`pip install pytesseract` + system binary

LLM/VLM adapters (used through pipelines, not as standalone OCR engines): GPT-4o, Claude, Mistral Large, Ollama (local). See docs/cli-workflows.md.

The Engine table is regenerated automatically by scripts/gen_readme_tables.py — adding a new adapter under picarones/engines/ makes the next CI run update this table or fail.

CLI commands

Command	Description
`picarones compare`	Compare two benchmark JSON runs and flag regressions (Sprint 28)
`picarones demo`	Generate a demo report with synthetic data (no engine required)
`picarones diagnose`	Pre-wired workflow: bench + improvement levers + factual recommendations
`picarones economics`	Pre-wired workflow: bench + effective throughput + cost projection
`picarones edition`	Pre-wired workflow: bench + philological metrics for critical editing
`picarones engines`	List available OCR engines and LLM adapters
`picarones history`	Query longitudinal benchmark history (SQLite)
`picarones import`	Import a corpus from a remote source (IIIF, HF, HTR-United)
`picarones info`	Display version and system information
`picarones metrics`	Compute CER/WER between two text files
`picarones pipeline`	Run / compare composed pipelines from a YAML spec (Sprint 70)
`picarones report`	Generate an HTML report from JSON results
`picarones robustness`	Run robustness analysis with degraded images
`picarones run`	Run a full benchmark on a corpus
`picarones serve`	Launch the FastAPI web interface

Each command supports --help for full options. See docs/cli-workflows.md for end-to-end examples.

Web API endpoints

The web app exposes a documented OpenAPI spec at /docs (Swagger UI) when running. Summary:

Method	Endpoint	Summary
`GET`	`/`	Index
`POST`	`/api/benchmark/run`	Api Benchmark Run
`POST`	`/api/benchmark/start`	Api Benchmark Start
`POST`	`/api/benchmark/{job_id}/cancel`	Api Benchmark Cancel
`GET`	`/api/benchmark/{job_id}/status`	Api Benchmark Status
`GET`	`/api/benchmark/{job_id}/stream`	Api Benchmark Stream
`GET`	`/api/benchmark/{job_id}/synthesis_preview`	Api Benchmark Synthesis Preview
`POST`	`/api/config/load`	Api Config Load
`POST`	`/api/config/save`	Api Config Save
`GET`	`/api/corpus/browse`	Api Corpus Browse
`GET`	`/api/corpus/image/{upload_id}/{filename}`	Api Corpus Image
`POST`	`/api/corpus/upload`	Api Corpus Upload
`GET`	`/api/corpus/uploads`	Api Corpus Uploads
`DELETE`	`/api/corpus/uploads/{corpus_id}`	Api Corpus Delete
`GET`	`/api/csrf/token`	Api Csrf Token
`GET`	`/api/engines`	Api Engines
`GET`	`/api/history/regressions`	Api History Regressions
`GET`	`/api/htr-united/catalogue`	Api Htr United Catalogue
`POST`	`/api/htr-united/import`	Api Htr United Import
`POST`	`/api/huggingface/import`	Api Huggingface Import
`GET`	`/api/huggingface/search`	Api Huggingface Search
`GET`	`/api/lang`	Api Get Lang
`POST`	`/api/lang/{lang_code}`	Api Set Lang
`GET`	`/api/models/{provider}`	Api Models
`GET`	`/api/normalization/profiles`	Api Normalization Profiles
`GET`	`/api/reports`	Api Reports
`GET`	`/api/status`	Api Status
`GET`	`/health`	Health
`GET`	`/reports/{filename}`	Serve Report

The complete OpenAPI JSON is also exposed at /openapi.json for client generation.

Normalization profiles

Picarones ships 11 built-in normalization profiles for historical text comparison (defined in picarones/measurements/normalization.py, exposed via /api/normalization/profiles):

nfc, caseless, minimal, medieval_french, early_modern_french, medieval_latin, medieval_english, early_modern_english, secretary_hand, sans_ponctuation, sans_apostrophes.

Custom profiles can be loaded from YAML files with user-defined diplomatic tables and exclude_chars sets. See docs/profiles.md.

A traceability table mapping each profile to its source standard (MUFI v4.0, TEI P5, DEAF) will ship in Sprint A12 (B-6).

Project structure

picarones/
├── core/                       Cercle 1 — pure abstractions (7 modules)
├── measurements/               Cercle 2 — official metrics (~70 modules + narrative engine)
├── engines/                    Cercle 2 — 5 OCR adapters
├── llm/                        Cercle 2 — 4 LLM adapters
├── pipelines/                  Cercle 2 — OCR+LLM pipelines
├── modules/                    Cercle 2 — official BaseModule modules
├── extras/                     Cercle 3 — plugins (importers, historical)
├── report/                     Cercle 3 — HTML rendering
├── cli/                        Cercle 3 — Click CLI (15 commands)
├── web/                        Cercle 3 — FastAPI app + 11 routers
├── prompts/                    8 versioned prompt templates
└── data/                       Indicative tables (pricing.yaml)

Strict 3-circle architecture: imports flow only from outer to inner. Enforced by tests/core/test_circle_dependencies.py (Sprint A3). See docs/architecture.md for the full manifesto.

Environment variables

See .env.example for the complete list. Key variables:

# Security & mode (cf. SECURITY.md)
PICARONES_PUBLIC_MODE=         # 1/true/yes for HF Space (no cloud OCR)
PICARONES_CSRF_REQUIRED=       # 1 for institutional deployment
PICARONES_BROWSE_ROOTS=        # restrict browse to specific paths

# Cloud API keys (optional)
MISTRAL_API_KEY=
OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_APPLICATION_CREDENTIALS=
AZURE_DOC_INTEL_ENDPOINT=
AZURE_DOC_INTEL_KEY=

# RGPD retention (Sprint A11)
PICARONES_UPLOAD_RETENTION_DAYS=7

For HuggingFace Spaces, set these in Settings → Variables and secrets.

CI/CD

GitHub Actions: .github/workflows/

ci.yml — tests on Python 3.11/3.12/3.13 × Linux/macOS/Windows, ruff, mypy strict on core/, security scanners (bandit + pip-audit
- trivy), coverage gate --cov-fail-under=85, pytest-timeout 300s.
precommit.yml — replays pre-commit hooks (catches --no-verify bypass).
release.yml — on tag v*.*.*: PyPI + ghcr.io multi-arch + GitHub Release with notes from CHANGELOG.
perf_regression.yml — weekly cron + PR-triggered: CER anti-regression on a synthetic reference corpus.
sync_to_huggingface.yml — auto-syncs main to the HF Space.

Development

pip install -e ".[dev,web]"
pre-commit install
pytest tests/ -q
ruff check picarones/ tests/
python -m mypy picarones/core/

Test suite: ~3763 tests, ~3 min on a modern laptop. Coverage floor at 85% (currently ~87%). The network marker excludes tests requiring live HTTP.

For end-to-end developer guides, see docs/developer/index.md (FR) / docs/developer/index.en.md (EN).

Conventions

Never except Exception: pass — use logger.warning("[module] degraded feature: %s", e).
One canonical home per module — circle dependency direction enforced by tests.
Engines declare execution_mode ("io" or "cpu") so the runner picks ThreadPoolExecutor vs ProcessPoolExecutor appropriately.
Hardcoded UI strings forbidden — always go through i18n (cf. docs/developer/extending-i18n.md).

Roadmap

Detailed history and current direction live in:

CHANGELOG.md — Keep a Changelog format, one entry per sprint up to the latest release.
docs/roadmap/evolution-2026.md — technical evolution roadmap (axes A and B for 2026+).
docs/audits/ — institutional readiness audit and remediation plan (sprints A1–A15).

The Phase 1 of the institutional readiness plan (sprints A1–A11) is complete as of May 2026: CI hardening, doc consistency gates, 3-circle refactor, web hardening, perf+concurrency tests, WCAG 2.1 AA accessibility, reproducibility ops (lock files, Docker pinning), PyPI/ghcr.io release pipeline, governance & COI policies, institutional deployment guide & RGPD documentation.

Remaining: scientific publication track (CITATION + JOSS, sprint A12), README/SPECS final polish (this sprint and A14), external audits (RGAA + security pentest, A15).

Documentation

Audience	Entry point
End user	`docs/user/reading-a-report.md` (EN)
Developer	`docs/developer/index.md` (EN)
Operations / DSI	`docs/operations/deployment-institutional.md`, `docs/operations/data-retention-rgpd.md`, `docs/operations/release-process.md`
Architect	`docs/architecture.md`, `docs/api-stable.md`
Researcher	`docs/case-studies/`, `docs/reproducibility-snapshots.md`
Contributor	`CONTRIBUTING.md`, `GOVERNANCE.md`, `CODE_OF_CONDUCT.md`
Security	`SECURITY.md`
Accessibility	`ACCESSIBILITY.md`

The complete functional specification is in SPECS.md (full refresh planned in Sprint A14).

Citation

A CITATION.cff file and a Zenodo DOI will land in Sprint A12 (scientific publication track). Until then, cite the GitHub repo with the commit SHA used in your benchmark — every Picarones report embeds the commit and full snapshot for reproducibility (cf. docs/reproducibility-snapshots.md).

Contributing

See CONTRIBUTING.md (FR) / CONTRIBUTING.en.md (EN). Code of conduct: CODE_OF_CONDUCT.md (Contributor Covenant 2.1). Governance & maintainership: GOVERNANCE.md.

License

Apache License 2.0