--- title: Picarones emoji: 📜 colorFrom: blue colorTo: purple sdk: docker pinned: false --- # Picarones > **OCR/HTR Benchmarking Platform for Heritage Documents** [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml) [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE) [![HuggingFace Space](https://img.shields.io/badge/%F0%9F%A4%97-HuggingFace%20Space-yellow.svg)](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) --- **Picarones** is an open-source platform for rigorously comparing OCR and HTR engines (Tesseract, Pero OCR, Kraken, cloud APIs...) and OCR+LLM pipelines on historical document corpora -- manuscripts, early printed books, and archives. It provides heritage-specific metrics (diplomatic CER, ligature scores, medieval abbreviation handling), composable OCR+LLM pipelines, interactive HTML reports, and multiple corpus import sources including IIIF, Gallica, HuggingFace, and eScriptorium. --- ## Table of Contents - [Features](#features) - [Heritage-Specific Metrics](#heritage-specific-metrics) - [OCR+LLM Pipelines](#ocr-llm-pipelines) - [Corpus Import](#corpus-import) - [Interactive HTML Report](#interactive-html-report) - [Longitudinal Tracking & Robustness](#longitudinal-tracking--robustness) - [Web Interface](#web-interface) - [Quick Start](#quick-start) - [Installation](#installation) - [From Source](#from-source) - [Docker](#docker) - [Optional Extras](#optional-extras) - [Usage](#usage) - [CLI Commands](#cli-commands) - [Web Interface](#web-interface-1) - [Pipeline Modes](#pipeline-modes) - [Supported Engines](#supported-engines) - [Normalization Profiles](#normalization-profiles) - [Error Taxonomy](#error-taxonomy) - [Project Structure](#project-structure) - [Environment Variables](#environment-variables) - [CI/CD](#cicd) - [Development](#development) - [Roadmap](#roadmap) - [Contributing](#contributing) - [License](#license) --- ## Features ### Heritage-Specific Metrics - **CER** (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and **diplomatic** (historical equivalences: long s = s, u = v, i = j, etc.) - **WER**, **MER**, **WIL** with historical-aware tokenization (via [jiwer](https://github.com/jitsi/jiwer)) - **Unicode confusion matrix** -- fingerprint each engine's character-level errors - **Ligature and diacritic scores** -- track handling of fi, fl, ff, oe, ae, p-bar, and other medieval glyphs - **10-class error taxonomy** -- automatic classification of every error (visual confusion, abbreviation, segmentation, lacuna, over-normalization, etc.) - **Bootstrap 95% confidence intervals** and **Wilcoxon signed-rank tests** for statistical significance - **Intrinsic difficulty score** per document, independent of engine performance - **Line-level error distribution** with Gini coefficient and percentile analysis - **VLM hallucination detection** -- anchor score and length ratio to flag fabricated output ### OCR+LLM Pipelines - Composable chains: `tesseract -> gpt-4o`, `pero_ocr -> claude-sonnet`, zero-shot VLM, etc. - Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot - **Over-normalization detection** -- does the LLM silently modernize historical spellings? - Versioned prompt library for medieval French, early modern French, medieval Latin, medieval English, and early modern English -- both correction and zero-shot variants ### Corpus Import | Source | Method | |--------|--------| | Local folder | `picarones run --corpus ./corpus/` | | IIIF manifests (Gallica, Bodleian, BL...) | `picarones import iiif ` | | Gallica API (SRU + OCR) | `GallicaClient` / `picarones import iiif` | | HuggingFace Datasets | `picarones import hf ` | | HTR-United catalogue | `picarones import htr-united` | | eScriptorium | `EScriptoriumClient` | | ZIP upload (browser) | Web interface upload endpoint | Supported corpus formats: plain text pairs (image + ground truth), **ALTO XML**, and **PAGE XML**. ### Interactive HTML Report - **Self-contained HTML file** -- works offline, no server needed - Sortable ranking table, radar charts, histograms (powered by Chart.js) - Gallery view with dynamic filters and color-coded CER badges - GitHub-style colored diff with synchronized N-way scrolling - Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction - Unicode character view: interactive confusion matrix explorer - Export to **CSV**, **JSON**, **ALTO XML**, **PAGE XML**, and annotated images ### Longitudinal Tracking & Robustness - Optional **SQLite database** to record benchmark history across runs - **CER evolution curves** over time, per engine - **Automatic regression detection** between consecutive runs - **Robustness analysis**: measure engine resilience to noise, blur, rotation, resolution reduction, and binarization - Critical degradation threshold identification ### Web Interface - **FastAPI** application with real-time **Server-Sent Events** (SSE) progress streaming - Upload corpus as a **ZIP file** directly from the browser - Dynamic engine and normalization profile selectors - Browse and re-download generated HTML reports - Bilingual **French/English** interface - Deployable on HuggingFace Spaces (Docker, port 7860) --- ## Quick Start ```bash # Clone and install git clone https://github.com/maribakulj/Picarones.git cd Picarones pip install -e . # Install Tesseract (system binary, required for the Tesseract engine) # Ubuntu/Debian sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat # macOS brew install tesseract # Generate a demo report (no OCR engine needed) picarones demo --output demo_report.html # List available engines picarones engines # Run a benchmark picarones run --corpus ./corpus/ --engines tesseract --output results.json # Generate HTML report picarones report --results results.json --output report.html # Launch the web interface picarones serve --port 8080 ``` --- ## Installation ### From Source ```bash git clone https://github.com/maribakulj/Picarones.git cd Picarones pip install -e ".[dev,web]" # includes test and web dependencies ``` **System requirements:** - Python >= 3.11 - [Tesseract OCR 5](https://github.com/tesseract-ocr/tesseract) (for the Tesseract engine) ### Docker ```bash docker build -t picarones . docker run -p 7860:7860 \ -e MISTRAL_API_KEY=... \ -e OPENAI_API_KEY=... \ picarones ``` The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs (fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls `/health` every 30 seconds. The [HuggingFace Space](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) uses this same Docker image. ### Optional Extras | Extra | Install command | What it adds | |-------|----------------|--------------| | `dev` | `pip install -e ".[dev]"` | pytest, pytest-cov, httpx, linting | | `web` | `pip install -e ".[web]"` | FastAPI, uvicorn, python-multipart | | `llm` | `pip install -e ".[llm]"` | OpenAI, Anthropic, Mistral SDKs | | `hf` | `pip install -e ".[hf]"` | HuggingFace Datasets | | `pero` | `pip install -e ".[pero]"` | Pero OCR engine | | `kraken` | `pip install -e ".[kraken]"` | Kraken engine | | `ocr-cloud` | `pip install -e ".[ocr-cloud]"` | Google Vision, AWS Textract, Azure Doc Intelligence | | `all` | `pip install -e ".[all]"` | Everything except ocr-cloud | See [INSTALL.md](INSTALL.md) for detailed instructions on Linux, macOS, Windows, and Docker. --- ## Usage ### CLI Commands | Command | Description | |---------|-------------| | `picarones run` | Run a full benchmark on a corpus | | `picarones report` | Generate an HTML report from JSON results | | `picarones demo` | Generate a demo report with synthetic data (no engine required) | | `picarones metrics` | Calculate CER/WER between two text files | | `picarones engines` | List all available OCR engines and LLM adapters | | `picarones info` | Display version and system information | | `picarones serve` | Launch the FastAPI web interface | | `picarones history` | Query longitudinal benchmark history (SQLite) | | `picarones robustness` | Run robustness analysis with degraded images | | `picarones import` | Import corpus from IIIF, HuggingFace, or HTR-United | **Examples:** ```bash # Benchmark with Tesseract, French language, PSM 6 picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \ --output results.json --verbose # Compare two text files picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt # Import 10 pages from a Gallica IIIF manifest picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10 # Import a HuggingFace dataset picarones import hf medieval-ocr/dataset-name # View benchmark history with regression detection picarones history --engine tesseract --regression # Robustness demo (noise, blur, rotation, resolution) picarones robustness --corpus ./gt/ --engine tesseract --demo # Fail CI if CER exceeds threshold picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15 ``` ### Web Interface ```bash picarones serve --host 0.0.0.0 --port 8080 ``` **API endpoints include:** | Endpoint | Method | Description | |----------|--------|-------------| | `/` | GET | Main single-page application | | `/api/status` | GET | Version and application status | | `/api/engines` | GET | Available OCR/LLM engines | | `/api/normalization/profiles` | GET | Normalization profiles (read dynamically) | | `/api/benchmark/start` | POST | Start a benchmark job (returns `job_id`) | | `/api/benchmark/{job_id}/stream` | GET | SSE real-time progress stream | | `/api/benchmark/{job_id}/cancel` | POST | Cancel a running benchmark | | `/api/corpus/browse` | GET | Browse server-side corpus folders | | `/api/htr-united/catalogue` | GET | Browse HTR-United catalogue | | `/api/huggingface/search` | GET | Search HuggingFace datasets | | `/reports/{filename}` | GET | Download generated HTML reports | ### Pipeline Modes Picarones supports three modes for OCR+LLM pipelines: | Mode | Description | Model type | |------|-------------|------------| | `zero_shot` | LLM receives the image directly and transcribes without prior OCR | VLM (vision) | | `post_correction_texte` | OCR produces raw text, then LLM corrects it | Text-only LLM | | `post_correction_image_texte` | OCR produces raw text, then LLM receives both image and text for correction | VLM (vision) | **Example:** `ministral-3b-latest` is a text-only model and should use `post_correction_texte`. GPT-4o and Claude support all three modes. --- ## Supported Engines | Engine | Type | Execution Mode | Installation | |--------|------|---------------|-------------| | **Tesseract 5** | Local CLI | CPU (ProcessPool) | `pip install pytesseract` + system binary | | **Pero OCR** | Local Python | CPU (ProcessPool) | `pip install pero-ocr` | | **Kraken** | Local Python | CPU (ProcessPool) | `pip install kraken` | | **Mistral OCR** | Cloud API | IO (ThreadPool) | `MISTRAL_API_KEY` env var | | **Google Vision** | Cloud API | IO (ThreadPool) | `GOOGLE_APPLICATION_CREDENTIALS` env var | | **Azure Doc Intelligence** | Cloud API | IO (ThreadPool) | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` | | **GPT-4o** (VLM) | LLM API | IO (ThreadPool) | `OPENAI_API_KEY` env var | | **Claude Sonnet** (VLM) | LLM API | IO (ThreadPool) | `ANTHROPIC_API_KEY` env var | | **Mistral Large** (LLM) | LLM API | IO (ThreadPool) | `MISTRAL_API_KEY` env var | | **Ollama** (local LLM) | Local LLM | IO (ThreadPool) | `ollama serve` running locally | | **Custom engine** | CLI or API | Configurable | YAML declaration, no code required | Engines declare their `execution_mode` (`"io"` or `"cpu"`), allowing the runner to use `ThreadPoolExecutor` for IO-bound engines and `ProcessPoolExecutor` for CPU-bound engines simultaneously. --- ## Normalization Profiles Picarones includes eight built-in diplomatic normalization profiles designed for historical text comparison. These reduce noise from expected orthographic variation so metrics reflect genuine OCR errors, not historical spelling differences. | Profile | Period | Key equivalences | |---------|--------|-----------------| | `nfc` | Any | Unicode NFC normalization only | | `minimal` | Any | NFC + long s (ſ -> s) | | `medieval_french` | 12th-15th c. | ſ=s, u=v, i=j, y=i, ae=ae, oe=oe, p-bar=per, etc. | | `early_modern_french` | 16th-18th c. | ſ=s, ae=ae, oe=oe, y-tilde=yn, &=et | | `medieval_latin` | 12th-15th c. | ſ=s, u=v, i=j, ae=ae, oe=oe, p-bar=per, q-bar=que | | `medieval_english` | 12th-15th c. | ſ=s, u=v, i=j, thorn=th, eth=th, yogh=y, p-bar=per | | `early_modern_english` | 16th-18th c. | ſ=s, u=v, i=j, vv=w, thorn=th, eth=th, yogh=y | | `secretary_hand` | 16th-17th c. | Early modern English + secretary hand visual confusions | Custom profiles can be loaded from YAML files with user-defined diplomatic tables. --- ## Error Taxonomy Every character-level error is automatically classified into one of 10 categories: | Class | Name | Description | |-------|------|-------------| | 1 | `visual_confusion` | Morphologically similar characters (rn/m, l/1, O/0, u/n) | | 2 | `diacritic_error` | Missing, incorrect, or spurious diacritical mark | | 3 | `case_error` | Case difference only (A/a) | | 4 | `ligature_error` | Ligature not resolved or incorrectly resolved | | 5 | `abbreviation_error` | Medieval abbreviation not expanded | | 6 | `hapax` | Word not found in any reference lexicon | | 7 | `segmentation_error` | Token fusion or fragmentation (words/lines) | | 8 | `oov_character` | Character outside the engine's vocabulary | | 9 | `lacuna` | Text present in ground truth but absent from OCR output | | 10 | `over_normalization` | LLM silently modernized a historical spelling | --- ## Project Structure ``` picarones/ ├── __init__.py # Version (1.0.0), package metadata ├── cli.py # Click CLI: run, demo, report, metrics, engines, info, │ # serve, import, history, robustness ├── fixtures.py # Realistic synthetic test data (medieval documents) │ ├── core/ │ ├── corpus.py # Corpus loading (folder, ALTO XML, PAGE XML) │ ├── metrics.py # CER, WER, MER, WIL (via jiwer) │ ├── normalization.py # Unicode normalization, 8 diplomatic profiles │ ├── statistics.py # Bootstrap CI 95%, Wilcoxon test, correlations │ ├── runner.py # Benchmark orchestrator (ThreadPool + ProcessPool) │ ├── results.py # DocumentResult, BenchmarkResults, JSON export │ ├── confusion.py # Unicode confusion matrix │ ├── char_scores.py # Ligature and diacritic scores │ ├── taxonomy.py # 10-class error taxonomy │ ├── structure.py # Structural analysis (blocks, lines, words) │ ├── image_quality.py # Image quality metrics (contrast, noise, resolution) │ ├── difficulty.py # Intrinsic difficulty score per document │ ├── hallucination.py # VLM hallucination detection │ ├── line_metrics.py # Line-level error distribution (Gini, percentiles) │ ├── history.py # SQLite longitudinal tracking │ └── robustness.py # Robustness analysis (noise, blur, rotation, resolution) │ ├── engines/ │ ├── base.py # BaseOCREngine (execution_mode: "io" | "cpu") │ ├── tesseract.py # Tesseract 5 adapter (CPU) │ ├── pero_ocr.py # Pero OCR adapter (CPU) │ ├── mistral_ocr.py # Mistral OCR API (/v1/ocr endpoint) │ ├── google_vision.py # Google Cloud Vision adapter │ └── azure_doc_intel.py # Azure Document Intelligence adapter │ ├── llm/ │ ├── base.py # BaseLLMAdapter interface │ ├── openai_adapter.py # OpenAI / GPT-4o adapter │ ├── anthropic_adapter.py # Anthropic / Claude adapter │ ├── mistral_adapter.py # Mistral chat completions adapter │ └── ollama_adapter.py # Ollama local LLM adapter │ ├── pipelines/ │ ├── base.py # OCRLLMPipeline orchestrator │ └── over_normalization.py # Over-normalization detection │ ├── prompts/ # Versioned prompt templates (FR + EN) │ ├── correction_medieval_french.txt │ ├── zero_shot_medieval_french.txt │ ├── correction_imprime_ancien.txt │ ├── zero_shot_imprime_ancien.txt │ ├── correction_medieval_english.txt │ ├── zero_shot_medieval_english.txt │ ├── correction_early_modern_english.txt │ └── correction_image_medieval_french.txt │ ├── report/ │ ├── generator.py # Self-contained HTML report (Chart.js + diff2html) │ └── diff_utils.py # Diff computation utilities │ ├── web/ │ ├── app.py # FastAPI app (SSE, ZIP upload, dynamic endpoints) │ └── static/ # CSS assets │ └── importers/ ├── iiif.py # IIIF manifest importer ├── gallica.py # Gallica API client (BnF) ├── htr_united.py # HTR-United catalogue importer ├── huggingface.py # HuggingFace Datasets importer └── escriptorium.py # eScriptorium client tests/ # ~1020 unit and integration tests .github/workflows/ ├── ci.yml # CI: Python 3.11/3.12, Linux/macOS/Windows └── sync_to_huggingface.yml # Auto-sync to HuggingFace Space on push to main Dockerfile # Multi-stage Docker build for HuggingFace Spaces ``` --- ## Environment Variables Configure API keys depending on which engines and LLM adapters you use: ```bash # LLM APIs export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." export MISTRAL_API_KEY="..." # Cloud OCR APIs (optional) export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json" export AWS_ACCESS_KEY_ID="..." export AWS_SECRET_ACCESS_KEY="..." export AWS_DEFAULT_REGION="eu-west-1" export AZURE_DOC_INTEL_ENDPOINT="https://..." export AZURE_DOC_INTEL_KEY="..." ``` For deployment on HuggingFace Spaces, set these in **Settings > Variables and secrets**. --- ## CI/CD ### GitHub Actions (`ci.yml`) - **Triggers:** push to `main`/`develop`/`feature/*`/`sprint/*`/`claude/*`, PRs to `main`/`develop`, manual dispatch - **Matrix:** Python 3.11 + 3.12 on Linux, macOS, and Windows - **Jobs:** 1. **Tests** -- full pytest suite with coverage, uploaded to Codecov 2. **Demo** -- end-to-end demo report generation with history and robustness 3. **Build** -- wheel and sdist with twine validation 4. **Lint** -- ruff check for errors (E, W, F; ignores E501, E402) ### HuggingFace Sync (`sync_to_huggingface.yml`) - Automatically pushes `main` to the HuggingFace Space `Ma-Ri-Ba-Ku/Picarones` - Requires the `HF_TOKEN` secret in GitHub repository settings --- ## Development ```bash # Install with dev + web dependencies pip install -e ".[dev,web]" # Run the test suite pytest tests/ -q --tb=short # Run with coverage pytest tests/ --cov=picarones --cov-report=term-missing # Generate a demo report picarones demo --output demo_report.html # Launch the web UI in development mode picarones serve --port 8080 # Full refresh (useful in Codespaces) git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html ``` **Key development conventions:** - Never use bare `except Exception: pass` -- always log with `logger.warning()` - Normalization profiles are read dynamically from `picarones/core/normalization.py` -- never hardcode them in endpoint handlers - Engines declare their `execution_mode` (`"io"` or `"cpu"`) so the runner can select the appropriate executor - `python-multipart` must remain in dependencies (FastAPI checks at import time) --- ## Roadmap | Sprint | Status | Deliverables | |--------|--------|-------------| | 1 | Done | Project structure, Tesseract, Pero OCR, CER/WER, CLI | | 2 | Done | HTML report v1: Chart.js, colored diff, gallery | | 3 | Done | OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama | | 4 | Done | Cloud OCR APIs, IIIF import, diplomatic normalization | | 5 | Done | Advanced metrics: confusion matrix, ligatures, 9-class taxonomy | | 6 | Done | FastAPI web interface, HTR-United, HuggingFace, bilingual UI | | 7 | Done | HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score | | 8 | Done | eScriptorium, Gallica API, SQLite history, robustness analysis | | 9 | Done | Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta | | 10 | Done | Line error distribution (Gini), VLM hallucination detection | | 11 | Done | Internationalization FR/EN, English normalization profiles | | 12 | Done | Browser ZIP upload, macOS file filtering, dynamic model selector | | 13 | Done | pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation | --- ## Contributing See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on adding an OCR engine, an LLM adapter, or submitting a pull request. --- ## License [Apache License 2.0](LICENSE) Copyright 2024 Picarones contributors.