Spaces:
Sleeping
Sleeping
File size: 36,993 Bytes
a25f0d4 49cc409 6458414 49cc409 6458414 0ca9244 cecde1f bff1348 f6a6dc4 49cc409 2cb9762 781cbe9 2cb9762 781cbe9 2cb9762 0ca9244 2cb9762 0ca9244 2cb9762 0ca9244 781cbe9 2cb9762 49cc409 0ca9244 2cde8e8 2cb9762 49cc409 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 2cde8e8 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 1d8bf56 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 49cc409 f6a6dc4 cecde1f f6a6dc4 49cc409 f6a6dc4 bff1348 49cc409 bff1348 f6a6dc4 bff1348 49cc409 f6a6dc4 bff1348 f6a6dc4 49cc409 f6a6dc4 bff1348 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 49cc409 f6a6dc4 49cc409 f6a6dc4 49cc409 f6a6dc4 2cde8e8 f6a6dc4 2cde8e8 bff1348 1d8bf56 f6a6dc4 bff1348 49cc409 f6a6dc4 bff1348 f6a6dc4 49cc409 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 49cc409 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 49cc409 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 2cde8e8 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 f6a6dc4 49cc409 bff1348 49cc409 f6a6dc4 49cc409 bff1348 f6a6dc4 bff1348 49cc409 f6a6dc4 bff1348 49cc409 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 1d8bf56 f6a6dc4 1d8bf56 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 1d8bf56 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 49cc409 f6a6dc4 49cc409 f6a6dc4 1d8bf56 bff1348 f6a6dc4 bff1348 f6a6dc4 bff1348 f6a6dc4 49cc409 f6a6dc4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 | ---
title: Picarones
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---
# Picarones
> **Heritage OCR / HTR / VLM and post-correction benchmarking
> **Banc d'essai d'OCR / HTR / VLM et de post-correction pour documents patrimoniaux
[](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
[](https://www.python.org/downloads/)
[](LICENSE)
[](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones)
---
**Picarones** is an open-source benchmarking platform for OCR, HTR, VLM
and post-correction pipelines on heritage documents.
The input is a folder of `(image, ground truth)` pairs β ground truth in
plain text (`.gt.txt`), ALTO XML, or PAGE XML, hand-annotated or sourced
from a reference corpus. Picarones runs the AIs you plug in (OCR engines,
VLMs, OCR+LLM pipelines, ALTO mappers, ensemblesβ¦) on every page,
compares each output to the ground truth at every relevant level (text,
ALTO, PAGE, entities, reading order), and produces a self-contained HTML
report with factual numbers, statistical tests and a reproducibility
snapshot. Without ground truth, no benchmark β Picarones measures how
well an AI matches a known reference, not how well it transcribes an
arbitrary document.
Typical workflow: assemble a small golden dataset whose profile (script
type, period, language) matches the production corpus you intend to
process; benchmark candidate AIs on it; read the report; decide which AI
to deploy. Picarones does not yet ship a curated library of standard
datasets β the importers (IIIF, Gallica, HuggingFace, HTR-United,
eScriptorium, ZIP) help fetch existing data, curation remains yours.
Heritage-specific metrics (diplomatic CER, ligature and diacritic scores,
medieval abbreviations, Roman numerals, foliation, fuzzy searchability,
philological marker fidelity), composable pipelines, factual narrative
synthesis at the top of the report, multi-engine Friedman/Nemenyi tests
with critical difference diagram, cost / speed / COβ Pareto analysis,
per-junction error absorption, multi-run stability, controlled per-slot
comparison.
> *Version franΓ§aise ci-dessous.*
---
## Use case
An archive, a digital library or a heritage service plans to OCR a
production corpus β say, several thousand 17th-century parish registers,
19th-century newspapers, or medieval glossed manuscripts. Several
candidate pipelines are on the table (alternative OCR, LLM correction,
ALTO mappers, ensembles); the question is which one to deploy.
The candidates cannot be benchmarked on the production corpus itself
(no ground truth). A small golden dataset matching the target profile is
assembled; Picarones runs each candidate on it and reports CER gain,
recovered fuzzy searchability, preserved numerical sequences, errors
introduced by post-correctors and statistical significance. The numbers
inform the deployment decision.
---
## En franΓ§ais
**Picarones** est une plateforme open source de banc d'essai pour des IA
d'OCR, HTR, VLM et des pipelines de post-correction sur documents
patrimoniaux.
L'entrΓ©e est un dossier de paires `(image, vΓ©ritΓ© terrain)` β VT en
texte brut (`.gt.txt`), ALTO XML, ou PAGE XML, annotΓ©e Γ la main ou
issue d'un corpus de rΓ©fΓ©rence. Picarones exΓ©cute les IA que vous
branchez (moteurs OCR, VLM, pipelines OCR+LLM, mappeurs ALTO,
ensemblesβ¦) sur chaque page, compare la sortie Γ la VT Γ tous les
niveaux pertinents (texte, ALTO, PAGE, entitΓ©s, ordre de lecture) et
produit un rapport HTML autonome avec chiffres factuels, tests
statistiques et snapshot de reproductibilitΓ©. Sans vΓ©ritΓ© terrain, pas
de benchmark β Picarones mesure Γ quel point une IA matche une rΓ©fΓ©rence
connue, pas Γ quel point elle transcrit un document quelconque.
Workflow type : constituer un golden dataset dont le profil (type
d'Γ©criture, pΓ©riode, langue) correspond au corpus de production Γ
traiter ; benchmarker les IA candidates dessus ; lire le rapport ;
dΓ©cider quelle IA dΓ©ployer. Picarones ne fournit pas encore de
bibliothΓ¨que curatΓ©e de datasets standards β les importers (IIIF,
Gallica, HuggingFace, HTR-United, eScriptorium, ZIP) aident Γ rΓ©cupΓ©rer
des donnΓ©es existantes, la curation reste Γ votre charge.
MΓ©triques spΓ©cifiques aux corpus patrimoniaux (CER diplomatique, scores
de ligatures et diacritiques, abrΓ©viations mΓ©diΓ©vales, numΓ©raux romains,
foliotation, recherchabilitΓ© fuzzy, fidΓ©litΓ© aux marqueurs
philologiques), pipelines composables, synthèse narrative factuelle au
sommet du rapport, tests Friedman/Nemenyi multi-moteurs avec diagramme
de diffΓ©rence critique, analyse Pareto coΓ»t/vitesse/COβ, absorption
d'erreur par jonction, stabilitΓ© multi-runs, comparaison contrΓ΄lΓ©e par
slot.
### Cas d'usage type
Une archive, une bibliothèque numérique ou un service patrimonial
prΓ©voit d'OCRiser un corpus de production β par exemple plusieurs
milliers de registres paroissiaux du XVIIα΅, de presse du XIXα΅ ou de
manuscrits glosΓ©s mΓ©diΓ©vaux. Plusieurs pipelines candidats sont sur la
table (OCR alternatif, correction LLM, mappeurs ALTO, ensembles) ;
reste Γ dΓ©cider lequel dΓ©ployer.
Les candidats ne peuvent pas Γͺtre benchmarkΓ©s sur le corpus de
production lui-mΓͺme (pas de VT). On constitue un golden dataset
matching le profil cible ; Picarones exΓ©cute chaque candidat dessus et
remonte le gain CER, la recherchabilitΓ© fuzzy gagnΓ©e, les sΓ©quences
numΓ©riques prΓ©servΓ©es, les erreurs introduites par les post-correcteurs
et la significativitΓ© statistique. Les chiffres nourrissent la dΓ©cision
de dΓ©ploiement.
---
## Table of Contents
- [Features](#features)
- [Heritage-Specific Metrics](#heritage-specific-metrics)
- [OCR+LLM Pipelines](#ocr-llm-pipelines)
- [Corpus Import](#corpus-import)
- [Interactive HTML Report](#interactive-html-report)
- [Longitudinal Tracking & Robustness](#longitudinal-tracking--robustness)
- [Web Interface](#web-interface)
- [Quick Start](#quick-start)
- [Installation](#installation)
- [From Source](#from-source)
- [Docker](#docker)
- [Optional Extras](#optional-extras)
- [Usage](#usage)
- [CLI Commands](#cli-commands)
- [Web Interface](#web-interface-1)
- [Pipeline Modes](#pipeline-modes)
- [Supported Engines](#supported-engines)
- [Normalization Profiles](#normalization-profiles)
- [Error Taxonomy](#error-taxonomy)
- [Project Structure](#project-structure)
- [Environment Variables](#environment-variables)
- [CI/CD](#cicd)
- [Development](#development)
- [Roadmap](#roadmap)
- [Contributing](#contributing)
- [License](#license)
---
## Features
### Heritage-Specific Metrics
- **CER** (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and
**diplomatic** (historical equivalences: long s = s, u = v, i = j, etc.)
- **WER**, **MER**, **WIL** with historical-aware tokenization (via [jiwer](https://github.com/jitsi/jiwer))
- **Unicode confusion matrix** -- fingerprint each engine's character-level errors
- **Ligature and diacritic scores** -- track handling of fi, fl, ff, oe, ae, p-bar, and other
medieval glyphs
- **10-class error taxonomy** -- automatic classification of every error (visual confusion,
abbreviation, segmentation, lacuna, over-normalization, etc.)
- **Bootstrap 95% confidence intervals**, **Wilcoxon signed-rank tests**, and the
**Friedman test + Nemenyi post-hoc** with a **Critical Difference Diagram** (DemΕ‘ar 2006)
for rigorous multi-engine comparison
- **Intrinsic difficulty score** per document, independent of engine performance
- **Line-level error distribution** with Gini coefficient and percentile analysis
- **VLM hallucination detection** -- anchor score and length ratio to flag fabricated output
- **Cost / speed / carbon Pareto front** (local vs cloud, per-token pricing model)
### OCR+LLM Pipelines
- Composable chains: `tesseract -> gpt-4o`, `pero_ocr -> claude-sonnet`, zero-shot VLM, etc.
- Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot
- **Over-normalization detection** -- does the LLM silently modernize historical spellings?
- Versioned prompt library for medieval French, early modern French, medieval Latin, medieval
English, and early modern English -- both correction and zero-shot variants
### Corpus Import
| Source | Method |
|--------|--------|
| Local folder | `picarones run --corpus ./corpus/` |
| IIIF manifests (institutional repositories) | `picarones import iiif <manifest-url>` |
| Gallica API (SRU + OCR) | `GallicaClient` / `picarones import iiif` |
| HuggingFace Datasets | `picarones import hf <dataset-id>` |
| HTR-United catalogue | `picarones import htr-united` |
| eScriptorium | `EScriptoriumClient` |
| ZIP upload (browser) | Web interface upload endpoint |
Supported corpus formats: plain text pairs (image + ground truth), **ALTO XML**, and **PAGE XML**.
### Interactive HTML Report
- **Self-contained HTML file** -- works offline, no server needed (Jinja2-templated since Sprint 17)
- **Factual narrative synthesis** at the top of the report (Sprint 19): 12 deterministic
detectors extract salient facts (global leader, significant gap, stratum collapse, VLM
hallucination flag, speed winner, cost outlier, Pareto alternative, ...) and render them
as short sentences -- every number is traceable to the source payload, no LLM, no
hallucination risk
- **Critical Difference Diagram** (CDD) rendered server-side as static SVG -- no JS required
- **Cost / speed / carbon Pareto chart** with toggleable axes and highlighted Pareto front
- **Contextual glossary**: a `?` icon next to every metric header opens a side panel with
definition, what it measures, usage, limits, and reference (25 bilingual entries)
- **Advanced mode panel**: visible-column picker, per-stratum filter, and opt-in personal
composite score (sliders default to 0, formula always visible, explicit warning that no
universal weighting exists). State is persisted in the URL.
- Sortable ranking table, radar charts, histograms (powered by Chart.js)
- Gallery view with dynamic filters and color-coded CER badges
- GitHub-style colored diff with synchronized N-way scrolling
- Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction
- Unicode character view: interactive confusion matrix explorer
- Export to **CSV**, **JSON**, **ALTO XML**, **PAGE XML**, and annotated images
### Longitudinal Tracking & Robustness
- Optional **SQLite database** to record benchmark history across runs
- **CER evolution curves** over time, per engine
- **Automatic regression detection** between consecutive runs
- **Robustness analysis**: measure engine resilience to noise, blur, rotation, resolution
reduction, and binarization
- Critical degradation threshold identification
### Web Interface
- **FastAPI** application with real-time **Server-Sent Events** (SSE) progress streaming
- Upload corpus as a **ZIP file** directly from the browser
- Dynamic engine and normalization profile selectors
- Browse and re-download generated HTML reports
- Bilingual **French/English** interface
- Deployable on HuggingFace Spaces (Docker, port 7860)
---
## Quick Start
```bash
# Clone and install
git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e .
# Install Tesseract (system binary, required for the Tesseract engine)
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
# macOS
brew install tesseract
# Generate a demo report (no OCR engine needed)
picarones demo --output demo_report.html
# List available engines
picarones engines
# Run a benchmark
picarones run --corpus ./corpus/ --engines tesseract --output results.json
# Generate HTML report
picarones report --results results.json --output report.html
# Launch the web interface
picarones serve --port 8080
```
---
## Installation
### From Source
```bash
git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e ".[dev,web]" # includes test and web dependencies
```
**System requirements:**
- Python >= 3.11
- [Tesseract OCR 5](https://github.com/tesseract-ocr/tesseract) (for the Tesseract engine)
### Docker
```bash
docker build -t picarones .
docker run -p 7860:7860 \
-e MISTRAL_API_KEY=... \
-e OPENAI_API_KEY=... \
picarones
```
The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs
(fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls
`/health` every 30 seconds.
The [HuggingFace Space](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) uses this
same Docker image.
### Optional Extras
| Extra | Install command | What it adds |
|-------|----------------|--------------|
| `dev` | `pip install -e ".[dev]"` | pytest, pytest-cov, httpx, FastAPI, uvicorn, python-multipart |
| `web` | `pip install -e ".[web]"` | FastAPI, uvicorn, python-multipart, httpx |
| `stats` | `pip install -e ".[stats]"` | scipy (exact Wilcoxon/Friedman/Nemenyi -- otherwise pure-Python fallback) |
| `llm` | `pip install -e ".[llm]"` | OpenAI, Anthropic, Mistral SDKs |
| `hf` | `pip install -e ".[hf]"` | HuggingFace Datasets |
| `pero` | `pip install -e ".[pero]"` | Pero OCR engine |
| `kraken` | `pip install -e ".[kraken]"` | Kraken engine |
| `ocr-cloud` | `pip install -e ".[ocr-cloud]"` | Google Vision, AWS (boto3), Azure Doc Intelligence |
| `all` | `pip install -e ".[all]"` | `web` + `hf` + `llm` + `dev` (no `ocr-cloud`) |
See [INSTALL.md](INSTALL.md) for detailed instructions on Linux, macOS, Windows, and Docker.
---
## Usage
### CLI Commands
| Command | Description |
|---------|-------------|
| `picarones run` | Run a full benchmark on a corpus |
| `picarones report` | Generate an HTML report from JSON results |
| `picarones demo` | Generate a demo report with synthetic data (no engine required) |
| `picarones metrics` | Calculate CER/WER between two text files |
| `picarones engines` | List all available OCR engines and LLM adapters |
| `picarones info` | Display version and system information |
| `picarones serve` | Launch the FastAPI web interface |
| `picarones history` | Query longitudinal benchmark history (SQLite) |
| `picarones robustness` | Run robustness analysis with degraded images |
| `picarones import iiif` | Import corpus from an IIIF manifest (any institutional repository). HTR-United and HuggingFace imports are exposed through the web interface (`/api/htr-united/import`, `/api/huggingface/import`). |
**Examples:**
```bash
# Benchmark with Tesseract, French language, PSM 6
picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \
--output results.json --verbose
# Compare two text files
picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt
# Import 10 pages from any IIIF manifest URL
picarones import iiif https://institution.example/iiif/xxx/manifest.json --pages 1-10
# HuggingFace and HTR-United imports are available via the web UI at
# http://localhost:8000/ (endpoints POST /api/huggingface/import and /api/htr-united/import)
# View benchmark history with regression detection
picarones history --engine tesseract --regression
# Robustness demo (noise, blur, rotation, resolution)
picarones robustness --corpus ./gt/ --engine tesseract --demo
# Fail CI if CER exceeds threshold
picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15
```
### Web Interface
```bash
picarones serve --host 0.0.0.0 --port 8080
```
**API endpoints include:**
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Main single-page application |
| `/api/status` | GET | Version and application status |
| `/api/engines` | GET | Available OCR/LLM engines |
| `/api/normalization/profiles` | GET | Normalization profiles (read dynamically) |
| `/api/benchmark/start` | POST | Start a benchmark job (returns `job_id`) |
| `/api/benchmark/{job_id}/stream` | GET | SSE real-time progress stream |
| `/api/benchmark/{job_id}/cancel` | POST | Cancel a running benchmark |
| `/api/corpus/browse` | GET | Browse server-side corpus folders |
| `/api/htr-united/catalogue` | GET | Browse HTR-United catalogue |
| `/api/huggingface/search` | GET | Search HuggingFace datasets |
| `/reports/{filename}` | GET | Download generated HTML reports |
### Pipeline Modes
Picarones supports three modes for OCR+LLM pipelines:
| Mode | Description | Model type |
|------|-------------|------------|
| `zero_shot` | LLM receives the image directly and transcribes without prior OCR | VLM (vision) |
| `post_correction_texte` | OCR produces raw text, then LLM corrects it | Text-only LLM |
| `post_correction_image_texte` | OCR produces raw text, then LLM receives both image and text for correction | VLM (vision) |
**Example:** `ministral-3b-latest` is a text-only model and should use `post_correction_texte`.
GPT-4o and Claude support all three modes.
---
## Supported Engines
| Engine | Type | Execution Mode | Installation |
|--------|------|---------------|-------------|
| **Tesseract 5** | Local CLI | CPU (ProcessPool) | `pip install pytesseract` + system binary |
| **Pero OCR** | Local Python | CPU (ProcessPool) | `pip install pero-ocr` |
| **Kraken** | Local Python | CPU (ProcessPool) | `pip install kraken` |
| **Mistral OCR** | Cloud API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
| **Google Vision** | Cloud API | IO (ThreadPool) | `GOOGLE_APPLICATION_CREDENTIALS` env var |
| **Azure Doc Intelligence** | Cloud API | IO (ThreadPool) | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
| **GPT-4o** (VLM) | LLM API | IO (ThreadPool) | `OPENAI_API_KEY` env var |
| **Claude Sonnet** (VLM) | LLM API | IO (ThreadPool) | `ANTHROPIC_API_KEY` env var |
| **Mistral Large** (LLM) | LLM API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
| **Ollama** (local LLM) | Local LLM | IO (ThreadPool) | `ollama serve` running locally |
| **Custom engine** | CLI or API | Configurable | YAML declaration, no code required |
Engines declare their `execution_mode` (`"io"` or `"cpu"`), allowing the runner to use
`ThreadPoolExecutor` for IO-bound engines and `ProcessPoolExecutor` for CPU-bound engines
simultaneously.
---
## Normalization Profiles
Picarones ships **11 built-in normalization profiles** designed for historical text comparison.
These reduce noise from expected orthographic variation so metrics reflect genuine OCR errors,
not historical spelling differences. The canonical list is defined in
[`picarones/core/normalization.py`](picarones/core/normalization.py) (`NORMALIZATION_PROFILES`)
and is exposed dynamically via `/api/normalization/profiles`.
| Profile | Period | Key equivalences |
|---------|--------|-----------------|
| `nfc` | Any | Unicode NFC normalization only |
| `caseless` | Any | NFC + case folding (`casefold`) |
| `minimal` | Any | NFC + long s (ΕΏ -> s) |
| `medieval_french` | 12th-15th c. | ΕΏ=s, u=v, i=j, y=i, Γ¦=ae, Ε=oe, κ=per, & = et |
| `early_modern_french` | 16th-18th c. | ΕΏ=s, Γ¦=ae, Ε=oe |
| `medieval_latin` | 12th-15th c. | ΕΏ=s, u=v, i=j, κ=per, κ=pro |
| `medieval_english` | 12th-15th c. | ΕΏ=s, u=v, i=j, ΓΎ=th, Θ=y, κ=per, κ=pro |
| `early_modern_english` | 16th-18th c. | ΕΏ=s, u=v, i=j, vv=w, ΓΎ=th, Γ°=th, Θ=y |
| `secretary_hand` | 16th-17th c. | Early Modern English + secretary hand visual confusions |
| `sans_ponctuation` | Any | NFC + strips `. , ; : ! ? ' " - β β ( ) [ ]` |
| `sans_apostrophes` | Any | NFC + strips straight (`'`) and typographic (`β`) apostrophes |
Custom profiles can be loaded from YAML files with user-defined diplomatic tables and/or
`exclude_chars` sets.
---
## Error Taxonomy
Every character-level error is automatically classified into one of 10 categories:
| Class | Name | Description |
|-------|------|-------------|
| 1 | `visual_confusion` | Morphologically similar characters (rn/m, l/1, O/0, u/n) |
| 2 | `diacritic_error` | Missing, incorrect, or spurious diacritical mark |
| 3 | `case_error` | Case difference only (A/a) |
| 4 | `ligature_error` | Ligature not resolved or incorrectly resolved |
| 5 | `abbreviation_error` | Medieval abbreviation not expanded |
| 6 | `hapax` | Word not found in any reference lexicon |
| 7 | `segmentation_error` | Token fusion or fragmentation (words/lines) |
| 8 | `oov_character` | Character outside the engine's vocabulary |
| 9 | `lacuna` | Text present in ground truth but absent from OCR output |
| 10 | `over_normalization` | LLM silently modernized a historical spelling |
---
## Project Structure
```
picarones/
βββ __init__.py # Version (1.0.0), package metadata
βββ __main__.py # `python -m picarones`
βββ cli.py # Click CLI: run, demo, report, metrics, engines, info,
β # serve, import iiif, history, robustness
βββ fixtures.py # Realistic synthetic test data (medieval documents)
βββ i18n.py # Back-compat shim loading report/i18n/{fr,en}.json
β
βββ core/
β βββ corpus.py # Corpus loading (folder, ALTO XML, PAGE XML)
β βββ metrics.py # CER, WER, MER, WIL (via jiwer)
β βββ normalization.py # Unicode normalization, 11 diplomatic/exclusion profiles
β βββ statistics.py # Bootstrap CI, Wilcoxon, Friedman, Nemenyi, CDD SVG
β βββ runner.py # Benchmark orchestrator (ThreadPool + ProcessPool)
β βββ results.py # DocumentResult, BenchmarkResults, JSON export
β βββ confusion.py # Unicode confusion matrix
β βββ char_scores.py # Ligature and diacritic scores
β βββ taxonomy.py # 10-class error taxonomy
β βββ structure.py # Structural analysis (blocks, lines, words)
β βββ image_quality.py # Image quality metrics (contrast, noise, resolution)
β βββ difficulty.py # Intrinsic difficulty score per document
β βββ hallucination.py # VLM hallucination detection
β βββ line_metrics.py # Line-level error distribution (Gini, percentiles)
β βββ history.py # SQLite longitudinal tracking
β βββ robustness.py # Robustness analysis (noise, blur, rotation, resolution)
β βββ pricing.py # Cost model, EngineCost, Pareto front
β βββ narrative/ # Factual narrative engine (Sprint 16-19)
β βββ facts.py # Fact model, 12 FactType, DetectorRegistry
β βββ detectors.py # 12 detectors (global_leader_cer, significant_gap,
β β # stratum_winner/collapse, error_profile_outlier,
β β # llm_hallucination_flag, robustness_fragile,
β β # speed_winner, confidence_warning,
β β # statistical_tie, pareto_alternative, cost_outlier)
β βββ arbiter.py # Sort by importance, dedup, anti-contradiction
β βββ renderer.py # YAML template rendering via str.format_map
β βββ templates/{fr,en}.yaml
β
βββ data/
β βββ pricing.yaml # Indicative cost table (OCR local/cloud + LLM)
β
βββ engines/
β βββ base.py # BaseOCREngine (execution_mode: "io" | "cpu")
β βββ tesseract.py # Tesseract 5 adapter (CPU)
β βββ pero_ocr.py # Pero OCR adapter (CPU)
β βββ mistral_ocr.py # Mistral OCR API (/v1/ocr endpoint)
β βββ google_vision.py # Google Cloud Vision adapter
β βββ azure_doc_intel.py # Azure Document Intelligence adapter
β
βββ llm/
β βββ base.py # BaseLLMAdapter interface
β βββ openai_adapter.py # OpenAI / GPT-4o adapter
β βββ anthropic_adapter.py # Anthropic / Claude adapter
β βββ mistral_adapter.py # Mistral chat completions adapter
β βββ ollama_adapter.py # Ollama local LLM adapter
β
βββ pipelines/
β βββ base.py # OCRLLMPipeline orchestrator
β βββ over_normalization.py # Over-normalization detection
β
βββ prompts/ # 8 versioned prompt templates
β βββ correction_medieval_french.txt
β βββ correction_image_medieval_french.txt
β βββ correction_imprime_ancien.txt
β βββ correction_medieval_english.txt
β βββ correction_early_modern_english.txt
β βββ zero_shot_medieval_french.txt
β βββ zero_shot_imprime_ancien.txt
β βββ zero_shot_medieval_english.txt
β
βββ report/
β βββ generator.py # Orchestrates Jinja2 rendering (617 lines since Sprint 17)
β βββ diff_utils.py # Diff computation utilities
β βββ templates/ # Jinja2 partials (Sprint 17)
β β βββ base.html.j2 # assembles everything via {% include %}
β β βββ _header.html, _footer.html, _styles.css, _app.js
β β βββ _critical_difference.html, _narrative_summary.html, _side_panels.html
β β βββ view_ranking.html, view_gallery.html, view_document.html,
β β view_analyses.html, view_characters.html
β βββ i18n/ # FR/EN translations (Sprint 17 -- extracted from i18n.py)
β β βββ fr.json
β β βββ en.json
β βββ glossary/ # Contextual glossary (Sprint 21)
β β βββ fr.yaml # 25 bilingual entries (definition, measures, usage,
β β βββ en.yaml # limits, reference)
β βββ vendor/ # Vendored Chart.js
β
βββ web/
β βββ app.py # FastAPI app (SSE, ZIP upload, dynamic endpoints)
β βββ static/ # CSS assets
β
βββ importers/
βββ iiif.py # IIIF manifest importer
βββ gallica.py # Gallica API client (institutional digital library)
βββ htr_united.py # HTR-United catalogue importer
βββ huggingface.py # HuggingFace Datasets importer
βββ escriptorium.py # eScriptorium client
docs/ # User + developer documentation (Sprint 22)
βββ case-studies/ # Two labelled case studies ("Cas d'Γ©cole")
β βββ 01-registres-paroissiaux.md
β βββ 02-edition-critique.md
βββ user/
β βββ reading-a-report.md # Anatomy, suggested reading order, advanced panel
βββ developer/
βββ index.md
βββ narrative-engine.md
βββ extending-glossary.md
βββ extending-i18n.md
tests/ # 1242 tests (1 skipped: scipy optional)
.github/workflows/
βββ ci.yml # CI: Python 3.11/3.12, Linux/macOS/Windows, ruff lint
βββ sync_to_huggingface.yml # Auto-sync to HuggingFace Space on push to main
Dockerfile # Multi-stage Docker build for HuggingFace Spaces
```
---
## Environment Variables
Configure API keys depending on which engines and LLM adapters you use:
```bash
# LLM APIs
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export MISTRAL_API_KEY="..."
# Cloud OCR APIs (optional)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="eu-west-1"
export AZURE_DOC_INTEL_ENDPOINT="https://..."
export AZURE_DOC_INTEL_KEY="..."
```
For deployment on HuggingFace Spaces, set these in **Settings > Variables and secrets**.
---
## CI/CD
### GitHub Actions (`ci.yml`)
- **Triggers:** push to `main`/`develop`/`feature/*`/`sprint/*`/`claude/*`, PRs to
`main`/`develop`, manual dispatch
- **Matrix:** Python 3.11 + 3.12 on Linux, macOS, and Windows
- **Jobs:**
1. **Tests** -- full pytest suite (1242 passing, 1 skipped when scipy is absent) with
coverage uploaded to Codecov
2. **Demo** -- end-to-end demo report generation with history and robustness
3. **Build** -- wheel and sdist with twine validation
4. **Lint** -- `ruff check picarones/ tests/` (E, W, F; ignores E501, E402). The ruff
config lives in `pyproject.toml` under `[tool.ruff]` so CI, `make lint` and direct
invocation all produce the same result -- blocking on F401 / E741.
### HuggingFace Sync (`sync_to_huggingface.yml`)
- Automatically pushes `main` to the HuggingFace Space `Ma-Ri-Ba-Ku/Picarones`
- Requires the `HF_TOKEN` secret in GitHub repository settings
---
## Development
```bash
# Install with dev + web dependencies
pip install -e ".[dev,web]"
# Run the test suite
pytest tests/ -q --tb=short
# Run with coverage
pytest tests/ --cov=picarones --cov-report=term-missing
# Generate a demo report
picarones demo --output demo_report.html
# Launch the web UI in development mode
picarones serve --port 8080
# Full refresh (useful in Codespaces)
git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html
```
**Test suite:** `pytest tests/` -> **1242 passed, 1 skipped** (the skip is intentional
when the optional `scipy` extra is not installed).
**Key development conventions:**
- Never use bare `except Exception: pass` -- always log with `logger.warning()`
- Normalization profiles are read dynamically from `picarones/core/normalization.py` --
never hardcode them in endpoint handlers
- Engines declare their `execution_mode` (`"io"` or `"cpu"`) so the runner can select the
appropriate executor
- `python-multipart` must remain in dependencies (FastAPI checks at import time)
---
## Roadmap
| Sprint | Status | Deliverables |
|--------|--------|-------------|
| 1 | Done | Project structure, Tesseract, Pero OCR, CER/WER, CLI |
| 2 | Done | HTML report v1: Chart.js, colored diff, gallery |
| 3 | Done | OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama |
| 4 | Done | Cloud OCR APIs, IIIF import, diplomatic normalization |
| 5 | Done | Advanced metrics: confusion matrix, ligatures, 9-class taxonomy |
| 6 | Done | FastAPI web interface, HTR-United, HuggingFace, bilingual UI |
| 7 | Done | HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score |
| 8 | Done | eScriptorium, Gallica API, SQLite history, robustness analysis |
| 9 | Done | Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta |
| 10 | Done | Line error distribution (Gini), VLM hallucination detection |
| 11 | Done | Internationalization FR/EN, English normalization profiles |
| 12 | Done | Browser ZIP upload, macOS file filtering, dynamic model selector |
| 13 | Done | pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation |
| 14 | Done | Robust engine filtering, corpus validation |
| 15 | Done | Fix empty OCR+LLM pipeline output (Mistral ContentChunk normalization, `finish_reason` logging) |
| 16 | Done | `line_metrics` + `hallucination` wired into runner/`EngineReport`; narrative engine foundations (`core/narrative/` with `Fact` / `DetectorRegistry`); Pillow `getdata`->`tobytes`, silent excepts -> explicit warnings |
| 17 | Done | Report refactor: `generator.py` 3690 -> 617 lines via Jinja2; monolithic HTML template split into 10 files under `picarones/report/templates/`; i18n migrated to `report/i18n/{fr,en}.json`; +16 non-regression tests |
| 18 | Done | Friedman test + Nemenyi post-hoc + Critical Difference Diagram (DemΕ‘ar 2006); `detect_statistical_tie` enabled; SVG rendered server-side; +41 tests |
| 19 | Done | Factual narrative engine complete: 9 new detectors, arbiter (importance + anti-contradiction), YAML templates renderer, `_narrative_summary.html` partial, anti-hallucination traceability test; +32 tests |
| 20 | Done | Cost model + Pareto view: `core/pricing.py` + `data/pricing.yaml`, `compute_pareto_front`, Chart.js Pareto chart with cost/speed/carbon toggles, `pareto_alternative` and `cost_outlier` detectors; +28 tests |
| 21 | Done | Contextual glossary (25 bilingual entries) + advanced-mode side panel (visible columns, strata filters, opt-in composite score, URL state persistence); +21 tests |
| 22 | Done | Case studies (`docs/case-studies/`), user guide (`docs/user/reading-a-report.md`), three developer guides (`docs/developer/`); +18 tests |
---
## Known Issues & Improvement Opportunities
This section captures the findings of the Sprint 22 audit. None of them block the current
release (all 1242 tests pass, lint clean), but each represents a sensible next step.
### Architecture / refactor
- **`picarones/web/app.py` is 3072 lines** (FastAPI routes, corpus upload, SSE, ZIP flattening,
HTML delivery, model registry all in one module). Candidate split: `app_routes.py` /
`app_corpus.py` / `app_jobs.py` / `app_models.py`.
- **`picarones/core/statistics.py` is 1127 lines** mixing bootstrap CI, Wilcoxon, Friedman,
Nemenyi table, Pareto front and CDD SVG. Splitting into `statistics/bootstrap.py`,
`statistics/tests.py`, `statistics/pareto.py`, `statistics/cdd_svg.py` would shorten
import graphs and ease review.
- **`picarones/cli.py` is 971 lines** β each Click command could live in its own module under
`picarones/cli/` and be registered via `cli.add_command(...)`.
- **`picarones/core/runner.py` is 847 lines** β orchestrator is reasonable but edges past the
500-line guideline; extracting the per-document worker + the partial-NDJSON writer would
reduce mental load.
- **`picarones/core/narrative/detectors.py` is 680 lines** β all 12 detectors live together;
one file per `FactType` (or per importance tier) would make additions safer.
### Back-compat shim
- **`picarones/i18n.py`** is a 66-line shim that reads `picarones/report/i18n/{fr,en}.json`.
Since Sprint 17 the JSON files are the source of truth and only
`picarones/report/generator.py:654` still imports through the shim. Either promote the
shim to `picarones.report.i18n` (renaming the import) or delete the file and import the
loader directly.
### Explicit engine declarations
- `MistralOCREngine`, `GoogleVisionEngine` and `AzureDocIntelEngine` inherit the implicit
`execution_mode = "io"` default from `BaseOCREngine`. For clarity and to protect against a
future default flip, declare it explicitly (as `TesseractEngine` and `PeroOCREngine` already
do for `"cpu"`).
### Test coverage gaps
- No dedicated unit tests for `picarones/core/char_scores.py` (exercised only transitively).
- No unit tests for the cloud engine adapters themselves (`mistral_ocr.py`,
`google_vision.py`, `azure_doc_intel.py`) β they are only reached via integration fixtures.
- `pytest` installed as a `uv` tool doesn't see project dependencies automatically; document
`pip install -e ".[dev,web,stats]"` in the pytest environment or switch to an in-repo venv
to avoid "`ModuleNotFoundError: No module named 'yaml'`" surprises.
### Documentation
- `CHANGELOG.md` stops at Sprint 9 (2025-03). Sprints 10-22 are described in `CLAUDE.md` and
this README but should be back-ported into `CHANGELOG.md` to follow Keep-a-Changelog.
- `SPECS.md` predates the narrative engine, Pareto view and glossary β worth a pass.
- Some code comments and docstrings are still in French while user-facing strings are
bilingual; harmonising module docstrings in English would make the project more
contributor-friendly.
### CI / packaging
- `sync_to_huggingface.yml` uses `git push --force hf main` unconditionally β safe today but
worth documenting because a non-main branch push would silently rewrite the Space.
- `picarones.spec` (PyInstaller) is still present but not exercised in CI β either add a
`build-exe` job or mark the spec as community-maintained.
### Security (nothing critical)
- ZIP upload flattening in `web/app.py` rejects absolute paths and `..` traversal but does
not check for symlinks inside archives. Python's `zipfile` doesn't extract symlinks, so
the risk is theoretical; adding an explicit check (`ZipInfo.external_attr & 0xA000`) is a
belt-and-braces improvement.
- API keys are read from environment variables only (no hardcoded fallback) β good.
---
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on adding an OCR engine, an LLM
adapter, or submitting a pull request.
---
## License
[Apache License 2.0](LICENSE)
Copyright 2024 Picarones contributors.
|