Spaces:
Running
Running
Claude commited on
readme: anglais en haut, français en bas
Browse filesInverse l'ordre des taglines (anglais d'abord) et restructure
le bloc descriptif : version anglaise complète en haut (intro +
"Use case"), version française en dessous sous une section
"En français" + "Cas d'usage type".
Le contenu est identique des deux côtés ; seul l'ordre change
pour positionner l'anglais comme langue principale du README.
https://claude.ai/code/session_01RusTQYcSfXqTsbFNvwmCV7
README.md
CHANGED
|
@@ -9,10 +9,10 @@ pinned: false
|
|
| 9 |
|
| 10 |
# Picarones
|
| 11 |
|
| 12 |
-
> **Plateforme d'évaluation de pipelines de post-correction OCR sur corpus ALTO XML**
|
| 13 |
-
|
| 14 |
> **OCR Post-Correction Benchmarking Platform for Existing ALTO XML Corpora**
|
| 15 |
|
|
|
|
|
|
|
| 16 |
[](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
|
| 17 |
[](https://www.python.org/downloads/)
|
| 18 |
[](LICENSE)
|
|
@@ -20,6 +20,60 @@ pinned: false
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
**Picarones** est une plateforme open source conçue pour un **contexte
|
| 24 |
institutionnel** — services patrimoniaux, archives, bibliothèques numériques
|
| 25 |
qui disposent déjà d'un **corpus en XML ALTO** (issu d'une chaîne d'OCR
|
|
@@ -43,9 +97,7 @@ jonction**, **stabilité multi-runs**, **comparaison contrôlée par slot**, et
|
|
| 43 |
plusieurs sources d'import (IIIF, HuggingFace, HTR-United, eScriptorium,
|
| 44 |
upload ZIP).
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
## Cas d'usage type
|
| 49 |
|
| 50 |
Une institution (archive, bibliothèque numérique, service patrimonial) a
|
| 51 |
**déjà OCRisé** un corpus de plusieurs milliers de pages — sortie au format
|
|
|
|
| 9 |
|
| 10 |
# Picarones
|
| 11 |
|
|
|
|
|
|
|
| 12 |
> **OCR Post-Correction Benchmarking Platform for Existing ALTO XML Corpora**
|
| 13 |
|
| 14 |
+
> **Plateforme d'évaluation de pipelines de post-correction OCR sur corpus ALTO XML**
|
| 15 |
+
|
| 16 |
[](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
|
| 17 |
[](https://www.python.org/downloads/)
|
| 18 |
[](LICENSE)
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
+
**Picarones** is an open-source platform designed for an **institutional
|
| 24 |
+
context** — heritage services, archives, digital libraries that already
|
| 25 |
+
have a **corpus in ALTO XML** (output of a prior OCR pipeline) and want
|
| 26 |
+
to **rigorously evaluate** post-correction strategies: alternative re-OCR,
|
| 27 |
+
LLM correction, specialised ALTO mappers, ensemble voting, etc.
|
| 28 |
+
|
| 29 |
+
This is a **benchmarking platform, not a production workshop**. Picarones
|
| 30 |
+
loads an existing ALTO corpus, runs the pipelines the researcher brings,
|
| 31 |
+
measures every relevant metric, and produces a self-contained HTML report
|
| 32 |
+
that is **factual and reproducible**. No prescriptions, no automatic
|
| 33 |
+
verdicts: the report shows the numbers, the researcher decides.
|
| 34 |
+
|
| 35 |
+
Heritage-specific metrics (diplomatic CER, ligature and diacritic scores,
|
| 36 |
+
medieval abbreviations, Roman numerals, foliation, fuzzy full-text
|
| 37 |
+
searchability, philological marker fidelity), composable pipelines, a
|
| 38 |
+
**factual narrative synthesis** at the top of the report, **multi-engine
|
| 39 |
+
Friedman/Nemenyi significance tests** with a **critical difference
|
| 40 |
+
diagram**, **cost / speed / CO₂ Pareto analysis**, **per-junction error
|
| 41 |
+
absorption**, **multi-run stability**, **controlled per-slot comparison**,
|
| 42 |
+
and several corpus import sources (IIIF, HuggingFace, HTR-United,
|
| 43 |
+
eScriptorium, ZIP upload).
|
| 44 |
+
|
| 45 |
+
> *Version française ci-dessous.*
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## Use case
|
| 50 |
+
|
| 51 |
+
An institution (archive, digital library, heritage service) has **already
|
| 52 |
+
OCR'd** a corpus of several thousand pages — output in **ALTO XML** with
|
| 53 |
+
zone, line and word coordinates. The output has a decent but imperfect
|
| 54 |
+
CER, with the typical defects on historical ligatures, unexpanded
|
| 55 |
+
abbreviations and badly recognised proper names.
|
| 56 |
+
|
| 57 |
+
The institution wants to **rigorously compare** several post-correction
|
| 58 |
+
strategies on that existing corpus:
|
| 59 |
+
|
| 60 |
+
- alternative re-OCR (Pero OCR, Kraken, Mistral OCR…);
|
| 61 |
+
- LLM correction (GPT-4o, Claude, Mistral) in text-only or image+text mode;
|
| 62 |
+
- specialised ALTO mappers (line re-segmentation, abbreviation expansion,
|
| 63 |
+
diplomatic normalisation);
|
| 64 |
+
- composed pipelines: alternative OCR → LLM correction → ALTO mapper.
|
| 65 |
+
|
| 66 |
+
Picarones loads the ALTO corpus, runs each pipeline, measures the
|
| 67 |
+
relevant metrics (CER gain, recovered fuzzy searchability, preserved
|
| 68 |
+
numerical sequences, **errors introduced by the post-corrector** —
|
| 69 |
+
critical for LLMs that silently modernise) and produces a factual HTML
|
| 70 |
+
report that is **directly citable in a scientific publication**: every
|
| 71 |
+
number is traceable to its source payload, no prescription imposed.
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## En français
|
| 76 |
+
|
| 77 |
**Picarones** est une plateforme open source conçue pour un **contexte
|
| 78 |
institutionnel** — services patrimoniaux, archives, bibliothèques numériques
|
| 79 |
qui disposent déjà d'un **corpus en XML ALTO** (issu d'une chaîne d'OCR
|
|
|
|
| 97 |
plusieurs sources d'import (IIIF, HuggingFace, HTR-United, eScriptorium,
|
| 98 |
upload ZIP).
|
| 99 |
|
| 100 |
+
### Cas d'usage type
|
|
|
|
|
|
|
| 101 |
|
| 102 |
Une institution (archive, bibliothèque numérique, service patrimonial) a
|
| 103 |
**déjà OCRisé** un corpus de plusieurs milliers de pages — sortie au format
|