Claude commited on
Commit
0ca9244
·
unverified ·
1 Parent(s): 2cde8e8

readme: anglais en haut, français en bas

Browse files

Inverse l'ordre des taglines (anglais d'abord) et restructure
le bloc descriptif : version anglaise complète en haut (intro +
"Use case"), version française en dessous sous une section
"En français" + "Cas d'usage type".

Le contenu est identique des deux côtés ; seul l'ordre change
pour positionner l'anglais comme langue principale du README.

https://claude.ai/code/session_01RusTQYcSfXqTsbFNvwmCV7

Files changed (1) hide show
  1. README.md +57 -5
README.md CHANGED
@@ -9,10 +9,10 @@ pinned: false
9
 
10
  # Picarones
11
 
12
- > **Plateforme d'évaluation de pipelines de post-correction OCR sur corpus ALTO XML**
13
-
14
  > **OCR Post-Correction Benchmarking Platform for Existing ALTO XML Corpora**
15
 
 
 
16
  [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
17
  [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
18
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
@@ -20,6 +20,60 @@ pinned: false
20
 
21
  ---
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  **Picarones** est une plateforme open source conçue pour un **contexte
24
  institutionnel** — services patrimoniaux, archives, bibliothèques numériques
25
  qui disposent déjà d'un **corpus en XML ALTO** (issu d'une chaîne d'OCR
@@ -43,9 +97,7 @@ jonction**, **stabilité multi-runs**, **comparaison contrôlée par slot**, et
43
  plusieurs sources d'import (IIIF, HuggingFace, HTR-United, eScriptorium,
44
  upload ZIP).
45
 
46
- ---
47
-
48
- ## Cas d'usage type
49
 
50
  Une institution (archive, bibliothèque numérique, service patrimonial) a
51
  **déjà OCRisé** un corpus de plusieurs milliers de pages — sortie au format
 
9
 
10
  # Picarones
11
 
 
 
12
  > **OCR Post-Correction Benchmarking Platform for Existing ALTO XML Corpora**
13
 
14
+ > **Plateforme d'évaluation de pipelines de post-correction OCR sur corpus ALTO XML**
15
+
16
  [![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
17
  [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
18
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
 
20
 
21
  ---
22
 
23
+ **Picarones** is an open-source platform designed for an **institutional
24
+ context** — heritage services, archives, digital libraries that already
25
+ have a **corpus in ALTO XML** (output of a prior OCR pipeline) and want
26
+ to **rigorously evaluate** post-correction strategies: alternative re-OCR,
27
+ LLM correction, specialised ALTO mappers, ensemble voting, etc.
28
+
29
+ This is a **benchmarking platform, not a production workshop**. Picarones
30
+ loads an existing ALTO corpus, runs the pipelines the researcher brings,
31
+ measures every relevant metric, and produces a self-contained HTML report
32
+ that is **factual and reproducible**. No prescriptions, no automatic
33
+ verdicts: the report shows the numbers, the researcher decides.
34
+
35
+ Heritage-specific metrics (diplomatic CER, ligature and diacritic scores,
36
+ medieval abbreviations, Roman numerals, foliation, fuzzy full-text
37
+ searchability, philological marker fidelity), composable pipelines, a
38
+ **factual narrative synthesis** at the top of the report, **multi-engine
39
+ Friedman/Nemenyi significance tests** with a **critical difference
40
+ diagram**, **cost / speed / CO₂ Pareto analysis**, **per-junction error
41
+ absorption**, **multi-run stability**, **controlled per-slot comparison**,
42
+ and several corpus import sources (IIIF, HuggingFace, HTR-United,
43
+ eScriptorium, ZIP upload).
44
+
45
+ > *Version française ci-dessous.*
46
+
47
+ ---
48
+
49
+ ## Use case
50
+
51
+ An institution (archive, digital library, heritage service) has **already
52
+ OCR'd** a corpus of several thousand pages — output in **ALTO XML** with
53
+ zone, line and word coordinates. The output has a decent but imperfect
54
+ CER, with the typical defects on historical ligatures, unexpanded
55
+ abbreviations and badly recognised proper names.
56
+
57
+ The institution wants to **rigorously compare** several post-correction
58
+ strategies on that existing corpus:
59
+
60
+ - alternative re-OCR (Pero OCR, Kraken, Mistral OCR…);
61
+ - LLM correction (GPT-4o, Claude, Mistral) in text-only or image+text mode;
62
+ - specialised ALTO mappers (line re-segmentation, abbreviation expansion,
63
+ diplomatic normalisation);
64
+ - composed pipelines: alternative OCR → LLM correction → ALTO mapper.
65
+
66
+ Picarones loads the ALTO corpus, runs each pipeline, measures the
67
+ relevant metrics (CER gain, recovered fuzzy searchability, preserved
68
+ numerical sequences, **errors introduced by the post-corrector** —
69
+ critical for LLMs that silently modernise) and produces a factual HTML
70
+ report that is **directly citable in a scientific publication**: every
71
+ number is traceable to its source payload, no prescription imposed.
72
+
73
+ ---
74
+
75
+ ## En français
76
+
77
  **Picarones** est une plateforme open source conçue pour un **contexte
78
  institutionnel** — services patrimoniaux, archives, bibliothèques numériques
79
  qui disposent déjà d'un **corpus en XML ALTO** (issu d'une chaîne d'OCR
 
97
  plusieurs sources d'import (IIIF, HuggingFace, HTR-United, eScriptorium,
98
  upload ZIP).
99
 
100
+ ### Cas d'usage type
 
 
101
 
102
  Une institution (archive, bibliothèque numérique, service patrimonial) a
103
  **déjà OCRisé** un corpus de plusieurs milliers de pages — sortie au format