File size: 41,041 Bytes
a25f0d4
 
 
 
 
 
 
 
 
49cc409
 
781cbe9
49cc409
781cbe9
0ca9244
cecde1f
bff1348
 
f6a6dc4
49cc409
 
 
781cbe9
 
0ca9244
781cbe9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ca9244
 
 
 
 
 
 
781cbe9
0ca9244
 
 
 
 
 
 
781cbe9
 
 
 
 
 
0ca9244
781cbe9
 
 
 
 
0ca9244
 
 
 
 
 
 
781cbe9
 
 
 
 
 
 
 
 
 
0ca9244
 
 
 
 
781cbe9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2cde8e8
781cbe9
2cde8e8
 
 
 
 
 
 
781cbe9
49cc409
0ca9244
2cde8e8
781cbe9
 
 
 
 
 
 
 
 
 
 
 
 
 
2cde8e8
 
 
 
 
 
 
781cbe9
 
 
 
 
 
 
 
 
 
 
49cc409
 
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bff1348
f6a6dc4
 
bff1348
 
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
1d8bf56
 
 
f6a6dc4
 
 
1d8bf56
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
2cde8e8
f6a6dc4
 
 
 
 
bff1348
f6a6dc4
bff1348
f6a6dc4
bff1348
1d8bf56
 
 
 
 
 
 
 
 
 
 
 
 
f6a6dc4
 
 
 
 
 
bff1348
f6a6dc4
bff1348
f6a6dc4
 
 
 
 
 
bff1348
f6a6dc4
bff1348
f6a6dc4
 
 
 
 
 
bff1348
 
 
f6a6dc4
49cc409
 
f6a6dc4
cecde1f
f6a6dc4
49cc409
 
f6a6dc4
bff1348
 
49cc409
bff1348
 
 
f6a6dc4
 
 
 
bff1348
49cc409
f6a6dc4
 
 
 
 
 
 
 
 
bff1348
 
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49cc409
 
f6a6dc4
 
 
 
 
 
 
 
 
 
bff1348
f6a6dc4
 
 
 
 
 
 
1d8bf56
 
 
f6a6dc4
 
 
 
1d8bf56
 
f6a6dc4
 
 
 
49cc409
f6a6dc4
49cc409
f6a6dc4
49cc409
f6a6dc4
 
 
 
 
 
 
 
 
 
 
2cde8e8
f6a6dc4
 
 
 
 
 
 
 
 
 
 
2cde8e8
 
bff1348
1d8bf56
 
f6a6dc4
 
bff1348
49cc409
f6a6dc4
bff1348
 
f6a6dc4
 
49cc409
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bff1348
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bff1348
 
 
f6a6dc4
 
1d8bf56
 
 
 
 
f6a6dc4
 
 
 
1d8bf56
f6a6dc4
1d8bf56
 
 
 
 
 
 
 
 
 
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49cc409
 
 
f6a6dc4
1d8bf56
f6a6dc4
1d8bf56
f6a6dc4
1d8bf56
f6a6dc4
49cc409
f6a6dc4
 
1d8bf56
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
1d8bf56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d8bf56
f6a6dc4
1d8bf56
f6a6dc4
 
 
1d8bf56
 
 
f6a6dc4
 
1d8bf56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6a6dc4
 
 
 
 
 
 
2cde8e8
f6a6dc4
 
 
 
1d8bf56
 
 
 
 
 
 
 
 
 
 
 
 
f6a6dc4
1d8bf56
f6a6dc4
 
49cc409
 
bff1348
49cc409
f6a6dc4
 
 
49cc409
bff1348
f6a6dc4
bff1348
 
 
49cc409
f6a6dc4
bff1348
 
 
 
 
 
49cc409
 
f6a6dc4
 
bff1348
 
f6a6dc4
bff1348
f6a6dc4
bff1348
f6a6dc4
 
 
 
1d8bf56
 
f6a6dc4
 
1d8bf56
 
 
bff1348
f6a6dc4
bff1348
f6a6dc4
 
bff1348
 
 
f6a6dc4
bff1348
f6a6dc4
 
 
bff1348
f6a6dc4
 
bff1348
f6a6dc4
 
bff1348
f6a6dc4
 
bff1348
f6a6dc4
 
bff1348
f6a6dc4
 
 
bff1348
1d8bf56
 
 
f6a6dc4
bff1348
f6a6dc4
 
 
 
 
 
bff1348
f6a6dc4
49cc409
f6a6dc4
49cc409
f6a6dc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d8bf56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bff1348
f6a6dc4
 
 
 
 
 
 
 
bff1348
f6a6dc4
bff1348
f6a6dc4
49cc409
f6a6dc4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
---
title: Picarones
emoji: 📜
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---

# Picarones

> **Heritage OCR / HTR / VLM and post-correction benchmarking — bring your golden dataset, plug in the AIs.**

> **Banc d'essai d'OCR / HTR / VLM et de post-correction pour documents patrimoniaux — amenez votre golden dataset, branchez vos IA.**

[![CI](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml/badge.svg)](https://github.com/maribakulj/Picarones/actions/workflows/ci.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
[![HuggingFace Space](https://img.shields.io/badge/%F0%9F%A4%97-HuggingFace%20Space-yellow.svg)](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones)

---

**Picarones** is an open-source benchmarking platform for OCR, HTR, VLM and
post-correction pipelines on heritage documents.

### Input contract: pairs of (image, ground truth)

The user provides a **golden dataset** — a folder of pairs `image.{jpg,png,…}`
+ ground truth, where the ground truth is plain text (`image.gt.txt`),
**ALTO XML** (`image.xml`), or **PAGE XML** (`image.xml`). The ground truth
must be hand-annotated (or come from a curated reference corpus); Picarones
auto-detects the format and converts ALTO/PAGE to plain text for the
text-level metrics while keeping the structured GT for the ALTO/PAGE/entity
metrics.

### Evaluation contract: every metric is computed against the GT in the input pair

The user plugs in one or several AIs to evaluate — OCR engines, VLMs,
OCR+LLM correction pipelines, alternative re-OCR + LLM + ALTO mappers
chained, etc. Picarones runs each AI on every page of the dataset,
compares the output to the ground truth at every relevant level (text,
ALTO, PAGE, entities, reading order), and produces a self-contained HTML
report with factual numbers, statistical tests and a reproducibility
snapshot. **A benchmark on a corpus without GT is impossible by design**:
Picarones measures how well an AI matches a known annotated reference,
not how well it transcribes an arbitrary document.

### Decision contract: the researcher reads the numbers and decides

This is a **benchmarking platform, not a production workshop**. The
typical workflow is: build a small golden dataset whose script type,
period and language match the production corpus you eventually want to
process; benchmark candidate AIs on that dataset; read the report and
decide which AI is reliable enough to deploy on your real (unlabelled)
production corpus. No prescriptions, no automatic verdicts.

### Each researcher brings their own dataset

Picarones does not yet maintain a curated library of standard golden
datasets. The corpus importers (IIIF, Gallica, HuggingFace, HTR-United,
eScriptorium, ZIP upload) help **fetch and ingest** existing datasets,
but the **choice and curation** are the researcher's responsibility.

---

Heritage-specific metrics (diplomatic CER, ligature and diacritic scores,
medieval abbreviations, Roman numerals, foliation, fuzzy full-text
searchability, philological marker fidelity), composable pipelines, a
**factual narrative synthesis** at the top of the report, **multi-engine
Friedman/Nemenyi significance tests** with a **critical difference
diagram**, **cost / speed / CO₂ Pareto analysis**, **per-junction error
absorption**, **multi-run stability**, **controlled per-slot comparison**.

> *Version française ci-dessous.*

---

## Use case

A heritage institution wants to choose an OCR / HTR / post-correction
pipeline to deploy on a future production corpus — say, several thousand
17th-century parish registers, or 19th-century newspapers, or medieval
glossed manuscripts. They cannot benchmark candidate AIs directly on that
production corpus: there is no ground truth for it, so no metric can be
computed.

Instead, they assemble (or borrow) a **golden dataset** of a few hundred
hand-annotated pages whose script type, period and language match the
target corpus. Each page is a pair: the image, plus a ground truth in
plain text, ALTO XML, or PAGE XML. They feed the dataset to Picarones and
plug in the AIs to compare:

- alternative re-OCR (Pero OCR, Kraken, Mistral OCR…);
- LLM correction (GPT-4o, Claude, Mistral) in text-only or image+text mode;
- specialised ALTO mappers (line re-segmentation, abbreviation expansion,
  diplomatic normalisation);
- composed pipelines: alternative OCR → LLM correction → ALTO mapper.

Picarones runs each AI on every page of the golden dataset, compares the
output to the ground truth at every relevant level, measures the metrics
(CER gain, recovered fuzzy searchability, preserved numerical sequences,
**errors introduced by the post-corrector** — critical for LLMs that
silently modernise) and produces a factual HTML report that is **directly
citable in a scientific publication**: every number is traceable to its
source payload, no prescription imposed.

The researcher reads the numbers and decides which pipeline is reliable
enough to deploy on the actual (unlabelled) production corpus.

---

## En français

**Picarones** est une plateforme open source de banc d'essai pour des IA
d'OCR, HTR, VLM et des pipelines de post-correction sur documents
patrimoniaux.

### Contrat d'entrée : paires (image, vérité terrain)

L'utilisateur amène un **golden dataset** — un dossier de paires
`image.{jpg,png,…}` + vérité terrain, où la VT est en texte brut
(`image.gt.txt`), en **ALTO XML** (`image.xml`), ou en **PAGE XML**
(`image.xml`). La VT doit être annotée à la main (ou provenir d'un corpus
de référence curaté) ; Picarones détecte automatiquement le format et
convertit l'ALTO / PAGE en texte brut pour les métriques textuelles tout
en conservant la VT structurée pour les métriques ALTO / PAGE / entités.

### Contrat d'évaluation : chaque métrique est calculée contre la VT de la paire en entrée

L'utilisateur branche une ou plusieurs IA à évaluer — moteurs OCR, VLM,
pipelines OCR+LLM, ré-OCR alternatif + LLM + mappeur ALTO chaînés, etc.
Picarones exécute chaque IA sur chaque page du dataset, compare la sortie
à la vérité terrain à tous les niveaux pertinents (texte, ALTO, PAGE,
entités, ordre de lecture) et produit un rapport HTML autonome avec
chiffres factuels, tests statistiques et snapshot de reproductibilité.
**Un benchmark sur un corpus sans VT est impossible par design** :
Picarones mesure à quel point une IA matche une référence annotée connue,
pas à quel point elle transcrit un document quelconque.

### Contrat de décision : le chercheur lit les chiffres et arbitre

C'est un **banc d'essai, pas un atelier de production**. Le workflow type
est : constituer un golden dataset de quelques pages annotées dont le
type d'écriture, la période et la langue correspondent au corpus de
production qu'on veut traiter ; benchmarker les IA candidates sur ce
dataset ; lire le rapport et décider quelle IA est assez fiable pour la
passer en prod sur le vrai corpus (non annoté). Pas de prescription, pas
de verdict automatique.

### Chaque chercheur amène son propre dataset

Picarones ne maintient pas (encore) de bibliothèque curatée de golden
datasets standards. Les importers de corpus (IIIF, Gallica, HuggingFace,
HTR-United, eScriptorium, upload ZIP) aident à **récupérer et ingérer**
des datasets existants, mais le **choix et la curation** restent à la
charge du chercheur.

---

Métriques spécifiques aux corpus patrimoniaux (CER diplomatique, scores de
ligatures, abréviations médiévales, numéraux romains, foliotation,
recherchabilité fuzzy plein-texte, fidélité aux marqueurs philologiques),
pipelines composables, **synthèse narrative factuelle** au sommet du rapport,
**tests Friedman/Nemenyi multi-moteurs** avec **diagramme de différence
critique**, analyse **Pareto coût/vitesse/CO₂**, **absorption d'erreur par
jonction**, **stabilité multi-runs**, **comparaison contrôlée par slot**.

### Cas d'usage type

Une institution patrimoniale veut choisir un pipeline OCR / HTR /
post-correction à déployer sur un futur corpus de production — par
exemple plusieurs milliers de registres paroissiaux du XVIIᵉ siècle, ou
de presse du XIXᵉ, ou de manuscrits glosés médiévaux. Elle ne peut pas
benchmarker les IA candidates directement sur ce corpus de production :
il n'y a pas de vérité terrain pour lui, donc aucune métrique ne peut
être calculée.

À la place, elle constitue (ou récupère) un **golden dataset** de
quelques centaines de pages annotées à la main dont le type d'écriture,
la période et la langue correspondent au corpus cible. Chaque page est
une paire : l'image, plus une vérité terrain en texte brut, ALTO XML, ou
PAGE XML. Elle alimente Picarones avec ce dataset et branche les IA à
comparer :

- ré-OCR avec un moteur alternatif (Pero OCR, Kraken, Mistral OCR…) ;
- correction LLM (GPT-4o, Claude, Mistral) en mode texte seul ou image+texte ;
- mappeurs ALTO spécialisés (re-segmentation des lignes, fusion des
  abréviations, normalisation diplomatique) ;
- pipelines composées : OCR alternatif → correction LLM → mappeur ALTO.

Picarones exécute chaque IA sur chaque page du golden dataset, compare la
sortie à la vérité terrain à tous les niveaux pertinents, mesure les
métriques (gain CER, recherchabilité fuzzy gagnée, séquences numériques
préservées, **erreurs introduites par le post-correcteur** — critique
pour les LLM qui modernisent silencieusement) et produit un rapport HTML
factuel **directement citable dans une publication scientifique** :
chaque chiffre est traçable au payload source, aucune prescription n'est
imposée.

Le chercheur lit les chiffres et décide quel pipeline est assez fiable
pour le déployer sur son corpus de production réel (non annoté).

---

## Table of Contents

- [Features](#features)
  - [Heritage-Specific Metrics](#heritage-specific-metrics)
  - [OCR+LLM Pipelines](#ocr-llm-pipelines)
  - [Corpus Import](#corpus-import)
  - [Interactive HTML Report](#interactive-html-report)
  - [Longitudinal Tracking & Robustness](#longitudinal-tracking--robustness)
  - [Web Interface](#web-interface)
- [Quick Start](#quick-start)
- [Installation](#installation)
  - [From Source](#from-source)
  - [Docker](#docker)
  - [Optional Extras](#optional-extras)
- [Usage](#usage)
  - [CLI Commands](#cli-commands)
  - [Web Interface](#web-interface-1)
  - [Pipeline Modes](#pipeline-modes)
- [Supported Engines](#supported-engines)
- [Normalization Profiles](#normalization-profiles)
- [Error Taxonomy](#error-taxonomy)
- [Project Structure](#project-structure)
- [Environment Variables](#environment-variables)
- [CI/CD](#cicd)
- [Development](#development)
- [Roadmap](#roadmap)
- [Contributing](#contributing)
- [License](#license)

---

## Features

### Heritage-Specific Metrics

- **CER** (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and
  **diplomatic** (historical equivalences: long s = s, u = v, i = j, etc.)
- **WER**, **MER**, **WIL** with historical-aware tokenization (via [jiwer](https://github.com/jitsi/jiwer))
- **Unicode confusion matrix** -- fingerprint each engine's character-level errors
- **Ligature and diacritic scores** -- track handling of fi, fl, ff, oe, ae, p-bar, and other
  medieval glyphs
- **10-class error taxonomy** -- automatic classification of every error (visual confusion,
  abbreviation, segmentation, lacuna, over-normalization, etc.)
- **Bootstrap 95% confidence intervals**, **Wilcoxon signed-rank tests**, and the
  **Friedman test + Nemenyi post-hoc** with a **Critical Difference Diagram** (Demšar 2006)
  for rigorous multi-engine comparison
- **Intrinsic difficulty score** per document, independent of engine performance
- **Line-level error distribution** with Gini coefficient and percentile analysis
- **VLM hallucination detection** -- anchor score and length ratio to flag fabricated output
- **Cost / speed / carbon Pareto front** (local vs cloud, per-token pricing model)

### OCR+LLM Pipelines

- Composable chains: `tesseract -> gpt-4o`, `pero_ocr -> claude-sonnet`, zero-shot VLM, etc.
- Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot
- **Over-normalization detection** -- does the LLM silently modernize historical spellings?
- Versioned prompt library for medieval French, early modern French, medieval Latin, medieval
  English, and early modern English -- both correction and zero-shot variants

### Corpus Import

| Source | Method |
|--------|--------|
| Local folder | `picarones run --corpus ./corpus/` |
| IIIF manifests (institutional repositories) | `picarones import iiif <manifest-url>` |
| Gallica API (SRU + OCR) | `GallicaClient` / `picarones import iiif` |
| HuggingFace Datasets | `picarones import hf <dataset-id>` |
| HTR-United catalogue | `picarones import htr-united` |
| eScriptorium | `EScriptoriumClient` |
| ZIP upload (browser) | Web interface upload endpoint |

Supported corpus formats: plain text pairs (image + ground truth), **ALTO XML**, and **PAGE XML**.

### Interactive HTML Report

- **Self-contained HTML file** -- works offline, no server needed (Jinja2-templated since Sprint 17)
- **Factual narrative synthesis** at the top of the report (Sprint 19): 12 deterministic
  detectors extract salient facts (global leader, significant gap, stratum collapse, VLM
  hallucination flag, speed winner, cost outlier, Pareto alternative, ...) and render them
  as short sentences -- every number is traceable to the source payload, no LLM, no
  hallucination risk
- **Critical Difference Diagram** (CDD) rendered server-side as static SVG -- no JS required
- **Cost / speed / carbon Pareto chart** with toggleable axes and highlighted Pareto front
- **Contextual glossary**: a `?` icon next to every metric header opens a side panel with
  definition, what it measures, usage, limits, and reference (25 bilingual entries)
- **Advanced mode panel**: visible-column picker, per-stratum filter, and opt-in personal
  composite score (sliders default to 0, formula always visible, explicit warning that no
  universal weighting exists). State is persisted in the URL.
- Sortable ranking table, radar charts, histograms (powered by Chart.js)
- Gallery view with dynamic filters and color-coded CER badges
- GitHub-style colored diff with synchronized N-way scrolling
- Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction
- Unicode character view: interactive confusion matrix explorer
- Export to **CSV**, **JSON**, **ALTO XML**, **PAGE XML**, and annotated images

### Longitudinal Tracking & Robustness

- Optional **SQLite database** to record benchmark history across runs
- **CER evolution curves** over time, per engine
- **Automatic regression detection** between consecutive runs
- **Robustness analysis**: measure engine resilience to noise, blur, rotation, resolution
  reduction, and binarization
- Critical degradation threshold identification

### Web Interface

- **FastAPI** application with real-time **Server-Sent Events** (SSE) progress streaming
- Upload corpus as a **ZIP file** directly from the browser
- Dynamic engine and normalization profile selectors
- Browse and re-download generated HTML reports
- Bilingual **French/English** interface
- Deployable on HuggingFace Spaces (Docker, port 7860)

---

## Quick Start

```bash
# Clone and install
git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e .

# Install Tesseract (system binary, required for the Tesseract engine)
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat

# macOS
brew install tesseract

# Generate a demo report (no OCR engine needed)
picarones demo --output demo_report.html

# List available engines
picarones engines

# Run a benchmark
picarones run --corpus ./corpus/ --engines tesseract --output results.json

# Generate HTML report
picarones report --results results.json --output report.html

# Launch the web interface
picarones serve --port 8080
```

---

## Installation

### From Source

```bash
git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e ".[dev,web]"    # includes test and web dependencies
```

**System requirements:**

- Python >= 3.11
- [Tesseract OCR 5](https://github.com/tesseract-ocr/tesseract) (for the Tesseract engine)

### Docker

```bash
docker build -t picarones .
docker run -p 7860:7860 \
  -e MISTRAL_API_KEY=... \
  -e OPENAI_API_KEY=... \
  picarones
```

The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs
(fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls
`/health` every 30 seconds.

The [HuggingFace Space](https://huggingface.co/spaces/Ma-Ri-Ba-Ku/Picarones) uses this
same Docker image.

### Optional Extras

| Extra | Install command | What it adds |
|-------|----------------|--------------|
| `dev` | `pip install -e ".[dev]"` | pytest, pytest-cov, httpx, FastAPI, uvicorn, python-multipart |
| `web` | `pip install -e ".[web]"` | FastAPI, uvicorn, python-multipart, httpx |
| `stats` | `pip install -e ".[stats]"` | scipy (exact Wilcoxon/Friedman/Nemenyi -- otherwise pure-Python fallback) |
| `llm` | `pip install -e ".[llm]"` | OpenAI, Anthropic, Mistral SDKs |
| `hf` | `pip install -e ".[hf]"` | HuggingFace Datasets |
| `pero` | `pip install -e ".[pero]"` | Pero OCR engine |
| `kraken` | `pip install -e ".[kraken]"` | Kraken engine |
| `ocr-cloud` | `pip install -e ".[ocr-cloud]"` | Google Vision, AWS (boto3), Azure Doc Intelligence |
| `all` | `pip install -e ".[all]"` | `web` + `hf` + `llm` + `dev` (no `ocr-cloud`) |

See [INSTALL.md](INSTALL.md) for detailed instructions on Linux, macOS, Windows, and Docker.

---

## Usage

### CLI Commands

| Command | Description |
|---------|-------------|
| `picarones run` | Run a full benchmark on a corpus |
| `picarones report` | Generate an HTML report from JSON results |
| `picarones demo` | Generate a demo report with synthetic data (no engine required) |
| `picarones metrics` | Calculate CER/WER between two text files |
| `picarones engines` | List all available OCR engines and LLM adapters |
| `picarones info` | Display version and system information |
| `picarones serve` | Launch the FastAPI web interface |
| `picarones history` | Query longitudinal benchmark history (SQLite) |
| `picarones robustness` | Run robustness analysis with degraded images |
| `picarones import iiif` | Import corpus from an IIIF manifest (any institutional repository). HTR-United and HuggingFace imports are exposed through the web interface (`/api/htr-united/import`, `/api/huggingface/import`). |

**Examples:**

```bash
# Benchmark with Tesseract, French language, PSM 6
picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \
  --output results.json --verbose

# Compare two text files
picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt

# Import 10 pages from any IIIF manifest URL
picarones import iiif https://institution.example/iiif/xxx/manifest.json --pages 1-10

# HuggingFace and HTR-United imports are available via the web UI at
#   http://localhost:8000/  (endpoints POST /api/huggingface/import and /api/htr-united/import)

# View benchmark history with regression detection
picarones history --engine tesseract --regression

# Robustness demo (noise, blur, rotation, resolution)
picarones robustness --corpus ./gt/ --engine tesseract --demo

# Fail CI if CER exceeds threshold
picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15
```

### Web Interface

```bash
picarones serve --host 0.0.0.0 --port 8080
```

**API endpoints include:**

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Main single-page application |
| `/api/status` | GET | Version and application status |
| `/api/engines` | GET | Available OCR/LLM engines |
| `/api/normalization/profiles` | GET | Normalization profiles (read dynamically) |
| `/api/benchmark/start` | POST | Start a benchmark job (returns `job_id`) |
| `/api/benchmark/{job_id}/stream` | GET | SSE real-time progress stream |
| `/api/benchmark/{job_id}/cancel` | POST | Cancel a running benchmark |
| `/api/corpus/browse` | GET | Browse server-side corpus folders |
| `/api/htr-united/catalogue` | GET | Browse HTR-United catalogue |
| `/api/huggingface/search` | GET | Search HuggingFace datasets |
| `/reports/{filename}` | GET | Download generated HTML reports |

### Pipeline Modes

Picarones supports three modes for OCR+LLM pipelines:

| Mode | Description | Model type |
|------|-------------|------------|
| `zero_shot` | LLM receives the image directly and transcribes without prior OCR | VLM (vision) |
| `post_correction_texte` | OCR produces raw text, then LLM corrects it | Text-only LLM |
| `post_correction_image_texte` | OCR produces raw text, then LLM receives both image and text for correction | VLM (vision) |

**Example:** `ministral-3b-latest` is a text-only model and should use `post_correction_texte`.
GPT-4o and Claude support all three modes.

---

## Supported Engines

| Engine | Type | Execution Mode | Installation |
|--------|------|---------------|-------------|
| **Tesseract 5** | Local CLI | CPU (ProcessPool) | `pip install pytesseract` + system binary |
| **Pero OCR** | Local Python | CPU (ProcessPool) | `pip install pero-ocr` |
| **Kraken** | Local Python | CPU (ProcessPool) | `pip install kraken` |
| **Mistral OCR** | Cloud API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
| **Google Vision** | Cloud API | IO (ThreadPool) | `GOOGLE_APPLICATION_CREDENTIALS` env var |
| **Azure Doc Intelligence** | Cloud API | IO (ThreadPool) | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
| **GPT-4o** (VLM) | LLM API | IO (ThreadPool) | `OPENAI_API_KEY` env var |
| **Claude Sonnet** (VLM) | LLM API | IO (ThreadPool) | `ANTHROPIC_API_KEY` env var |
| **Mistral Large** (LLM) | LLM API | IO (ThreadPool) | `MISTRAL_API_KEY` env var |
| **Ollama** (local LLM) | Local LLM | IO (ThreadPool) | `ollama serve` running locally |
| **Custom engine** | CLI or API | Configurable | YAML declaration, no code required |

Engines declare their `execution_mode` (`"io"` or `"cpu"`), allowing the runner to use
`ThreadPoolExecutor` for IO-bound engines and `ProcessPoolExecutor` for CPU-bound engines
simultaneously.

---

## Normalization Profiles

Picarones ships **11 built-in normalization profiles** designed for historical text comparison.
These reduce noise from expected orthographic variation so metrics reflect genuine OCR errors,
not historical spelling differences. The canonical list is defined in
[`picarones/core/normalization.py`](picarones/core/normalization.py) (`NORMALIZATION_PROFILES`)
and is exposed dynamically via `/api/normalization/profiles`.

| Profile | Period | Key equivalences |
|---------|--------|-----------------|
| `nfc` | Any | Unicode NFC normalization only |
| `caseless` | Any | NFC + case folding (`casefold`) |
| `minimal` | Any | NFC + long s (ſ -> s) |
| `medieval_french` | 12th-15th c. | ſ=s, u=v, i=j, y=i, æ=ae, œ=oe, ꝑ=per, & = et |
| `early_modern_french` | 16th-18th c. | ſ=s, æ=ae, œ=oe |
| `medieval_latin` | 12th-15th c. | ſ=s, u=v, i=j, ꝑ=per, ꝓ=pro |
| `medieval_english` | 12th-15th c. | ſ=s, u=v, i=j, þ=th, ȝ=y, ꝑ=per, ꝓ=pro |
| `early_modern_english` | 16th-18th c. | ſ=s, u=v, i=j, vv=w, þ=th, ð=th, ȝ=y |
| `secretary_hand` | 16th-17th c. | Early Modern English + secretary hand visual confusions |
| `sans_ponctuation` | Any | NFC + strips `. , ; : ! ? ' " - – — ( ) [ ]` |
| `sans_apostrophes` | Any | NFC + strips straight (`'`) and typographic (`’`) apostrophes |

Custom profiles can be loaded from YAML files with user-defined diplomatic tables and/or
`exclude_chars` sets.

---

## Error Taxonomy

Every character-level error is automatically classified into one of 10 categories:

| Class | Name | Description |
|-------|------|-------------|
| 1 | `visual_confusion` | Morphologically similar characters (rn/m, l/1, O/0, u/n) |
| 2 | `diacritic_error` | Missing, incorrect, or spurious diacritical mark |
| 3 | `case_error` | Case difference only (A/a) |
| 4 | `ligature_error` | Ligature not resolved or incorrectly resolved |
| 5 | `abbreviation_error` | Medieval abbreviation not expanded |
| 6 | `hapax` | Word not found in any reference lexicon |
| 7 | `segmentation_error` | Token fusion or fragmentation (words/lines) |
| 8 | `oov_character` | Character outside the engine's vocabulary |
| 9 | `lacuna` | Text present in ground truth but absent from OCR output |
| 10 | `over_normalization` | LLM silently modernized a historical spelling |

---

## Project Structure

```
picarones/
├── __init__.py                 # Version (1.0.0), package metadata
├── __main__.py                 # `python -m picarones`
├── cli.py                      # Click CLI: run, demo, report, metrics, engines, info,
│                               #   serve, import iiif, history, robustness
├── fixtures.py                 # Realistic synthetic test data (medieval documents)
├── i18n.py                     # Back-compat shim loading report/i18n/{fr,en}.json

├── core/
│   ├── corpus.py               # Corpus loading (folder, ALTO XML, PAGE XML)
│   ├── metrics.py              # CER, WER, MER, WIL (via jiwer)
│   ├── normalization.py        # Unicode normalization, 11 diplomatic/exclusion profiles
│   ├── statistics.py           # Bootstrap CI, Wilcoxon, Friedman, Nemenyi, CDD SVG
│   ├── runner.py               # Benchmark orchestrator (ThreadPool + ProcessPool)
│   ├── results.py              # DocumentResult, BenchmarkResults, JSON export
│   ├── confusion.py            # Unicode confusion matrix
│   ├── char_scores.py          # Ligature and diacritic scores
│   ├── taxonomy.py             # 10-class error taxonomy
│   ├── structure.py            # Structural analysis (blocks, lines, words)
│   ├── image_quality.py        # Image quality metrics (contrast, noise, resolution)
│   ├── difficulty.py           # Intrinsic difficulty score per document
│   ├── hallucination.py        # VLM hallucination detection
│   ├── line_metrics.py         # Line-level error distribution (Gini, percentiles)
│   ├── history.py              # SQLite longitudinal tracking
│   ├── robustness.py           # Robustness analysis (noise, blur, rotation, resolution)
│   ├── pricing.py              # Cost model, EngineCost, Pareto front
│   └── narrative/              # Factual narrative engine (Sprint 16-19)
│       ├── facts.py            # Fact model, 12 FactType, DetectorRegistry
│       ├── detectors.py        # 12 detectors (global_leader_cer, significant_gap,
│       │                       #   stratum_winner/collapse, error_profile_outlier,
│       │                       #   llm_hallucination_flag, robustness_fragile,
│       │                       #   speed_winner, confidence_warning,
│       │                       #   statistical_tie, pareto_alternative, cost_outlier)
│       ├── arbiter.py          # Sort by importance, dedup, anti-contradiction
│       ├── renderer.py         # YAML template rendering via str.format_map
│       └── templates/{fr,en}.yaml

├── data/
│   └── pricing.yaml            # Indicative cost table (OCR local/cloud + LLM)

├── engines/
│   ├── base.py                 # BaseOCREngine (execution_mode: "io" | "cpu")
│   ├── tesseract.py            # Tesseract 5 adapter (CPU)
│   ├── pero_ocr.py             # Pero OCR adapter (CPU)
│   ├── mistral_ocr.py          # Mistral OCR API (/v1/ocr endpoint)
│   ├── google_vision.py        # Google Cloud Vision adapter
│   └── azure_doc_intel.py      # Azure Document Intelligence adapter

├── llm/
│   ├── base.py                 # BaseLLMAdapter interface
│   ├── openai_adapter.py       # OpenAI / GPT-4o adapter
│   ├── anthropic_adapter.py    # Anthropic / Claude adapter
│   ├── mistral_adapter.py      # Mistral chat completions adapter
│   └── ollama_adapter.py       # Ollama local LLM adapter

├── pipelines/
│   ├── base.py                 # OCRLLMPipeline orchestrator
│   └── over_normalization.py   # Over-normalization detection

├── prompts/                    # 8 versioned prompt templates
│   ├── correction_medieval_french.txt
│   ├── correction_image_medieval_french.txt
│   ├── correction_imprime_ancien.txt
│   ├── correction_medieval_english.txt
│   ├── correction_early_modern_english.txt
│   ├── zero_shot_medieval_french.txt
│   ├── zero_shot_imprime_ancien.txt
│   └── zero_shot_medieval_english.txt

├── report/
│   ├── generator.py            # Orchestrates Jinja2 rendering (617 lines since Sprint 17)
│   ├── diff_utils.py           # Diff computation utilities
│   ├── templates/              # Jinja2 partials (Sprint 17)
│   │   ├── base.html.j2        # assembles everything via {% include %}
│   │   ├── _header.html, _footer.html, _styles.css, _app.js
│   │   ├── _critical_difference.html, _narrative_summary.html, _side_panels.html
│   │   └── view_ranking.html, view_gallery.html, view_document.html,
│   │       view_analyses.html, view_characters.html
│   ├── i18n/                   # FR/EN translations (Sprint 17 -- extracted from i18n.py)
│   │   ├── fr.json
│   │   └── en.json
│   ├── glossary/               # Contextual glossary (Sprint 21)
│   │   ├── fr.yaml             # 25 bilingual entries (definition, measures, usage,
│   │   └── en.yaml             #   limits, reference)
│   └── vendor/                 # Vendored Chart.js

├── web/
│   ├── app.py                  # FastAPI app (SSE, ZIP upload, dynamic endpoints)
│   └── static/                 # CSS assets

└── importers/
    ├── iiif.py                 # IIIF manifest importer
    ├── gallica.py              # Gallica API client (institutional digital library)
    ├── htr_united.py           # HTR-United catalogue importer
    ├── huggingface.py          # HuggingFace Datasets importer
    └── escriptorium.py         # eScriptorium client

docs/                           # User + developer documentation (Sprint 22)
├── case-studies/               # Two labelled case studies ("Cas d'école")
│   ├── 01-registres-paroissiaux.md
│   └── 02-edition-critique.md
├── user/
│   └── reading-a-report.md     # Anatomy, suggested reading order, advanced panel
└── developer/
    ├── index.md
    ├── narrative-engine.md
    ├── extending-glossary.md
    └── extending-i18n.md

tests/                          # 1242 tests (1 skipped: scipy optional)
.github/workflows/
├── ci.yml                      # CI: Python 3.11/3.12, Linux/macOS/Windows, ruff lint
└── sync_to_huggingface.yml     # Auto-sync to HuggingFace Space on push to main
Dockerfile                      # Multi-stage Docker build for HuggingFace Spaces
```

---

## Environment Variables

Configure API keys depending on which engines and LLM adapters you use:

```bash
# LLM APIs
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export MISTRAL_API_KEY="..."

# Cloud OCR APIs (optional)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="eu-west-1"
export AZURE_DOC_INTEL_ENDPOINT="https://..."
export AZURE_DOC_INTEL_KEY="..."
```

For deployment on HuggingFace Spaces, set these in **Settings > Variables and secrets**.

---

## CI/CD

### GitHub Actions (`ci.yml`)

- **Triggers:** push to `main`/`develop`/`feature/*`/`sprint/*`/`claude/*`, PRs to
  `main`/`develop`, manual dispatch
- **Matrix:** Python 3.11 + 3.12 on Linux, macOS, and Windows
- **Jobs:**
  1. **Tests** -- full pytest suite (1242 passing, 1 skipped when scipy is absent) with
     coverage uploaded to Codecov
  2. **Demo** -- end-to-end demo report generation with history and robustness
  3. **Build** -- wheel and sdist with twine validation
  4. **Lint** -- `ruff check picarones/ tests/` (E, W, F; ignores E501, E402). The ruff
     config lives in `pyproject.toml` under `[tool.ruff]` so CI, `make lint` and direct
     invocation all produce the same result -- blocking on F401 / E741.

### HuggingFace Sync (`sync_to_huggingface.yml`)

- Automatically pushes `main` to the HuggingFace Space `Ma-Ri-Ba-Ku/Picarones`
- Requires the `HF_TOKEN` secret in GitHub repository settings

---

## Development

```bash
# Install with dev + web dependencies
pip install -e ".[dev,web]"

# Run the test suite
pytest tests/ -q --tb=short

# Run with coverage
pytest tests/ --cov=picarones --cov-report=term-missing

# Generate a demo report
picarones demo --output demo_report.html

# Launch the web UI in development mode
picarones serve --port 8080

# Full refresh (useful in Codespaces)
git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html
```

**Test suite:** `pytest tests/` -> **1242 passed, 1 skipped** (the skip is intentional
when the optional `scipy` extra is not installed).

**Key development conventions:**

- Never use bare `except Exception: pass` -- always log with `logger.warning()`
- Normalization profiles are read dynamically from `picarones/core/normalization.py` --
  never hardcode them in endpoint handlers
- Engines declare their `execution_mode` (`"io"` or `"cpu"`) so the runner can select the
  appropriate executor
- `python-multipart` must remain in dependencies (FastAPI checks at import time)

---

## Roadmap

| Sprint | Status | Deliverables |
|--------|--------|-------------|
| 1 | Done | Project structure, Tesseract, Pero OCR, CER/WER, CLI |
| 2 | Done | HTML report v1: Chart.js, colored diff, gallery |
| 3 | Done | OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama |
| 4 | Done | Cloud OCR APIs, IIIF import, diplomatic normalization |
| 5 | Done | Advanced metrics: confusion matrix, ligatures, 9-class taxonomy |
| 6 | Done | FastAPI web interface, HTR-United, HuggingFace, bilingual UI |
| 7 | Done | HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score |
| 8 | Done | eScriptorium, Gallica API, SQLite history, robustness analysis |
| 9 | Done | Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta |
| 10 | Done | Line error distribution (Gini), VLM hallucination detection |
| 11 | Done | Internationalization FR/EN, English normalization profiles |
| 12 | Done | Browser ZIP upload, macOS file filtering, dynamic model selector |
| 13 | Done | pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation |
| 14 | Done | Robust engine filtering, corpus validation |
| 15 | Done | Fix empty OCR+LLM pipeline output (Mistral ContentChunk normalization, `finish_reason` logging) |
| 16 | Done | `line_metrics` + `hallucination` wired into runner/`EngineReport`; narrative engine foundations (`core/narrative/` with `Fact` / `DetectorRegistry`); Pillow `getdata`->`tobytes`, silent excepts -> explicit warnings |
| 17 | Done | Report refactor: `generator.py` 3690 -> 617 lines via Jinja2; monolithic HTML template split into 10 files under `picarones/report/templates/`; i18n migrated to `report/i18n/{fr,en}.json`; +16 non-regression tests |
| 18 | Done | Friedman test + Nemenyi post-hoc + Critical Difference Diagram (Demšar 2006); `detect_statistical_tie` enabled; SVG rendered server-side; +41 tests |
| 19 | Done | Factual narrative engine complete: 9 new detectors, arbiter (importance + anti-contradiction), YAML templates renderer, `_narrative_summary.html` partial, anti-hallucination traceability test; +32 tests |
| 20 | Done | Cost model + Pareto view: `core/pricing.py` + `data/pricing.yaml`, `compute_pareto_front`, Chart.js Pareto chart with cost/speed/carbon toggles, `pareto_alternative` and `cost_outlier` detectors; +28 tests |
| 21 | Done | Contextual glossary (25 bilingual entries) + advanced-mode side panel (visible columns, strata filters, opt-in composite score, URL state persistence); +21 tests |
| 22 | Done | Case studies (`docs/case-studies/`), user guide (`docs/user/reading-a-report.md`), three developer guides (`docs/developer/`); +18 tests |

---

## Known Issues & Improvement Opportunities

This section captures the findings of the Sprint 22 audit. None of them block the current
release (all 1242 tests pass, lint clean), but each represents a sensible next step.

### Architecture / refactor

- **`picarones/web/app.py` is 3072 lines** (FastAPI routes, corpus upload, SSE, ZIP flattening,
  HTML delivery, model registry all in one module). Candidate split: `app_routes.py` /
  `app_corpus.py` / `app_jobs.py` / `app_models.py`.
- **`picarones/core/statistics.py` is 1127 lines** mixing bootstrap CI, Wilcoxon, Friedman,
  Nemenyi table, Pareto front and CDD SVG. Splitting into `statistics/bootstrap.py`,
  `statistics/tests.py`, `statistics/pareto.py`, `statistics/cdd_svg.py` would shorten
  import graphs and ease review.
- **`picarones/cli.py` is 971 lines** — each Click command could live in its own module under
  `picarones/cli/` and be registered via `cli.add_command(...)`.
- **`picarones/core/runner.py` is 847 lines** — orchestrator is reasonable but edges past the
  500-line guideline; extracting the per-document worker + the partial-NDJSON writer would
  reduce mental load.
- **`picarones/core/narrative/detectors.py` is 680 lines** — all 12 detectors live together;
  one file per `FactType` (or per importance tier) would make additions safer.

### Back-compat shim

- **`picarones/i18n.py`** is a 66-line shim that reads `picarones/report/i18n/{fr,en}.json`.
  Since Sprint 17 the JSON files are the source of truth and only
  `picarones/report/generator.py:654` still imports through the shim. Either promote the
  shim to `picarones.report.i18n` (renaming the import) or delete the file and import the
  loader directly.

### Explicit engine declarations

- `MistralOCREngine`, `GoogleVisionEngine` and `AzureDocIntelEngine` inherit the implicit
  `execution_mode = "io"` default from `BaseOCREngine`. For clarity and to protect against a
  future default flip, declare it explicitly (as `TesseractEngine` and `PeroOCREngine` already
  do for `"cpu"`).

### Test coverage gaps

- No dedicated unit tests for `picarones/core/char_scores.py` (exercised only transitively).
- No unit tests for the cloud engine adapters themselves (`mistral_ocr.py`,
  `google_vision.py`, `azure_doc_intel.py`) — they are only reached via integration fixtures.
- `pytest` installed as a `uv` tool doesn't see project dependencies automatically; document
  `pip install -e ".[dev,web,stats]"` in the pytest environment or switch to an in-repo venv
  to avoid "`ModuleNotFoundError: No module named 'yaml'`" surprises.

### Documentation

- `CHANGELOG.md` stops at Sprint 9 (2025-03). Sprints 10-22 are described in `CLAUDE.md` and
  this README but should be back-ported into `CHANGELOG.md` to follow Keep-a-Changelog.
- `SPECS.md` predates the narrative engine, Pareto view and glossary — worth a pass.
- Some code comments and docstrings are still in French while user-facing strings are
  bilingual; harmonising module docstrings in English would make the project more
  contributor-friendly.

### CI / packaging

- `sync_to_huggingface.yml` uses `git push --force hf main` unconditionally — safe today but
  worth documenting because a non-main branch push would silently rewrite the Space.
- `picarones.spec` (PyInstaller) is still present but not exercised in CI — either add a
  `build-exe` job or mark the spec as community-maintained.

### Security (nothing critical)

- ZIP upload flattening in `web/app.py` rejects absolute paths and `..` traversal but does
  not check for symlinks inside archives. Python's `zipfile` doesn't extract symlinks, so
  the risk is theoretical; adding an explicit check (`ZipInfo.external_attr & 0xA000`) is a
  belt-and-braces improvement.
- API keys are read from environment variables only (no hardcoded fallback) — good.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for instructions on adding an OCR engine, an LLM
adapter, or submitting a pull request.

---

## License

[Apache License 2.0](LICENSE)

Copyright 2024 Picarones contributors.