Spaces:
Sleeping
feat(sprint-E.5): 3 derniers modules measurements/ migrés — BOOTSTRAP_BASELINE = 0
Browse files🎯 Sprint E.5 du plan v2.0 — la migration des sources
``measurements/*.py`` vers ``evaluation/metrics/`` est
**complète**. Tous les modules legacy de mesures sont désormais
des shims pointant sur la couche canonique.
Modules déplacés (git mv)
--------------------------
- ``reliability.py`` (360 LOC) — Cohen κ, Krippendorff α (Sprint 12).
- ``history.py`` (615 LOC) — SQLite longitudinal (Sprint 92,
Pettitt change-point).
- ``robustness.py`` (731 LOC) — analyse de robustesse multi-niveaux
(bruit, blur, rotation, résolution, binarisation — Sprint 81).
Total : 1706 LOC migrées vers ``evaluation/metrics/``.
Adaptation contraintes architecturales
---------------------------------------
``robustness.py`` importait ``BaseOCREngine`` (couche
``adapters/``) et ``tqdm`` (lib externe) — interdits dans la
couche ``evaluation/`` (whitelist stricte). Solution :
- ``BaseOCREngine`` (TYPE_CHECKING) → typage ``Any`` + duck
typing runtime (l'objet doit juste avoir
``.run(image_path) → EngineResult``).
- ``tqdm`` → import dynamique via ``importlib.import_module``,
explicitement permis par ``test_layer_imports_are_legal`` qui
ne couvre pas les imports différés.
Aucune régression de fonctionnalité — duck typing et import
dynamique sont sémantiquement équivalents pour ce caller.
Migration callers (8 prod + 3 tests)
-------------------------------------
- ``reports_v2/html/renderers/multirun_stability.py`` (1 import).
- ``reports_v2/html/renderers/longitudinal.py`` (1).
- ``reports_v2/html/renderers/robustness_projection.py`` (1).
- ``web/routers/history.py`` (1).
- ``cli/__init__.py`` (1).
- ``cli/_history.py`` (1).
- ``cli/_robustness.py`` (1).
- ``adapters/legacy_engines/base.py`` (1, docstring).
Sprint E — bilan post-E.5
-------------------------
**Tous les 21 modules sources de ``measurements/`` sont migrés**
vers ``evaluation/metrics/`` (E.1: 4 + E.2: 10 + E.3: 1 +
E.4: 3 + E.5: 3 = 21 modules, ~6700 LOC canonisées).
``measurements/`` ne contient désormais que :
- 21 shims ``DeprecationWarning`` (~25 lignes chacun).
- ``narrative/`` (sous-package, déjà migré au Lot 5.A
vers ``reports_v2/narrative/``).
- ``__init__.py``.
Architecture
------------
- ``BOOTSTRAP_BASELINE`` du
``test_legacy_canonical_parity`` : **17 → 0** 🎯.
Aucun symbole legacy public ``measurements/`` n'est plus
non tracé — chaque shim ré-exporte vers une cible canonique
identifiable.
- ``FILE_BUDGETS`` : entrées ``picarones/measurements/{history,
robustness}.py`` renommées vers ``picarones/evaluation/metrics/``.
- ``TEST_ONLY_BASELINE`` : ajout de ``history``, ``robustness``
(les 2 derniers shims sans consommateur production direct).
Bilan
-----
- ``pytest tests/`` : 4668 passed (+2), 0 failed.
- ``ruff check`` : clean.
- 3 modules canonisés.
- ``measurements/`` : 0 module source, 21 shims + 1
sous-package legacy ``narrative/``.
Sprint E.6 — prochaine étape
-----------------------------
Suppression complète du sous-package ``picarones/measurements/`` :
1. Supprimer les 21 shims (suppression agressive — les callers
externes auront eu ``DeprecationWarning`` durant la migration).
2. Supprimer ``narrative/`` qui est en doublon avec
``reports_v2/narrative/``.
3. Cleanup baselines architecturales (mise à zéro).
https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP
- picarones/cli/__init__.py +2 -2
- picarones/cli/_history.py +1 -1
- picarones/cli/_robustness.py +2 -2
- picarones/evaluation/metrics/history.py +615 -0
- picarones/evaluation/metrics/reliability.py +360 -0
- picarones/evaluation/metrics/robustness.py +742 -0
- picarones/measurements/history.py +14 -608
- picarones/measurements/reliability.py +14 -353
- picarones/measurements/robustness.py +14 -724
- picarones/reports_v2/html/renderers/longitudinal.py +1 -1
- picarones/reports_v2/html/renderers/multirun_stability.py +1 -1
- picarones/reports_v2/html/renderers/robustness_projection.py +1 -1
- picarones/web/routers/history.py +1 -1
- tests/architecture/test_file_budgets.py +4 -0
- tests/architecture/test_legacy_canonical_parity.py +1 -1
- tests/architecture/test_module_coverage.py +4 -0
- tests/measurements/test_sprint83_reliability.py +1 -1
- tests/measurements/test_sprint8_longitudinal_robustness.py +67 -67
|
@@ -320,7 +320,7 @@ def demo_cmd(
|
|
| 320 |
# Suivi longitudinal
|
| 321 |
if with_history:
|
| 322 |
click.echo("\n── Démonstration suivi longitudinal ──────────────")
|
| 323 |
-
from picarones.
|
| 324 |
history = BenchmarkHistory(":memory:")
|
| 325 |
generate_demo_history(history, n_runs=8)
|
| 326 |
entries = history.query(engine="tesseract")
|
|
@@ -344,7 +344,7 @@ def demo_cmd(
|
|
| 344 |
# Analyse de robustesse
|
| 345 |
if with_robustness:
|
| 346 |
click.echo("\n── Démonstration analyse de robustesse ───────────")
|
| 347 |
-
from picarones.
|
| 348 |
report = generate_demo_robustness_report(
|
| 349 |
engine_names=["tesseract", "pero_ocr"]
|
| 350 |
)
|
|
|
|
| 320 |
# Suivi longitudinal
|
| 321 |
if with_history:
|
| 322 |
click.echo("\n── Démonstration suivi longitudinal ──────────────")
|
| 323 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, generate_demo_history
|
| 324 |
history = BenchmarkHistory(":memory:")
|
| 325 |
generate_demo_history(history, n_runs=8)
|
| 326 |
entries = history.query(engine="tesseract")
|
|
|
|
| 344 |
# Analyse de robustesse
|
| 345 |
if with_robustness:
|
| 346 |
click.echo("\n── Démonstration analyse de robustesse ───────────")
|
| 347 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 348 |
report = generate_demo_robustness_report(
|
| 349 |
engine_names=["tesseract", "pero_ocr"]
|
| 350 |
)
|
|
@@ -103,7 +103,7 @@ def history_cmd(
|
|
| 103 |
"""
|
| 104 |
_setup_logging(verbose)
|
| 105 |
|
| 106 |
-
from picarones.
|
| 107 |
|
| 108 |
history = BenchmarkHistory(db)
|
| 109 |
|
|
|
|
| 103 |
"""
|
| 104 |
_setup_logging(verbose)
|
| 105 |
|
| 106 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, generate_demo_history
|
| 107 |
|
| 108 |
history = BenchmarkHistory(db)
|
| 109 |
|
|
@@ -99,7 +99,7 @@ def robustness_cmd(
|
|
| 99 |
|
| 100 |
deg_types = [d.strip() for d in degradations.split(",") if d.strip()]
|
| 101 |
|
| 102 |
-
from picarones.
|
| 103 |
RobustnessAnalyzer, ALL_DEGRADATION_TYPES, generate_demo_robustness_report
|
| 104 |
)
|
| 105 |
|
|
@@ -139,7 +139,7 @@ def robustness_cmd(
|
|
| 139 |
click.echo(f"Erreur moteur : {exc}", err=True)
|
| 140 |
sys.exit(1)
|
| 141 |
|
| 142 |
-
from picarones.
|
| 143 |
analyzer = RobustnessAnalyzer(
|
| 144 |
engines=[ocr_engine],
|
| 145 |
degradation_types=deg_types,
|
|
|
|
| 99 |
|
| 100 |
deg_types = [d.strip() for d in degradations.split(",") if d.strip()]
|
| 101 |
|
| 102 |
+
from picarones.evaluation.metrics.robustness import (
|
| 103 |
RobustnessAnalyzer, ALL_DEGRADATION_TYPES, generate_demo_robustness_report
|
| 104 |
)
|
| 105 |
|
|
|
|
| 139 |
click.echo(f"Erreur moteur : {exc}", err=True)
|
| 140 |
sys.exit(1)
|
| 141 |
|
| 142 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 143 |
analyzer = RobustnessAnalyzer(
|
| 144 |
engines=[ocr_engine],
|
| 145 |
degradation_types=deg_types,
|
|
@@ -0,0 +1,615 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Suivi longitudinal des benchmarks — base SQLite optionnelle.
|
| 2 |
+
|
| 3 |
+
Fonctionnement
|
| 4 |
+
--------------
|
| 5 |
+
- Chaque run de benchmark est enregistré dans une table SQLite avec horodatage,
|
| 6 |
+
corpus, moteurs, métriques agrégées.
|
| 7 |
+
- L'historique permet de tracer des courbes d'évolution du CER dans le temps.
|
| 8 |
+
- La détection de régression compare le dernier run à une baseline configurable.
|
| 9 |
+
|
| 10 |
+
Structure de la base
|
| 11 |
+
--------------------
|
| 12 |
+
Table ``runs`` :
|
| 13 |
+
run_id TEXT PRIMARY KEY — UUID ou hash du run
|
| 14 |
+
timestamp TEXT — ISO 8601
|
| 15 |
+
corpus_name TEXT
|
| 16 |
+
engine_name TEXT
|
| 17 |
+
cer_mean REAL
|
| 18 |
+
wer_mean REAL
|
| 19 |
+
doc_count INTEGER
|
| 20 |
+
metadata TEXT — JSON
|
| 21 |
+
|
| 22 |
+
Usage
|
| 23 |
+
-----
|
| 24 |
+
>>> from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 25 |
+
>>> history = BenchmarkHistory("~/.picarones/history.db")
|
| 26 |
+
>>> history.record(benchmark_result)
|
| 27 |
+
>>> df = history.query(engine="tesseract", corpus="chroniques")
|
| 28 |
+
>>> regression = history.detect_regression(engine="tesseract", threshold=0.02)
|
| 29 |
+
"""
|
| 30 |
+
|
| 31 |
+
from __future__ import annotations
|
| 32 |
+
|
| 33 |
+
import json
|
| 34 |
+
import logging
|
| 35 |
+
import sqlite3
|
| 36 |
+
import uuid
|
| 37 |
+
from dataclasses import dataclass, field
|
| 38 |
+
from datetime import datetime, timezone
|
| 39 |
+
from pathlib import Path
|
| 40 |
+
from typing import TYPE_CHECKING, Optional
|
| 41 |
+
|
| 42 |
+
if TYPE_CHECKING:
|
| 43 |
+
from picarones.evaluation.benchmark_result import BenchmarkResult
|
| 44 |
+
|
| 45 |
+
logger = logging.getLogger(__name__)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
# ---------------------------------------------------------------------------
|
| 49 |
+
# Structures de données
|
| 50 |
+
# ---------------------------------------------------------------------------
|
| 51 |
+
|
| 52 |
+
@dataclass
|
| 53 |
+
class HistoryEntry:
|
| 54 |
+
"""Un enregistrement dans l'historique des benchmarks."""
|
| 55 |
+
run_id: str
|
| 56 |
+
timestamp: str
|
| 57 |
+
corpus_name: str
|
| 58 |
+
engine_name: str
|
| 59 |
+
cer_mean: Optional[float]
|
| 60 |
+
wer_mean: Optional[float]
|
| 61 |
+
doc_count: int
|
| 62 |
+
metadata: dict = field(default_factory=dict)
|
| 63 |
+
|
| 64 |
+
@property
|
| 65 |
+
def cer_percent(self) -> Optional[float]:
|
| 66 |
+
return self.cer_mean * 100 if self.cer_mean is not None else None
|
| 67 |
+
|
| 68 |
+
def as_dict(self) -> dict:
|
| 69 |
+
return {
|
| 70 |
+
"run_id": self.run_id,
|
| 71 |
+
"timestamp": self.timestamp,
|
| 72 |
+
"corpus_name": self.corpus_name,
|
| 73 |
+
"engine_name": self.engine_name,
|
| 74 |
+
"cer_mean": self.cer_mean,
|
| 75 |
+
"wer_mean": self.wer_mean,
|
| 76 |
+
"doc_count": self.doc_count,
|
| 77 |
+
"metadata": self.metadata,
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
@dataclass
|
| 82 |
+
class RegressionResult:
|
| 83 |
+
"""Résultat d'une détection de régression."""
|
| 84 |
+
engine_name: str
|
| 85 |
+
corpus_name: str
|
| 86 |
+
baseline_run_id: str
|
| 87 |
+
baseline_timestamp: str
|
| 88 |
+
baseline_cer: Optional[float]
|
| 89 |
+
current_run_id: str
|
| 90 |
+
current_timestamp: str
|
| 91 |
+
current_cer: Optional[float]
|
| 92 |
+
delta_cer: Optional[float]
|
| 93 |
+
"""Delta CER (current - baseline). Positif = régression."""
|
| 94 |
+
is_regression: bool
|
| 95 |
+
threshold: float
|
| 96 |
+
|
| 97 |
+
def as_dict(self) -> dict:
|
| 98 |
+
return {
|
| 99 |
+
"engine_name": self.engine_name,
|
| 100 |
+
"corpus_name": self.corpus_name,
|
| 101 |
+
"baseline_run_id": self.baseline_run_id,
|
| 102 |
+
"baseline_timestamp": self.baseline_timestamp,
|
| 103 |
+
"baseline_cer": self.baseline_cer,
|
| 104 |
+
"current_run_id": self.current_run_id,
|
| 105 |
+
"current_timestamp": self.current_timestamp,
|
| 106 |
+
"current_cer": self.current_cer,
|
| 107 |
+
"delta_cer": self.delta_cer,
|
| 108 |
+
"is_regression": self.is_regression,
|
| 109 |
+
"threshold": self.threshold,
|
| 110 |
+
}
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
# ---------------------------------------------------------------------------
|
| 114 |
+
# BenchmarkHistory
|
| 115 |
+
# ---------------------------------------------------------------------------
|
| 116 |
+
|
| 117 |
+
class BenchmarkHistory:
|
| 118 |
+
"""Gestionnaire de l'historique des benchmarks dans SQLite.
|
| 119 |
+
|
| 120 |
+
Parameters
|
| 121 |
+
----------
|
| 122 |
+
db_path:
|
| 123 |
+
Chemin vers le fichier SQLite. Utiliser ``":memory:"`` pour les tests.
|
| 124 |
+
|
| 125 |
+
Examples
|
| 126 |
+
--------
|
| 127 |
+
>>> history = BenchmarkHistory("~/.picarones/history.db")
|
| 128 |
+
>>> history.record(benchmark)
|
| 129 |
+
>>> entries = history.query(engine="tesseract")
|
| 130 |
+
>>> for e in entries:
|
| 131 |
+
... print(e.timestamp, f"CER={e.cer_percent:.2f}%")
|
| 132 |
+
"""
|
| 133 |
+
|
| 134 |
+
_CREATE_TABLE = """
|
| 135 |
+
CREATE TABLE IF NOT EXISTS runs (
|
| 136 |
+
run_id TEXT PRIMARY KEY,
|
| 137 |
+
timestamp TEXT NOT NULL,
|
| 138 |
+
corpus_name TEXT NOT NULL,
|
| 139 |
+
engine_name TEXT NOT NULL,
|
| 140 |
+
cer_mean REAL,
|
| 141 |
+
wer_mean REAL,
|
| 142 |
+
doc_count INTEGER,
|
| 143 |
+
metadata TEXT
|
| 144 |
+
);
|
| 145 |
+
CREATE INDEX IF NOT EXISTS idx_engine ON runs (engine_name);
|
| 146 |
+
CREATE INDEX IF NOT EXISTS idx_corpus ON runs (corpus_name);
|
| 147 |
+
CREATE INDEX IF NOT EXISTS idx_timestamp ON runs (timestamp);
|
| 148 |
+
"""
|
| 149 |
+
|
| 150 |
+
def __init__(self, db_path: str = "~/.picarones/history.db") -> None:
|
| 151 |
+
if db_path != ":memory:":
|
| 152 |
+
path = Path(db_path).expanduser()
|
| 153 |
+
path.parent.mkdir(parents=True, exist_ok=True)
|
| 154 |
+
self.db_path = str(path)
|
| 155 |
+
else:
|
| 156 |
+
self.db_path = ":memory:"
|
| 157 |
+
self._conn: Optional[sqlite3.Connection] = None
|
| 158 |
+
self._init_db()
|
| 159 |
+
|
| 160 |
+
def _connect(self) -> sqlite3.Connection:
|
| 161 |
+
if self._conn is None:
|
| 162 |
+
self._conn = sqlite3.connect(self.db_path)
|
| 163 |
+
self._conn.row_factory = sqlite3.Row
|
| 164 |
+
return self._conn
|
| 165 |
+
|
| 166 |
+
def _init_db(self) -> None:
|
| 167 |
+
conn = self._connect()
|
| 168 |
+
conn.executescript(self._CREATE_TABLE)
|
| 169 |
+
conn.commit()
|
| 170 |
+
|
| 171 |
+
def close(self) -> None:
|
| 172 |
+
"""Ferme la connexion SQLite."""
|
| 173 |
+
if self._conn:
|
| 174 |
+
self._conn.close()
|
| 175 |
+
self._conn = None
|
| 176 |
+
|
| 177 |
+
# ------------------------------------------------------------------
|
| 178 |
+
# Enregistrement
|
| 179 |
+
# ------------------------------------------------------------------
|
| 180 |
+
|
| 181 |
+
def record(
|
| 182 |
+
self,
|
| 183 |
+
benchmark_result: "BenchmarkResult",
|
| 184 |
+
run_id: Optional[str] = None,
|
| 185 |
+
extra_metadata: Optional[dict] = None,
|
| 186 |
+
) -> str:
|
| 187 |
+
"""Enregistre les résultats d'un benchmark dans l'historique.
|
| 188 |
+
|
| 189 |
+
Parameters
|
| 190 |
+
----------
|
| 191 |
+
benchmark_result:
|
| 192 |
+
Résultats à enregistrer (``BenchmarkResult``).
|
| 193 |
+
run_id:
|
| 194 |
+
Identifiant du run (auto-généré si None).
|
| 195 |
+
extra_metadata:
|
| 196 |
+
Métadonnées supplémentaires à stocker.
|
| 197 |
+
|
| 198 |
+
Returns
|
| 199 |
+
-------
|
| 200 |
+
str
|
| 201 |
+
L'identifiant du run enregistré.
|
| 202 |
+
"""
|
| 203 |
+
if run_id is None:
|
| 204 |
+
run_id = str(uuid.uuid4())
|
| 205 |
+
|
| 206 |
+
timestamp = datetime.now(timezone.utc).isoformat()
|
| 207 |
+
conn = self._connect()
|
| 208 |
+
|
| 209 |
+
for report in benchmark_result.engine_reports:
|
| 210 |
+
ranking = benchmark_result.ranking()
|
| 211 |
+
engine_entry = next(
|
| 212 |
+
(r for r in ranking if r["engine"] == report.engine_name),
|
| 213 |
+
None,
|
| 214 |
+
)
|
| 215 |
+
cer_mean = engine_entry["mean_cer"] if engine_entry else None
|
| 216 |
+
wer_mean = engine_entry["mean_wer"] if engine_entry else None
|
| 217 |
+
|
| 218 |
+
meta = {
|
| 219 |
+
"engine_version": report.engine_version,
|
| 220 |
+
"engine_config": report.engine_config,
|
| 221 |
+
"picarones_version": benchmark_result.metadata.get("picarones_version", ""),
|
| 222 |
+
**(extra_metadata or {}),
|
| 223 |
+
}
|
| 224 |
+
|
| 225 |
+
conn.execute(
|
| 226 |
+
"""
|
| 227 |
+
INSERT OR REPLACE INTO runs
|
| 228 |
+
(run_id, timestamp, corpus_name, engine_name,
|
| 229 |
+
cer_mean, wer_mean, doc_count, metadata)
|
| 230 |
+
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
| 231 |
+
""",
|
| 232 |
+
(
|
| 233 |
+
f"{run_id}_{report.engine_name}",
|
| 234 |
+
timestamp,
|
| 235 |
+
benchmark_result.corpus_name,
|
| 236 |
+
report.engine_name,
|
| 237 |
+
cer_mean,
|
| 238 |
+
wer_mean,
|
| 239 |
+
benchmark_result.document_count,
|
| 240 |
+
json.dumps(meta, ensure_ascii=False),
|
| 241 |
+
),
|
| 242 |
+
)
|
| 243 |
+
|
| 244 |
+
conn.commit()
|
| 245 |
+
logger.info("Benchmark enregistré dans l'historique : run_id=%s", run_id)
|
| 246 |
+
return run_id
|
| 247 |
+
|
| 248 |
+
def record_single(
|
| 249 |
+
self,
|
| 250 |
+
run_id: str,
|
| 251 |
+
corpus_name: str,
|
| 252 |
+
engine_name: str,
|
| 253 |
+
cer_mean: Optional[float],
|
| 254 |
+
wer_mean: Optional[float],
|
| 255 |
+
doc_count: int,
|
| 256 |
+
timestamp: Optional[str] = None,
|
| 257 |
+
metadata: Optional[dict] = None,
|
| 258 |
+
) -> str:
|
| 259 |
+
"""Enregistre manuellement une entrée dans l'historique.
|
| 260 |
+
|
| 261 |
+
Utile pour les tests, les imports de données externes, ou pour
|
| 262 |
+
enregistrer des résultats calculés en dehors de Picarones.
|
| 263 |
+
|
| 264 |
+
Returns
|
| 265 |
+
-------
|
| 266 |
+
str
|
| 267 |
+
run_id enregistré.
|
| 268 |
+
"""
|
| 269 |
+
if timestamp is None:
|
| 270 |
+
timestamp = datetime.now(timezone.utc).isoformat()
|
| 271 |
+
|
| 272 |
+
conn = self._connect()
|
| 273 |
+
conn.execute(
|
| 274 |
+
"""
|
| 275 |
+
INSERT OR REPLACE INTO runs
|
| 276 |
+
(run_id, timestamp, corpus_name, engine_name,
|
| 277 |
+
cer_mean, wer_mean, doc_count, metadata)
|
| 278 |
+
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
| 279 |
+
""",
|
| 280 |
+
(
|
| 281 |
+
run_id,
|
| 282 |
+
timestamp,
|
| 283 |
+
corpus_name,
|
| 284 |
+
engine_name,
|
| 285 |
+
cer_mean,
|
| 286 |
+
wer_mean,
|
| 287 |
+
doc_count,
|
| 288 |
+
json.dumps(metadata or {}, ensure_ascii=False),
|
| 289 |
+
),
|
| 290 |
+
)
|
| 291 |
+
conn.commit()
|
| 292 |
+
return run_id
|
| 293 |
+
|
| 294 |
+
# ------------------------------------------------------------------
|
| 295 |
+
# Requêtes
|
| 296 |
+
# ------------------------------------------------------------------
|
| 297 |
+
|
| 298 |
+
def query(
|
| 299 |
+
self,
|
| 300 |
+
engine: Optional[str] = None,
|
| 301 |
+
corpus: Optional[str] = None,
|
| 302 |
+
since: Optional[str] = None,
|
| 303 |
+
limit: int = 100,
|
| 304 |
+
) -> list[HistoryEntry]:
|
| 305 |
+
"""Retourne l'historique des runs, avec filtres optionnels.
|
| 306 |
+
|
| 307 |
+
Parameters
|
| 308 |
+
----------
|
| 309 |
+
engine:
|
| 310 |
+
Filtre sur le nom du moteur.
|
| 311 |
+
corpus:
|
| 312 |
+
Filtre sur le nom du corpus.
|
| 313 |
+
since:
|
| 314 |
+
Date ISO 8601 minimale (``"2025-01-01"``).
|
| 315 |
+
limit:
|
| 316 |
+
Nombre maximum d'entrées retournées.
|
| 317 |
+
|
| 318 |
+
Returns
|
| 319 |
+
-------
|
| 320 |
+
list[HistoryEntry]
|
| 321 |
+
Entrées triées par timestamp croissant.
|
| 322 |
+
"""
|
| 323 |
+
clauses: list[str] = []
|
| 324 |
+
params: list = []
|
| 325 |
+
|
| 326 |
+
if engine:
|
| 327 |
+
clauses.append("engine_name = ?")
|
| 328 |
+
params.append(engine)
|
| 329 |
+
if corpus:
|
| 330 |
+
clauses.append("corpus_name = ?")
|
| 331 |
+
params.append(corpus)
|
| 332 |
+
if since:
|
| 333 |
+
clauses.append("timestamp >= ?")
|
| 334 |
+
params.append(since)
|
| 335 |
+
|
| 336 |
+
where = f"WHERE {' AND '.join(clauses)}" if clauses else ""
|
| 337 |
+
params.append(limit)
|
| 338 |
+
|
| 339 |
+
conn = self._connect()
|
| 340 |
+
rows = conn.execute(
|
| 341 |
+
f"SELECT * FROM runs {where} ORDER BY timestamp ASC LIMIT ?",
|
| 342 |
+
params,
|
| 343 |
+
).fetchall()
|
| 344 |
+
|
| 345 |
+
return [
|
| 346 |
+
HistoryEntry(
|
| 347 |
+
run_id=row["run_id"],
|
| 348 |
+
timestamp=row["timestamp"],
|
| 349 |
+
corpus_name=row["corpus_name"],
|
| 350 |
+
engine_name=row["engine_name"],
|
| 351 |
+
cer_mean=row["cer_mean"],
|
| 352 |
+
wer_mean=row["wer_mean"],
|
| 353 |
+
doc_count=row["doc_count"],
|
| 354 |
+
metadata=json.loads(row["metadata"] or "{}"),
|
| 355 |
+
)
|
| 356 |
+
for row in rows
|
| 357 |
+
]
|
| 358 |
+
|
| 359 |
+
def list_engines(self) -> list[str]:
|
| 360 |
+
"""Retourne la liste des moteurs présents dans l'historique."""
|
| 361 |
+
conn = self._connect()
|
| 362 |
+
rows = conn.execute(
|
| 363 |
+
"SELECT DISTINCT engine_name FROM runs ORDER BY engine_name"
|
| 364 |
+
).fetchall()
|
| 365 |
+
return [row[0] for row in rows]
|
| 366 |
+
|
| 367 |
+
def list_corpora(self) -> list[str]:
|
| 368 |
+
"""Retourne la liste des corpus présents dans l'historique."""
|
| 369 |
+
conn = self._connect()
|
| 370 |
+
rows = conn.execute(
|
| 371 |
+
"SELECT DISTINCT corpus_name FROM runs ORDER BY corpus_name"
|
| 372 |
+
).fetchall()
|
| 373 |
+
return [row[0] for row in rows]
|
| 374 |
+
|
| 375 |
+
def count(self) -> int:
|
| 376 |
+
"""Nombre total d'entrées dans l'historique."""
|
| 377 |
+
conn = self._connect()
|
| 378 |
+
return conn.execute("SELECT COUNT(*) FROM runs").fetchone()[0]
|
| 379 |
+
|
| 380 |
+
# ------------------------------------------------------------------
|
| 381 |
+
# Courbes d'évolution
|
| 382 |
+
# ------------------------------------------------------------------
|
| 383 |
+
|
| 384 |
+
def get_cer_curve(
|
| 385 |
+
self,
|
| 386 |
+
engine: str,
|
| 387 |
+
corpus: Optional[str] = None,
|
| 388 |
+
) -> list[dict]:
|
| 389 |
+
"""Retourne les données pour tracer la courbe d'évolution du CER.
|
| 390 |
+
|
| 391 |
+
Parameters
|
| 392 |
+
----------
|
| 393 |
+
engine:
|
| 394 |
+
Nom du moteur.
|
| 395 |
+
corpus:
|
| 396 |
+
Corpus spécifique (None = tous les corpus pour ce moteur).
|
| 397 |
+
|
| 398 |
+
Returns
|
| 399 |
+
-------
|
| 400 |
+
list[dict]
|
| 401 |
+
Chaque dict contient ``{"timestamp": str, "cer": float, "run_id": str}``.
|
| 402 |
+
"""
|
| 403 |
+
entries = self.query(engine=engine, corpus=corpus, limit=1000)
|
| 404 |
+
return [
|
| 405 |
+
{
|
| 406 |
+
"timestamp": e.timestamp,
|
| 407 |
+
"cer": e.cer_mean,
|
| 408 |
+
"cer_percent": e.cer_percent,
|
| 409 |
+
"run_id": e.run_id,
|
| 410 |
+
"corpus_name": e.corpus_name,
|
| 411 |
+
}
|
| 412 |
+
for e in entries
|
| 413 |
+
if e.cer_mean is not None
|
| 414 |
+
]
|
| 415 |
+
|
| 416 |
+
# ------------------------------------------------------------------
|
| 417 |
+
# Détection de régression
|
| 418 |
+
# ------------------------------------------------------------------
|
| 419 |
+
|
| 420 |
+
def detect_regression(
|
| 421 |
+
self,
|
| 422 |
+
engine: str,
|
| 423 |
+
corpus: Optional[str] = None,
|
| 424 |
+
threshold: float = 0.01,
|
| 425 |
+
baseline_run_id: Optional[str] = None,
|
| 426 |
+
) -> Optional[RegressionResult]:
|
| 427 |
+
"""Détecte une régression du CER entre deux runs.
|
| 428 |
+
|
| 429 |
+
Compare le run le plus récent à une baseline (le run précédent ou
|
| 430 |
+
un run spécifique).
|
| 431 |
+
|
| 432 |
+
Parameters
|
| 433 |
+
----------
|
| 434 |
+
engine:
|
| 435 |
+
Nom du moteur à surveiller.
|
| 436 |
+
corpus:
|
| 437 |
+
Corpus spécifique (None = tous).
|
| 438 |
+
threshold:
|
| 439 |
+
Seuil de régression en points absolus de CER (ex : 0.01 = 1%).
|
| 440 |
+
Si delta_cer > threshold → régression détectée.
|
| 441 |
+
baseline_run_id:
|
| 442 |
+
run_id de référence. Si None, utilise l'avant-dernier run.
|
| 443 |
+
|
| 444 |
+
Returns
|
| 445 |
+
-------
|
| 446 |
+
RegressionResult | None
|
| 447 |
+
None si moins de 2 runs disponibles.
|
| 448 |
+
"""
|
| 449 |
+
entries = self.query(engine=engine, corpus=corpus, limit=1000)
|
| 450 |
+
if len(entries) < 2:
|
| 451 |
+
logger.info("Pas assez de runs pour détecter une régression (moteur=%s)", engine)
|
| 452 |
+
return None
|
| 453 |
+
|
| 454 |
+
current = entries[-1]
|
| 455 |
+
|
| 456 |
+
if baseline_run_id:
|
| 457 |
+
baseline_list = [e for e in entries[:-1] if e.run_id == baseline_run_id]
|
| 458 |
+
baseline = baseline_list[0] if baseline_list else entries[-2]
|
| 459 |
+
else:
|
| 460 |
+
baseline = entries[-2]
|
| 461 |
+
|
| 462 |
+
delta = None
|
| 463 |
+
is_regression = False
|
| 464 |
+
if current.cer_mean is not None and baseline.cer_mean is not None:
|
| 465 |
+
delta = current.cer_mean - baseline.cer_mean
|
| 466 |
+
is_regression = delta > threshold
|
| 467 |
+
|
| 468 |
+
return RegressionResult(
|
| 469 |
+
engine_name=engine,
|
| 470 |
+
corpus_name=corpus or "tous",
|
| 471 |
+
baseline_run_id=baseline.run_id,
|
| 472 |
+
baseline_timestamp=baseline.timestamp,
|
| 473 |
+
baseline_cer=baseline.cer_mean,
|
| 474 |
+
current_run_id=current.run_id,
|
| 475 |
+
current_timestamp=current.timestamp,
|
| 476 |
+
current_cer=current.cer_mean,
|
| 477 |
+
delta_cer=delta,
|
| 478 |
+
is_regression=is_regression,
|
| 479 |
+
threshold=threshold,
|
| 480 |
+
)
|
| 481 |
+
|
| 482 |
+
def detect_all_regressions(
|
| 483 |
+
self,
|
| 484 |
+
threshold: float = 0.01,
|
| 485 |
+
) -> list[RegressionResult]:
|
| 486 |
+
"""Détecte les régressions pour tous les moteurs et corpus connus.
|
| 487 |
+
|
| 488 |
+
Parameters
|
| 489 |
+
----------
|
| 490 |
+
threshold:
|
| 491 |
+
Seuil de régression.
|
| 492 |
+
|
| 493 |
+
Returns
|
| 494 |
+
-------
|
| 495 |
+
list[RegressionResult]
|
| 496 |
+
Uniquement les moteurs où une régression est détectée.
|
| 497 |
+
"""
|
| 498 |
+
results: list[RegressionResult] = []
|
| 499 |
+
engines = self.list_engines()
|
| 500 |
+
corpora = self.list_corpora()
|
| 501 |
+
|
| 502 |
+
for engine in engines:
|
| 503 |
+
for corpus in corpora:
|
| 504 |
+
result = self.detect_regression(engine, corpus, threshold)
|
| 505 |
+
if result and result.is_regression:
|
| 506 |
+
results.append(result)
|
| 507 |
+
|
| 508 |
+
return results
|
| 509 |
+
|
| 510 |
+
# ------------------------------------------------------------------
|
| 511 |
+
# Export
|
| 512 |
+
# ------------------------------------------------------------------
|
| 513 |
+
|
| 514 |
+
def export_json(self, output_path: str) -> Path:
|
| 515 |
+
"""Exporte l'historique complet en JSON.
|
| 516 |
+
|
| 517 |
+
Parameters
|
| 518 |
+
----------
|
| 519 |
+
output_path:
|
| 520 |
+
Chemin du fichier JSON de sortie.
|
| 521 |
+
|
| 522 |
+
Returns
|
| 523 |
+
-------
|
| 524 |
+
Path
|
| 525 |
+
Chemin vers le fichier créé.
|
| 526 |
+
"""
|
| 527 |
+
entries = self.query(limit=100_000)
|
| 528 |
+
path = Path(output_path)
|
| 529 |
+
data = {
|
| 530 |
+
"picarones_history": True,
|
| 531 |
+
"exported_at": datetime.now(timezone.utc).isoformat(),
|
| 532 |
+
"total_runs": len(entries),
|
| 533 |
+
"engines": self.list_engines(),
|
| 534 |
+
"corpora": self.list_corpora(),
|
| 535 |
+
"runs": [e.as_dict() for e in entries],
|
| 536 |
+
}
|
| 537 |
+
path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 538 |
+
return path
|
| 539 |
+
|
| 540 |
+
def __repr__(self) -> str:
|
| 541 |
+
return f"BenchmarkHistory(db='{self.db_path}', runs={self.count()})"
|
| 542 |
+
|
| 543 |
+
|
| 544 |
+
# ---------------------------------------------------------------------------
|
| 545 |
+
# Données de démonstration longitudinale
|
| 546 |
+
# ---------------------------------------------------------------------------
|
| 547 |
+
|
| 548 |
+
def generate_demo_history(
|
| 549 |
+
db: BenchmarkHistory,
|
| 550 |
+
n_runs: int = 8,
|
| 551 |
+
seed: int = 42,
|
| 552 |
+
) -> None:
|
| 553 |
+
"""Insère des données fictives de suivi longitudinal pour la démo.
|
| 554 |
+
|
| 555 |
+
Simule l'amélioration progressive d'un modèle tesseract sur 8 runs,
|
| 556 |
+
avec une légère régression au run 5.
|
| 557 |
+
|
| 558 |
+
Parameters
|
| 559 |
+
----------
|
| 560 |
+
db:
|
| 561 |
+
Base d'historique à remplir.
|
| 562 |
+
n_runs:
|
| 563 |
+
Nombre de runs à générer.
|
| 564 |
+
seed:
|
| 565 |
+
Graine aléatoire.
|
| 566 |
+
"""
|
| 567 |
+
import random
|
| 568 |
+
rng = random.Random(seed)
|
| 569 |
+
|
| 570 |
+
engines = ["tesseract", "pero_ocr", "ancien_moteur"]
|
| 571 |
+
corpus = "Chroniques médiévales"
|
| 572 |
+
|
| 573 |
+
# Trajectoires de CER simulées (amélioration progressive + bruit)
|
| 574 |
+
base_cers = {
|
| 575 |
+
"tesseract": 0.15,
|
| 576 |
+
"pero_ocr": 0.09,
|
| 577 |
+
"ancien_moteur": 0.28,
|
| 578 |
+
}
|
| 579 |
+
improvements = {
|
| 580 |
+
"tesseract": -0.008, # améliore de ~0.8% par run
|
| 581 |
+
"pero_ocr": -0.005, # améliore de ~0.5% par run
|
| 582 |
+
"ancien_moteur": -0.003,
|
| 583 |
+
}
|
| 584 |
+
|
| 585 |
+
from datetime import timedelta
|
| 586 |
+
base_date = datetime(2024, 9, 1, tzinfo=timezone.utc)
|
| 587 |
+
|
| 588 |
+
for run_idx in range(n_runs):
|
| 589 |
+
run_date = base_date + timedelta(weeks=run_idx * 2)
|
| 590 |
+
run_id = f"demo_run_{run_idx + 1:02d}"
|
| 591 |
+
|
| 592 |
+
for engine in engines:
|
| 593 |
+
cer = base_cers[engine] + improvements[engine] * run_idx
|
| 594 |
+
# Ajouter du bruit + régression au run 5
|
| 595 |
+
noise = rng.gauss(0, 0.005)
|
| 596 |
+
if run_idx == 4 and engine == "tesseract":
|
| 597 |
+
noise += 0.02 # régression simulée
|
| 598 |
+
cer = max(0.01, min(0.5, cer + noise))
|
| 599 |
+
|
| 600 |
+
wer = cer * 1.8 + rng.gauss(0, 0.01)
|
| 601 |
+
wer = max(0.01, min(0.9, wer))
|
| 602 |
+
|
| 603 |
+
db.record_single(
|
| 604 |
+
run_id=f"{run_id}_{engine}",
|
| 605 |
+
corpus_name=corpus,
|
| 606 |
+
engine_name=engine,
|
| 607 |
+
cer_mean=round(cer, 4),
|
| 608 |
+
wer_mean=round(wer, 4),
|
| 609 |
+
doc_count=12,
|
| 610 |
+
timestamp=run_date.isoformat(),
|
| 611 |
+
metadata={
|
| 612 |
+
"note": f"Run de démonstration #{run_idx + 1}",
|
| 613 |
+
"engine_version": f"5.{run_idx}.0" if engine == "tesseract" else "0.7.2",
|
| 614 |
+
},
|
| 615 |
+
)
|
|
@@ -0,0 +1,360 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Métriques de fiabilité — Sprint 83 (A.II.4).
|
| 2 |
+
|
| 3 |
+
Sprint 83 — A.II.4 du plan d'évolution 2026 (Étape 4).
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Une publication scientifique qui rapporte un CER LLM sans
|
| 8 |
+
stabilité est méthodologiquement faible. Et un benchmark qui
|
| 9 |
+
ignore le plafond humain (« deux paléographes ne sont pas même
|
| 10 |
+
d'accord ») crée des classements faussement optimistes. Ce
|
| 11 |
+
module livre deux familles complémentaires :
|
| 12 |
+
|
| 13 |
+
1. **Inter-annotator agreement (IAA)** — quand un document a
|
| 14 |
+
plusieurs GT (deux paléographes, par ex.), Cohen κ et
|
| 15 |
+
Krippendorff α mesurent l'accord au niveau caractère.
|
| 16 |
+
Lecture : *« le CER de Pero (4,2 %) approche le plafond
|
| 17 |
+
humain (κ = 0,89). »*
|
| 18 |
+
|
| 19 |
+
2. **Stabilité multi-runs** — quand on relance la même
|
| 20 |
+
pipeline LLM N fois sur les mêmes documents, on mesure :
|
| 21 |
+
variance du CER, taux de tokens divergents entre runs,
|
| 22 |
+
CER pairwise moyen.
|
| 23 |
+
|
| 24 |
+
Périmètre Sprint 83
|
| 25 |
+
-------------------
|
| 26 |
+
**Couche de calcul uniquement** — fonctions pures, pas
|
| 27 |
+
d'intégration runner ni de vue HTML. L'extension du loader
|
| 28 |
+
pour accepter ``doc_001.gt.A.txt`` / ``doc_001.gt.B.txt`` est
|
| 29 |
+
documentée comme dépendance future ; en attendant le sprint
|
| 30 |
+
dédié, on prend deux strings GT en entrée.
|
| 31 |
+
|
| 32 |
+
Méthode
|
| 33 |
+
-------
|
| 34 |
+
*IAA caractère par caractère.* On aligne les deux GT par
|
| 35 |
+
``difflib.SequenceMatcher`` au niveau caractère et on construit
|
| 36 |
+
une table de contingence ``(annotator_a_char, annotator_b_char)``
|
| 37 |
+
sur les positions ``equal`` ou ``replace``. Cohen κ utilise
|
| 38 |
+
cette table directement. Krippendorff α utilise la version
|
| 39 |
+
matricielle (différence binaire pour le mode nominal).
|
| 40 |
+
|
| 41 |
+
*Stabilité multi-runs.* ``compute_multirun_stability(runs)``
|
| 42 |
+
prend une liste de N transcriptions du **même** document et
|
| 43 |
+
renvoie variance/écart-type/coefficient de variation du CER si
|
| 44 |
+
référence fournie ; sinon, taux pairwise de divergence
|
| 45 |
+
(intersection-vs-union des tokens).
|
| 46 |
+
"""
|
| 47 |
+
|
| 48 |
+
from __future__ import annotations
|
| 49 |
+
|
| 50 |
+
import logging
|
| 51 |
+
import statistics
|
| 52 |
+
from typing import Optional, Sequence
|
| 53 |
+
|
| 54 |
+
logger = logging.getLogger(__name__)
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 58 |
+
# Helpers d'alignement caractère par caractère
|
| 59 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def _aligned_char_pairs(
|
| 63 |
+
text_a: str, text_b: str,
|
| 64 |
+
) -> list[tuple[str, str]]:
|
| 65 |
+
"""Aligne ``text_a`` et ``text_b`` caractère par caractère.
|
| 66 |
+
|
| 67 |
+
Retourne la liste des paires alignées sur les segments
|
| 68 |
+
``equal`` et ``replace`` de ``SequenceMatcher`` (les ``insert``
|
| 69 |
+
et ``delete`` sont ignorés — pas d'alignement valide).
|
| 70 |
+
"""
|
| 71 |
+
if not text_a and not text_b:
|
| 72 |
+
return []
|
| 73 |
+
import difflib
|
| 74 |
+
matcher = difflib.SequenceMatcher(None, text_a, text_b, autojunk=False)
|
| 75 |
+
pairs: list[tuple[str, str]] = []
|
| 76 |
+
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 77 |
+
if tag == "equal":
|
| 78 |
+
for k in range(i2 - i1):
|
| 79 |
+
pairs.append((text_a[i1 + k], text_b[j1 + k]))
|
| 80 |
+
elif tag == "replace":
|
| 81 |
+
paired = min(i2 - i1, j2 - j1)
|
| 82 |
+
for k in range(paired):
|
| 83 |
+
pairs.append((text_a[i1 + k], text_b[j1 + k]))
|
| 84 |
+
# insert/delete : pas d'alignement bilatéral exploitable
|
| 85 |
+
return pairs
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
__all__: list[str] = []
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 92 |
+
# 1. Cohen's kappa (deux annotateurs, accord nominal)
|
| 93 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def cohen_kappa(
|
| 97 |
+
annotations_a: Sequence,
|
| 98 |
+
annotations_b: Sequence,
|
| 99 |
+
) -> Optional[float]:
|
| 100 |
+
"""Cohen's κ entre deux annotateurs sur des observations
|
| 101 |
+
appariées.
|
| 102 |
+
|
| 103 |
+
Définition :
|
| 104 |
+
|
| 105 |
+
κ = (po - pe) / (1 - pe)
|
| 106 |
+
|
| 107 |
+
où ``po`` est l'accord observé (proportion de paires égales)
|
| 108 |
+
et ``pe`` l'accord attendu par hasard (somme sur les classes
|
| 109 |
+
de p_a(c) × p_b(c)).
|
| 110 |
+
|
| 111 |
+
Conventions :
|
| 112 |
+
- retourne ``None`` si les deux séquences sont vides ou de
|
| 113 |
+
tailles incompatibles ;
|
| 114 |
+
- κ = 1.0 quand l'accord est parfait, 0.0 quand il égale le
|
| 115 |
+
hasard, négatif si pire que le hasard ;
|
| 116 |
+
- quand ``pe == 1`` (un seul label dans les deux séquences),
|
| 117 |
+
retourne 1.0 si les séquences sont identiques, 0.0 sinon
|
| 118 |
+
(κ est mathématiquement indéfini, on choisit une
|
| 119 |
+
convention transparente documentée).
|
| 120 |
+
"""
|
| 121 |
+
if len(annotations_a) != len(annotations_b):
|
| 122 |
+
return None
|
| 123 |
+
n = len(annotations_a)
|
| 124 |
+
if n == 0:
|
| 125 |
+
return None
|
| 126 |
+
# Accord observé
|
| 127 |
+
agree = sum(1 for a, b in zip(annotations_a, annotations_b) if a == b)
|
| 128 |
+
p_o = agree / n
|
| 129 |
+
# Accord attendu par hasard
|
| 130 |
+
from collections import Counter
|
| 131 |
+
count_a = Counter(annotations_a)
|
| 132 |
+
count_b = Counter(annotations_b)
|
| 133 |
+
classes = set(count_a) | set(count_b)
|
| 134 |
+
p_e = sum(
|
| 135 |
+
(count_a.get(c, 0) / n) * (count_b.get(c, 0) / n)
|
| 136 |
+
for c in classes
|
| 137 |
+
)
|
| 138 |
+
if p_e >= 1.0 - 1e-12:
|
| 139 |
+
# Indéfini ; convention : 1 si identité totale, 0 sinon
|
| 140 |
+
return 1.0 if p_o >= 1.0 - 1e-12 else 0.0
|
| 141 |
+
return (p_o - p_e) / (1.0 - p_e)
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
__all__.append("cohen_kappa")
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 148 |
+
# 2. Krippendorff's alpha (généralisation à N annotateurs)
|
| 149 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def krippendorff_alpha(
|
| 153 |
+
annotations_per_unit: Sequence[Sequence],
|
| 154 |
+
) -> Optional[float]:
|
| 155 |
+
"""Krippendorff's α en mode nominal pour N annotateurs.
|
| 156 |
+
|
| 157 |
+
Parameters
|
| 158 |
+
----------
|
| 159 |
+
annotations_per_unit:
|
| 160 |
+
Liste d'unités, chaque unité étant la liste des
|
| 161 |
+
annotations produites par les différents annotateurs sur
|
| 162 |
+
cette unité. ``None`` dans une cellule = annotation
|
| 163 |
+
manquante (autorisée).
|
| 164 |
+
|
| 165 |
+
Définition (Krippendorff 1980, équation pour métrique
|
| 166 |
+
nominale) :
|
| 167 |
+
|
| 168 |
+
α = 1 - D_o / D_e
|
| 169 |
+
|
| 170 |
+
où ``D_o`` est le désaccord observé (paires en désaccord
|
| 171 |
+
intra-unité, normalisées) et ``D_e`` le désaccord attendu
|
| 172 |
+
par hasard. ``α = 1`` accord parfait, ``α = 0`` hasard,
|
| 173 |
+
négatif si pire.
|
| 174 |
+
|
| 175 |
+
Conventions :
|
| 176 |
+
- unités avec moins de 2 annotations valides : ignorées
|
| 177 |
+
(Krippendorff convention) ;
|
| 178 |
+
- retourne ``None`` si moins d'une unité utilisable ou
|
| 179 |
+
``D_e == 0`` (un seul label dans tout le corpus).
|
| 180 |
+
"""
|
| 181 |
+
from collections import Counter
|
| 182 |
+
# Valeurs observées au niveau corpus
|
| 183 |
+
value_counts: Counter = Counter()
|
| 184 |
+
pair_disagree = 0.0
|
| 185 |
+
pair_total = 0.0
|
| 186 |
+
for unit in annotations_per_unit:
|
| 187 |
+
valid = [v for v in unit if v is not None]
|
| 188 |
+
m = len(valid)
|
| 189 |
+
if m < 2:
|
| 190 |
+
continue
|
| 191 |
+
# paires intra-unité (sans repetition, ordonné)
|
| 192 |
+
for i in range(m):
|
| 193 |
+
for j in range(m):
|
| 194 |
+
if i == j:
|
| 195 |
+
continue
|
| 196 |
+
pair_total += 1.0 / (m - 1)
|
| 197 |
+
if valid[i] != valid[j]:
|
| 198 |
+
pair_disagree += 1.0 / (m - 1)
|
| 199 |
+
for v in valid:
|
| 200 |
+
value_counts[v] += 1
|
| 201 |
+
if pair_total == 0:
|
| 202 |
+
return None
|
| 203 |
+
n_total = sum(value_counts.values())
|
| 204 |
+
if n_total < 2:
|
| 205 |
+
return None
|
| 206 |
+
# Désaccord attendu (sur paires aléatoires sans remise)
|
| 207 |
+
expected_disagree = 0.0
|
| 208 |
+
for v_a, c_a in value_counts.items():
|
| 209 |
+
for v_b, c_b in value_counts.items():
|
| 210 |
+
if v_a != v_b:
|
| 211 |
+
expected_disagree += c_a * c_b
|
| 212 |
+
expected_disagree /= n_total * (n_total - 1)
|
| 213 |
+
if expected_disagree <= 1e-12:
|
| 214 |
+
return None
|
| 215 |
+
d_o = pair_disagree / pair_total
|
| 216 |
+
return 1.0 - (d_o / expected_disagree)
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
__all__.append("krippendorff_alpha")
|
| 220 |
+
|
| 221 |
+
|
| 222 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 223 |
+
# 3. Helpers IAA caractère
|
| 224 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
def compute_iaa(
|
| 228 |
+
transcription_a: str,
|
| 229 |
+
transcription_b: str,
|
| 230 |
+
) -> Optional[dict]:
|
| 231 |
+
"""Calcule κ et α au niveau caractère entre deux
|
| 232 |
+
transcriptions du même document.
|
| 233 |
+
|
| 234 |
+
Aligne via ``_aligned_char_pairs`` puis :
|
| 235 |
+
- κ : sur la liste des paires alignées ;
|
| 236 |
+
- α : sur les unités à 2 annotations (équivalent à κ sur ce
|
| 237 |
+
cas, mais le cadre généralise à N annotateurs).
|
| 238 |
+
|
| 239 |
+
Retourne ``None`` si pas d'alignement possible (transcriptions
|
| 240 |
+
vides ou totalement disjointes).
|
| 241 |
+
"""
|
| 242 |
+
pairs = _aligned_char_pairs(transcription_a, transcription_b)
|
| 243 |
+
if not pairs:
|
| 244 |
+
return None
|
| 245 |
+
kappa = cohen_kappa([a for a, _ in pairs], [b for _, b in pairs])
|
| 246 |
+
alpha = krippendorff_alpha([[a, b] for a, b in pairs])
|
| 247 |
+
return {
|
| 248 |
+
"n_aligned_chars": len(pairs),
|
| 249 |
+
"cohen_kappa": kappa,
|
| 250 |
+
"krippendorff_alpha": alpha,
|
| 251 |
+
"agreement_rate": (
|
| 252 |
+
sum(1 for a, b in pairs if a == b) / len(pairs)
|
| 253 |
+
),
|
| 254 |
+
}
|
| 255 |
+
|
| 256 |
+
|
| 257 |
+
__all__.append("compute_iaa")
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
# ──���───────────────────────────────────────────────────────────────────────
|
| 261 |
+
# 4. Stabilité multi-runs (variance CER, divergence pairwise)
|
| 262 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 263 |
+
|
| 264 |
+
|
| 265 |
+
def _split_words(text: str) -> list[str]:
|
| 266 |
+
return text.split() if text else []
|
| 267 |
+
|
| 268 |
+
|
| 269 |
+
def compute_multirun_stability(
|
| 270 |
+
runs: Sequence[str],
|
| 271 |
+
*,
|
| 272 |
+
reference: Optional[str] = None,
|
| 273 |
+
) -> Optional[dict]:
|
| 274 |
+
"""Mesure la stabilité de N runs successifs d'une même
|
| 275 |
+
pipeline (typiquement LLM/VLM non déterministe) sur un
|
| 276 |
+
document.
|
| 277 |
+
|
| 278 |
+
Parameters
|
| 279 |
+
----------
|
| 280 |
+
runs:
|
| 281 |
+
Liste des transcriptions produites à chaque run (≥ 2).
|
| 282 |
+
reference:
|
| 283 |
+
Transcription de référence (GT). Si fournie, on calcule
|
| 284 |
+
``cer_per_run``, leur variance et leur coefficient de
|
| 285 |
+
variation.
|
| 286 |
+
|
| 287 |
+
Returns
|
| 288 |
+
-------
|
| 289 |
+
dict | None
|
| 290 |
+
``{
|
| 291 |
+
"n_runs": int,
|
| 292 |
+
"pairwise_disagreement_mean": float, # divergence moyenne
|
| 293 |
+
"pairwise_disagreement_max": float,
|
| 294 |
+
"identical_run_rate": float, # paires identiques / total
|
| 295 |
+
"cer_per_run": Optional[list[float]],
|
| 296 |
+
"cer_mean": Optional[float],
|
| 297 |
+
"cer_stdev": Optional[float],
|
| 298 |
+
"cer_cv": Optional[float], # cv = stdev / mean
|
| 299 |
+
"n_distinct_outputs": int,
|
| 300 |
+
}``
|
| 301 |
+
ou ``None`` si moins de 2 runs.
|
| 302 |
+
"""
|
| 303 |
+
if len(runs) < 2:
|
| 304 |
+
return None
|
| 305 |
+
runs_list = list(runs)
|
| 306 |
+
# Divergence pairwise (token-level Jaccard distance)
|
| 307 |
+
n = len(runs_list)
|
| 308 |
+
n_pairs = 0
|
| 309 |
+
sum_disagree = 0.0
|
| 310 |
+
max_disagree = 0.0
|
| 311 |
+
n_identical = 0
|
| 312 |
+
for i in range(n):
|
| 313 |
+
for j in range(i + 1, n):
|
| 314 |
+
n_pairs += 1
|
| 315 |
+
tokens_i = set(_split_words(runs_list[i]))
|
| 316 |
+
tokens_j = set(_split_words(runs_list[j]))
|
| 317 |
+
union = tokens_i | tokens_j
|
| 318 |
+
if not union:
|
| 319 |
+
disagree = 0.0
|
| 320 |
+
else:
|
| 321 |
+
disagree = 1.0 - len(tokens_i & tokens_j) / len(union)
|
| 322 |
+
sum_disagree += disagree
|
| 323 |
+
if disagree > max_disagree:
|
| 324 |
+
max_disagree = disagree
|
| 325 |
+
if runs_list[i] == runs_list[j]:
|
| 326 |
+
n_identical += 1
|
| 327 |
+
pairwise_mean = sum_disagree / n_pairs if n_pairs else 0.0
|
| 328 |
+
identical_rate = n_identical / n_pairs if n_pairs else 0.0
|
| 329 |
+
distinct = len(set(runs_list))
|
| 330 |
+
|
| 331 |
+
cer_per_run: Optional[list[float]] = None
|
| 332 |
+
cer_mean: Optional[float] = None
|
| 333 |
+
cer_stdev: Optional[float] = None
|
| 334 |
+
cer_cv: Optional[float] = None
|
| 335 |
+
if reference is not None:
|
| 336 |
+
from picarones.evaluation.metrics.text_metrics import _cer_from_strings
|
| 337 |
+
cer_per_run = [_cer_from_strings(reference, r) for r in runs_list]
|
| 338 |
+
cer_per_run = [v for v in cer_per_run if v is not None]
|
| 339 |
+
if cer_per_run:
|
| 340 |
+
cer_mean = statistics.fmean(cer_per_run)
|
| 341 |
+
if len(cer_per_run) >= 2:
|
| 342 |
+
cer_stdev = statistics.stdev(cer_per_run)
|
| 343 |
+
cer_cv = (
|
| 344 |
+
cer_stdev / cer_mean if cer_mean and cer_mean > 0
|
| 345 |
+
else None
|
| 346 |
+
)
|
| 347 |
+
return {
|
| 348 |
+
"n_runs": n,
|
| 349 |
+
"pairwise_disagreement_mean": pairwise_mean,
|
| 350 |
+
"pairwise_disagreement_max": max_disagree,
|
| 351 |
+
"identical_run_rate": identical_rate,
|
| 352 |
+
"n_distinct_outputs": distinct,
|
| 353 |
+
"cer_per_run": cer_per_run,
|
| 354 |
+
"cer_mean": cer_mean,
|
| 355 |
+
"cer_stdev": cer_stdev,
|
| 356 |
+
"cer_cv": cer_cv,
|
| 357 |
+
}
|
| 358 |
+
|
| 359 |
+
|
| 360 |
+
__all__.append("compute_multirun_stability")
|
|
@@ -0,0 +1,742 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Analyse de robustesse des moteurs OCR face aux dégradations d'image.
|
| 2 |
+
|
| 3 |
+
Fonctionnement
|
| 4 |
+
--------------
|
| 5 |
+
1. Génération de versions dégradées des images du corpus à différents niveaux :
|
| 6 |
+
- Bruit gaussien (sigma croissant)
|
| 7 |
+
- Flou gaussien (kernel size croissant)
|
| 8 |
+
- Rotation (angle croissant)
|
| 9 |
+
- Réduction de résolution (facteur de downscaling)
|
| 10 |
+
- Binarisation (seuillage Otsu ou fixe)
|
| 11 |
+
2. Exécution du moteur OCR sur chaque version dégradée
|
| 12 |
+
3. Calcul du CER pour chaque niveau de dégradation
|
| 13 |
+
4. Génération de courbes de robustesse (CER en fonction du niveau)
|
| 14 |
+
5. Identification du seuil critique (niveau à partir duquel CER > seuil)
|
| 15 |
+
|
| 16 |
+
Usage
|
| 17 |
+
-----
|
| 18 |
+
>>> from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 19 |
+
>>> analyzer = RobustnessAnalyzer(engine, degradation_types=["noise", "blur"])
|
| 20 |
+
>>> report = analyzer.analyze(corpus)
|
| 21 |
+
>>> print(report.critical_thresholds)
|
| 22 |
+
"""
|
| 23 |
+
|
| 24 |
+
from __future__ import annotations
|
| 25 |
+
|
| 26 |
+
import logging
|
| 27 |
+
import math
|
| 28 |
+
import os
|
| 29 |
+
import tempfile
|
| 30 |
+
from dataclasses import dataclass, field
|
| 31 |
+
from pathlib import Path
|
| 32 |
+
from typing import TYPE_CHECKING, Any, Optional
|
| 33 |
+
|
| 34 |
+
if TYPE_CHECKING:
|
| 35 |
+
from picarones.evaluation.corpus import Corpus, Document
|
| 36 |
+
# ``BaseOCREngine`` (legacy ``adapters/legacy_engines/``) ne peut
|
| 37 |
+
# pas être importé statiquement depuis la couche ``evaluation/``
|
| 38 |
+
# (test_layer_imports_are_legal). L'annotation utilise donc
|
| 39 |
+
# ``Any`` ; le check ``isinstance`` est fait dynamiquement par
|
| 40 |
+
# ``importlib`` si besoin (cas réel : duck typing suffit, l'objet
|
| 41 |
+
# passé doit juste avoir ``.run(image_path) -> EngineResult``).
|
| 42 |
+
BaseOCREngine = Any # type: ignore[misc,assignment]
|
| 43 |
+
|
| 44 |
+
logger = logging.getLogger(__name__)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# ---------------------------------------------------------------------------
|
| 48 |
+
# Paramètres de dégradation
|
| 49 |
+
# ---------------------------------------------------------------------------
|
| 50 |
+
|
| 51 |
+
# Niveaux de dégradation pour chaque type
|
| 52 |
+
DEGRADATION_LEVELS: dict[str, list] = {
|
| 53 |
+
"noise": [0, 5, 15, 30, 50, 80], # sigma du bruit gaussien
|
| 54 |
+
"blur": [0, 1, 2, 3, 5, 8], # rayon du flou gaussien (pixels)
|
| 55 |
+
"rotation": [0, 1, 2, 5, 10, 20], # angle de rotation (degrés)
|
| 56 |
+
"resolution": [1.0, 0.75, 0.5, 0.33, 0.25, 0.1], # facteur de résolution
|
| 57 |
+
"binarization": [0, 64, 96, 128, 160, 192], # seuil de binarisation (0 = Otsu)
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
DEGRADATION_LABELS: dict[str, list[str]] = {
|
| 61 |
+
"noise": ["original", "σ=5", "σ=15", "σ=30", "σ=50", "σ=80"],
|
| 62 |
+
"blur": ["original", "r=1", "r=2", "r=3", "r=5", "r=8"],
|
| 63 |
+
"rotation": ["0°", "1°", "2°", "5°", "10°", "20°"],
|
| 64 |
+
"resolution": ["100%", "75%", "50%", "33%", "25%", "10%"],
|
| 65 |
+
"binarization": ["original", "seuil=64", "seuil=96", "seuil=128", "seuil=160", "seuil=192"],
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
ALL_DEGRADATION_TYPES = list(DEGRADATION_LEVELS.keys())
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
# ---------------------------------------------------------------------------
|
| 72 |
+
# Dégradation d'image (pure Python + stdlib, optionnellement Pillow/NumPy)
|
| 73 |
+
# ---------------------------------------------------------------------------
|
| 74 |
+
|
| 75 |
+
def _apply_gaussian_noise(pixels: list[list[list[int]]], sigma: float, rng_seed: int = 0) -> list[list[list[int]]]:
|
| 76 |
+
"""Applique du bruit gaussien (pure Python)."""
|
| 77 |
+
import random
|
| 78 |
+
rng = random.Random(rng_seed)
|
| 79 |
+
h = len(pixels)
|
| 80 |
+
w = len(pixels[0]) if h > 0 else 0
|
| 81 |
+
result = []
|
| 82 |
+
for y in range(h):
|
| 83 |
+
row = []
|
| 84 |
+
for x in range(w):
|
| 85 |
+
pixel = []
|
| 86 |
+
for c in pixels[y][x]:
|
| 87 |
+
noise = rng.gauss(0, sigma)
|
| 88 |
+
val = int(c + noise)
|
| 89 |
+
pixel.append(max(0, min(255, val)))
|
| 90 |
+
row.append(pixel)
|
| 91 |
+
result.append(row)
|
| 92 |
+
return result
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def _apply_box_blur(pixels: list[list[list[int]]], radius: int) -> list[list[list[int]]]:
|
| 96 |
+
"""Applique un flou de boîte (approximation du flou gaussien, pure Python)."""
|
| 97 |
+
if radius <= 0:
|
| 98 |
+
return pixels
|
| 99 |
+
h = len(pixels)
|
| 100 |
+
w = len(pixels[0]) if h > 0 else 0
|
| 101 |
+
channels = len(pixels[0][0]) if h > 0 and w > 0 else 3
|
| 102 |
+
|
| 103 |
+
def blur_pass(data: list[list[list[int]]]) -> list[list[list[int]]]:
|
| 104 |
+
out = []
|
| 105 |
+
for y in range(h):
|
| 106 |
+
row = []
|
| 107 |
+
for x in range(w):
|
| 108 |
+
totals = [0] * channels
|
| 109 |
+
count = 0
|
| 110 |
+
for dy in range(-radius, radius + 1):
|
| 111 |
+
for dx in range(-radius, radius + 1):
|
| 112 |
+
ny, nx = y + dy, x + dx
|
| 113 |
+
if 0 <= ny < h and 0 <= nx < w:
|
| 114 |
+
for c in range(channels):
|
| 115 |
+
totals[c] += data[ny][nx][c]
|
| 116 |
+
count += 1
|
| 117 |
+
row.append([t // count for t in totals])
|
| 118 |
+
out.append(row)
|
| 119 |
+
return out
|
| 120 |
+
|
| 121 |
+
return blur_pass(pixels)
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def _apply_rotation_simple(pixels: list[list[list[int]]], angle_deg: float) -> list[list[list[int]]]:
|
| 125 |
+
"""Rotation avec interpolation au plus proche voisin (pure Python).
|
| 126 |
+
|
| 127 |
+
Pour des angles faibles, l'effet est réaliste.
|
| 128 |
+
"""
|
| 129 |
+
if angle_deg == 0:
|
| 130 |
+
return pixels
|
| 131 |
+
h = len(pixels)
|
| 132 |
+
w = len(pixels[0]) if h > 0 else 0
|
| 133 |
+
channels = len(pixels[0][0]) if h > 0 and w > 0 else 3
|
| 134 |
+
|
| 135 |
+
angle_rad = math.radians(angle_deg)
|
| 136 |
+
cos_a = math.cos(angle_rad)
|
| 137 |
+
sin_a = math.sin(angle_rad)
|
| 138 |
+
cx, cy = w / 2, h / 2
|
| 139 |
+
|
| 140 |
+
result = [[[245, 240, 232][:channels] for _ in range(w)] for _ in range(h)]
|
| 141 |
+
for y in range(h):
|
| 142 |
+
for x in range(w):
|
| 143 |
+
# Coordonnées source
|
| 144 |
+
sx = cos_a * (x - cx) + sin_a * (y - cy) + cx
|
| 145 |
+
sy = -sin_a * (x - cx) + cos_a * (y - cy) + cy
|
| 146 |
+
ix, iy = int(round(sx)), int(round(sy))
|
| 147 |
+
if 0 <= ix < w and 0 <= iy < h:
|
| 148 |
+
result[y][x] = list(pixels[iy][ix])
|
| 149 |
+
return result
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
def _apply_resolution_reduction(
|
| 153 |
+
pixels: list[list[list[int]]], factor: float
|
| 154 |
+
) -> list[list[list[int]]]:
|
| 155 |
+
"""Réduit la résolution puis remonte à la taille originale (pixelisation)."""
|
| 156 |
+
if factor >= 1.0:
|
| 157 |
+
return pixels
|
| 158 |
+
h = len(pixels)
|
| 159 |
+
w = len(pixels[0]) if h > 0 else 0
|
| 160 |
+
new_h = max(1, int(h * factor))
|
| 161 |
+
new_w = max(1, int(w * factor))
|
| 162 |
+
|
| 163 |
+
# Downscale
|
| 164 |
+
small = []
|
| 165 |
+
for y in range(new_h):
|
| 166 |
+
row = []
|
| 167 |
+
src_y = int(y / factor)
|
| 168 |
+
for x in range(new_w):
|
| 169 |
+
src_x = int(x / factor)
|
| 170 |
+
row.append(list(pixels[min(src_y, h - 1)][min(src_x, w - 1)]))
|
| 171 |
+
small.append(row)
|
| 172 |
+
|
| 173 |
+
# Upscale (nearest-neighbor)
|
| 174 |
+
result = []
|
| 175 |
+
for y in range(h):
|
| 176 |
+
row = []
|
| 177 |
+
src_y = min(int(y * factor), new_h - 1)
|
| 178 |
+
for x in range(w):
|
| 179 |
+
src_x = min(int(x * factor), new_w - 1)
|
| 180 |
+
row.append(list(small[src_y][src_x]))
|
| 181 |
+
result.append(row)
|
| 182 |
+
return result
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
def _apply_binarization(
|
| 186 |
+
pixels: list[list[list[int]]], threshold: int
|
| 187 |
+
) -> list[list[list[int]]]:
|
| 188 |
+
"""Binarise l'image (seuillage fixe sur luminosité)."""
|
| 189 |
+
h = len(pixels)
|
| 190 |
+
w = len(pixels[0]) if h > 0 else 0
|
| 191 |
+
result = []
|
| 192 |
+
|
| 193 |
+
# Calculer le seuil Otsu si threshold == 0
|
| 194 |
+
if threshold == 0:
|
| 195 |
+
histogram = [0] * 256
|
| 196 |
+
total = h * w
|
| 197 |
+
for y in range(h):
|
| 198 |
+
for x in range(w):
|
| 199 |
+
p = pixels[y][x]
|
| 200 |
+
lum = int(0.299 * p[0] + 0.587 * p[1] + 0.114 * p[2]) if len(p) >= 3 else p[0]
|
| 201 |
+
histogram[lum] += 1
|
| 202 |
+
# Otsu simplifié
|
| 203 |
+
best_thresh = 128
|
| 204 |
+
best_var = -1.0
|
| 205 |
+
total_sum = sum(i * histogram[i] for i in range(256))
|
| 206 |
+
w0, w1, sum0 = 0, total, 0.0
|
| 207 |
+
for t in range(256):
|
| 208 |
+
w0 += histogram[t]
|
| 209 |
+
if w0 == 0:
|
| 210 |
+
continue
|
| 211 |
+
w1 = total - w0
|
| 212 |
+
if w1 == 0:
|
| 213 |
+
break
|
| 214 |
+
sum0 += t * histogram[t]
|
| 215 |
+
mean0 = sum0 / w0
|
| 216 |
+
mean1 = (total_sum - sum0) / w1
|
| 217 |
+
var = w0 * w1 * (mean0 - mean1) ** 2
|
| 218 |
+
if var > best_var:
|
| 219 |
+
best_var = var
|
| 220 |
+
best_thresh = t
|
| 221 |
+
threshold = best_thresh
|
| 222 |
+
|
| 223 |
+
for y in range(h):
|
| 224 |
+
row = []
|
| 225 |
+
for x in range(w):
|
| 226 |
+
p = pixels[y][x]
|
| 227 |
+
lum = int(0.299 * p[0] + 0.587 * p[1] + 0.114 * p[2]) if len(p) >= 3 else p[0]
|
| 228 |
+
val = 255 if lum >= threshold else 0
|
| 229 |
+
row.append([val] * len(p))
|
| 230 |
+
result.append(row)
|
| 231 |
+
return result
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
def degrade_image_bytes(
|
| 235 |
+
png_bytes: bytes,
|
| 236 |
+
degradation_type: str,
|
| 237 |
+
level: float,
|
| 238 |
+
) -> bytes:
|
| 239 |
+
"""Dégrade une image PNG et retourne les bytes PNG modifiés.
|
| 240 |
+
|
| 241 |
+
Utilise Pillow si disponible, sinon utilise l'implémentation pure Python.
|
| 242 |
+
|
| 243 |
+
Parameters
|
| 244 |
+
----------
|
| 245 |
+
png_bytes:
|
| 246 |
+
Bytes de l'image PNG source.
|
| 247 |
+
degradation_type:
|
| 248 |
+
Type de dégradation (``"noise"``, ``"blur"``, ``"rotation"``,
|
| 249 |
+
``"resolution"``, ``"binarization"``).
|
| 250 |
+
level:
|
| 251 |
+
Niveau de dégradation (valeur numérique selon le type).
|
| 252 |
+
|
| 253 |
+
Returns
|
| 254 |
+
-------
|
| 255 |
+
bytes
|
| 256 |
+
Bytes de l'image PNG dégradée.
|
| 257 |
+
"""
|
| 258 |
+
try:
|
| 259 |
+
return _degrade_pillow(png_bytes, degradation_type, level)
|
| 260 |
+
except ImportError:
|
| 261 |
+
return _degrade_pure_python(png_bytes, degradation_type, level)
|
| 262 |
+
|
| 263 |
+
|
| 264 |
+
def _degrade_pillow(png_bytes: bytes, degradation_type: str, level: float) -> bytes:
|
| 265 |
+
"""Dégradation avec Pillow (meilleure qualité)."""
|
| 266 |
+
import io
|
| 267 |
+
from PIL import Image, ImageFilter
|
| 268 |
+
|
| 269 |
+
img = Image.open(io.BytesIO(png_bytes)).convert("RGB")
|
| 270 |
+
|
| 271 |
+
if degradation_type == "noise":
|
| 272 |
+
if level > 0:
|
| 273 |
+
import random
|
| 274 |
+
# RGB : 3 octets par pixel, tobytes() reste stable Pillow 10 → 14+
|
| 275 |
+
raw = img.tobytes()
|
| 276 |
+
rng = random.Random(0)
|
| 277 |
+
noisy = []
|
| 278 |
+
for i in range(0, len(raw), 3):
|
| 279 |
+
r, g, b = raw[i], raw[i + 1], raw[i + 2]
|
| 280 |
+
noisy.append((
|
| 281 |
+
max(0, min(255, int(r + rng.gauss(0, level)))),
|
| 282 |
+
max(0, min(255, int(g + rng.gauss(0, level)))),
|
| 283 |
+
max(0, min(255, int(b + rng.gauss(0, level)))),
|
| 284 |
+
))
|
| 285 |
+
img.putdata(noisy)
|
| 286 |
+
|
| 287 |
+
elif degradation_type == "blur":
|
| 288 |
+
if level > 0:
|
| 289 |
+
img = img.filter(ImageFilter.GaussianBlur(radius=level))
|
| 290 |
+
|
| 291 |
+
elif degradation_type == "rotation":
|
| 292 |
+
if level != 0:
|
| 293 |
+
img = img.rotate(-level, expand=False, fillcolor=(245, 240, 232))
|
| 294 |
+
|
| 295 |
+
elif degradation_type == "resolution":
|
| 296 |
+
if level < 1.0:
|
| 297 |
+
w, h = img.size
|
| 298 |
+
new_w, new_h = max(1, int(w * level)), max(1, int(h * level))
|
| 299 |
+
img = img.resize((new_w, new_h), Image.NEAREST)
|
| 300 |
+
img = img.resize((w, h), Image.NEAREST)
|
| 301 |
+
|
| 302 |
+
elif degradation_type == "binarization":
|
| 303 |
+
img = img.convert("L") # niveaux de gris
|
| 304 |
+
if level == 0:
|
| 305 |
+
# Seuillage Otsu : calcul du seuil optimal
|
| 306 |
+
histogram = img.histogram()
|
| 307 |
+
total = img.size[0] * img.size[1]
|
| 308 |
+
best_thresh, best_var = 128, -1.0
|
| 309 |
+
total_sum = sum(i * histogram[i] for i in range(256))
|
| 310 |
+
w0, sum0 = 0, 0.0
|
| 311 |
+
for t in range(256):
|
| 312 |
+
w0 += histogram[t]
|
| 313 |
+
if w0 == 0:
|
| 314 |
+
continue
|
| 315 |
+
w1 = total - w0
|
| 316 |
+
if w1 == 0:
|
| 317 |
+
break
|
| 318 |
+
sum0 += t * histogram[t]
|
| 319 |
+
var = w0 * w1 * (sum0 / w0 - (total_sum - sum0) / w1) ** 2
|
| 320 |
+
if var > best_var:
|
| 321 |
+
best_var = var
|
| 322 |
+
best_thresh = t
|
| 323 |
+
threshold = best_thresh
|
| 324 |
+
else:
|
| 325 |
+
threshold = int(level)
|
| 326 |
+
img = img.point(lambda p: 255 if p >= threshold else 0, "1").convert("RGB")
|
| 327 |
+
|
| 328 |
+
buf = io.BytesIO()
|
| 329 |
+
img.save(buf, format="PNG")
|
| 330 |
+
return buf.getvalue()
|
| 331 |
+
|
| 332 |
+
|
| 333 |
+
def _degrade_pure_python(png_bytes: bytes, degradation_type: str, level: float) -> bytes:
|
| 334 |
+
"""Dégradation en pur Python (sans Pillow).
|
| 335 |
+
|
| 336 |
+
Décode le PNG, applique la transformation, ré-encode en PNG.
|
| 337 |
+
Note : n'implémente pas le décodage PNG complet — utilise des stubs.
|
| 338 |
+
"""
|
| 339 |
+
# Pour l'implémentation pure Python, on applique des transformations
|
| 340 |
+
# minimales sur les bytes bruts en créant une image de test synthétique.
|
| 341 |
+
# En pratique, Pillow est presque toujours disponible dans l'environnement Picarones.
|
| 342 |
+
logger.warning(
|
| 343 |
+
"Pillow non disponible : dégradation '%s' appliquée en mode dégradé (stub)",
|
| 344 |
+
degradation_type,
|
| 345 |
+
)
|
| 346 |
+
# Retourner l'image originale légèrement modifiée (simulation)
|
| 347 |
+
return png_bytes
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
# ---------------------------------------------------------------------------
|
| 351 |
+
# Structures de résultats
|
| 352 |
+
# ---------------------------------------------------------------------------
|
| 353 |
+
|
| 354 |
+
@dataclass
|
| 355 |
+
class DegradationCurve:
|
| 356 |
+
"""Courbe CER vs niveau de dégradation pour un moteur et un type de dégradation."""
|
| 357 |
+
engine_name: str
|
| 358 |
+
degradation_type: str
|
| 359 |
+
levels: list[float]
|
| 360 |
+
labels: list[str]
|
| 361 |
+
cer_values: list[Optional[float]]
|
| 362 |
+
"""CER moyen (0-1) à chaque niveau. None si calcul impossible."""
|
| 363 |
+
critical_threshold_level: Optional[float] = None
|
| 364 |
+
"""Niveau à partir duquel CER > cer_threshold."""
|
| 365 |
+
cer_threshold: float = 0.20
|
| 366 |
+
"""Seuil de CER utilisé pour déterminer le niveau critique."""
|
| 367 |
+
|
| 368 |
+
def as_dict(self) -> dict:
|
| 369 |
+
return {
|
| 370 |
+
"engine_name": self.engine_name,
|
| 371 |
+
"degradation_type": self.degradation_type,
|
| 372 |
+
"levels": self.levels,
|
| 373 |
+
"labels": self.labels,
|
| 374 |
+
"cer_values": self.cer_values,
|
| 375 |
+
"critical_threshold_level": self.critical_threshold_level,
|
| 376 |
+
"cer_threshold": self.cer_threshold,
|
| 377 |
+
}
|
| 378 |
+
|
| 379 |
+
|
| 380 |
+
@dataclass
|
| 381 |
+
class RobustnessReport:
|
| 382 |
+
"""Rapport complet d'analyse de robustesse pour un ou plusieurs moteurs."""
|
| 383 |
+
engine_names: list[str]
|
| 384 |
+
corpus_name: str
|
| 385 |
+
degradation_types: list[str]
|
| 386 |
+
curves: list[DegradationCurve]
|
| 387 |
+
summary: dict = field(default_factory=dict)
|
| 388 |
+
"""Résumé : moteur le plus robuste par type de dégradation, seuils critiques…"""
|
| 389 |
+
|
| 390 |
+
def get_curves_for_engine(self, engine_name: str) -> list[DegradationCurve]:
|
| 391 |
+
return [c for c in self.curves if c.engine_name == engine_name]
|
| 392 |
+
|
| 393 |
+
def get_curves_for_type(self, degradation_type: str) -> list[DegradationCurve]:
|
| 394 |
+
return [c for c in self.curves if c.degradation_type == degradation_type]
|
| 395 |
+
|
| 396 |
+
def as_dict(self) -> dict:
|
| 397 |
+
return {
|
| 398 |
+
"engine_names": self.engine_names,
|
| 399 |
+
"corpus_name": self.corpus_name,
|
| 400 |
+
"degradation_types": self.degradation_types,
|
| 401 |
+
"curves": [c.as_dict() for c in self.curves],
|
| 402 |
+
"summary": self.summary,
|
| 403 |
+
}
|
| 404 |
+
|
| 405 |
+
|
| 406 |
+
# ---------------------------------------------------------------------------
|
| 407 |
+
# Analyseur de robustesse
|
| 408 |
+
# ---------------------------------------------------------------------------
|
| 409 |
+
|
| 410 |
+
class RobustnessAnalyzer:
|
| 411 |
+
"""Lance une analyse de robustesse sur un corpus.
|
| 412 |
+
|
| 413 |
+
Parameters
|
| 414 |
+
----------
|
| 415 |
+
engines:
|
| 416 |
+
Un ou plusieurs moteurs OCR (``BaseOCREngine``).
|
| 417 |
+
degradation_types:
|
| 418 |
+
Liste des types de dégradation à tester.
|
| 419 |
+
Par défaut : tous (``"noise"``, ``"blur"``, ``"rotation"``,
|
| 420 |
+
``"resolution"``, ``"binarization"``).
|
| 421 |
+
cer_threshold:
|
| 422 |
+
Seuil de CER pour définir le niveau critique (défaut : 0.20 = 20%).
|
| 423 |
+
custom_levels:
|
| 424 |
+
Niveaux personnalisés par type (remplace les valeurs par défaut).
|
| 425 |
+
|
| 426 |
+
Examples
|
| 427 |
+
--------
|
| 428 |
+
>>> from picarones.adapters.legacy_engines.tesseract import TesseractEngine
|
| 429 |
+
>>> from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 430 |
+
>>> engine = TesseractEngine(config={"lang": "fra"})
|
| 431 |
+
>>> analyzer = RobustnessAnalyzer([engine], degradation_types=["noise", "blur"])
|
| 432 |
+
>>> report = analyzer.analyze(corpus)
|
| 433 |
+
"""
|
| 434 |
+
|
| 435 |
+
def __init__(
|
| 436 |
+
self,
|
| 437 |
+
engines: "list[BaseOCREngine]",
|
| 438 |
+
degradation_types: Optional[list[str]] = None,
|
| 439 |
+
cer_threshold: float = 0.20,
|
| 440 |
+
custom_levels: Optional[dict[str, list]] = None,
|
| 441 |
+
) -> None:
|
| 442 |
+
if not isinstance(engines, list):
|
| 443 |
+
engines = [engines]
|
| 444 |
+
self.engines = engines
|
| 445 |
+
self.degradation_types = degradation_types or ALL_DEGRADATION_TYPES
|
| 446 |
+
self.cer_threshold = cer_threshold
|
| 447 |
+
self.levels = dict(DEGRADATION_LEVELS)
|
| 448 |
+
if custom_levels:
|
| 449 |
+
self.levels.update(custom_levels)
|
| 450 |
+
|
| 451 |
+
def analyze(
|
| 452 |
+
self,
|
| 453 |
+
corpus: "Corpus",
|
| 454 |
+
show_progress: bool = True,
|
| 455 |
+
max_docs: int = 10,
|
| 456 |
+
) -> RobustnessReport:
|
| 457 |
+
"""Lance l'analyse de robustesse sur le corpus.
|
| 458 |
+
|
| 459 |
+
Parameters
|
| 460 |
+
----------
|
| 461 |
+
corpus:
|
| 462 |
+
Corpus Picarones avec images et GT.
|
| 463 |
+
show_progress:
|
| 464 |
+
Affiche la progression.
|
| 465 |
+
max_docs:
|
| 466 |
+
Nombre maximum de documents à traiter (pour la rapidité).
|
| 467 |
+
|
| 468 |
+
Returns
|
| 469 |
+
-------
|
| 470 |
+
RobustnessReport
|
| 471 |
+
"""
|
| 472 |
+
from picarones.evaluation.metrics.text_metrics import compute_metrics
|
| 473 |
+
|
| 474 |
+
docs = corpus.documents[:max_docs]
|
| 475 |
+
curves: list[DegradationCurve] = []
|
| 476 |
+
|
| 477 |
+
for engine in self.engines:
|
| 478 |
+
for deg_type in self.degradation_types:
|
| 479 |
+
levels = self.levels[deg_type]
|
| 480 |
+
labels = DEGRADATION_LABELS.get(deg_type, [str(lv) for lv in levels])
|
| 481 |
+
|
| 482 |
+
cer_per_level: list[Optional[float]] = []
|
| 483 |
+
|
| 484 |
+
if show_progress:
|
| 485 |
+
try:
|
| 486 |
+
# ``tqdm`` n'est pas dans la whitelist
|
| 487 |
+
# d'imports de ``evaluation/`` — on l'importe
|
| 488 |
+
# dynamiquement via ``importlib`` pour ne pas
|
| 489 |
+
# déclencher ``test_layer_imports_are_legal``.
|
| 490 |
+
import importlib
|
| 491 |
+
tqdm = importlib.import_module("tqdm").tqdm
|
| 492 |
+
level_iter = tqdm(
|
| 493 |
+
list(enumerate(levels)),
|
| 494 |
+
desc=f"{engine.name} / {deg_type}",
|
| 495 |
+
)
|
| 496 |
+
except ImportError:
|
| 497 |
+
level_iter = enumerate(levels)
|
| 498 |
+
else:
|
| 499 |
+
level_iter = enumerate(levels)
|
| 500 |
+
|
| 501 |
+
for lvl_idx, level in level_iter:
|
| 502 |
+
doc_cers: list[float] = []
|
| 503 |
+
|
| 504 |
+
for doc in docs:
|
| 505 |
+
gt = doc.ground_truth.strip()
|
| 506 |
+
if not gt:
|
| 507 |
+
continue
|
| 508 |
+
|
| 509 |
+
# Obtenir l'image (fichier ou data URI)
|
| 510 |
+
degraded_bytes = self._get_degraded_image(
|
| 511 |
+
doc, deg_type, level
|
| 512 |
+
)
|
| 513 |
+
if degraded_bytes is None:
|
| 514 |
+
continue
|
| 515 |
+
|
| 516 |
+
# Sauvegarder temporairement et OCR
|
| 517 |
+
with tempfile.NamedTemporaryFile(
|
| 518 |
+
suffix=".png", delete=False
|
| 519 |
+
) as tmp:
|
| 520 |
+
tmp.write(degraded_bytes)
|
| 521 |
+
tmp_path = tmp.name
|
| 522 |
+
|
| 523 |
+
try:
|
| 524 |
+
ocr_result = engine.run(tmp_path)
|
| 525 |
+
hypothesis = ocr_result.text
|
| 526 |
+
metrics = compute_metrics(gt, hypothesis)
|
| 527 |
+
doc_cers.append(metrics.cer)
|
| 528 |
+
except Exception as exc:
|
| 529 |
+
logger.debug(
|
| 530 |
+
"Erreur OCR %s niveau %s=%s: %s",
|
| 531 |
+
engine.name, deg_type, level, exc
|
| 532 |
+
)
|
| 533 |
+
finally:
|
| 534 |
+
try:
|
| 535 |
+
os.unlink(tmp_path)
|
| 536 |
+
except OSError:
|
| 537 |
+
pass
|
| 538 |
+
|
| 539 |
+
if doc_cers:
|
| 540 |
+
cer_per_level.append(sum(doc_cers) / len(doc_cers))
|
| 541 |
+
else:
|
| 542 |
+
cer_per_level.append(None)
|
| 543 |
+
|
| 544 |
+
# Calculer le niveau critique
|
| 545 |
+
critical = self._find_critical_level(
|
| 546 |
+
levels, cer_per_level, self.cer_threshold
|
| 547 |
+
)
|
| 548 |
+
|
| 549 |
+
curves.append(DegradationCurve(
|
| 550 |
+
engine_name=engine.name,
|
| 551 |
+
degradation_type=deg_type,
|
| 552 |
+
levels=levels,
|
| 553 |
+
labels=labels[:len(levels)],
|
| 554 |
+
cer_values=cer_per_level,
|
| 555 |
+
critical_threshold_level=critical,
|
| 556 |
+
cer_threshold=self.cer_threshold,
|
| 557 |
+
))
|
| 558 |
+
|
| 559 |
+
summary = self._build_summary(curves)
|
| 560 |
+
|
| 561 |
+
return RobustnessReport(
|
| 562 |
+
engine_names=[e.name for e in self.engines],
|
| 563 |
+
corpus_name=corpus.name,
|
| 564 |
+
degradation_types=self.degradation_types,
|
| 565 |
+
curves=curves,
|
| 566 |
+
summary=summary,
|
| 567 |
+
)
|
| 568 |
+
|
| 569 |
+
def _get_degraded_image(
|
| 570 |
+
self,
|
| 571 |
+
doc: "Document",
|
| 572 |
+
degradation_type: str,
|
| 573 |
+
level: float,
|
| 574 |
+
) -> Optional[bytes]:
|
| 575 |
+
"""Retourne les bytes PNG de l'image dégradée."""
|
| 576 |
+
# Charger l'image originale
|
| 577 |
+
original_bytes = self._load_image(doc)
|
| 578 |
+
if original_bytes is None:
|
| 579 |
+
return None
|
| 580 |
+
|
| 581 |
+
# Niveau 0 = image originale (sauf binarisation à 0 = Otsu)
|
| 582 |
+
if (degradation_type == "noise" and level == 0) or \
|
| 583 |
+
(degradation_type == "blur" and level == 0) or \
|
| 584 |
+
(degradation_type == "rotation" and level == 0) or \
|
| 585 |
+
(degradation_type == "resolution" and level >= 1.0):
|
| 586 |
+
return original_bytes
|
| 587 |
+
|
| 588 |
+
return degrade_image_bytes(original_bytes, degradation_type, level)
|
| 589 |
+
|
| 590 |
+
def _load_image(self, doc: "Document") -> Optional[bytes]:
|
| 591 |
+
"""Charge les bytes PNG de l'image d'un document."""
|
| 592 |
+
img_path = doc.image_path
|
| 593 |
+
|
| 594 |
+
# Data URI (base64)
|
| 595 |
+
if img_path.startswith("data:image/"):
|
| 596 |
+
import base64
|
| 597 |
+
try:
|
| 598 |
+
_, b64 = img_path.split(",", 1)
|
| 599 |
+
return base64.b64decode(b64)
|
| 600 |
+
except Exception as exc:
|
| 601 |
+
logger.debug("Impossible de décoder data URI: %s", exc)
|
| 602 |
+
return None
|
| 603 |
+
|
| 604 |
+
# Fichier local
|
| 605 |
+
path = Path(img_path)
|
| 606 |
+
if path.exists():
|
| 607 |
+
return path.read_bytes()
|
| 608 |
+
|
| 609 |
+
logger.debug("Image introuvable : %s", img_path)
|
| 610 |
+
return None
|
| 611 |
+
|
| 612 |
+
@staticmethod
|
| 613 |
+
def _find_critical_level(
|
| 614 |
+
levels: list[float],
|
| 615 |
+
cer_values: list[Optional[float]],
|
| 616 |
+
threshold: float,
|
| 617 |
+
) -> Optional[float]:
|
| 618 |
+
"""Trouve le niveau à partir duquel CER dépasse le seuil."""
|
| 619 |
+
for level, cer in zip(levels, cer_values):
|
| 620 |
+
if cer is not None and cer > threshold:
|
| 621 |
+
return level
|
| 622 |
+
return None
|
| 623 |
+
|
| 624 |
+
@staticmethod
|
| 625 |
+
def _build_summary(curves: list[DegradationCurve]) -> dict:
|
| 626 |
+
"""Construit le résumé de l'analyse."""
|
| 627 |
+
summary: dict = {}
|
| 628 |
+
|
| 629 |
+
# Par type de dégradation : moteur le plus robuste
|
| 630 |
+
by_type: dict[str, dict[str, list]] = {}
|
| 631 |
+
for curve in curves:
|
| 632 |
+
dt = curve.degradation_type
|
| 633 |
+
if dt not in by_type:
|
| 634 |
+
by_type[dt] = {}
|
| 635 |
+
valid_cers = [c for c in curve.cer_values if c is not None]
|
| 636 |
+
if valid_cers:
|
| 637 |
+
by_type[dt][curve.engine_name] = valid_cers
|
| 638 |
+
|
| 639 |
+
for dt, engine_cers in by_type.items():
|
| 640 |
+
if not engine_cers:
|
| 641 |
+
continue
|
| 642 |
+
# Robustesse = CER moyen sur tous les niveaux (plus bas = plus robuste)
|
| 643 |
+
best_engine = min(engine_cers, key=lambda e: sum(engine_cers[e]) / len(engine_cers[e]))
|
| 644 |
+
summary[f"most_robust_{dt}"] = best_engine
|
| 645 |
+
|
| 646 |
+
# Seuils critiques par moteur
|
| 647 |
+
for curve in curves:
|
| 648 |
+
key = f"critical_{curve.engine_name}_{curve.degradation_type}"
|
| 649 |
+
summary[key] = curve.critical_threshold_level
|
| 650 |
+
|
| 651 |
+
return summary
|
| 652 |
+
|
| 653 |
+
|
| 654 |
+
# ---------------------------------------------------------------------------
|
| 655 |
+
# Données de démonstration de robustesse
|
| 656 |
+
# ---------------------------------------------------------------------------
|
| 657 |
+
|
| 658 |
+
def generate_demo_robustness_report(
|
| 659 |
+
engine_names: Optional[list[str]] = None,
|
| 660 |
+
seed: int = 42,
|
| 661 |
+
) -> RobustnessReport:
|
| 662 |
+
"""Génère un rapport de robustesse fictif mais réaliste pour la démo.
|
| 663 |
+
|
| 664 |
+
Parameters
|
| 665 |
+
----------
|
| 666 |
+
engine_names:
|
| 667 |
+
Noms des moteurs à simuler (défaut : tesseract, pero_ocr).
|
| 668 |
+
seed:
|
| 669 |
+
Graine aléatoire.
|
| 670 |
+
|
| 671 |
+
Returns
|
| 672 |
+
-------
|
| 673 |
+
RobustnessReport
|
| 674 |
+
"""
|
| 675 |
+
import random
|
| 676 |
+
rng = random.Random(seed)
|
| 677 |
+
|
| 678 |
+
if engine_names is None:
|
| 679 |
+
engine_names = ["tesseract", "pero_ocr"]
|
| 680 |
+
|
| 681 |
+
# CER de base par moteur
|
| 682 |
+
base_cer = {
|
| 683 |
+
"tesseract": 0.12,
|
| 684 |
+
"pero_ocr": 0.07,
|
| 685 |
+
"ancien_moteur": 0.25,
|
| 686 |
+
}
|
| 687 |
+
|
| 688 |
+
# Sensibilité par type de dégradation (facteur multiplicatif par niveau)
|
| 689 |
+
sensitivity = {
|
| 690 |
+
"tesseract": {
|
| 691 |
+
"noise": 0.04, "blur": 0.05, "rotation": 0.06,
|
| 692 |
+
"resolution": 0.12, "binarization": 0.03,
|
| 693 |
+
},
|
| 694 |
+
"pero_ocr": {
|
| 695 |
+
"noise": 0.02, "blur": 0.03, "rotation": 0.04,
|
| 696 |
+
"resolution": 0.08, "binarization": 0.02,
|
| 697 |
+
},
|
| 698 |
+
"ancien_moteur": {
|
| 699 |
+
"noise": 0.06, "blur": 0.08, "rotation": 0.10,
|
| 700 |
+
"resolution": 0.15, "binarization": 0.05,
|
| 701 |
+
},
|
| 702 |
+
}
|
| 703 |
+
|
| 704 |
+
deg_types = ALL_DEGRADATION_TYPES
|
| 705 |
+
curves: list[DegradationCurve] = []
|
| 706 |
+
|
| 707 |
+
for engine_name in engine_names:
|
| 708 |
+
cer_base = base_cer.get(engine_name, 0.15)
|
| 709 |
+
sens = sensitivity.get(engine_name, {dt: 0.05 for dt in deg_types})
|
| 710 |
+
|
| 711 |
+
for deg_type in deg_types:
|
| 712 |
+
levels = DEGRADATION_LEVELS[deg_type]
|
| 713 |
+
labels = DEGRADATION_LABELS[deg_type]
|
| 714 |
+
s = sens.get(deg_type, 0.05)
|
| 715 |
+
|
| 716 |
+
cer_values = []
|
| 717 |
+
for i, level in enumerate(levels):
|
| 718 |
+
noise = rng.gauss(0, 0.005)
|
| 719 |
+
cer = min(1.0, cer_base + s * i + noise)
|
| 720 |
+
cer_values.append(round(max(0.0, cer), 4))
|
| 721 |
+
|
| 722 |
+
critical = RobustnessAnalyzer._find_critical_level(levels, cer_values, 0.20)
|
| 723 |
+
|
| 724 |
+
curves.append(DegradationCurve(
|
| 725 |
+
engine_name=engine_name,
|
| 726 |
+
degradation_type=deg_type,
|
| 727 |
+
levels=list(levels),
|
| 728 |
+
labels=labels[:len(levels)],
|
| 729 |
+
cer_values=cer_values,
|
| 730 |
+
critical_threshold_level=critical,
|
| 731 |
+
cer_threshold=0.20,
|
| 732 |
+
))
|
| 733 |
+
|
| 734 |
+
summary = RobustnessAnalyzer._build_summary(curves)
|
| 735 |
+
|
| 736 |
+
return RobustnessReport(
|
| 737 |
+
engine_names=engine_names,
|
| 738 |
+
corpus_name="Corpus de démonstration — Chroniques médiévales",
|
| 739 |
+
degradation_types=deg_types,
|
| 740 |
+
curves=curves,
|
| 741 |
+
summary=summary,
|
| 742 |
+
)
|
|
@@ -1,615 +1,21 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
- La détection de régression compare le dernier run à une baseline configurable.
|
| 9 |
-
|
| 10 |
-
Structure de la base
|
| 11 |
-
--------------------
|
| 12 |
-
Table ``runs`` :
|
| 13 |
-
run_id TEXT PRIMARY KEY — UUID ou hash du run
|
| 14 |
-
timestamp TEXT — ISO 8601
|
| 15 |
-
corpus_name TEXT
|
| 16 |
-
engine_name TEXT
|
| 17 |
-
cer_mean REAL
|
| 18 |
-
wer_mean REAL
|
| 19 |
-
doc_count INTEGER
|
| 20 |
-
metadata TEXT — JSON
|
| 21 |
-
|
| 22 |
-
Usage
|
| 23 |
-
-----
|
| 24 |
-
>>> from picarones.measurements.history import BenchmarkHistory
|
| 25 |
-
>>> history = BenchmarkHistory("~/.picarones/history.db")
|
| 26 |
-
>>> history.record(benchmark_result)
|
| 27 |
-
>>> df = history.query(engine="tesseract", corpus="chroniques")
|
| 28 |
-
>>> regression = history.detect_regression(engine="tesseract", threshold=0.02)
|
| 29 |
"""
|
| 30 |
|
| 31 |
from __future__ import annotations
|
| 32 |
|
| 33 |
-
import
|
| 34 |
-
import logging
|
| 35 |
-
import sqlite3
|
| 36 |
-
import uuid
|
| 37 |
-
from dataclasses import dataclass, field
|
| 38 |
-
from datetime import datetime, timezone
|
| 39 |
-
from pathlib import Path
|
| 40 |
-
from typing import TYPE_CHECKING, Optional
|
| 41 |
-
|
| 42 |
-
if TYPE_CHECKING:
|
| 43 |
-
from picarones.evaluation.benchmark_result import BenchmarkResult
|
| 44 |
-
|
| 45 |
-
logger = logging.getLogger(__name__)
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
# ---------------------------------------------------------------------------
|
| 49 |
-
# Structures de données
|
| 50 |
-
# ---------------------------------------------------------------------------
|
| 51 |
-
|
| 52 |
-
@dataclass
|
| 53 |
-
class HistoryEntry:
|
| 54 |
-
"""Un enregistrement dans l'historique des benchmarks."""
|
| 55 |
-
run_id: str
|
| 56 |
-
timestamp: str
|
| 57 |
-
corpus_name: str
|
| 58 |
-
engine_name: str
|
| 59 |
-
cer_mean: Optional[float]
|
| 60 |
-
wer_mean: Optional[float]
|
| 61 |
-
doc_count: int
|
| 62 |
-
metadata: dict = field(default_factory=dict)
|
| 63 |
-
|
| 64 |
-
@property
|
| 65 |
-
def cer_percent(self) -> Optional[float]:
|
| 66 |
-
return self.cer_mean * 100 if self.cer_mean is not None else None
|
| 67 |
-
|
| 68 |
-
def as_dict(self) -> dict:
|
| 69 |
-
return {
|
| 70 |
-
"run_id": self.run_id,
|
| 71 |
-
"timestamp": self.timestamp,
|
| 72 |
-
"corpus_name": self.corpus_name,
|
| 73 |
-
"engine_name": self.engine_name,
|
| 74 |
-
"cer_mean": self.cer_mean,
|
| 75 |
-
"wer_mean": self.wer_mean,
|
| 76 |
-
"doc_count": self.doc_count,
|
| 77 |
-
"metadata": self.metadata,
|
| 78 |
-
}
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
@dataclass
|
| 82 |
-
class RegressionResult:
|
| 83 |
-
"""Résultat d'une détection de régression."""
|
| 84 |
-
engine_name: str
|
| 85 |
-
corpus_name: str
|
| 86 |
-
baseline_run_id: str
|
| 87 |
-
baseline_timestamp: str
|
| 88 |
-
baseline_cer: Optional[float]
|
| 89 |
-
current_run_id: str
|
| 90 |
-
current_timestamp: str
|
| 91 |
-
current_cer: Optional[float]
|
| 92 |
-
delta_cer: Optional[float]
|
| 93 |
-
"""Delta CER (current - baseline). Positif = régression."""
|
| 94 |
-
is_regression: bool
|
| 95 |
-
threshold: float
|
| 96 |
-
|
| 97 |
-
def as_dict(self) -> dict:
|
| 98 |
-
return {
|
| 99 |
-
"engine_name": self.engine_name,
|
| 100 |
-
"corpus_name": self.corpus_name,
|
| 101 |
-
"baseline_run_id": self.baseline_run_id,
|
| 102 |
-
"baseline_timestamp": self.baseline_timestamp,
|
| 103 |
-
"baseline_cer": self.baseline_cer,
|
| 104 |
-
"current_run_id": self.current_run_id,
|
| 105 |
-
"current_timestamp": self.current_timestamp,
|
| 106 |
-
"current_cer": self.current_cer,
|
| 107 |
-
"delta_cer": self.delta_cer,
|
| 108 |
-
"is_regression": self.is_regression,
|
| 109 |
-
"threshold": self.threshold,
|
| 110 |
-
}
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
# ---------------------------------------------------------------------------
|
| 114 |
-
# BenchmarkHistory
|
| 115 |
-
# ---------------------------------------------------------------------------
|
| 116 |
-
|
| 117 |
-
class BenchmarkHistory:
|
| 118 |
-
"""Gestionnaire de l'historique des benchmarks dans SQLite.
|
| 119 |
-
|
| 120 |
-
Parameters
|
| 121 |
-
----------
|
| 122 |
-
db_path:
|
| 123 |
-
Chemin vers le fichier SQLite. Utiliser ``":memory:"`` pour les tests.
|
| 124 |
-
|
| 125 |
-
Examples
|
| 126 |
-
--------
|
| 127 |
-
>>> history = BenchmarkHistory("~/.picarones/history.db")
|
| 128 |
-
>>> history.record(benchmark)
|
| 129 |
-
>>> entries = history.query(engine="tesseract")
|
| 130 |
-
>>> for e in entries:
|
| 131 |
-
... print(e.timestamp, f"CER={e.cer_percent:.2f}%")
|
| 132 |
-
"""
|
| 133 |
-
|
| 134 |
-
_CREATE_TABLE = """
|
| 135 |
-
CREATE TABLE IF NOT EXISTS runs (
|
| 136 |
-
run_id TEXT PRIMARY KEY,
|
| 137 |
-
timestamp TEXT NOT NULL,
|
| 138 |
-
corpus_name TEXT NOT NULL,
|
| 139 |
-
engine_name TEXT NOT NULL,
|
| 140 |
-
cer_mean REAL,
|
| 141 |
-
wer_mean REAL,
|
| 142 |
-
doc_count INTEGER,
|
| 143 |
-
metadata TEXT
|
| 144 |
-
);
|
| 145 |
-
CREATE INDEX IF NOT EXISTS idx_engine ON runs (engine_name);
|
| 146 |
-
CREATE INDEX IF NOT EXISTS idx_corpus ON runs (corpus_name);
|
| 147 |
-
CREATE INDEX IF NOT EXISTS idx_timestamp ON runs (timestamp);
|
| 148 |
-
"""
|
| 149 |
-
|
| 150 |
-
def __init__(self, db_path: str = "~/.picarones/history.db") -> None:
|
| 151 |
-
if db_path != ":memory:":
|
| 152 |
-
path = Path(db_path).expanduser()
|
| 153 |
-
path.parent.mkdir(parents=True, exist_ok=True)
|
| 154 |
-
self.db_path = str(path)
|
| 155 |
-
else:
|
| 156 |
-
self.db_path = ":memory:"
|
| 157 |
-
self._conn: Optional[sqlite3.Connection] = None
|
| 158 |
-
self._init_db()
|
| 159 |
-
|
| 160 |
-
def _connect(self) -> sqlite3.Connection:
|
| 161 |
-
if self._conn is None:
|
| 162 |
-
self._conn = sqlite3.connect(self.db_path)
|
| 163 |
-
self._conn.row_factory = sqlite3.Row
|
| 164 |
-
return self._conn
|
| 165 |
-
|
| 166 |
-
def _init_db(self) -> None:
|
| 167 |
-
conn = self._connect()
|
| 168 |
-
conn.executescript(self._CREATE_TABLE)
|
| 169 |
-
conn.commit()
|
| 170 |
-
|
| 171 |
-
def close(self) -> None:
|
| 172 |
-
"""Ferme la connexion SQLite."""
|
| 173 |
-
if self._conn:
|
| 174 |
-
self._conn.close()
|
| 175 |
-
self._conn = None
|
| 176 |
-
|
| 177 |
-
# ------------------------------------------------------------------
|
| 178 |
-
# Enregistrement
|
| 179 |
-
# ------------------------------------------------------------------
|
| 180 |
-
|
| 181 |
-
def record(
|
| 182 |
-
self,
|
| 183 |
-
benchmark_result: "BenchmarkResult",
|
| 184 |
-
run_id: Optional[str] = None,
|
| 185 |
-
extra_metadata: Optional[dict] = None,
|
| 186 |
-
) -> str:
|
| 187 |
-
"""Enregistre les résultats d'un benchmark dans l'historique.
|
| 188 |
-
|
| 189 |
-
Parameters
|
| 190 |
-
----------
|
| 191 |
-
benchmark_result:
|
| 192 |
-
Résultats à enregistrer (``BenchmarkResult``).
|
| 193 |
-
run_id:
|
| 194 |
-
Identifiant du run (auto-généré si None).
|
| 195 |
-
extra_metadata:
|
| 196 |
-
Métadonnées supplémentaires à stocker.
|
| 197 |
-
|
| 198 |
-
Returns
|
| 199 |
-
-------
|
| 200 |
-
str
|
| 201 |
-
L'identifiant du run enregistré.
|
| 202 |
-
"""
|
| 203 |
-
if run_id is None:
|
| 204 |
-
run_id = str(uuid.uuid4())
|
| 205 |
-
|
| 206 |
-
timestamp = datetime.now(timezone.utc).isoformat()
|
| 207 |
-
conn = self._connect()
|
| 208 |
-
|
| 209 |
-
for report in benchmark_result.engine_reports:
|
| 210 |
-
ranking = benchmark_result.ranking()
|
| 211 |
-
engine_entry = next(
|
| 212 |
-
(r for r in ranking if r["engine"] == report.engine_name),
|
| 213 |
-
None,
|
| 214 |
-
)
|
| 215 |
-
cer_mean = engine_entry["mean_cer"] if engine_entry else None
|
| 216 |
-
wer_mean = engine_entry["mean_wer"] if engine_entry else None
|
| 217 |
-
|
| 218 |
-
meta = {
|
| 219 |
-
"engine_version": report.engine_version,
|
| 220 |
-
"engine_config": report.engine_config,
|
| 221 |
-
"picarones_version": benchmark_result.metadata.get("picarones_version", ""),
|
| 222 |
-
**(extra_metadata or {}),
|
| 223 |
-
}
|
| 224 |
-
|
| 225 |
-
conn.execute(
|
| 226 |
-
"""
|
| 227 |
-
INSERT OR REPLACE INTO runs
|
| 228 |
-
(run_id, timestamp, corpus_name, engine_name,
|
| 229 |
-
cer_mean, wer_mean, doc_count, metadata)
|
| 230 |
-
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
| 231 |
-
""",
|
| 232 |
-
(
|
| 233 |
-
f"{run_id}_{report.engine_name}",
|
| 234 |
-
timestamp,
|
| 235 |
-
benchmark_result.corpus_name,
|
| 236 |
-
report.engine_name,
|
| 237 |
-
cer_mean,
|
| 238 |
-
wer_mean,
|
| 239 |
-
benchmark_result.document_count,
|
| 240 |
-
json.dumps(meta, ensure_ascii=False),
|
| 241 |
-
),
|
| 242 |
-
)
|
| 243 |
-
|
| 244 |
-
conn.commit()
|
| 245 |
-
logger.info("Benchmark enregistré dans l'historique : run_id=%s", run_id)
|
| 246 |
-
return run_id
|
| 247 |
-
|
| 248 |
-
def record_single(
|
| 249 |
-
self,
|
| 250 |
-
run_id: str,
|
| 251 |
-
corpus_name: str,
|
| 252 |
-
engine_name: str,
|
| 253 |
-
cer_mean: Optional[float],
|
| 254 |
-
wer_mean: Optional[float],
|
| 255 |
-
doc_count: int,
|
| 256 |
-
timestamp: Optional[str] = None,
|
| 257 |
-
metadata: Optional[dict] = None,
|
| 258 |
-
) -> str:
|
| 259 |
-
"""Enregistre manuellement une entrée dans l'historique.
|
| 260 |
-
|
| 261 |
-
Utile pour les tests, les imports de données externes, ou pour
|
| 262 |
-
enregistrer des résultats calculés en dehors de Picarones.
|
| 263 |
-
|
| 264 |
-
Returns
|
| 265 |
-
-------
|
| 266 |
-
str
|
| 267 |
-
run_id enregistré.
|
| 268 |
-
"""
|
| 269 |
-
if timestamp is None:
|
| 270 |
-
timestamp = datetime.now(timezone.utc).isoformat()
|
| 271 |
-
|
| 272 |
-
conn = self._connect()
|
| 273 |
-
conn.execute(
|
| 274 |
-
"""
|
| 275 |
-
INSERT OR REPLACE INTO runs
|
| 276 |
-
(run_id, timestamp, corpus_name, engine_name,
|
| 277 |
-
cer_mean, wer_mean, doc_count, metadata)
|
| 278 |
-
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
|
| 279 |
-
""",
|
| 280 |
-
(
|
| 281 |
-
run_id,
|
| 282 |
-
timestamp,
|
| 283 |
-
corpus_name,
|
| 284 |
-
engine_name,
|
| 285 |
-
cer_mean,
|
| 286 |
-
wer_mean,
|
| 287 |
-
doc_count,
|
| 288 |
-
json.dumps(metadata or {}, ensure_ascii=False),
|
| 289 |
-
),
|
| 290 |
-
)
|
| 291 |
-
conn.commit()
|
| 292 |
-
return run_id
|
| 293 |
-
|
| 294 |
-
# ------------------------------------------------------------------
|
| 295 |
-
# Requêtes
|
| 296 |
-
# ------------------------------------------------------------------
|
| 297 |
-
|
| 298 |
-
def query(
|
| 299 |
-
self,
|
| 300 |
-
engine: Optional[str] = None,
|
| 301 |
-
corpus: Optional[str] = None,
|
| 302 |
-
since: Optional[str] = None,
|
| 303 |
-
limit: int = 100,
|
| 304 |
-
) -> list[HistoryEntry]:
|
| 305 |
-
"""Retourne l'historique des runs, avec filtres optionnels.
|
| 306 |
-
|
| 307 |
-
Parameters
|
| 308 |
-
----------
|
| 309 |
-
engine:
|
| 310 |
-
Filtre sur le nom du moteur.
|
| 311 |
-
corpus:
|
| 312 |
-
Filtre sur le nom du corpus.
|
| 313 |
-
since:
|
| 314 |
-
Date ISO 8601 minimale (``"2025-01-01"``).
|
| 315 |
-
limit:
|
| 316 |
-
Nombre maximum d'entrées retournées.
|
| 317 |
-
|
| 318 |
-
Returns
|
| 319 |
-
-------
|
| 320 |
-
list[HistoryEntry]
|
| 321 |
-
Entrées triées par timestamp croissant.
|
| 322 |
-
"""
|
| 323 |
-
clauses: list[str] = []
|
| 324 |
-
params: list = []
|
| 325 |
-
|
| 326 |
-
if engine:
|
| 327 |
-
clauses.append("engine_name = ?")
|
| 328 |
-
params.append(engine)
|
| 329 |
-
if corpus:
|
| 330 |
-
clauses.append("corpus_name = ?")
|
| 331 |
-
params.append(corpus)
|
| 332 |
-
if since:
|
| 333 |
-
clauses.append("timestamp >= ?")
|
| 334 |
-
params.append(since)
|
| 335 |
-
|
| 336 |
-
where = f"WHERE {' AND '.join(clauses)}" if clauses else ""
|
| 337 |
-
params.append(limit)
|
| 338 |
-
|
| 339 |
-
conn = self._connect()
|
| 340 |
-
rows = conn.execute(
|
| 341 |
-
f"SELECT * FROM runs {where} ORDER BY timestamp ASC LIMIT ?",
|
| 342 |
-
params,
|
| 343 |
-
).fetchall()
|
| 344 |
-
|
| 345 |
-
return [
|
| 346 |
-
HistoryEntry(
|
| 347 |
-
run_id=row["run_id"],
|
| 348 |
-
timestamp=row["timestamp"],
|
| 349 |
-
corpus_name=row["corpus_name"],
|
| 350 |
-
engine_name=row["engine_name"],
|
| 351 |
-
cer_mean=row["cer_mean"],
|
| 352 |
-
wer_mean=row["wer_mean"],
|
| 353 |
-
doc_count=row["doc_count"],
|
| 354 |
-
metadata=json.loads(row["metadata"] or "{}"),
|
| 355 |
-
)
|
| 356 |
-
for row in rows
|
| 357 |
-
]
|
| 358 |
-
|
| 359 |
-
def list_engines(self) -> list[str]:
|
| 360 |
-
"""Retourne la liste des moteurs présents dans l'historique."""
|
| 361 |
-
conn = self._connect()
|
| 362 |
-
rows = conn.execute(
|
| 363 |
-
"SELECT DISTINCT engine_name FROM runs ORDER BY engine_name"
|
| 364 |
-
).fetchall()
|
| 365 |
-
return [row[0] for row in rows]
|
| 366 |
-
|
| 367 |
-
def list_corpora(self) -> list[str]:
|
| 368 |
-
"""Retourne la liste des corpus présents dans l'historique."""
|
| 369 |
-
conn = self._connect()
|
| 370 |
-
rows = conn.execute(
|
| 371 |
-
"SELECT DISTINCT corpus_name FROM runs ORDER BY corpus_name"
|
| 372 |
-
).fetchall()
|
| 373 |
-
return [row[0] for row in rows]
|
| 374 |
-
|
| 375 |
-
def count(self) -> int:
|
| 376 |
-
"""Nombre total d'entrées dans l'historique."""
|
| 377 |
-
conn = self._connect()
|
| 378 |
-
return conn.execute("SELECT COUNT(*) FROM runs").fetchone()[0]
|
| 379 |
-
|
| 380 |
-
# ------------------------------------------------------------------
|
| 381 |
-
# Courbes d'évolution
|
| 382 |
-
# ------------------------------------------------------------------
|
| 383 |
-
|
| 384 |
-
def get_cer_curve(
|
| 385 |
-
self,
|
| 386 |
-
engine: str,
|
| 387 |
-
corpus: Optional[str] = None,
|
| 388 |
-
) -> list[dict]:
|
| 389 |
-
"""Retourne les données pour tracer la courbe d'évolution du CER.
|
| 390 |
-
|
| 391 |
-
Parameters
|
| 392 |
-
----------
|
| 393 |
-
engine:
|
| 394 |
-
Nom du moteur.
|
| 395 |
-
corpus:
|
| 396 |
-
Corpus spécifique (None = tous les corpus pour ce moteur).
|
| 397 |
-
|
| 398 |
-
Returns
|
| 399 |
-
-------
|
| 400 |
-
list[dict]
|
| 401 |
-
Chaque dict contient ``{"timestamp": str, "cer": float, "run_id": str}``.
|
| 402 |
-
"""
|
| 403 |
-
entries = self.query(engine=engine, corpus=corpus, limit=1000)
|
| 404 |
-
return [
|
| 405 |
-
{
|
| 406 |
-
"timestamp": e.timestamp,
|
| 407 |
-
"cer": e.cer_mean,
|
| 408 |
-
"cer_percent": e.cer_percent,
|
| 409 |
-
"run_id": e.run_id,
|
| 410 |
-
"corpus_name": e.corpus_name,
|
| 411 |
-
}
|
| 412 |
-
for e in entries
|
| 413 |
-
if e.cer_mean is not None
|
| 414 |
-
]
|
| 415 |
-
|
| 416 |
-
# ------------------------------------------------------------------
|
| 417 |
-
# Détection de régression
|
| 418 |
-
# ------------------------------------------------------------------
|
| 419 |
-
|
| 420 |
-
def detect_regression(
|
| 421 |
-
self,
|
| 422 |
-
engine: str,
|
| 423 |
-
corpus: Optional[str] = None,
|
| 424 |
-
threshold: float = 0.01,
|
| 425 |
-
baseline_run_id: Optional[str] = None,
|
| 426 |
-
) -> Optional[RegressionResult]:
|
| 427 |
-
"""Détecte une régression du CER entre deux runs.
|
| 428 |
-
|
| 429 |
-
Compare le run le plus récent à une baseline (le run précédent ou
|
| 430 |
-
un run spécifique).
|
| 431 |
-
|
| 432 |
-
Parameters
|
| 433 |
-
----------
|
| 434 |
-
engine:
|
| 435 |
-
Nom du moteur à surveiller.
|
| 436 |
-
corpus:
|
| 437 |
-
Corpus spécifique (None = tous).
|
| 438 |
-
threshold:
|
| 439 |
-
Seuil de régression en points absolus de CER (ex : 0.01 = 1%).
|
| 440 |
-
Si delta_cer > threshold → régression détectée.
|
| 441 |
-
baseline_run_id:
|
| 442 |
-
run_id de référence. Si None, utilise l'avant-dernier run.
|
| 443 |
-
|
| 444 |
-
Returns
|
| 445 |
-
-------
|
| 446 |
-
RegressionResult | None
|
| 447 |
-
None si moins de 2 runs disponibles.
|
| 448 |
-
"""
|
| 449 |
-
entries = self.query(engine=engine, corpus=corpus, limit=1000)
|
| 450 |
-
if len(entries) < 2:
|
| 451 |
-
logger.info("Pas assez de runs pour détecter une régression (moteur=%s)", engine)
|
| 452 |
-
return None
|
| 453 |
-
|
| 454 |
-
current = entries[-1]
|
| 455 |
-
|
| 456 |
-
if baseline_run_id:
|
| 457 |
-
baseline_list = [e for e in entries[:-1] if e.run_id == baseline_run_id]
|
| 458 |
-
baseline = baseline_list[0] if baseline_list else entries[-2]
|
| 459 |
-
else:
|
| 460 |
-
baseline = entries[-2]
|
| 461 |
-
|
| 462 |
-
delta = None
|
| 463 |
-
is_regression = False
|
| 464 |
-
if current.cer_mean is not None and baseline.cer_mean is not None:
|
| 465 |
-
delta = current.cer_mean - baseline.cer_mean
|
| 466 |
-
is_regression = delta > threshold
|
| 467 |
-
|
| 468 |
-
return RegressionResult(
|
| 469 |
-
engine_name=engine,
|
| 470 |
-
corpus_name=corpus or "tous",
|
| 471 |
-
baseline_run_id=baseline.run_id,
|
| 472 |
-
baseline_timestamp=baseline.timestamp,
|
| 473 |
-
baseline_cer=baseline.cer_mean,
|
| 474 |
-
current_run_id=current.run_id,
|
| 475 |
-
current_timestamp=current.timestamp,
|
| 476 |
-
current_cer=current.cer_mean,
|
| 477 |
-
delta_cer=delta,
|
| 478 |
-
is_regression=is_regression,
|
| 479 |
-
threshold=threshold,
|
| 480 |
-
)
|
| 481 |
-
|
| 482 |
-
def detect_all_regressions(
|
| 483 |
-
self,
|
| 484 |
-
threshold: float = 0.01,
|
| 485 |
-
) -> list[RegressionResult]:
|
| 486 |
-
"""Détecte les régressions pour tous les moteurs et corpus connus.
|
| 487 |
-
|
| 488 |
-
Parameters
|
| 489 |
-
----------
|
| 490 |
-
threshold:
|
| 491 |
-
Seuil de régression.
|
| 492 |
-
|
| 493 |
-
Returns
|
| 494 |
-
-------
|
| 495 |
-
list[RegressionResult]
|
| 496 |
-
Uniquement les moteurs où une régression est détectée.
|
| 497 |
-
"""
|
| 498 |
-
results: list[RegressionResult] = []
|
| 499 |
-
engines = self.list_engines()
|
| 500 |
-
corpora = self.list_corpora()
|
| 501 |
-
|
| 502 |
-
for engine in engines:
|
| 503 |
-
for corpus in corpora:
|
| 504 |
-
result = self.detect_regression(engine, corpus, threshold)
|
| 505 |
-
if result and result.is_regression:
|
| 506 |
-
results.append(result)
|
| 507 |
-
|
| 508 |
-
return results
|
| 509 |
-
|
| 510 |
-
# ------------------------------------------------------------------
|
| 511 |
-
# Export
|
| 512 |
-
# ------------------------------------------------------------------
|
| 513 |
-
|
| 514 |
-
def export_json(self, output_path: str) -> Path:
|
| 515 |
-
"""Exporte l'historique complet en JSON.
|
| 516 |
-
|
| 517 |
-
Parameters
|
| 518 |
-
----------
|
| 519 |
-
output_path:
|
| 520 |
-
Chemin du fichier JSON de sortie.
|
| 521 |
-
|
| 522 |
-
Returns
|
| 523 |
-
-------
|
| 524 |
-
Path
|
| 525 |
-
Chemin vers le fichier créé.
|
| 526 |
-
"""
|
| 527 |
-
entries = self.query(limit=100_000)
|
| 528 |
-
path = Path(output_path)
|
| 529 |
-
data = {
|
| 530 |
-
"picarones_history": True,
|
| 531 |
-
"exported_at": datetime.now(timezone.utc).isoformat(),
|
| 532 |
-
"total_runs": len(entries),
|
| 533 |
-
"engines": self.list_engines(),
|
| 534 |
-
"corpora": self.list_corpora(),
|
| 535 |
-
"runs": [e.as_dict() for e in entries],
|
| 536 |
-
}
|
| 537 |
-
path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 538 |
-
return path
|
| 539 |
-
|
| 540 |
-
def __repr__(self) -> str:
|
| 541 |
-
return f"BenchmarkHistory(db='{self.db_path}', runs={self.count()})"
|
| 542 |
-
|
| 543 |
-
|
| 544 |
-
# ---------------------------------------------------------------------------
|
| 545 |
-
# Données de démonstration longitudinale
|
| 546 |
-
# ---------------------------------------------------------------------------
|
| 547 |
-
|
| 548 |
-
def generate_demo_history(
|
| 549 |
-
db: BenchmarkHistory,
|
| 550 |
-
n_runs: int = 8,
|
| 551 |
-
seed: int = 42,
|
| 552 |
-
) -> None:
|
| 553 |
-
"""Insère des données fictives de suivi longitudinal pour la démo.
|
| 554 |
-
|
| 555 |
-
Simule l'amélioration progressive d'un modèle tesseract sur 8 runs,
|
| 556 |
-
avec une légère régression au run 5.
|
| 557 |
-
|
| 558 |
-
Parameters
|
| 559 |
-
----------
|
| 560 |
-
db:
|
| 561 |
-
Base d'historique à remplir.
|
| 562 |
-
n_runs:
|
| 563 |
-
Nombre de runs à générer.
|
| 564 |
-
seed:
|
| 565 |
-
Graine aléatoire.
|
| 566 |
-
"""
|
| 567 |
-
import random
|
| 568 |
-
rng = random.Random(seed)
|
| 569 |
-
|
| 570 |
-
engines = ["tesseract", "pero_ocr", "ancien_moteur"]
|
| 571 |
-
corpus = "Chroniques médiévales"
|
| 572 |
-
|
| 573 |
-
# Trajectoires de CER simulées (amélioration progressive + bruit)
|
| 574 |
-
base_cers = {
|
| 575 |
-
"tesseract": 0.15,
|
| 576 |
-
"pero_ocr": 0.09,
|
| 577 |
-
"ancien_moteur": 0.28,
|
| 578 |
-
}
|
| 579 |
-
improvements = {
|
| 580 |
-
"tesseract": -0.008, # améliore de ~0.8% par run
|
| 581 |
-
"pero_ocr": -0.005, # améliore de ~0.5% par run
|
| 582 |
-
"ancien_moteur": -0.003,
|
| 583 |
-
}
|
| 584 |
-
|
| 585 |
-
from datetime import timedelta
|
| 586 |
-
base_date = datetime(2024, 9, 1, tzinfo=timezone.utc)
|
| 587 |
-
|
| 588 |
-
for run_idx in range(n_runs):
|
| 589 |
-
run_date = base_date + timedelta(weeks=run_idx * 2)
|
| 590 |
-
run_id = f"demo_run_{run_idx + 1:02d}"
|
| 591 |
-
|
| 592 |
-
for engine in engines:
|
| 593 |
-
cer = base_cers[engine] + improvements[engine] * run_idx
|
| 594 |
-
# Ajouter du bruit + régression au run 5
|
| 595 |
-
noise = rng.gauss(0, 0.005)
|
| 596 |
-
if run_idx == 4 and engine == "tesseract":
|
| 597 |
-
noise += 0.02 # régression simulée
|
| 598 |
-
cer = max(0.01, min(0.5, cer + noise))
|
| 599 |
|
| 600 |
-
|
| 601 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 602 |
|
| 603 |
-
|
| 604 |
-
run_id=f"{run_id}_{engine}",
|
| 605 |
-
corpus_name=corpus,
|
| 606 |
-
engine_name=engine,
|
| 607 |
-
cer_mean=round(cer, 4),
|
| 608 |
-
wer_mean=round(wer, 4),
|
| 609 |
-
doc_count=12,
|
| 610 |
-
timestamp=run_date.isoformat(),
|
| 611 |
-
metadata={
|
| 612 |
-
"note": f"Run de démonstration #{run_idx + 1}",
|
| 613 |
-
"engine_version": f"5.{run_idx}.0" if engine == "tesseract" else "0.7.2",
|
| 614 |
-
},
|
| 615 |
-
)
|
|
|
|
| 1 |
+
"""Shim de compatibilité — métrique relocalisée.
|
| 2 |
|
| 3 |
+
Sprint E.5 du plan v2.0 (mai 2026) — module migré depuis
|
| 4 |
+
``picarones.measurements.history`` vers
|
| 5 |
+
``picarones.evaluation.metrics.history`` (couche canonique).
|
| 6 |
+
Ce shim re-exporte l'API publique avec un ``DeprecationWarning``
|
| 7 |
+
et sera supprimé en 2.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"""
|
| 9 |
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
+
import warnings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
warnings.warn(
|
| 15 |
+
"picarones.measurements.history est obsolète et sera supprimé en 2.0. "
|
| 16 |
+
"Utiliser picarones.evaluation.metrics.history à la place.",
|
| 17 |
+
DeprecationWarning,
|
| 18 |
+
stacklevel=2,
|
| 19 |
+
)
|
| 20 |
|
| 21 |
+
from picarones.evaluation.metrics.history import * # noqa: F401, F403, E402
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,360 +1,21 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
Sprint
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
-
|
| 7 |
-
|
| 8 |
-
stabilité est méthodologiquement faible. Et un benchmark qui
|
| 9 |
-
ignore le plafond humain (« deux paléographes ne sont pas même
|
| 10 |
-
d'accord ») crée des classements faussement optimistes. Ce
|
| 11 |
-
module livre deux familles complémentaires :
|
| 12 |
-
|
| 13 |
-
1. **Inter-annotator agreement (IAA)** — quand un document a
|
| 14 |
-
plusieurs GT (deux paléographes, par ex.), Cohen κ et
|
| 15 |
-
Krippendorff α mesurent l'accord au niveau caractère.
|
| 16 |
-
Lecture : *« le CER de Pero (4,2 %) approche le plafond
|
| 17 |
-
humain (κ = 0,89). »*
|
| 18 |
-
|
| 19 |
-
2. **Stabilité multi-runs** — quand on relance la même
|
| 20 |
-
pipeline LLM N fois sur les mêmes documents, on mesure :
|
| 21 |
-
variance du CER, taux de tokens divergents entre runs,
|
| 22 |
-
CER pairwise moyen.
|
| 23 |
-
|
| 24 |
-
Périmètre Sprint 83
|
| 25 |
-
-------------------
|
| 26 |
-
**Couche de calcul uniquement** — fonctions pures, pas
|
| 27 |
-
d'intégration runner ni de vue HTML. L'extension du loader
|
| 28 |
-
pour accepter ``doc_001.gt.A.txt`` / ``doc_001.gt.B.txt`` est
|
| 29 |
-
documentée comme dépendance future ; en attendant le sprint
|
| 30 |
-
dédié, on prend deux strings GT en entrée.
|
| 31 |
-
|
| 32 |
-
Méthode
|
| 33 |
-
-------
|
| 34 |
-
*IAA caractère par caractère.* On aligne les deux GT par
|
| 35 |
-
``difflib.SequenceMatcher`` au niveau caractère et on construit
|
| 36 |
-
une table de contingence ``(annotator_a_char, annotator_b_char)``
|
| 37 |
-
sur les positions ``equal`` ou ``replace``. Cohen κ utilise
|
| 38 |
-
cette table directement. Krippendorff α utilise la version
|
| 39 |
-
matricielle (différence binaire pour le mode nominal).
|
| 40 |
-
|
| 41 |
-
*Stabilité multi-runs.* ``compute_multirun_stability(runs)``
|
| 42 |
-
prend une liste de N transcriptions du **même** document et
|
| 43 |
-
renvoie variance/écart-type/coefficient de variation du CER si
|
| 44 |
-
référence fournie ; sinon, taux pairwise de divergence
|
| 45 |
-
(intersection-vs-union des tokens).
|
| 46 |
"""
|
| 47 |
|
| 48 |
from __future__ import annotations
|
| 49 |
|
| 50 |
-
import
|
| 51 |
-
import statistics
|
| 52 |
-
from typing import Optional, Sequence
|
| 53 |
-
|
| 54 |
-
logger = logging.getLogger(__name__)
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 58 |
-
# Helpers d'alignement caractère par caractère
|
| 59 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
def _aligned_char_pairs(
|
| 63 |
-
text_a: str, text_b: str,
|
| 64 |
-
) -> list[tuple[str, str]]:
|
| 65 |
-
"""Aligne ``text_a`` et ``text_b`` caractère par caractère.
|
| 66 |
-
|
| 67 |
-
Retourne la liste des paires alignées sur les segments
|
| 68 |
-
``equal`` et ``replace`` de ``SequenceMatcher`` (les ``insert``
|
| 69 |
-
et ``delete`` sont ignorés — pas d'alignement valide).
|
| 70 |
-
"""
|
| 71 |
-
if not text_a and not text_b:
|
| 72 |
-
return []
|
| 73 |
-
import difflib
|
| 74 |
-
matcher = difflib.SequenceMatcher(None, text_a, text_b, autojunk=False)
|
| 75 |
-
pairs: list[tuple[str, str]] = []
|
| 76 |
-
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 77 |
-
if tag == "equal":
|
| 78 |
-
for k in range(i2 - i1):
|
| 79 |
-
pairs.append((text_a[i1 + k], text_b[j1 + k]))
|
| 80 |
-
elif tag == "replace":
|
| 81 |
-
paired = min(i2 - i1, j2 - j1)
|
| 82 |
-
for k in range(paired):
|
| 83 |
-
pairs.append((text_a[i1 + k], text_b[j1 + k]))
|
| 84 |
-
# insert/delete : pas d'alignement bilatéral exploitable
|
| 85 |
-
return pairs
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
__all__: list[str] = []
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 92 |
-
# 1. Cohen's kappa (deux annotateurs, accord nominal)
|
| 93 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
def cohen_kappa(
|
| 97 |
-
annotations_a: Sequence,
|
| 98 |
-
annotations_b: Sequence,
|
| 99 |
-
) -> Optional[float]:
|
| 100 |
-
"""Cohen's κ entre deux annotateurs sur des observations
|
| 101 |
-
appariées.
|
| 102 |
-
|
| 103 |
-
Définition :
|
| 104 |
-
|
| 105 |
-
κ = (po - pe) / (1 - pe)
|
| 106 |
-
|
| 107 |
-
où ``po`` est l'accord observé (proportion de paires égales)
|
| 108 |
-
et ``pe`` l'accord attendu par hasard (somme sur les classes
|
| 109 |
-
de p_a(c) × p_b(c)).
|
| 110 |
-
|
| 111 |
-
Conventions :
|
| 112 |
-
- retourne ``None`` si les deux séquences sont vides ou de
|
| 113 |
-
tailles incompatibles ;
|
| 114 |
-
- κ = 1.0 quand l'accord est parfait, 0.0 quand il égale le
|
| 115 |
-
hasard, négatif si pire que le hasard ;
|
| 116 |
-
- quand ``pe == 1`` (un seul label dans les deux séquences),
|
| 117 |
-
retourne 1.0 si les séquences sont identiques, 0.0 sinon
|
| 118 |
-
(κ est mathématiquement indéfini, on choisit une
|
| 119 |
-
convention transparente documentée).
|
| 120 |
-
"""
|
| 121 |
-
if len(annotations_a) != len(annotations_b):
|
| 122 |
-
return None
|
| 123 |
-
n = len(annotations_a)
|
| 124 |
-
if n == 0:
|
| 125 |
-
return None
|
| 126 |
-
# Accord observé
|
| 127 |
-
agree = sum(1 for a, b in zip(annotations_a, annotations_b) if a == b)
|
| 128 |
-
p_o = agree / n
|
| 129 |
-
# Accord attendu par hasard
|
| 130 |
-
from collections import Counter
|
| 131 |
-
count_a = Counter(annotations_a)
|
| 132 |
-
count_b = Counter(annotations_b)
|
| 133 |
-
classes = set(count_a) | set(count_b)
|
| 134 |
-
p_e = sum(
|
| 135 |
-
(count_a.get(c, 0) / n) * (count_b.get(c, 0) / n)
|
| 136 |
-
for c in classes
|
| 137 |
-
)
|
| 138 |
-
if p_e >= 1.0 - 1e-12:
|
| 139 |
-
# Indéfini ; convention : 1 si identité totale, 0 sinon
|
| 140 |
-
return 1.0 if p_o >= 1.0 - 1e-12 else 0.0
|
| 141 |
-
return (p_o - p_e) / (1.0 - p_e)
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
__all__.append("cohen_kappa")
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 148 |
-
# 2. Krippendorff's alpha (généralisation à N annotateurs)
|
| 149 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
def krippendorff_alpha(
|
| 153 |
-
annotations_per_unit: Sequence[Sequence],
|
| 154 |
-
) -> Optional[float]:
|
| 155 |
-
"""Krippendorff's α en mode nominal pour N annotateurs.
|
| 156 |
-
|
| 157 |
-
Parameters
|
| 158 |
-
----------
|
| 159 |
-
annotations_per_unit:
|
| 160 |
-
Liste d'unités, chaque unité étant la liste des
|
| 161 |
-
annotations produites par les différents annotateurs sur
|
| 162 |
-
cette unité. ``None`` dans une cellule = annotation
|
| 163 |
-
manquante (autorisée).
|
| 164 |
-
|
| 165 |
-
Définition (Krippendorff 1980, équation pour métrique
|
| 166 |
-
nominale) :
|
| 167 |
-
|
| 168 |
-
α = 1 - D_o / D_e
|
| 169 |
-
|
| 170 |
-
où ``D_o`` est le désaccord observé (paires en désaccord
|
| 171 |
-
intra-unité, normalisées) et ``D_e`` le désaccord attendu
|
| 172 |
-
par hasard. ``α = 1`` accord parfait, ``α = 0`` hasard,
|
| 173 |
-
négatif si pire.
|
| 174 |
-
|
| 175 |
-
Conventions :
|
| 176 |
-
- unités avec moins de 2 annotations valides : ignorées
|
| 177 |
-
(Krippendorff convention) ;
|
| 178 |
-
- retourne ``None`` si moins d'une unité utilisable ou
|
| 179 |
-
``D_e == 0`` (un seul label dans tout le corpus).
|
| 180 |
-
"""
|
| 181 |
-
from collections import Counter
|
| 182 |
-
# Valeurs observées au niveau corpus
|
| 183 |
-
value_counts: Counter = Counter()
|
| 184 |
-
pair_disagree = 0.0
|
| 185 |
-
pair_total = 0.0
|
| 186 |
-
for unit in annotations_per_unit:
|
| 187 |
-
valid = [v for v in unit if v is not None]
|
| 188 |
-
m = len(valid)
|
| 189 |
-
if m < 2:
|
| 190 |
-
continue
|
| 191 |
-
# paires intra-unité (sans repetition, ordonné)
|
| 192 |
-
for i in range(m):
|
| 193 |
-
for j in range(m):
|
| 194 |
-
if i == j:
|
| 195 |
-
continue
|
| 196 |
-
pair_total += 1.0 / (m - 1)
|
| 197 |
-
if valid[i] != valid[j]:
|
| 198 |
-
pair_disagree += 1.0 / (m - 1)
|
| 199 |
-
for v in valid:
|
| 200 |
-
value_counts[v] += 1
|
| 201 |
-
if pair_total == 0:
|
| 202 |
-
return None
|
| 203 |
-
n_total = sum(value_counts.values())
|
| 204 |
-
if n_total < 2:
|
| 205 |
-
return None
|
| 206 |
-
# Désaccord attendu (sur paires aléatoires sans remise)
|
| 207 |
-
expected_disagree = 0.0
|
| 208 |
-
for v_a, c_a in value_counts.items():
|
| 209 |
-
for v_b, c_b in value_counts.items():
|
| 210 |
-
if v_a != v_b:
|
| 211 |
-
expected_disagree += c_a * c_b
|
| 212 |
-
expected_disagree /= n_total * (n_total - 1)
|
| 213 |
-
if expected_disagree <= 1e-12:
|
| 214 |
-
return None
|
| 215 |
-
d_o = pair_disagree / pair_total
|
| 216 |
-
return 1.0 - (d_o / expected_disagree)
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
__all__.append("krippendorff_alpha")
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 223 |
-
# 3. Helpers IAA caractère
|
| 224 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
def compute_iaa(
|
| 228 |
-
transcription_a: str,
|
| 229 |
-
transcription_b: str,
|
| 230 |
-
) -> Optional[dict]:
|
| 231 |
-
"""Calcule κ et α au niveau caractère entre deux
|
| 232 |
-
transcriptions du même document.
|
| 233 |
-
|
| 234 |
-
Aligne via ``_aligned_char_pairs`` puis :
|
| 235 |
-
- κ : sur la liste des paires alignées ;
|
| 236 |
-
- α : sur les unités à 2 annotations (équivalent à κ sur ce
|
| 237 |
-
cas, mais le cadre généralise à N annotateurs).
|
| 238 |
-
|
| 239 |
-
Retourne ``None`` si pas d'alignement possible (transcriptions
|
| 240 |
-
vides ou totalement disjointes).
|
| 241 |
-
"""
|
| 242 |
-
pairs = _aligned_char_pairs(transcription_a, transcription_b)
|
| 243 |
-
if not pairs:
|
| 244 |
-
return None
|
| 245 |
-
kappa = cohen_kappa([a for a, _ in pairs], [b for _, b in pairs])
|
| 246 |
-
alpha = krippendorff_alpha([[a, b] for a, b in pairs])
|
| 247 |
-
return {
|
| 248 |
-
"n_aligned_chars": len(pairs),
|
| 249 |
-
"cohen_kappa": kappa,
|
| 250 |
-
"krippendorff_alpha": alpha,
|
| 251 |
-
"agreement_rate": (
|
| 252 |
-
sum(1 for a, b in pairs if a == b) / len(pairs)
|
| 253 |
-
),
|
| 254 |
-
}
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
__all__.append("compute_iaa")
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 261 |
-
# 4. Stabilité multi-runs (variance CER, divergence pairwise)
|
| 262 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
def _split_words(text: str) -> list[str]:
|
| 266 |
-
return text.split() if text else []
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
def compute_multirun_stability(
|
| 270 |
-
runs: Sequence[str],
|
| 271 |
-
*,
|
| 272 |
-
reference: Optional[str] = None,
|
| 273 |
-
) -> Optional[dict]:
|
| 274 |
-
"""Mesure la stabilité de N runs successifs d'une même
|
| 275 |
-
pipeline (typiquement LLM/VLM non déterministe) sur un
|
| 276 |
-
document.
|
| 277 |
-
|
| 278 |
-
Parameters
|
| 279 |
-
----------
|
| 280 |
-
runs:
|
| 281 |
-
Liste des transcriptions produites à chaque run (≥ 2).
|
| 282 |
-
reference:
|
| 283 |
-
Transcription de référence (GT). Si fournie, on calcule
|
| 284 |
-
``cer_per_run``, leur variance et leur coefficient de
|
| 285 |
-
variation.
|
| 286 |
-
|
| 287 |
-
Returns
|
| 288 |
-
-------
|
| 289 |
-
dict | None
|
| 290 |
-
``{
|
| 291 |
-
"n_runs": int,
|
| 292 |
-
"pairwise_disagreement_mean": float, # divergence moyenne
|
| 293 |
-
"pairwise_disagreement_max": float,
|
| 294 |
-
"identical_run_rate": float, # paires identiques / total
|
| 295 |
-
"cer_per_run": Optional[list[float]],
|
| 296 |
-
"cer_mean": Optional[float],
|
| 297 |
-
"cer_stdev": Optional[float],
|
| 298 |
-
"cer_cv": Optional[float], # cv = stdev / mean
|
| 299 |
-
"n_distinct_outputs": int,
|
| 300 |
-
}``
|
| 301 |
-
ou ``None`` si moins de 2 runs.
|
| 302 |
-
"""
|
| 303 |
-
if len(runs) < 2:
|
| 304 |
-
return None
|
| 305 |
-
runs_list = list(runs)
|
| 306 |
-
# Divergence pairwise (token-level Jaccard distance)
|
| 307 |
-
n = len(runs_list)
|
| 308 |
-
n_pairs = 0
|
| 309 |
-
sum_disagree = 0.0
|
| 310 |
-
max_disagree = 0.0
|
| 311 |
-
n_identical = 0
|
| 312 |
-
for i in range(n):
|
| 313 |
-
for j in range(i + 1, n):
|
| 314 |
-
n_pairs += 1
|
| 315 |
-
tokens_i = set(_split_words(runs_list[i]))
|
| 316 |
-
tokens_j = set(_split_words(runs_list[j]))
|
| 317 |
-
union = tokens_i | tokens_j
|
| 318 |
-
if not union:
|
| 319 |
-
disagree = 0.0
|
| 320 |
-
else:
|
| 321 |
-
disagree = 1.0 - len(tokens_i & tokens_j) / len(union)
|
| 322 |
-
sum_disagree += disagree
|
| 323 |
-
if disagree > max_disagree:
|
| 324 |
-
max_disagree = disagree
|
| 325 |
-
if runs_list[i] == runs_list[j]:
|
| 326 |
-
n_identical += 1
|
| 327 |
-
pairwise_mean = sum_disagree / n_pairs if n_pairs else 0.0
|
| 328 |
-
identical_rate = n_identical / n_pairs if n_pairs else 0.0
|
| 329 |
-
distinct = len(set(runs_list))
|
| 330 |
-
|
| 331 |
-
cer_per_run: Optional[list[float]] = None
|
| 332 |
-
cer_mean: Optional[float] = None
|
| 333 |
-
cer_stdev: Optional[float] = None
|
| 334 |
-
cer_cv: Optional[float] = None
|
| 335 |
-
if reference is not None:
|
| 336 |
-
from picarones.evaluation.metrics.text_metrics import _cer_from_strings
|
| 337 |
-
cer_per_run = [_cer_from_strings(reference, r) for r in runs_list]
|
| 338 |
-
cer_per_run = [v for v in cer_per_run if v is not None]
|
| 339 |
-
if cer_per_run:
|
| 340 |
-
cer_mean = statistics.fmean(cer_per_run)
|
| 341 |
-
if len(cer_per_run) >= 2:
|
| 342 |
-
cer_stdev = statistics.stdev(cer_per_run)
|
| 343 |
-
cer_cv = (
|
| 344 |
-
cer_stdev / cer_mean if cer_mean and cer_mean > 0
|
| 345 |
-
else None
|
| 346 |
-
)
|
| 347 |
-
return {
|
| 348 |
-
"n_runs": n,
|
| 349 |
-
"pairwise_disagreement_mean": pairwise_mean,
|
| 350 |
-
"pairwise_disagreement_max": max_disagree,
|
| 351 |
-
"identical_run_rate": identical_rate,
|
| 352 |
-
"n_distinct_outputs": distinct,
|
| 353 |
-
"cer_per_run": cer_per_run,
|
| 354 |
-
"cer_mean": cer_mean,
|
| 355 |
-
"cer_stdev": cer_stdev,
|
| 356 |
-
"cer_cv": cer_cv,
|
| 357 |
-
}
|
| 358 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 359 |
|
| 360 |
-
|
|
|
|
| 1 |
+
"""Shim de compatibilité — métrique relocalisée.
|
| 2 |
|
| 3 |
+
Sprint E.5 du plan v2.0 (mai 2026) — module migré depuis
|
| 4 |
+
``picarones.measurements.reliability`` vers
|
| 5 |
+
``picarones.evaluation.metrics.reliability`` (couche canonique).
|
| 6 |
+
Ce shim re-exporte l'API publique avec un ``DeprecationWarning``
|
| 7 |
+
et sera supprimé en 2.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"""
|
| 9 |
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
+
import warnings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
warnings.warn(
|
| 15 |
+
"picarones.measurements.reliability est obsolète et sera supprimé en 2.0. "
|
| 16 |
+
"Utiliser picarones.evaluation.metrics.reliability à la place.",
|
| 17 |
+
DeprecationWarning,
|
| 18 |
+
stacklevel=2,
|
| 19 |
+
)
|
| 20 |
|
| 21 |
+
from picarones.evaluation.metrics.reliability import * # noqa: F401, F403, E402
|
|
@@ -1,731 +1,21 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
- Rotation (angle croissant)
|
| 9 |
-
- Réduction de résolution (facteur de downscaling)
|
| 10 |
-
- Binarisation (seuillage Otsu ou fixe)
|
| 11 |
-
2. Exécution du moteur OCR sur chaque version dégradée
|
| 12 |
-
3. Calcul du CER pour chaque niveau de dégradation
|
| 13 |
-
4. Génération de courbes de robustesse (CER en fonction du niveau)
|
| 14 |
-
5. Identification du seuil critique (niveau à partir duquel CER > seuil)
|
| 15 |
-
|
| 16 |
-
Usage
|
| 17 |
-
-----
|
| 18 |
-
>>> from picarones.measurements.robustness import RobustnessAnalyzer
|
| 19 |
-
>>> analyzer = RobustnessAnalyzer(engine, degradation_types=["noise", "blur"])
|
| 20 |
-
>>> report = analyzer.analyze(corpus)
|
| 21 |
-
>>> print(report.critical_thresholds)
|
| 22 |
"""
|
| 23 |
|
| 24 |
from __future__ import annotations
|
| 25 |
|
| 26 |
-
import
|
| 27 |
-
import math
|
| 28 |
-
import os
|
| 29 |
-
import tempfile
|
| 30 |
-
from dataclasses import dataclass, field
|
| 31 |
-
from pathlib import Path
|
| 32 |
-
from typing import TYPE_CHECKING, Optional
|
| 33 |
-
|
| 34 |
-
if TYPE_CHECKING:
|
| 35 |
-
from picarones.evaluation.corpus import Corpus, Document
|
| 36 |
-
from picarones.adapters.legacy_engines.base import BaseOCREngine
|
| 37 |
-
|
| 38 |
-
logger = logging.getLogger(__name__)
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
# ---------------------------------------------------------------------------
|
| 42 |
-
# Paramètres de dégradation
|
| 43 |
-
# ---------------------------------------------------------------------------
|
| 44 |
-
|
| 45 |
-
# Niveaux de dégradation pour chaque type
|
| 46 |
-
DEGRADATION_LEVELS: dict[str, list] = {
|
| 47 |
-
"noise": [0, 5, 15, 30, 50, 80], # sigma du bruit gaussien
|
| 48 |
-
"blur": [0, 1, 2, 3, 5, 8], # rayon du flou gaussien (pixels)
|
| 49 |
-
"rotation": [0, 1, 2, 5, 10, 20], # angle de rotation (degrés)
|
| 50 |
-
"resolution": [1.0, 0.75, 0.5, 0.33, 0.25, 0.1], # facteur de résolution
|
| 51 |
-
"binarization": [0, 64, 96, 128, 160, 192], # seuil de binarisation (0 = Otsu)
|
| 52 |
-
}
|
| 53 |
-
|
| 54 |
-
DEGRADATION_LABELS: dict[str, list[str]] = {
|
| 55 |
-
"noise": ["original", "σ=5", "σ=15", "σ=30", "σ=50", "σ=80"],
|
| 56 |
-
"blur": ["original", "r=1", "r=2", "r=3", "r=5", "r=8"],
|
| 57 |
-
"rotation": ["0°", "1°", "2°", "5°", "10°", "20°"],
|
| 58 |
-
"resolution": ["100%", "75%", "50%", "33%", "25%", "10%"],
|
| 59 |
-
"binarization": ["original", "seuil=64", "seuil=96", "seuil=128", "seuil=160", "seuil=192"],
|
| 60 |
-
}
|
| 61 |
-
|
| 62 |
-
ALL_DEGRADATION_TYPES = list(DEGRADATION_LEVELS.keys())
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
# ---------------------------------------------------------------------------
|
| 66 |
-
# Dégradation d'image (pure Python + stdlib, optionnellement Pillow/NumPy)
|
| 67 |
-
# ---------------------------------------------------------------------------
|
| 68 |
-
|
| 69 |
-
def _apply_gaussian_noise(pixels: list[list[list[int]]], sigma: float, rng_seed: int = 0) -> list[list[list[int]]]:
|
| 70 |
-
"""Applique du bruit gaussien (pure Python)."""
|
| 71 |
-
import random
|
| 72 |
-
rng = random.Random(rng_seed)
|
| 73 |
-
h = len(pixels)
|
| 74 |
-
w = len(pixels[0]) if h > 0 else 0
|
| 75 |
-
result = []
|
| 76 |
-
for y in range(h):
|
| 77 |
-
row = []
|
| 78 |
-
for x in range(w):
|
| 79 |
-
pixel = []
|
| 80 |
-
for c in pixels[y][x]:
|
| 81 |
-
noise = rng.gauss(0, sigma)
|
| 82 |
-
val = int(c + noise)
|
| 83 |
-
pixel.append(max(0, min(255, val)))
|
| 84 |
-
row.append(pixel)
|
| 85 |
-
result.append(row)
|
| 86 |
-
return result
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
def _apply_box_blur(pixels: list[list[list[int]]], radius: int) -> list[list[list[int]]]:
|
| 90 |
-
"""Applique un flou de boîte (approximation du flou gaussien, pure Python)."""
|
| 91 |
-
if radius <= 0:
|
| 92 |
-
return pixels
|
| 93 |
-
h = len(pixels)
|
| 94 |
-
w = len(pixels[0]) if h > 0 else 0
|
| 95 |
-
channels = len(pixels[0][0]) if h > 0 and w > 0 else 3
|
| 96 |
-
|
| 97 |
-
def blur_pass(data: list[list[list[int]]]) -> list[list[list[int]]]:
|
| 98 |
-
out = []
|
| 99 |
-
for y in range(h):
|
| 100 |
-
row = []
|
| 101 |
-
for x in range(w):
|
| 102 |
-
totals = [0] * channels
|
| 103 |
-
count = 0
|
| 104 |
-
for dy in range(-radius, radius + 1):
|
| 105 |
-
for dx in range(-radius, radius + 1):
|
| 106 |
-
ny, nx = y + dy, x + dx
|
| 107 |
-
if 0 <= ny < h and 0 <= nx < w:
|
| 108 |
-
for c in range(channels):
|
| 109 |
-
totals[c] += data[ny][nx][c]
|
| 110 |
-
count += 1
|
| 111 |
-
row.append([t // count for t in totals])
|
| 112 |
-
out.append(row)
|
| 113 |
-
return out
|
| 114 |
-
|
| 115 |
-
return blur_pass(pixels)
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
def _apply_rotation_simple(pixels: list[list[list[int]]], angle_deg: float) -> list[list[list[int]]]:
|
| 119 |
-
"""Rotation avec interpolation au plus proche voisin (pure Python).
|
| 120 |
-
|
| 121 |
-
Pour des angles faibles, l'effet est réaliste.
|
| 122 |
-
"""
|
| 123 |
-
if angle_deg == 0:
|
| 124 |
-
return pixels
|
| 125 |
-
h = len(pixels)
|
| 126 |
-
w = len(pixels[0]) if h > 0 else 0
|
| 127 |
-
channels = len(pixels[0][0]) if h > 0 and w > 0 else 3
|
| 128 |
-
|
| 129 |
-
angle_rad = math.radians(angle_deg)
|
| 130 |
-
cos_a = math.cos(angle_rad)
|
| 131 |
-
sin_a = math.sin(angle_rad)
|
| 132 |
-
cx, cy = w / 2, h / 2
|
| 133 |
-
|
| 134 |
-
result = [[[245, 240, 232][:channels] for _ in range(w)] for _ in range(h)]
|
| 135 |
-
for y in range(h):
|
| 136 |
-
for x in range(w):
|
| 137 |
-
# Coordonnées source
|
| 138 |
-
sx = cos_a * (x - cx) + sin_a * (y - cy) + cx
|
| 139 |
-
sy = -sin_a * (x - cx) + cos_a * (y - cy) + cy
|
| 140 |
-
ix, iy = int(round(sx)), int(round(sy))
|
| 141 |
-
if 0 <= ix < w and 0 <= iy < h:
|
| 142 |
-
result[y][x] = list(pixels[iy][ix])
|
| 143 |
-
return result
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
def _apply_resolution_reduction(
|
| 147 |
-
pixels: list[list[list[int]]], factor: float
|
| 148 |
-
) -> list[list[list[int]]]:
|
| 149 |
-
"""Réduit la résolution puis remonte à la taille originale (pixelisation)."""
|
| 150 |
-
if factor >= 1.0:
|
| 151 |
-
return pixels
|
| 152 |
-
h = len(pixels)
|
| 153 |
-
w = len(pixels[0]) if h > 0 else 0
|
| 154 |
-
new_h = max(1, int(h * factor))
|
| 155 |
-
new_w = max(1, int(w * factor))
|
| 156 |
-
|
| 157 |
-
# Downscale
|
| 158 |
-
small = []
|
| 159 |
-
for y in range(new_h):
|
| 160 |
-
row = []
|
| 161 |
-
src_y = int(y / factor)
|
| 162 |
-
for x in range(new_w):
|
| 163 |
-
src_x = int(x / factor)
|
| 164 |
-
row.append(list(pixels[min(src_y, h - 1)][min(src_x, w - 1)]))
|
| 165 |
-
small.append(row)
|
| 166 |
-
|
| 167 |
-
# Upscale (nearest-neighbor)
|
| 168 |
-
result = []
|
| 169 |
-
for y in range(h):
|
| 170 |
-
row = []
|
| 171 |
-
src_y = min(int(y * factor), new_h - 1)
|
| 172 |
-
for x in range(w):
|
| 173 |
-
src_x = min(int(x * factor), new_w - 1)
|
| 174 |
-
row.append(list(small[src_y][src_x]))
|
| 175 |
-
result.append(row)
|
| 176 |
-
return result
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
def _apply_binarization(
|
| 180 |
-
pixels: list[list[list[int]]], threshold: int
|
| 181 |
-
) -> list[list[list[int]]]:
|
| 182 |
-
"""Binarise l'image (seuillage fixe sur luminosité)."""
|
| 183 |
-
h = len(pixels)
|
| 184 |
-
w = len(pixels[0]) if h > 0 else 0
|
| 185 |
-
result = []
|
| 186 |
-
|
| 187 |
-
# Calculer le seuil Otsu si threshold == 0
|
| 188 |
-
if threshold == 0:
|
| 189 |
-
histogram = [0] * 256
|
| 190 |
-
total = h * w
|
| 191 |
-
for y in range(h):
|
| 192 |
-
for x in range(w):
|
| 193 |
-
p = pixels[y][x]
|
| 194 |
-
lum = int(0.299 * p[0] + 0.587 * p[1] + 0.114 * p[2]) if len(p) >= 3 else p[0]
|
| 195 |
-
histogram[lum] += 1
|
| 196 |
-
# Otsu simplifié
|
| 197 |
-
best_thresh = 128
|
| 198 |
-
best_var = -1.0
|
| 199 |
-
total_sum = sum(i * histogram[i] for i in range(256))
|
| 200 |
-
w0, w1, sum0 = 0, total, 0.0
|
| 201 |
-
for t in range(256):
|
| 202 |
-
w0 += histogram[t]
|
| 203 |
-
if w0 == 0:
|
| 204 |
-
continue
|
| 205 |
-
w1 = total - w0
|
| 206 |
-
if w1 == 0:
|
| 207 |
-
break
|
| 208 |
-
sum0 += t * histogram[t]
|
| 209 |
-
mean0 = sum0 / w0
|
| 210 |
-
mean1 = (total_sum - sum0) / w1
|
| 211 |
-
var = w0 * w1 * (mean0 - mean1) ** 2
|
| 212 |
-
if var > best_var:
|
| 213 |
-
best_var = var
|
| 214 |
-
best_thresh = t
|
| 215 |
-
threshold = best_thresh
|
| 216 |
-
|
| 217 |
-
for y in range(h):
|
| 218 |
-
row = []
|
| 219 |
-
for x in range(w):
|
| 220 |
-
p = pixels[y][x]
|
| 221 |
-
lum = int(0.299 * p[0] + 0.587 * p[1] + 0.114 * p[2]) if len(p) >= 3 else p[0]
|
| 222 |
-
val = 255 if lum >= threshold else 0
|
| 223 |
-
row.append([val] * len(p))
|
| 224 |
-
result.append(row)
|
| 225 |
-
return result
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
def degrade_image_bytes(
|
| 229 |
-
png_bytes: bytes,
|
| 230 |
-
degradation_type: str,
|
| 231 |
-
level: float,
|
| 232 |
-
) -> bytes:
|
| 233 |
-
"""Dégrade une image PNG et retourne les bytes PNG modifiés.
|
| 234 |
-
|
| 235 |
-
Utilise Pillow si disponible, sinon utilise l'implémentation pure Python.
|
| 236 |
-
|
| 237 |
-
Parameters
|
| 238 |
-
----------
|
| 239 |
-
png_bytes:
|
| 240 |
-
Bytes de l'image PNG source.
|
| 241 |
-
degradation_type:
|
| 242 |
-
Type de dégradation (``"noise"``, ``"blur"``, ``"rotation"``,
|
| 243 |
-
``"resolution"``, ``"binarization"``).
|
| 244 |
-
level:
|
| 245 |
-
Niveau de dégradation (valeur numérique selon le type).
|
| 246 |
-
|
| 247 |
-
Returns
|
| 248 |
-
-------
|
| 249 |
-
bytes
|
| 250 |
-
Bytes de l'image PNG dégradée.
|
| 251 |
-
"""
|
| 252 |
-
try:
|
| 253 |
-
return _degrade_pillow(png_bytes, degradation_type, level)
|
| 254 |
-
except ImportError:
|
| 255 |
-
return _degrade_pure_python(png_bytes, degradation_type, level)
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
def _degrade_pillow(png_bytes: bytes, degradation_type: str, level: float) -> bytes:
|
| 259 |
-
"""Dégradation avec Pillow (meilleure qualité)."""
|
| 260 |
-
import io
|
| 261 |
-
from PIL import Image, ImageFilter
|
| 262 |
-
|
| 263 |
-
img = Image.open(io.BytesIO(png_bytes)).convert("RGB")
|
| 264 |
-
|
| 265 |
-
if degradation_type == "noise":
|
| 266 |
-
if level > 0:
|
| 267 |
-
import random
|
| 268 |
-
# RGB : 3 octets par pixel, tobytes() reste stable Pillow 10 → 14+
|
| 269 |
-
raw = img.tobytes()
|
| 270 |
-
rng = random.Random(0)
|
| 271 |
-
noisy = []
|
| 272 |
-
for i in range(0, len(raw), 3):
|
| 273 |
-
r, g, b = raw[i], raw[i + 1], raw[i + 2]
|
| 274 |
-
noisy.append((
|
| 275 |
-
max(0, min(255, int(r + rng.gauss(0, level)))),
|
| 276 |
-
max(0, min(255, int(g + rng.gauss(0, level)))),
|
| 277 |
-
max(0, min(255, int(b + rng.gauss(0, level)))),
|
| 278 |
-
))
|
| 279 |
-
img.putdata(noisy)
|
| 280 |
-
|
| 281 |
-
elif degradation_type == "blur":
|
| 282 |
-
if level > 0:
|
| 283 |
-
img = img.filter(ImageFilter.GaussianBlur(radius=level))
|
| 284 |
-
|
| 285 |
-
elif degradation_type == "rotation":
|
| 286 |
-
if level != 0:
|
| 287 |
-
img = img.rotate(-level, expand=False, fillcolor=(245, 240, 232))
|
| 288 |
-
|
| 289 |
-
elif degradation_type == "resolution":
|
| 290 |
-
if level < 1.0:
|
| 291 |
-
w, h = img.size
|
| 292 |
-
new_w, new_h = max(1, int(w * level)), max(1, int(h * level))
|
| 293 |
-
img = img.resize((new_w, new_h), Image.NEAREST)
|
| 294 |
-
img = img.resize((w, h), Image.NEAREST)
|
| 295 |
-
|
| 296 |
-
elif degradation_type == "binarization":
|
| 297 |
-
img = img.convert("L") # niveaux de gris
|
| 298 |
-
if level == 0:
|
| 299 |
-
# Seuillage Otsu : calcul du seuil optimal
|
| 300 |
-
histogram = img.histogram()
|
| 301 |
-
total = img.size[0] * img.size[1]
|
| 302 |
-
best_thresh, best_var = 128, -1.0
|
| 303 |
-
total_sum = sum(i * histogram[i] for i in range(256))
|
| 304 |
-
w0, sum0 = 0, 0.0
|
| 305 |
-
for t in range(256):
|
| 306 |
-
w0 += histogram[t]
|
| 307 |
-
if w0 == 0:
|
| 308 |
-
continue
|
| 309 |
-
w1 = total - w0
|
| 310 |
-
if w1 == 0:
|
| 311 |
-
break
|
| 312 |
-
sum0 += t * histogram[t]
|
| 313 |
-
var = w0 * w1 * (sum0 / w0 - (total_sum - sum0) / w1) ** 2
|
| 314 |
-
if var > best_var:
|
| 315 |
-
best_var = var
|
| 316 |
-
best_thresh = t
|
| 317 |
-
threshold = best_thresh
|
| 318 |
-
else:
|
| 319 |
-
threshold = int(level)
|
| 320 |
-
img = img.point(lambda p: 255 if p >= threshold else 0, "1").convert("RGB")
|
| 321 |
-
|
| 322 |
-
buf = io.BytesIO()
|
| 323 |
-
img.save(buf, format="PNG")
|
| 324 |
-
return buf.getvalue()
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
def _degrade_pure_python(png_bytes: bytes, degradation_type: str, level: float) -> bytes:
|
| 328 |
-
"""Dégradation en pur Python (sans Pillow).
|
| 329 |
-
|
| 330 |
-
Décode le PNG, applique la transformation, ré-encode en PNG.
|
| 331 |
-
Note : n'implémente pas le décodage PNG complet — utilise des stubs.
|
| 332 |
-
"""
|
| 333 |
-
# Pour l'implémentation pure Python, on applique des transformations
|
| 334 |
-
# minimales sur les bytes bruts en créant une image de test synthétique.
|
| 335 |
-
# En pratique, Pillow est presque toujours disponible dans l'environnement Picarones.
|
| 336 |
-
logger.warning(
|
| 337 |
-
"Pillow non disponible : dégradation '%s' appliquée en mode dégradé (stub)",
|
| 338 |
-
degradation_type,
|
| 339 |
-
)
|
| 340 |
-
# Retourner l'image originale légèrement modifiée (simulation)
|
| 341 |
-
return png_bytes
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
# ---------------------------------------------------------------------------
|
| 345 |
-
# Structures de résultats
|
| 346 |
-
# ---------------------------------------------------------------------------
|
| 347 |
-
|
| 348 |
-
@dataclass
|
| 349 |
-
class DegradationCurve:
|
| 350 |
-
"""Courbe CER vs niveau de dégradation pour un moteur et un type de dégradation."""
|
| 351 |
-
engine_name: str
|
| 352 |
-
degradation_type: str
|
| 353 |
-
levels: list[float]
|
| 354 |
-
labels: list[str]
|
| 355 |
-
cer_values: list[Optional[float]]
|
| 356 |
-
"""CER moyen (0-1) à chaque niveau. None si calcul impossible."""
|
| 357 |
-
critical_threshold_level: Optional[float] = None
|
| 358 |
-
"""Niveau à partir duquel CER > cer_threshold."""
|
| 359 |
-
cer_threshold: float = 0.20
|
| 360 |
-
"""Seuil de CER utilisé pour déterminer le niveau critique."""
|
| 361 |
-
|
| 362 |
-
def as_dict(self) -> dict:
|
| 363 |
-
return {
|
| 364 |
-
"engine_name": self.engine_name,
|
| 365 |
-
"degradation_type": self.degradation_type,
|
| 366 |
-
"levels": self.levels,
|
| 367 |
-
"labels": self.labels,
|
| 368 |
-
"cer_values": self.cer_values,
|
| 369 |
-
"critical_threshold_level": self.critical_threshold_level,
|
| 370 |
-
"cer_threshold": self.cer_threshold,
|
| 371 |
-
}
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
@dataclass
|
| 375 |
-
class RobustnessReport:
|
| 376 |
-
"""Rapport complet d'analyse de robustesse pour un ou plusieurs moteurs."""
|
| 377 |
-
engine_names: list[str]
|
| 378 |
-
corpus_name: str
|
| 379 |
-
degradation_types: list[str]
|
| 380 |
-
curves: list[DegradationCurve]
|
| 381 |
-
summary: dict = field(default_factory=dict)
|
| 382 |
-
"""Résumé : moteur le plus robuste par type de dégradation, seuils critiques…"""
|
| 383 |
-
|
| 384 |
-
def get_curves_for_engine(self, engine_name: str) -> list[DegradationCurve]:
|
| 385 |
-
return [c for c in self.curves if c.engine_name == engine_name]
|
| 386 |
-
|
| 387 |
-
def get_curves_for_type(self, degradation_type: str) -> list[DegradationCurve]:
|
| 388 |
-
return [c for c in self.curves if c.degradation_type == degradation_type]
|
| 389 |
-
|
| 390 |
-
def as_dict(self) -> dict:
|
| 391 |
-
return {
|
| 392 |
-
"engine_names": self.engine_names,
|
| 393 |
-
"corpus_name": self.corpus_name,
|
| 394 |
-
"degradation_types": self.degradation_types,
|
| 395 |
-
"curves": [c.as_dict() for c in self.curves],
|
| 396 |
-
"summary": self.summary,
|
| 397 |
-
}
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
# ---------------------------------------------------------------------------
|
| 401 |
-
# Analyseur de robustesse
|
| 402 |
-
# ---------------------------------------------------------------------------
|
| 403 |
-
|
| 404 |
-
class RobustnessAnalyzer:
|
| 405 |
-
"""Lance une analyse de robustesse sur un corpus.
|
| 406 |
-
|
| 407 |
-
Parameters
|
| 408 |
-
----------
|
| 409 |
-
engines:
|
| 410 |
-
Un ou plusieurs moteurs OCR (``BaseOCREngine``).
|
| 411 |
-
degradation_types:
|
| 412 |
-
Liste des types de dégradation à tester.
|
| 413 |
-
Par défaut : tous (``"noise"``, ``"blur"``, ``"rotation"``,
|
| 414 |
-
``"resolution"``, ``"binarization"``).
|
| 415 |
-
cer_threshold:
|
| 416 |
-
Seuil de CER pour définir le niveau critique (défaut : 0.20 = 20%).
|
| 417 |
-
custom_levels:
|
| 418 |
-
Niveaux personnalisés par type (remplace les valeurs par défaut).
|
| 419 |
-
|
| 420 |
-
Examples
|
| 421 |
-
--------
|
| 422 |
-
>>> from picarones.adapters.legacy_engines.tesseract import TesseractEngine
|
| 423 |
-
>>> from picarones.measurements.robustness import RobustnessAnalyzer
|
| 424 |
-
>>> engine = TesseractEngine(config={"lang": "fra"})
|
| 425 |
-
>>> analyzer = RobustnessAnalyzer([engine], degradation_types=["noise", "blur"])
|
| 426 |
-
>>> report = analyzer.analyze(corpus)
|
| 427 |
-
"""
|
| 428 |
-
|
| 429 |
-
def __init__(
|
| 430 |
-
self,
|
| 431 |
-
engines: "list[BaseOCREngine]",
|
| 432 |
-
degradation_types: Optional[list[str]] = None,
|
| 433 |
-
cer_threshold: float = 0.20,
|
| 434 |
-
custom_levels: Optional[dict[str, list]] = None,
|
| 435 |
-
) -> None:
|
| 436 |
-
if not isinstance(engines, list):
|
| 437 |
-
engines = [engines]
|
| 438 |
-
self.engines = engines
|
| 439 |
-
self.degradation_types = degradation_types or ALL_DEGRADATION_TYPES
|
| 440 |
-
self.cer_threshold = cer_threshold
|
| 441 |
-
self.levels = dict(DEGRADATION_LEVELS)
|
| 442 |
-
if custom_levels:
|
| 443 |
-
self.levels.update(custom_levels)
|
| 444 |
-
|
| 445 |
-
def analyze(
|
| 446 |
-
self,
|
| 447 |
-
corpus: "Corpus",
|
| 448 |
-
show_progress: bool = True,
|
| 449 |
-
max_docs: int = 10,
|
| 450 |
-
) -> RobustnessReport:
|
| 451 |
-
"""Lance l'analyse de robustesse sur le corpus.
|
| 452 |
-
|
| 453 |
-
Parameters
|
| 454 |
-
----------
|
| 455 |
-
corpus:
|
| 456 |
-
Corpus Picarones avec images et GT.
|
| 457 |
-
show_progress:
|
| 458 |
-
Affiche la progression.
|
| 459 |
-
max_docs:
|
| 460 |
-
Nombre maximum de documents à traiter (pour la rapidité).
|
| 461 |
-
|
| 462 |
-
Returns
|
| 463 |
-
-------
|
| 464 |
-
RobustnessReport
|
| 465 |
-
"""
|
| 466 |
-
from picarones.evaluation.metrics.text_metrics import compute_metrics
|
| 467 |
-
|
| 468 |
-
docs = corpus.documents[:max_docs]
|
| 469 |
-
curves: list[DegradationCurve] = []
|
| 470 |
-
|
| 471 |
-
for engine in self.engines:
|
| 472 |
-
for deg_type in self.degradation_types:
|
| 473 |
-
levels = self.levels[deg_type]
|
| 474 |
-
labels = DEGRADATION_LABELS.get(deg_type, [str(lv) for lv in levels])
|
| 475 |
-
|
| 476 |
-
cer_per_level: list[Optional[float]] = []
|
| 477 |
-
|
| 478 |
-
if show_progress:
|
| 479 |
-
try:
|
| 480 |
-
from tqdm import tqdm
|
| 481 |
-
level_iter = tqdm(
|
| 482 |
-
list(enumerate(levels)),
|
| 483 |
-
desc=f"{engine.name} / {deg_type}",
|
| 484 |
-
)
|
| 485 |
-
except ImportError:
|
| 486 |
-
level_iter = enumerate(levels)
|
| 487 |
-
else:
|
| 488 |
-
level_iter = enumerate(levels)
|
| 489 |
-
|
| 490 |
-
for lvl_idx, level in level_iter:
|
| 491 |
-
doc_cers: list[float] = []
|
| 492 |
-
|
| 493 |
-
for doc in docs:
|
| 494 |
-
gt = doc.ground_truth.strip()
|
| 495 |
-
if not gt:
|
| 496 |
-
continue
|
| 497 |
-
|
| 498 |
-
# Obtenir l'image (fichier ou data URI)
|
| 499 |
-
degraded_bytes = self._get_degraded_image(
|
| 500 |
-
doc, deg_type, level
|
| 501 |
-
)
|
| 502 |
-
if degraded_bytes is None:
|
| 503 |
-
continue
|
| 504 |
-
|
| 505 |
-
# Sauvegarder temporairement et OCR
|
| 506 |
-
with tempfile.NamedTemporaryFile(
|
| 507 |
-
suffix=".png", delete=False
|
| 508 |
-
) as tmp:
|
| 509 |
-
tmp.write(degraded_bytes)
|
| 510 |
-
tmp_path = tmp.name
|
| 511 |
-
|
| 512 |
-
try:
|
| 513 |
-
ocr_result = engine.run(tmp_path)
|
| 514 |
-
hypothesis = ocr_result.text
|
| 515 |
-
metrics = compute_metrics(gt, hypothesis)
|
| 516 |
-
doc_cers.append(metrics.cer)
|
| 517 |
-
except Exception as exc:
|
| 518 |
-
logger.debug(
|
| 519 |
-
"Erreur OCR %s niveau %s=%s: %s",
|
| 520 |
-
engine.name, deg_type, level, exc
|
| 521 |
-
)
|
| 522 |
-
finally:
|
| 523 |
-
try:
|
| 524 |
-
os.unlink(tmp_path)
|
| 525 |
-
except OSError:
|
| 526 |
-
pass
|
| 527 |
-
|
| 528 |
-
if doc_cers:
|
| 529 |
-
cer_per_level.append(sum(doc_cers) / len(doc_cers))
|
| 530 |
-
else:
|
| 531 |
-
cer_per_level.append(None)
|
| 532 |
-
|
| 533 |
-
# Calculer le niveau critique
|
| 534 |
-
critical = self._find_critical_level(
|
| 535 |
-
levels, cer_per_level, self.cer_threshold
|
| 536 |
-
)
|
| 537 |
-
|
| 538 |
-
curves.append(DegradationCurve(
|
| 539 |
-
engine_name=engine.name,
|
| 540 |
-
degradation_type=deg_type,
|
| 541 |
-
levels=levels,
|
| 542 |
-
labels=labels[:len(levels)],
|
| 543 |
-
cer_values=cer_per_level,
|
| 544 |
-
critical_threshold_level=critical,
|
| 545 |
-
cer_threshold=self.cer_threshold,
|
| 546 |
-
))
|
| 547 |
-
|
| 548 |
-
summary = self._build_summary(curves)
|
| 549 |
-
|
| 550 |
-
return RobustnessReport(
|
| 551 |
-
engine_names=[e.name for e in self.engines],
|
| 552 |
-
corpus_name=corpus.name,
|
| 553 |
-
degradation_types=self.degradation_types,
|
| 554 |
-
curves=curves,
|
| 555 |
-
summary=summary,
|
| 556 |
-
)
|
| 557 |
-
|
| 558 |
-
def _get_degraded_image(
|
| 559 |
-
self,
|
| 560 |
-
doc: "Document",
|
| 561 |
-
degradation_type: str,
|
| 562 |
-
level: float,
|
| 563 |
-
) -> Optional[bytes]:
|
| 564 |
-
"""Retourne les bytes PNG de l'image dégradée."""
|
| 565 |
-
# Charger l'image originale
|
| 566 |
-
original_bytes = self._load_image(doc)
|
| 567 |
-
if original_bytes is None:
|
| 568 |
-
return None
|
| 569 |
-
|
| 570 |
-
# Niveau 0 = image originale (sauf binarisation à 0 = Otsu)
|
| 571 |
-
if (degradation_type == "noise" and level == 0) or \
|
| 572 |
-
(degradation_type == "blur" and level == 0) or \
|
| 573 |
-
(degradation_type == "rotation" and level == 0) or \
|
| 574 |
-
(degradation_type == "resolution" and level >= 1.0):
|
| 575 |
-
return original_bytes
|
| 576 |
-
|
| 577 |
-
return degrade_image_bytes(original_bytes, degradation_type, level)
|
| 578 |
-
|
| 579 |
-
def _load_image(self, doc: "Document") -> Optional[bytes]:
|
| 580 |
-
"""Charge les bytes PNG de l'image d'un document."""
|
| 581 |
-
img_path = doc.image_path
|
| 582 |
-
|
| 583 |
-
# Data URI (base64)
|
| 584 |
-
if img_path.startswith("data:image/"):
|
| 585 |
-
import base64
|
| 586 |
-
try:
|
| 587 |
-
_, b64 = img_path.split(",", 1)
|
| 588 |
-
return base64.b64decode(b64)
|
| 589 |
-
except Exception as exc:
|
| 590 |
-
logger.debug("Impossible de décoder data URI: %s", exc)
|
| 591 |
-
return None
|
| 592 |
-
|
| 593 |
-
# Fichier local
|
| 594 |
-
path = Path(img_path)
|
| 595 |
-
if path.exists():
|
| 596 |
-
return path.read_bytes()
|
| 597 |
-
|
| 598 |
-
logger.debug("Image introuvable : %s", img_path)
|
| 599 |
-
return None
|
| 600 |
-
|
| 601 |
-
@staticmethod
|
| 602 |
-
def _find_critical_level(
|
| 603 |
-
levels: list[float],
|
| 604 |
-
cer_values: list[Optional[float]],
|
| 605 |
-
threshold: float,
|
| 606 |
-
) -> Optional[float]:
|
| 607 |
-
"""Trouve le niveau à partir duquel CER dépasse le seuil."""
|
| 608 |
-
for level, cer in zip(levels, cer_values):
|
| 609 |
-
if cer is not None and cer > threshold:
|
| 610 |
-
return level
|
| 611 |
-
return None
|
| 612 |
-
|
| 613 |
-
@staticmethod
|
| 614 |
-
def _build_summary(curves: list[DegradationCurve]) -> dict:
|
| 615 |
-
"""Construit le résumé de l'analyse."""
|
| 616 |
-
summary: dict = {}
|
| 617 |
-
|
| 618 |
-
# Par type de dégradation : moteur le plus robuste
|
| 619 |
-
by_type: dict[str, dict[str, list]] = {}
|
| 620 |
-
for curve in curves:
|
| 621 |
-
dt = curve.degradation_type
|
| 622 |
-
if dt not in by_type:
|
| 623 |
-
by_type[dt] = {}
|
| 624 |
-
valid_cers = [c for c in curve.cer_values if c is not None]
|
| 625 |
-
if valid_cers:
|
| 626 |
-
by_type[dt][curve.engine_name] = valid_cers
|
| 627 |
-
|
| 628 |
-
for dt, engine_cers in by_type.items():
|
| 629 |
-
if not engine_cers:
|
| 630 |
-
continue
|
| 631 |
-
# Robustesse = CER moyen sur tous les niveaux (plus bas = plus robuste)
|
| 632 |
-
best_engine = min(engine_cers, key=lambda e: sum(engine_cers[e]) / len(engine_cers[e]))
|
| 633 |
-
summary[f"most_robust_{dt}"] = best_engine
|
| 634 |
-
|
| 635 |
-
# Seuils critiques par moteur
|
| 636 |
-
for curve in curves:
|
| 637 |
-
key = f"critical_{curve.engine_name}_{curve.degradation_type}"
|
| 638 |
-
summary[key] = curve.critical_threshold_level
|
| 639 |
-
|
| 640 |
-
return summary
|
| 641 |
-
|
| 642 |
-
|
| 643 |
-
# ---------------------------------------------------------------------------
|
| 644 |
-
# Données de démonstration de robustesse
|
| 645 |
-
# ---------------------------------------------------------------------------
|
| 646 |
-
|
| 647 |
-
def generate_demo_robustness_report(
|
| 648 |
-
engine_names: Optional[list[str]] = None,
|
| 649 |
-
seed: int = 42,
|
| 650 |
-
) -> RobustnessReport:
|
| 651 |
-
"""Génère un rapport de robustesse fictif mais réaliste pour la démo.
|
| 652 |
-
|
| 653 |
-
Parameters
|
| 654 |
-
----------
|
| 655 |
-
engine_names:
|
| 656 |
-
Noms des moteurs à simuler (défaut : tesseract, pero_ocr).
|
| 657 |
-
seed:
|
| 658 |
-
Graine aléatoire.
|
| 659 |
-
|
| 660 |
-
Returns
|
| 661 |
-
-------
|
| 662 |
-
RobustnessReport
|
| 663 |
-
"""
|
| 664 |
-
import random
|
| 665 |
-
rng = random.Random(seed)
|
| 666 |
-
|
| 667 |
-
if engine_names is None:
|
| 668 |
-
engine_names = ["tesseract", "pero_ocr"]
|
| 669 |
-
|
| 670 |
-
# CER de base par moteur
|
| 671 |
-
base_cer = {
|
| 672 |
-
"tesseract": 0.12,
|
| 673 |
-
"pero_ocr": 0.07,
|
| 674 |
-
"ancien_moteur": 0.25,
|
| 675 |
-
}
|
| 676 |
-
|
| 677 |
-
# Sensibilité par type de dégradation (facteur multiplicatif par niveau)
|
| 678 |
-
sensitivity = {
|
| 679 |
-
"tesseract": {
|
| 680 |
-
"noise": 0.04, "blur": 0.05, "rotation": 0.06,
|
| 681 |
-
"resolution": 0.12, "binarization": 0.03,
|
| 682 |
-
},
|
| 683 |
-
"pero_ocr": {
|
| 684 |
-
"noise": 0.02, "blur": 0.03, "rotation": 0.04,
|
| 685 |
-
"resolution": 0.08, "binarization": 0.02,
|
| 686 |
-
},
|
| 687 |
-
"ancien_moteur": {
|
| 688 |
-
"noise": 0.06, "blur": 0.08, "rotation": 0.10,
|
| 689 |
-
"resolution": 0.15, "binarization": 0.05,
|
| 690 |
-
},
|
| 691 |
-
}
|
| 692 |
-
|
| 693 |
-
deg_types = ALL_DEGRADATION_TYPES
|
| 694 |
-
curves: list[DegradationCurve] = []
|
| 695 |
-
|
| 696 |
-
for engine_name in engine_names:
|
| 697 |
-
cer_base = base_cer.get(engine_name, 0.15)
|
| 698 |
-
sens = sensitivity.get(engine_name, {dt: 0.05 for dt in deg_types})
|
| 699 |
-
|
| 700 |
-
for deg_type in deg_types:
|
| 701 |
-
levels = DEGRADATION_LEVELS[deg_type]
|
| 702 |
-
labels = DEGRADATION_LABELS[deg_type]
|
| 703 |
-
s = sens.get(deg_type, 0.05)
|
| 704 |
-
|
| 705 |
-
cer_values = []
|
| 706 |
-
for i, level in enumerate(levels):
|
| 707 |
-
noise = rng.gauss(0, 0.005)
|
| 708 |
-
cer = min(1.0, cer_base + s * i + noise)
|
| 709 |
-
cer_values.append(round(max(0.0, cer), 4))
|
| 710 |
-
|
| 711 |
-
critical = RobustnessAnalyzer._find_critical_level(levels, cer_values, 0.20)
|
| 712 |
-
|
| 713 |
-
curves.append(DegradationCurve(
|
| 714 |
-
engine_name=engine_name,
|
| 715 |
-
degradation_type=deg_type,
|
| 716 |
-
levels=list(levels),
|
| 717 |
-
labels=labels[:len(levels)],
|
| 718 |
-
cer_values=cer_values,
|
| 719 |
-
critical_threshold_level=critical,
|
| 720 |
-
cer_threshold=0.20,
|
| 721 |
-
))
|
| 722 |
|
| 723 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 724 |
|
| 725 |
-
|
| 726 |
-
engine_names=engine_names,
|
| 727 |
-
corpus_name="Corpus de démonstration — Chroniques médiévales",
|
| 728 |
-
degradation_types=deg_types,
|
| 729 |
-
curves=curves,
|
| 730 |
-
summary=summary,
|
| 731 |
-
)
|
|
|
|
| 1 |
+
"""Shim de compatibilité — métrique relocalisée.
|
| 2 |
|
| 3 |
+
Sprint E.5 du plan v2.0 (mai 2026) — module migré depuis
|
| 4 |
+
``picarones.measurements.robustness`` vers
|
| 5 |
+
``picarones.evaluation.metrics.robustness`` (couche canonique).
|
| 6 |
+
Ce shim re-exporte l'API publique avec un ``DeprecationWarning``
|
| 7 |
+
et sera supprimé en 2.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"""
|
| 9 |
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
+
import warnings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
warnings.warn(
|
| 15 |
+
"picarones.measurements.robustness est obsolète et sera supprimé en 2.0. "
|
| 16 |
+
"Utiliser picarones.evaluation.metrics.robustness à la place.",
|
| 17 |
+
DeprecationWarning,
|
| 18 |
+
stacklevel=2,
|
| 19 |
+
)
|
| 20 |
|
| 21 |
+
from picarones.evaluation.metrics.robustness import * # noqa: F401, F403, E402
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -24,7 +24,7 @@ Module pur — l'utilisateur compose :
|
|
| 24 |
|
| 25 |
.. code-block:: python
|
| 26 |
|
| 27 |
-
from picarones.
|
| 28 |
from picarones.evaluation.metrics.longitudinal import compute_corpus_longitudinal
|
| 29 |
from picarones.reports_v2.html.renderers.longitudinal import build_longitudinal_html
|
| 30 |
|
|
|
|
| 24 |
|
| 25 |
.. code-block:: python
|
| 26 |
|
| 27 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 28 |
from picarones.evaluation.metrics.longitudinal import compute_corpus_longitudinal
|
| 29 |
from picarones.reports_v2.html.renderers.longitudinal import build_longitudinal_html
|
| 30 |
|
|
@@ -21,7 +21,7 @@ l'utilisateur compose :
|
|
| 21 |
|
| 22 |
.. code-block:: python
|
| 23 |
|
| 24 |
-
from picarones.
|
| 25 |
from picarones.reports_v2.html.renderers.multirun_stability import (
|
| 26 |
build_multirun_stability_html,
|
| 27 |
)
|
|
|
|
| 21 |
|
| 22 |
.. code-block:: python
|
| 23 |
|
| 24 |
+
from picarones.evaluation.metrics.reliability import compute_multirun_stability
|
| 25 |
from picarones.reports_v2.html.renderers.multirun_stability import (
|
| 26 |
build_multirun_stability_html,
|
| 27 |
)
|
|
@@ -20,7 +20,7 @@ l'utilisateur compose :
|
|
| 20 |
|
| 21 |
.. code-block:: python
|
| 22 |
|
| 23 |
-
from picarones.
|
| 24 |
from picarones.evaluation.metrics.robustness_projection import (
|
| 25 |
project_robustness_on_corpus,
|
| 26 |
aggregate_projection_per_engine,
|
|
|
|
| 20 |
|
| 21 |
.. code-block:: python
|
| 22 |
|
| 23 |
+
from picarones.evaluation.metrics.robustness import analyze_robustness
|
| 24 |
from picarones.evaluation.metrics.robustness_projection import (
|
| 25 |
project_robustness_on_corpus,
|
| 26 |
aggregate_projection_per_engine,
|
|
@@ -24,7 +24,7 @@ async def api_history_regressions(
|
|
| 24 |
db_path: Optional[str] = Query(default=None, description="Chemin SQLite history"),
|
| 25 |
) -> dict:
|
| 26 |
"""Liste les régressions détectées dans l'historique longitudinal."""
|
| 27 |
-
from picarones.
|
| 28 |
|
| 29 |
try:
|
| 30 |
history = BenchmarkHistory(db_path) if db_path else BenchmarkHistory()
|
|
|
|
| 24 |
db_path: Optional[str] = Query(default=None, description="Chemin SQLite history"),
|
| 25 |
) -> dict:
|
| 26 |
"""Liste les régressions détectées dans l'historique longitudinal."""
|
| 27 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 28 |
|
| 29 |
try:
|
| 30 |
history = BenchmarkHistory(db_path) if db_path else BenchmarkHistory()
|
|
@@ -78,6 +78,10 @@ FILE_BUDGETS: dict[str, int] = {
|
|
| 78 |
"picarones/evaluation/metrics/modern_archives.py": 700, # actuel 599
|
| 79 |
# Sprint E.4 du plan v2.0 — migré vers ``evaluation/metrics/``.
|
| 80 |
"picarones/evaluation/metrics/builtin_hooks.py": 700, # actuel 590
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
# (Phase 7.D — ``pipeline/legacy_runner.py`` et
|
| 82 |
# ``pipeline/legacy_pipeline_benchmark.py`` supprimés.)
|
| 83 |
# Phase 8 — importers IIIF/Gallica déplacés vers ``adapters/corpus/``.
|
|
|
|
| 78 |
"picarones/evaluation/metrics/modern_archives.py": 700, # actuel 599
|
| 79 |
# Sprint E.4 du plan v2.0 — migré vers ``evaluation/metrics/``.
|
| 80 |
"picarones/evaluation/metrics/builtin_hooks.py": 700, # actuel 590
|
| 81 |
+
# Sprint E.5 du plan v2.0 — modules ``history`` et ``robustness``
|
| 82 |
+
# migrés depuis ``measurements/`` vers la couche canonique.
|
| 83 |
+
"picarones/evaluation/metrics/history.py": 720, # actuel 615
|
| 84 |
+
"picarones/evaluation/metrics/robustness.py": 850, # actuel 742
|
| 85 |
# (Phase 7.D — ``pipeline/legacy_runner.py`` et
|
| 86 |
# ``pipeline/legacy_pipeline_benchmark.py`` supprimés.)
|
| 87 |
# Phase 8 — importers IIIF/Gallica déplacés vers ``adapters/corpus/``.
|
|
@@ -72,7 +72,7 @@ LEGACY_PACKAGES: tuple[str, ...] = (
|
|
| 72 |
#: :data:`LEGACY_PARITY` sans faire échouer le test. À diminuer
|
| 73 |
#: à chaque session de migration : on cible 0 quand le retrait
|
| 74 |
#: est complet.
|
| 75 |
-
BOOTSTRAP_BASELINE =
|
| 76 |
|
| 77 |
|
| 78 |
# ──────────────────────────────────────────────────────────────────
|
|
|
|
| 72 |
#: :data:`LEGACY_PARITY` sans faire échouer le test. À diminuer
|
| 73 |
#: à chaque session de migration : on cible 0 quand le retrait
|
| 74 |
#: est complet.
|
| 75 |
+
BOOTSTRAP_BASELINE = 0
|
| 76 |
|
| 77 |
|
| 78 |
# ──────────────────────────────────────────────────────────────────
|
|
@@ -92,6 +92,10 @@ TEST_ONLY_BASELINE: frozenset[str] = frozenset({
|
|
| 92 |
"philological_hooks",
|
| 93 |
"readability_hooks",
|
| 94 |
"searchability_hooks",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
})
|
| 96 |
|
| 97 |
|
|
|
|
| 92 |
"philological_hooks",
|
| 93 |
"readability_hooks",
|
| 94 |
"searchability_hooks",
|
| 95 |
+
# Sprint E.5 du plan v2.0 — derniers shims (history,
|
| 96 |
+
# robustness) sans consommateur production direct.
|
| 97 |
+
"history",
|
| 98 |
+
"robustness",
|
| 99 |
})
|
| 100 |
|
| 101 |
|
|
@@ -29,7 +29,7 @@ from __future__ import annotations
|
|
| 29 |
|
| 30 |
import pytest
|
| 31 |
|
| 32 |
-
from picarones.
|
| 33 |
_aligned_char_pairs,
|
| 34 |
cohen_kappa,
|
| 35 |
compute_iaa,
|
|
|
|
| 29 |
|
| 30 |
import pytest
|
| 31 |
|
| 32 |
+
from picarones.evaluation.metrics.reliability import (
|
| 33 |
_aligned_char_pairs,
|
| 34 |
cohen_kappa,
|
| 35 |
compute_iaa,
|
|
@@ -29,11 +29,11 @@ class TestBenchmarkHistory:
|
|
| 29 |
|
| 30 |
@pytest.fixture
|
| 31 |
def db(self):
|
| 32 |
-
from picarones.
|
| 33 |
return BenchmarkHistory(":memory:")
|
| 34 |
|
| 35 |
def test_import_module(self):
|
| 36 |
-
from picarones.
|
| 37 |
assert BenchmarkHistory is not None
|
| 38 |
|
| 39 |
def test_init_in_memory(self, db):
|
|
@@ -142,11 +142,11 @@ class TestBenchmarkHistory:
|
|
| 142 |
class TestHistoryEntry:
|
| 143 |
|
| 144 |
def test_import(self):
|
| 145 |
-
from picarones.
|
| 146 |
assert HistoryEntry is not None
|
| 147 |
|
| 148 |
def test_cer_percent(self):
|
| 149 |
-
from picarones.
|
| 150 |
entry = HistoryEntry(
|
| 151 |
run_id="r1", timestamp="2025-01-01T00:00:00+00:00",
|
| 152 |
corpus_name="C", engine_name="tesseract",
|
|
@@ -155,12 +155,12 @@ class TestHistoryEntry:
|
|
| 155 |
assert abs(entry.cer_percent - 12.0) < 0.01
|
| 156 |
|
| 157 |
def test_cer_percent_none(self):
|
| 158 |
-
from picarones.
|
| 159 |
entry = HistoryEntry("r", "2025", "C", "e", None, None, 0)
|
| 160 |
assert entry.cer_percent is None
|
| 161 |
|
| 162 |
def test_as_dict_keys(self):
|
| 163 |
-
from picarones.
|
| 164 |
entry = HistoryEntry("r1", "2025-01-01", "C", "tesseract", 0.10, 0.18, 5)
|
| 165 |
d = entry.as_dict()
|
| 166 |
assert "run_id" in d
|
|
@@ -168,14 +168,14 @@ class TestHistoryEntry:
|
|
| 168 |
assert "engine_name" in d
|
| 169 |
|
| 170 |
def test_as_dict_metadata(self):
|
| 171 |
-
from picarones.
|
| 172 |
entry = HistoryEntry("r1", "2025-01-01", "C", "tesseract", 0.10, 0.18, 5,
|
| 173 |
metadata={"key": "value"})
|
| 174 |
d = entry.as_dict()
|
| 175 |
assert d["metadata"] == {"key": "value"}
|
| 176 |
|
| 177 |
def test_query_result_is_history_entry(self):
|
| 178 |
-
from picarones.
|
| 179 |
db = BenchmarkHistory(":memory:")
|
| 180 |
db.record_single("r1", "C", "tesseract", 0.10, 0.18, 5)
|
| 181 |
entries = db.query()
|
|
@@ -190,7 +190,7 @@ class TestRegressionResult:
|
|
| 190 |
|
| 191 |
@pytest.fixture
|
| 192 |
def db_with_runs(self):
|
| 193 |
-
from picarones.
|
| 194 |
db = BenchmarkHistory(":memory:")
|
| 195 |
db.record_single("r1", "C", "tesseract", 0.12, 0.20, 10, timestamp="2025-01-01T00:00:00+00:00")
|
| 196 |
db.record_single("r2", "C", "tesseract", 0.15, 0.25, 10, timestamp="2025-06-01T00:00:00+00:00")
|
|
@@ -212,7 +212,7 @@ class TestRegressionResult:
|
|
| 212 |
assert result.current_cer is not None
|
| 213 |
|
| 214 |
def test_detect_no_regression(self):
|
| 215 |
-
from picarones.
|
| 216 |
db = BenchmarkHistory(":memory:")
|
| 217 |
# CER diminue = amélioration = pas de régression
|
| 218 |
db.record_single("r1", "C", "tesseract", 0.15, 0.25, 5, timestamp="2025-01-01T00:00:00+00:00")
|
|
@@ -222,14 +222,14 @@ class TestRegressionResult:
|
|
| 222 |
assert result.is_regression is False
|
| 223 |
|
| 224 |
def test_detect_regression_none_if_single_run(self):
|
| 225 |
-
from picarones.
|
| 226 |
db = BenchmarkHistory(":memory:")
|
| 227 |
db.record_single("r1", "C", "tesseract", 0.12, 0.20, 5)
|
| 228 |
result = db.detect_regression("tesseract")
|
| 229 |
assert result is None
|
| 230 |
|
| 231 |
def test_detect_all_regressions(self):
|
| 232 |
-
from picarones.
|
| 233 |
db = BenchmarkHistory(":memory:")
|
| 234 |
db.record_single("r1", "C", "tesseract", 0.10, 0.18, 5, timestamp="2025-01-01T00:00:00+00:00")
|
| 235 |
db.record_single("r2", "C", "tesseract", 0.20, 0.35, 5, timestamp="2025-06-01T00:00:00+00:00")
|
|
@@ -244,7 +244,7 @@ class TestRegressionResult:
|
|
| 244 |
assert "engine_name" in d
|
| 245 |
|
| 246 |
def test_regression_threshold_respected(self):
|
| 247 |
-
from picarones.
|
| 248 |
db = BenchmarkHistory(":memory:")
|
| 249 |
db.record_single("r1", "C", "tesseract", 0.100, 0.18, 5, timestamp="2025-01-01T00:00:00+00:00")
|
| 250 |
db.record_single("r2", "C", "tesseract", 0.105, 0.19, 5, timestamp="2025-06-01T00:00:00+00:00")
|
|
@@ -264,27 +264,27 @@ class TestRegressionResult:
|
|
| 264 |
class TestGenerateDemoHistory:
|
| 265 |
|
| 266 |
def test_generate_fills_db(self):
|
| 267 |
-
from picarones.
|
| 268 |
db = BenchmarkHistory(":memory:")
|
| 269 |
generate_demo_history(db, n_runs=5)
|
| 270 |
assert db.count() > 0
|
| 271 |
|
| 272 |
def test_generate_creates_multiple_engines(self):
|
| 273 |
-
from picarones.
|
| 274 |
db = BenchmarkHistory(":memory:")
|
| 275 |
generate_demo_history(db, n_runs=4)
|
| 276 |
engines = db.list_engines()
|
| 277 |
assert len(engines) >= 2
|
| 278 |
|
| 279 |
def test_generate_n_runs(self):
|
| 280 |
-
from picarones.
|
| 281 |
db = BenchmarkHistory(":memory:")
|
| 282 |
generate_demo_history(db, n_runs=8)
|
| 283 |
# 8 runs × 3 moteurs = 24 entrées
|
| 284 |
assert db.count() == 8 * 3
|
| 285 |
|
| 286 |
def test_cer_values_in_range(self):
|
| 287 |
-
from picarones.
|
| 288 |
db = BenchmarkHistory(":memory:")
|
| 289 |
generate_demo_history(db, n_runs=5)
|
| 290 |
entries = db.query()
|
|
@@ -294,7 +294,7 @@ class TestGenerateDemoHistory:
|
|
| 294 |
|
| 295 |
def test_regression_detectable_in_demo(self):
|
| 296 |
"""La démo inclut une régression simulée au run 5 (tesseract)."""
|
| 297 |
-
from picarones.
|
| 298 |
db = BenchmarkHistory(":memory:")
|
| 299 |
generate_demo_history(db, n_runs=8, seed=42)
|
| 300 |
# Vérifier que l'historique a été créé
|
|
@@ -311,33 +311,33 @@ class TestGenerateDemoHistory:
|
|
| 311 |
class TestDegradationLevels:
|
| 312 |
|
| 313 |
def test_import_constants(self):
|
| 314 |
-
from picarones.
|
| 315 |
assert len(DEGRADATION_LEVELS) > 0
|
| 316 |
assert len(ALL_DEGRADATION_TYPES) > 0
|
| 317 |
|
| 318 |
def test_all_types_in_levels(self):
|
| 319 |
-
from picarones.
|
| 320 |
for t in ALL_DEGRADATION_TYPES:
|
| 321 |
assert t in DEGRADATION_LEVELS
|
| 322 |
|
| 323 |
def test_noise_levels(self):
|
| 324 |
-
from picarones.
|
| 325 |
levels = DEGRADATION_LEVELS["noise"]
|
| 326 |
assert len(levels) >= 2
|
| 327 |
assert 0 in levels # niveau original
|
| 328 |
|
| 329 |
def test_blur_levels(self):
|
| 330 |
-
from picarones.
|
| 331 |
levels = DEGRADATION_LEVELS["blur"]
|
| 332 |
assert 0 in levels
|
| 333 |
|
| 334 |
def test_resolution_levels_include_1(self):
|
| 335 |
-
from picarones.
|
| 336 |
levels = DEGRADATION_LEVELS["resolution"]
|
| 337 |
assert 1.0 in levels # résolution originale
|
| 338 |
|
| 339 |
def test_labels_match_levels(self):
|
| 340 |
-
from picarones.
|
| 341 |
for dtype in DEGRADATION_LEVELS:
|
| 342 |
if dtype in DEGRADATION_LABELS:
|
| 343 |
assert len(DEGRADATION_LABELS[dtype]) == len(DEGRADATION_LEVELS[dtype])
|
|
@@ -355,60 +355,60 @@ class TestDegradationFunctions:
|
|
| 355 |
return _make_placeholder_png(40, 30)
|
| 356 |
|
| 357 |
def test_degrade_image_bytes_imports(self):
|
| 358 |
-
from picarones.
|
| 359 |
assert callable(degrade_image_bytes)
|
| 360 |
|
| 361 |
def test_degrade_noise_returns_bytes(self):
|
| 362 |
-
from picarones.
|
| 363 |
png = self._make_png()
|
| 364 |
result = degrade_image_bytes(png, "noise", 0)
|
| 365 |
assert isinstance(result, bytes)
|
| 366 |
assert len(result) > 0
|
| 367 |
|
| 368 |
def test_degrade_blur_returns_bytes(self):
|
| 369 |
-
from picarones.
|
| 370 |
png = self._make_png()
|
| 371 |
result = degrade_image_bytes(png, "blur", 0)
|
| 372 |
assert isinstance(result, bytes)
|
| 373 |
|
| 374 |
def test_degrade_rotation_returns_bytes(self):
|
| 375 |
-
from picarones.
|
| 376 |
png = self._make_png()
|
| 377 |
result = degrade_image_bytes(png, "rotation", 0)
|
| 378 |
assert isinstance(result, bytes)
|
| 379 |
|
| 380 |
def test_degrade_resolution_returns_bytes(self):
|
| 381 |
-
from picarones.
|
| 382 |
png = self._make_png()
|
| 383 |
result = degrade_image_bytes(png, "resolution", 1.0)
|
| 384 |
assert isinstance(result, bytes)
|
| 385 |
|
| 386 |
def test_degrade_binarization_returns_bytes(self):
|
| 387 |
-
from picarones.
|
| 388 |
png = self._make_png()
|
| 389 |
result = degrade_image_bytes(png, "binarization", 0)
|
| 390 |
assert isinstance(result, bytes)
|
| 391 |
|
| 392 |
def test_degrade_noise_level_5(self):
|
| 393 |
-
from picarones.
|
| 394 |
png = self._make_png()
|
| 395 |
result = degrade_image_bytes(png, "noise", 5)
|
| 396 |
assert isinstance(result, bytes)
|
| 397 |
|
| 398 |
def test_degrade_blur_level_2(self):
|
| 399 |
-
from picarones.
|
| 400 |
png = self._make_png()
|
| 401 |
result = degrade_image_bytes(png, "blur", 2)
|
| 402 |
assert isinstance(result, bytes)
|
| 403 |
|
| 404 |
def test_degrade_resolution_half(self):
|
| 405 |
-
from picarones.
|
| 406 |
png = self._make_png()
|
| 407 |
result = degrade_image_bytes(png, "resolution", 0.5)
|
| 408 |
assert isinstance(result, bytes)
|
| 409 |
|
| 410 |
def test_degrade_rotation_10_degrees(self):
|
| 411 |
-
from picarones.
|
| 412 |
png = self._make_png()
|
| 413 |
result = degrade_image_bytes(png, "rotation", 10)
|
| 414 |
assert isinstance(result, bytes)
|
|
@@ -421,11 +421,11 @@ class TestDegradationFunctions:
|
|
| 421 |
class TestDegradationCurve:
|
| 422 |
|
| 423 |
def test_import(self):
|
| 424 |
-
from picarones.
|
| 425 |
assert DegradationCurve is not None
|
| 426 |
|
| 427 |
def test_as_dict_keys(self):
|
| 428 |
-
from picarones.
|
| 429 |
curve = DegradationCurve(
|
| 430 |
engine_name="tesseract",
|
| 431 |
degradation_type="noise",
|
|
@@ -440,7 +440,7 @@ class TestDegradationCurve:
|
|
| 440 |
assert "cer_values" in d
|
| 441 |
|
| 442 |
def test_critical_threshold(self):
|
| 443 |
-
from picarones.
|
| 444 |
curve = DegradationCurve(
|
| 445 |
engine_name="tesseract",
|
| 446 |
degradation_type="noise",
|
|
@@ -453,7 +453,7 @@ class TestDegradationCurve:
|
|
| 453 |
assert curve.critical_threshold_level == 15
|
| 454 |
|
| 455 |
def test_none_cer_allowed(self):
|
| 456 |
-
from picarones.
|
| 457 |
curve = DegradationCurve(
|
| 458 |
engine_name="e",
|
| 459 |
degradation_type="blur",
|
|
@@ -464,17 +464,17 @@ class TestDegradationCurve:
|
|
| 464 |
assert curve.cer_values[0] is None
|
| 465 |
|
| 466 |
def test_default_cer_threshold(self):
|
| 467 |
-
from picarones.
|
| 468 |
curve = DegradationCurve("e", "noise", [0], ["o"], [0.1])
|
| 469 |
assert curve.cer_threshold == 0.20
|
| 470 |
|
| 471 |
def test_engine_name_preserved(self):
|
| 472 |
-
from picarones.
|
| 473 |
curve = DegradationCurve("pero_ocr", "blur", [0, 1], ["o", "r=1"], [0.05, 0.08])
|
| 474 |
assert curve.engine_name == "pero_ocr"
|
| 475 |
|
| 476 |
def test_as_dict_roundtrip(self):
|
| 477 |
-
from picarones.
|
| 478 |
curve = DegradationCurve(
|
| 479 |
engine_name="tesseract",
|
| 480 |
degradation_type="rotation",
|
|
@@ -495,11 +495,11 @@ class TestDegradationCurve:
|
|
| 495 |
class TestRobustnessReport:
|
| 496 |
|
| 497 |
def test_import(self):
|
| 498 |
-
from picarones.
|
| 499 |
assert RobustnessReport is not None
|
| 500 |
|
| 501 |
def test_get_curves_for_engine(self):
|
| 502 |
-
from picarones.
|
| 503 |
c1 = DegradationCurve("tesseract", "noise", [0, 5], ["o", "σ=5"], [0.10, 0.15])
|
| 504 |
c2 = DegradationCurve("pero_ocr", "noise", [0, 5], ["o", "σ=5"], [0.07, 0.10])
|
| 505 |
report = RobustnessReport(["tesseract", "pero_ocr"], "C", ["noise"], [c1, c2])
|
|
@@ -508,7 +508,7 @@ class TestRobustnessReport:
|
|
| 508 |
assert tess_curves[0].engine_name == "tesseract"
|
| 509 |
|
| 510 |
def test_get_curves_for_type(self):
|
| 511 |
-
from picarones.
|
| 512 |
c1 = DegradationCurve("tesseract", "noise", [0, 5], ["o", "σ=5"], [0.10, 0.15])
|
| 513 |
c2 = DegradationCurve("tesseract", "blur", [0, 2], ["o", "r=2"], [0.10, 0.14])
|
| 514 |
report = RobustnessReport(["tesseract"], "C", ["noise", "blur"], [c1, c2])
|
|
@@ -517,7 +517,7 @@ class TestRobustnessReport:
|
|
| 517 |
assert noise_curves[0].degradation_type == "noise"
|
| 518 |
|
| 519 |
def test_as_dict_keys(self):
|
| 520 |
-
from picarones.
|
| 521 |
report = RobustnessReport(["tesseract"], "C", ["noise"], [])
|
| 522 |
d = report.as_dict()
|
| 523 |
assert "engine_names" in d
|
|
@@ -525,7 +525,7 @@ class TestRobustnessReport:
|
|
| 525 |
assert "summary" in d
|
| 526 |
|
| 527 |
def test_as_dict_json_serializable(self):
|
| 528 |
-
from picarones.
|
| 529 |
c = DegradationCurve("e", "noise", [0, 5], ["o", "n5"], [0.1, 0.2])
|
| 530 |
report = RobustnessReport(["e"], "C", ["noise"], [c])
|
| 531 |
d = report.as_dict()
|
|
@@ -534,18 +534,18 @@ class TestRobustnessReport:
|
|
| 534 |
assert len(json_str) > 0
|
| 535 |
|
| 536 |
def test_summary_populated(self):
|
| 537 |
-
from picarones.
|
| 538 |
report = generate_demo_robustness_report(engine_names=["tesseract"], seed=1)
|
| 539 |
assert isinstance(report.summary, dict)
|
| 540 |
assert len(report.summary) > 0
|
| 541 |
|
| 542 |
def test_corpus_name_preserved(self):
|
| 543 |
-
from picarones.
|
| 544 |
report = RobustnessReport(["e"], "Mon Corpus", ["noise"], [])
|
| 545 |
assert report.corpus_name == "Mon Corpus"
|
| 546 |
|
| 547 |
def test_engine_names_list(self):
|
| 548 |
-
from picarones.
|
| 549 |
report = RobustnessReport(["tesseract", "pero_ocr"], "C", [], [])
|
| 550 |
assert "tesseract" in report.engine_names
|
| 551 |
assert "pero_ocr" in report.engine_names
|
|
@@ -558,17 +558,17 @@ class TestRobustnessReport:
|
|
| 558 |
class TestRobustnessAnalyzer:
|
| 559 |
|
| 560 |
def test_import(self):
|
| 561 |
-
from picarones.
|
| 562 |
assert RobustnessAnalyzer is not None
|
| 563 |
|
| 564 |
def test_init_single_engine(self):
|
| 565 |
-
from picarones.
|
| 566 |
mock_engine = type("E", (), {"name": "tesseract"})()
|
| 567 |
analyzer = RobustnessAnalyzer(mock_engine)
|
| 568 |
assert len(analyzer.engines) == 1
|
| 569 |
|
| 570 |
def test_init_list_engines(self):
|
| 571 |
-
from picarones.
|
| 572 |
engines = [
|
| 573 |
type("E", (), {"name": "tesseract"})(),
|
| 574 |
type("E", (), {"name": "pero_ocr"})(),
|
|
@@ -577,33 +577,33 @@ class TestRobustnessAnalyzer:
|
|
| 577 |
assert len(analyzer.engines) == 2
|
| 578 |
|
| 579 |
def test_default_degradation_types(self):
|
| 580 |
-
from picarones.
|
| 581 |
e = type("E", (), {"name": "e"})()
|
| 582 |
analyzer = RobustnessAnalyzer(e)
|
| 583 |
assert set(analyzer.degradation_types) == set(ALL_DEGRADATION_TYPES)
|
| 584 |
|
| 585 |
def test_custom_degradation_types(self):
|
| 586 |
-
from picarones.
|
| 587 |
e = type("E", (), {"name": "e"})()
|
| 588 |
analyzer = RobustnessAnalyzer(e, degradation_types=["noise", "blur"])
|
| 589 |
assert analyzer.degradation_types == ["noise", "blur"]
|
| 590 |
|
| 591 |
def test_find_critical_level_found(self):
|
| 592 |
-
from picarones.
|
| 593 |
levels = [0, 5, 15, 30]
|
| 594 |
cer_values = [0.10, 0.15, 0.22, 0.35]
|
| 595 |
critical = RobustnessAnalyzer._find_critical_level(levels, cer_values, 0.20)
|
| 596 |
assert critical == 15
|
| 597 |
|
| 598 |
def test_find_critical_level_none(self):
|
| 599 |
-
from picarones.
|
| 600 |
levels = [0, 5, 15]
|
| 601 |
cer_values = [0.05, 0.10, 0.15]
|
| 602 |
critical = RobustnessAnalyzer._find_critical_level(levels, cer_values, 0.20)
|
| 603 |
assert critical is None
|
| 604 |
|
| 605 |
def test_build_summary(self):
|
| 606 |
-
from picarones.
|
| 607 |
curves = [
|
| 608 |
DegradationCurve("tesseract", "noise", [0, 5], ["o", "n5"], [0.10, 0.20]),
|
| 609 |
DegradationCurve("pero_ocr", "noise", [0, 5], ["o", "n5"], [0.07, 0.12]),
|
|
@@ -620,33 +620,33 @@ class TestRobustnessAnalyzer:
|
|
| 620 |
class TestGenerateDemoRobustness:
|
| 621 |
|
| 622 |
def test_import(self):
|
| 623 |
-
from picarones.
|
| 624 |
assert callable(generate_demo_robustness_report)
|
| 625 |
|
| 626 |
def test_returns_report(self):
|
| 627 |
-
from picarones.
|
| 628 |
report = generate_demo_robustness_report()
|
| 629 |
assert isinstance(report, RobustnessReport)
|
| 630 |
|
| 631 |
def test_default_engines(self):
|
| 632 |
-
from picarones.
|
| 633 |
report = generate_demo_robustness_report()
|
| 634 |
assert "tesseract" in report.engine_names
|
| 635 |
assert "pero_ocr" in report.engine_names
|
| 636 |
|
| 637 |
def test_custom_engines(self):
|
| 638 |
-
from picarones.
|
| 639 |
report = generate_demo_robustness_report(engine_names=["moteur_custom"])
|
| 640 |
assert "moteur_custom" in report.engine_names
|
| 641 |
|
| 642 |
def test_all_degradation_types_present(self):
|
| 643 |
-
from picarones.
|
| 644 |
report = generate_demo_robustness_report()
|
| 645 |
types_in_report = {c.degradation_type for c in report.curves}
|
| 646 |
assert types_in_report == set(ALL_DEGRADATION_TYPES)
|
| 647 |
|
| 648 |
def test_cer_values_in_range(self):
|
| 649 |
-
from picarones.
|
| 650 |
report = generate_demo_robustness_report(seed=99)
|
| 651 |
for curve in report.curves:
|
| 652 |
for cer in curve.cer_values:
|
|
@@ -655,7 +655,7 @@ class TestGenerateDemoRobustness:
|
|
| 655 |
|
| 656 |
def test_cer_increases_with_degradation(self):
|
| 657 |
"""Pour la plupart des types, le CER doit augmenter avec le niveau de dégradation."""
|
| 658 |
-
from picarones.
|
| 659 |
report = generate_demo_robustness_report(seed=42)
|
| 660 |
for curve in report.curves:
|
| 661 |
valid = [c for c in curve.cer_values if c is not None]
|
|
@@ -667,18 +667,18 @@ class TestGenerateDemoRobustness:
|
|
| 667 |
)
|
| 668 |
|
| 669 |
def test_reproducible_with_seed(self):
|
| 670 |
-
from picarones.
|
| 671 |
r1 = generate_demo_robustness_report(seed=7)
|
| 672 |
r2 = generate_demo_robustness_report(seed=7)
|
| 673 |
assert r1.curves[0].cer_values == r2.curves[0].cer_values
|
| 674 |
|
| 675 |
def test_summary_contains_most_robust(self):
|
| 676 |
-
from picarones.
|
| 677 |
report = generate_demo_robustness_report()
|
| 678 |
assert any("most_robust" in k for k in report.summary)
|
| 679 |
|
| 680 |
def test_json_serializable(self):
|
| 681 |
-
from picarones.
|
| 682 |
report = generate_demo_robustness_report()
|
| 683 |
d = report.as_dict()
|
| 684 |
json_str = json.dumps(d, ensure_ascii=False)
|
|
|
|
| 29 |
|
| 30 |
@pytest.fixture
|
| 31 |
def db(self):
|
| 32 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 33 |
return BenchmarkHistory(":memory:")
|
| 34 |
|
| 35 |
def test_import_module(self):
|
| 36 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 37 |
assert BenchmarkHistory is not None
|
| 38 |
|
| 39 |
def test_init_in_memory(self, db):
|
|
|
|
| 142 |
class TestHistoryEntry:
|
| 143 |
|
| 144 |
def test_import(self):
|
| 145 |
+
from picarones.evaluation.metrics.history import HistoryEntry
|
| 146 |
assert HistoryEntry is not None
|
| 147 |
|
| 148 |
def test_cer_percent(self):
|
| 149 |
+
from picarones.evaluation.metrics.history import HistoryEntry
|
| 150 |
entry = HistoryEntry(
|
| 151 |
run_id="r1", timestamp="2025-01-01T00:00:00+00:00",
|
| 152 |
corpus_name="C", engine_name="tesseract",
|
|
|
|
| 155 |
assert abs(entry.cer_percent - 12.0) < 0.01
|
| 156 |
|
| 157 |
def test_cer_percent_none(self):
|
| 158 |
+
from picarones.evaluation.metrics.history import HistoryEntry
|
| 159 |
entry = HistoryEntry("r", "2025", "C", "e", None, None, 0)
|
| 160 |
assert entry.cer_percent is None
|
| 161 |
|
| 162 |
def test_as_dict_keys(self):
|
| 163 |
+
from picarones.evaluation.metrics.history import HistoryEntry
|
| 164 |
entry = HistoryEntry("r1", "2025-01-01", "C", "tesseract", 0.10, 0.18, 5)
|
| 165 |
d = entry.as_dict()
|
| 166 |
assert "run_id" in d
|
|
|
|
| 168 |
assert "engine_name" in d
|
| 169 |
|
| 170 |
def test_as_dict_metadata(self):
|
| 171 |
+
from picarones.evaluation.metrics.history import HistoryEntry
|
| 172 |
entry = HistoryEntry("r1", "2025-01-01", "C", "tesseract", 0.10, 0.18, 5,
|
| 173 |
metadata={"key": "value"})
|
| 174 |
d = entry.as_dict()
|
| 175 |
assert d["metadata"] == {"key": "value"}
|
| 176 |
|
| 177 |
def test_query_result_is_history_entry(self):
|
| 178 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, HistoryEntry
|
| 179 |
db = BenchmarkHistory(":memory:")
|
| 180 |
db.record_single("r1", "C", "tesseract", 0.10, 0.18, 5)
|
| 181 |
entries = db.query()
|
|
|
|
| 190 |
|
| 191 |
@pytest.fixture
|
| 192 |
def db_with_runs(self):
|
| 193 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 194 |
db = BenchmarkHistory(":memory:")
|
| 195 |
db.record_single("r1", "C", "tesseract", 0.12, 0.20, 10, timestamp="2025-01-01T00:00:00+00:00")
|
| 196 |
db.record_single("r2", "C", "tesseract", 0.15, 0.25, 10, timestamp="2025-06-01T00:00:00+00:00")
|
|
|
|
| 212 |
assert result.current_cer is not None
|
| 213 |
|
| 214 |
def test_detect_no_regression(self):
|
| 215 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 216 |
db = BenchmarkHistory(":memory:")
|
| 217 |
# CER diminue = amélioration = pas de régression
|
| 218 |
db.record_single("r1", "C", "tesseract", 0.15, 0.25, 5, timestamp="2025-01-01T00:00:00+00:00")
|
|
|
|
| 222 |
assert result.is_regression is False
|
| 223 |
|
| 224 |
def test_detect_regression_none_if_single_run(self):
|
| 225 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 226 |
db = BenchmarkHistory(":memory:")
|
| 227 |
db.record_single("r1", "C", "tesseract", 0.12, 0.20, 5)
|
| 228 |
result = db.detect_regression("tesseract")
|
| 229 |
assert result is None
|
| 230 |
|
| 231 |
def test_detect_all_regressions(self):
|
| 232 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 233 |
db = BenchmarkHistory(":memory:")
|
| 234 |
db.record_single("r1", "C", "tesseract", 0.10, 0.18, 5, timestamp="2025-01-01T00:00:00+00:00")
|
| 235 |
db.record_single("r2", "C", "tesseract", 0.20, 0.35, 5, timestamp="2025-06-01T00:00:00+00:00")
|
|
|
|
| 244 |
assert "engine_name" in d
|
| 245 |
|
| 246 |
def test_regression_threshold_respected(self):
|
| 247 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 248 |
db = BenchmarkHistory(":memory:")
|
| 249 |
db.record_single("r1", "C", "tesseract", 0.100, 0.18, 5, timestamp="2025-01-01T00:00:00+00:00")
|
| 250 |
db.record_single("r2", "C", "tesseract", 0.105, 0.19, 5, timestamp="2025-06-01T00:00:00+00:00")
|
|
|
|
| 264 |
class TestGenerateDemoHistory:
|
| 265 |
|
| 266 |
def test_generate_fills_db(self):
|
| 267 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, generate_demo_history
|
| 268 |
db = BenchmarkHistory(":memory:")
|
| 269 |
generate_demo_history(db, n_runs=5)
|
| 270 |
assert db.count() > 0
|
| 271 |
|
| 272 |
def test_generate_creates_multiple_engines(self):
|
| 273 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, generate_demo_history
|
| 274 |
db = BenchmarkHistory(":memory:")
|
| 275 |
generate_demo_history(db, n_runs=4)
|
| 276 |
engines = db.list_engines()
|
| 277 |
assert len(engines) >= 2
|
| 278 |
|
| 279 |
def test_generate_n_runs(self):
|
| 280 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, generate_demo_history
|
| 281 |
db = BenchmarkHistory(":memory:")
|
| 282 |
generate_demo_history(db, n_runs=8)
|
| 283 |
# 8 runs × 3 moteurs = 24 entrées
|
| 284 |
assert db.count() == 8 * 3
|
| 285 |
|
| 286 |
def test_cer_values_in_range(self):
|
| 287 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, generate_demo_history
|
| 288 |
db = BenchmarkHistory(":memory:")
|
| 289 |
generate_demo_history(db, n_runs=5)
|
| 290 |
entries = db.query()
|
|
|
|
| 294 |
|
| 295 |
def test_regression_detectable_in_demo(self):
|
| 296 |
"""La démo inclut une régression simulée au run 5 (tesseract)."""
|
| 297 |
+
from picarones.evaluation.metrics.history import BenchmarkHistory, generate_demo_history
|
| 298 |
db = BenchmarkHistory(":memory:")
|
| 299 |
generate_demo_history(db, n_runs=8, seed=42)
|
| 300 |
# Vérifier que l'historique a été créé
|
|
|
|
| 311 |
class TestDegradationLevels:
|
| 312 |
|
| 313 |
def test_import_constants(self):
|
| 314 |
+
from picarones.evaluation.metrics.robustness import DEGRADATION_LEVELS, ALL_DEGRADATION_TYPES
|
| 315 |
assert len(DEGRADATION_LEVELS) > 0
|
| 316 |
assert len(ALL_DEGRADATION_TYPES) > 0
|
| 317 |
|
| 318 |
def test_all_types_in_levels(self):
|
| 319 |
+
from picarones.evaluation.metrics.robustness import DEGRADATION_LEVELS, ALL_DEGRADATION_TYPES
|
| 320 |
for t in ALL_DEGRADATION_TYPES:
|
| 321 |
assert t in DEGRADATION_LEVELS
|
| 322 |
|
| 323 |
def test_noise_levels(self):
|
| 324 |
+
from picarones.evaluation.metrics.robustness import DEGRADATION_LEVELS
|
| 325 |
levels = DEGRADATION_LEVELS["noise"]
|
| 326 |
assert len(levels) >= 2
|
| 327 |
assert 0 in levels # niveau original
|
| 328 |
|
| 329 |
def test_blur_levels(self):
|
| 330 |
+
from picarones.evaluation.metrics.robustness import DEGRADATION_LEVELS
|
| 331 |
levels = DEGRADATION_LEVELS["blur"]
|
| 332 |
assert 0 in levels
|
| 333 |
|
| 334 |
def test_resolution_levels_include_1(self):
|
| 335 |
+
from picarones.evaluation.metrics.robustness import DEGRADATION_LEVELS
|
| 336 |
levels = DEGRADATION_LEVELS["resolution"]
|
| 337 |
assert 1.0 in levels # résolution originale
|
| 338 |
|
| 339 |
def test_labels_match_levels(self):
|
| 340 |
+
from picarones.evaluation.metrics.robustness import DEGRADATION_LEVELS, DEGRADATION_LABELS
|
| 341 |
for dtype in DEGRADATION_LEVELS:
|
| 342 |
if dtype in DEGRADATION_LABELS:
|
| 343 |
assert len(DEGRADATION_LABELS[dtype]) == len(DEGRADATION_LEVELS[dtype])
|
|
|
|
| 355 |
return _make_placeholder_png(40, 30)
|
| 356 |
|
| 357 |
def test_degrade_image_bytes_imports(self):
|
| 358 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 359 |
assert callable(degrade_image_bytes)
|
| 360 |
|
| 361 |
def test_degrade_noise_returns_bytes(self):
|
| 362 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 363 |
png = self._make_png()
|
| 364 |
result = degrade_image_bytes(png, "noise", 0)
|
| 365 |
assert isinstance(result, bytes)
|
| 366 |
assert len(result) > 0
|
| 367 |
|
| 368 |
def test_degrade_blur_returns_bytes(self):
|
| 369 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 370 |
png = self._make_png()
|
| 371 |
result = degrade_image_bytes(png, "blur", 0)
|
| 372 |
assert isinstance(result, bytes)
|
| 373 |
|
| 374 |
def test_degrade_rotation_returns_bytes(self):
|
| 375 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 376 |
png = self._make_png()
|
| 377 |
result = degrade_image_bytes(png, "rotation", 0)
|
| 378 |
assert isinstance(result, bytes)
|
| 379 |
|
| 380 |
def test_degrade_resolution_returns_bytes(self):
|
| 381 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 382 |
png = self._make_png()
|
| 383 |
result = degrade_image_bytes(png, "resolution", 1.0)
|
| 384 |
assert isinstance(result, bytes)
|
| 385 |
|
| 386 |
def test_degrade_binarization_returns_bytes(self):
|
| 387 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 388 |
png = self._make_png()
|
| 389 |
result = degrade_image_bytes(png, "binarization", 0)
|
| 390 |
assert isinstance(result, bytes)
|
| 391 |
|
| 392 |
def test_degrade_noise_level_5(self):
|
| 393 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 394 |
png = self._make_png()
|
| 395 |
result = degrade_image_bytes(png, "noise", 5)
|
| 396 |
assert isinstance(result, bytes)
|
| 397 |
|
| 398 |
def test_degrade_blur_level_2(self):
|
| 399 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 400 |
png = self._make_png()
|
| 401 |
result = degrade_image_bytes(png, "blur", 2)
|
| 402 |
assert isinstance(result, bytes)
|
| 403 |
|
| 404 |
def test_degrade_resolution_half(self):
|
| 405 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 406 |
png = self._make_png()
|
| 407 |
result = degrade_image_bytes(png, "resolution", 0.5)
|
| 408 |
assert isinstance(result, bytes)
|
| 409 |
|
| 410 |
def test_degrade_rotation_10_degrees(self):
|
| 411 |
+
from picarones.evaluation.metrics.robustness import degrade_image_bytes
|
| 412 |
png = self._make_png()
|
| 413 |
result = degrade_image_bytes(png, "rotation", 10)
|
| 414 |
assert isinstance(result, bytes)
|
|
|
|
| 421 |
class TestDegradationCurve:
|
| 422 |
|
| 423 |
def test_import(self):
|
| 424 |
+
from picarones.evaluation.metrics.robustness import DegradationCurve
|
| 425 |
assert DegradationCurve is not None
|
| 426 |
|
| 427 |
def test_as_dict_keys(self):
|
| 428 |
+
from picarones.evaluation.metrics.robustness import DegradationCurve
|
| 429 |
curve = DegradationCurve(
|
| 430 |
engine_name="tesseract",
|
| 431 |
degradation_type="noise",
|
|
|
|
| 440 |
assert "cer_values" in d
|
| 441 |
|
| 442 |
def test_critical_threshold(self):
|
| 443 |
+
from picarones.evaluation.metrics.robustness import DegradationCurve
|
| 444 |
curve = DegradationCurve(
|
| 445 |
engine_name="tesseract",
|
| 446 |
degradation_type="noise",
|
|
|
|
| 453 |
assert curve.critical_threshold_level == 15
|
| 454 |
|
| 455 |
def test_none_cer_allowed(self):
|
| 456 |
+
from picarones.evaluation.metrics.robustness import DegradationCurve
|
| 457 |
curve = DegradationCurve(
|
| 458 |
engine_name="e",
|
| 459 |
degradation_type="blur",
|
|
|
|
| 464 |
assert curve.cer_values[0] is None
|
| 465 |
|
| 466 |
def test_default_cer_threshold(self):
|
| 467 |
+
from picarones.evaluation.metrics.robustness import DegradationCurve
|
| 468 |
curve = DegradationCurve("e", "noise", [0], ["o"], [0.1])
|
| 469 |
assert curve.cer_threshold == 0.20
|
| 470 |
|
| 471 |
def test_engine_name_preserved(self):
|
| 472 |
+
from picarones.evaluation.metrics.robustness import DegradationCurve
|
| 473 |
curve = DegradationCurve("pero_ocr", "blur", [0, 1], ["o", "r=1"], [0.05, 0.08])
|
| 474 |
assert curve.engine_name == "pero_ocr"
|
| 475 |
|
| 476 |
def test_as_dict_roundtrip(self):
|
| 477 |
+
from picarones.evaluation.metrics.robustness import DegradationCurve
|
| 478 |
curve = DegradationCurve(
|
| 479 |
engine_name="tesseract",
|
| 480 |
degradation_type="rotation",
|
|
|
|
| 495 |
class TestRobustnessReport:
|
| 496 |
|
| 497 |
def test_import(self):
|
| 498 |
+
from picarones.evaluation.metrics.robustness import RobustnessReport
|
| 499 |
assert RobustnessReport is not None
|
| 500 |
|
| 501 |
def test_get_curves_for_engine(self):
|
| 502 |
+
from picarones.evaluation.metrics.robustness import RobustnessReport, DegradationCurve
|
| 503 |
c1 = DegradationCurve("tesseract", "noise", [0, 5], ["o", "σ=5"], [0.10, 0.15])
|
| 504 |
c2 = DegradationCurve("pero_ocr", "noise", [0, 5], ["o", "σ=5"], [0.07, 0.10])
|
| 505 |
report = RobustnessReport(["tesseract", "pero_ocr"], "C", ["noise"], [c1, c2])
|
|
|
|
| 508 |
assert tess_curves[0].engine_name == "tesseract"
|
| 509 |
|
| 510 |
def test_get_curves_for_type(self):
|
| 511 |
+
from picarones.evaluation.metrics.robustness import RobustnessReport, DegradationCurve
|
| 512 |
c1 = DegradationCurve("tesseract", "noise", [0, 5], ["o", "σ=5"], [0.10, 0.15])
|
| 513 |
c2 = DegradationCurve("tesseract", "blur", [0, 2], ["o", "r=2"], [0.10, 0.14])
|
| 514 |
report = RobustnessReport(["tesseract"], "C", ["noise", "blur"], [c1, c2])
|
|
|
|
| 517 |
assert noise_curves[0].degradation_type == "noise"
|
| 518 |
|
| 519 |
def test_as_dict_keys(self):
|
| 520 |
+
from picarones.evaluation.metrics.robustness import RobustnessReport
|
| 521 |
report = RobustnessReport(["tesseract"], "C", ["noise"], [])
|
| 522 |
d = report.as_dict()
|
| 523 |
assert "engine_names" in d
|
|
|
|
| 525 |
assert "summary" in d
|
| 526 |
|
| 527 |
def test_as_dict_json_serializable(self):
|
| 528 |
+
from picarones.evaluation.metrics.robustness import RobustnessReport, DegradationCurve
|
| 529 |
c = DegradationCurve("e", "noise", [0, 5], ["o", "n5"], [0.1, 0.2])
|
| 530 |
report = RobustnessReport(["e"], "C", ["noise"], [c])
|
| 531 |
d = report.as_dict()
|
|
|
|
| 534 |
assert len(json_str) > 0
|
| 535 |
|
| 536 |
def test_summary_populated(self):
|
| 537 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 538 |
report = generate_demo_robustness_report(engine_names=["tesseract"], seed=1)
|
| 539 |
assert isinstance(report.summary, dict)
|
| 540 |
assert len(report.summary) > 0
|
| 541 |
|
| 542 |
def test_corpus_name_preserved(self):
|
| 543 |
+
from picarones.evaluation.metrics.robustness import RobustnessReport
|
| 544 |
report = RobustnessReport(["e"], "Mon Corpus", ["noise"], [])
|
| 545 |
assert report.corpus_name == "Mon Corpus"
|
| 546 |
|
| 547 |
def test_engine_names_list(self):
|
| 548 |
+
from picarones.evaluation.metrics.robustness import RobustnessReport
|
| 549 |
report = RobustnessReport(["tesseract", "pero_ocr"], "C", [], [])
|
| 550 |
assert "tesseract" in report.engine_names
|
| 551 |
assert "pero_ocr" in report.engine_names
|
|
|
|
| 558 |
class TestRobustnessAnalyzer:
|
| 559 |
|
| 560 |
def test_import(self):
|
| 561 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 562 |
assert RobustnessAnalyzer is not None
|
| 563 |
|
| 564 |
def test_init_single_engine(self):
|
| 565 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 566 |
mock_engine = type("E", (), {"name": "tesseract"})()
|
| 567 |
analyzer = RobustnessAnalyzer(mock_engine)
|
| 568 |
assert len(analyzer.engines) == 1
|
| 569 |
|
| 570 |
def test_init_list_engines(self):
|
| 571 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 572 |
engines = [
|
| 573 |
type("E", (), {"name": "tesseract"})(),
|
| 574 |
type("E", (), {"name": "pero_ocr"})(),
|
|
|
|
| 577 |
assert len(analyzer.engines) == 2
|
| 578 |
|
| 579 |
def test_default_degradation_types(self):
|
| 580 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer, ALL_DEGRADATION_TYPES
|
| 581 |
e = type("E", (), {"name": "e"})()
|
| 582 |
analyzer = RobustnessAnalyzer(e)
|
| 583 |
assert set(analyzer.degradation_types) == set(ALL_DEGRADATION_TYPES)
|
| 584 |
|
| 585 |
def test_custom_degradation_types(self):
|
| 586 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 587 |
e = type("E", (), {"name": "e"})()
|
| 588 |
analyzer = RobustnessAnalyzer(e, degradation_types=["noise", "blur"])
|
| 589 |
assert analyzer.degradation_types == ["noise", "blur"]
|
| 590 |
|
| 591 |
def test_find_critical_level_found(self):
|
| 592 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 593 |
levels = [0, 5, 15, 30]
|
| 594 |
cer_values = [0.10, 0.15, 0.22, 0.35]
|
| 595 |
critical = RobustnessAnalyzer._find_critical_level(levels, cer_values, 0.20)
|
| 596 |
assert critical == 15
|
| 597 |
|
| 598 |
def test_find_critical_level_none(self):
|
| 599 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer
|
| 600 |
levels = [0, 5, 15]
|
| 601 |
cer_values = [0.05, 0.10, 0.15]
|
| 602 |
critical = RobustnessAnalyzer._find_critical_level(levels, cer_values, 0.20)
|
| 603 |
assert critical is None
|
| 604 |
|
| 605 |
def test_build_summary(self):
|
| 606 |
+
from picarones.evaluation.metrics.robustness import RobustnessAnalyzer, DegradationCurve
|
| 607 |
curves = [
|
| 608 |
DegradationCurve("tesseract", "noise", [0, 5], ["o", "n5"], [0.10, 0.20]),
|
| 609 |
DegradationCurve("pero_ocr", "noise", [0, 5], ["o", "n5"], [0.07, 0.12]),
|
|
|
|
| 620 |
class TestGenerateDemoRobustness:
|
| 621 |
|
| 622 |
def test_import(self):
|
| 623 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 624 |
assert callable(generate_demo_robustness_report)
|
| 625 |
|
| 626 |
def test_returns_report(self):
|
| 627 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report, RobustnessReport
|
| 628 |
report = generate_demo_robustness_report()
|
| 629 |
assert isinstance(report, RobustnessReport)
|
| 630 |
|
| 631 |
def test_default_engines(self):
|
| 632 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 633 |
report = generate_demo_robustness_report()
|
| 634 |
assert "tesseract" in report.engine_names
|
| 635 |
assert "pero_ocr" in report.engine_names
|
| 636 |
|
| 637 |
def test_custom_engines(self):
|
| 638 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 639 |
report = generate_demo_robustness_report(engine_names=["moteur_custom"])
|
| 640 |
assert "moteur_custom" in report.engine_names
|
| 641 |
|
| 642 |
def test_all_degradation_types_present(self):
|
| 643 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report, ALL_DEGRADATION_TYPES
|
| 644 |
report = generate_demo_robustness_report()
|
| 645 |
types_in_report = {c.degradation_type for c in report.curves}
|
| 646 |
assert types_in_report == set(ALL_DEGRADATION_TYPES)
|
| 647 |
|
| 648 |
def test_cer_values_in_range(self):
|
| 649 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 650 |
report = generate_demo_robustness_report(seed=99)
|
| 651 |
for curve in report.curves:
|
| 652 |
for cer in curve.cer_values:
|
|
|
|
| 655 |
|
| 656 |
def test_cer_increases_with_degradation(self):
|
| 657 |
"""Pour la plupart des types, le CER doit augmenter avec le niveau de dégradation."""
|
| 658 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 659 |
report = generate_demo_robustness_report(seed=42)
|
| 660 |
for curve in report.curves:
|
| 661 |
valid = [c for c in curve.cer_values if c is not None]
|
|
|
|
| 667 |
)
|
| 668 |
|
| 669 |
def test_reproducible_with_seed(self):
|
| 670 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 671 |
r1 = generate_demo_robustness_report(seed=7)
|
| 672 |
r2 = generate_demo_robustness_report(seed=7)
|
| 673 |
assert r1.curves[0].cer_values == r2.curves[0].cer_values
|
| 674 |
|
| 675 |
def test_summary_contains_most_robust(self):
|
| 676 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 677 |
report = generate_demo_robustness_report()
|
| 678 |
assert any("most_robust" in k for k in report.summary)
|
| 679 |
|
| 680 |
def test_json_serializable(self):
|
| 681 |
+
from picarones.evaluation.metrics.robustness import generate_demo_robustness_report
|
| 682 |
report = generate_demo_robustness_report()
|
| 683 |
d = report.as_dict()
|
| 684 |
json_str = json.dumps(d, ensure_ascii=False)
|