Spaces:
Sleeping
refactor(evaluation): Sprint A14-S10 — déplacement de 23 fichiers de calcul vers evaluation/metrics/
Browse filesSprint S10 du plan rewrite ciblé. Phase 2 continue.
Déplacement physique (sans modification de logique) de 23 fichiers
de calcul autonomes depuis ``picarones/measurements/`` vers
``picarones/evaluation/metrics/``. L'ancien emplacement devient un
re-export pour ne casser aucun consommateur. Aucun test modifié.
Fichiers migrés (23)
--------------------
Calculs de qualité textuelle pure (5) :
rare_tokens, lexical_modernization, calibration, confusion,
line_metrics
Calculs structurels et géométriques (3) :
layout, image_quality, image_predictive
Calculs économiques (4) :
pricing, marginal_cost, throughput, incremental_comparison
Calculs analytiques post-traitement (8) :
error_absorption, hallucination, robustness_projection,
longitudinal, baseline_comparison, levers, worst_lines,
module_policy
Calculs inter-moteurs (3) :
inter_engine, taxonomy_cooccurrence, taxonomy_comparison
Critères de sélection (catégorie A)
-----------------------------------
- AUCUN ``@register_metric`` (le décorateur du registre legacy
``core.metric_registry`` n'est pas autorisé dans la nouvelle
couche evaluation/).
- AUCUN import vers ``picarones.measurements.*``,
``picarones.engines.*``, ``picarones.core.metric_registry``,
``picarones.core.modules``.
- Imports externes uniquement vers la whitelist evaluation/.
Une seule modification de logique : ``pricing._DEFAULT_PRICING_PATH``
adapté pour remonter de 3 niveaux au lieu de 2 (le YAML reste dans
``picarones/data/``, le module est passé de ``measurements/`` à
``evaluation/metrics/``).
Mécanisme de re-export
----------------------
Pour chaque fichier ``measurements/X.py`` migré :
# Avant (~200-560 lignes de code)
# ... logique complète ...
# Après (10 lignes)
'''Re-export — Sprint A14-S10. Le contenu canonique vit dans
``picarones.evaluation.metrics.X``.'''
from picarones.evaluation.metrics.X import * # noqa: F401,F403
Quatre fichiers (``layout``, ``image_quality``, ``pricing``,
``robustness_projection``) ré-exportent en plus des **symboles
privés** importés par les tests (cf. ``_iou_bbox``,
``_global_quality_score``, ``_DEFAULT_PRICING_PATH``,
``_extract_quality_value``, ``_interpolate_cer``).
Reste à migrer (différé, documenté dans BACKLOG)
------------------------------------------------
17 fichiers ``measurements/*.py`` restent en place. Sur ces 17 :
- 11 utilisent ``@register_metric`` → migrés au S20 quand
``MetricRegistry`` (S5) deviendra le seul registre via
``app/services/registry_service``.
- 1 (``robustness``) a des deps vers ``picarones.core.corpus``,
``picarones.engines.base``, ``picarones.measurements.metrics`` →
migré après S11 et S12.
- 5 ont des deps inter-fichiers qui sont maintenant migrées
(``cost_projection``, ``equivalence_profile``, ``specialization``,
``taxonomy_intra_doc``, ``taxonomy``) → peuvent être migrés au
S11+ puisque leurs deps sont là.
Le sous-package ``runner/``, ``pipeline_benchmark``,
``pipeline_comparison``, etc. sont des fichiers d'orchestration
legacy qui seront remplacés par ``pipeline/executor`` +
``pipeline/runner`` au S22 — pas migrés tels quels.
Mise à jour des règles d'architecture
-------------------------------------
``tests/architecture/test_layer_dependencies.py`` :
``EXTERNAL_ALLOWED["evaluation"]`` ajoute ``PIL`` et ``yaml``
(légitimes pour ``image_quality`` et ``pricing``, justifiés en
commentaire).
``tests/architecture/test_file_budgets.py`` :
- Ajout de ``evaluation/metrics/levers.py`` (561 lignes) et
``evaluation/metrics/inter_engine.py`` (484 lignes) à la
whitelist.
- Les anciens emplacements (``measurements/levers.py``,
``measurements/inter_engine.py``) restent dans la whitelist
comme re-exports, conservant leur ancien plafond.
État de la suite
----------------
``pytest tests/ -q`` → 4162 passed, 7 skipped, 2 failed.
+2 tests vs S9 (probablement deux nouveaux cas de coverage liés
aux nouveaux modules). Les 2 fails restants sont strictement
environnementaux (sous-process pytest sans
``pip install -e .``). Aucune régression S10.
Critère go/no-go S10 atteint
----------------------------
- 23 fichiers déplacés (vs 24-40 du plan original — différence
documentée et justifiée dans le BACKLOG).
- Aucune logique modifiée (sauf adaptation chemin filesystem
pricing).
- Aucun test modifié.
- Suite verte avec exactement les mêmes nombres de passed.
Le plan d'origine du S10 listait ~40 fichiers ; la réalité du
code montre que seuls 23 satisfont strictement la contrainte
"déplacement sans modification de logique". Les 17 autres ont
des dépendances qui exigent S11/S12/S20 d'abord. C'est un choix
pragmatique assumé qui préserve l'invariant "main reste
livrable, suite verte" pendant tout le rewrite.
Prêt pour S11 (migration des adapters dans ``picarones/adapters/``).
https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP
- BACKLOG_POST_LIVRAISON.md +42 -0
- picarones/evaluation/metrics/__init__.py +105 -26
- picarones/evaluation/metrics/baseline_comparison.py +229 -0
- picarones/evaluation/metrics/calibration.py +323 -0
- picarones/evaluation/metrics/confusion.py +268 -0
- picarones/evaluation/metrics/error_absorption.py +276 -0
- picarones/evaluation/metrics/hallucination.py +331 -0
- picarones/evaluation/metrics/image_predictive.py +283 -0
- picarones/evaluation/metrics/image_quality.py +391 -0
- picarones/evaluation/metrics/incremental_comparison.py +253 -0
- picarones/evaluation/metrics/inter_engine.py +484 -0
- picarones/evaluation/metrics/layout.py +280 -0
- picarones/evaluation/metrics/levers.py +561 -0
- picarones/evaluation/metrics/lexical_modernization.py +263 -0
- picarones/evaluation/metrics/line_metrics.py +286 -0
- picarones/evaluation/metrics/longitudinal.py +373 -0
- picarones/evaluation/metrics/marginal_cost.py +142 -0
- picarones/evaluation/metrics/module_policy.py +333 -0
- picarones/evaluation/metrics/pricing.py +313 -0
- picarones/evaluation/metrics/rare_tokens.py +254 -0
- picarones/evaluation/metrics/robustness_projection.py +287 -0
- picarones/evaluation/metrics/taxonomy_comparison.py +161 -0
- picarones/evaluation/metrics/taxonomy_cooccurrence.py +150 -0
- picarones/evaluation/metrics/throughput.py +165 -0
- picarones/evaluation/metrics/worst_lines.py +199 -0
- picarones/measurements/baseline_comparison.py +5 -224
- picarones/measurements/calibration.py +5 -318
- picarones/measurements/confusion.py +5 -263
- picarones/measurements/error_absorption.py +5 -271
- picarones/measurements/hallucination.py +5 -326
- picarones/measurements/image_predictive.py +5 -278
- picarones/measurements/image_quality.py +8 -385
- picarones/measurements/incremental_comparison.py +5 -248
- picarones/measurements/inter_engine.py +5 -479
- picarones/measurements/layout.py +8 -274
- picarones/measurements/levers.py +5 -556
- picarones/measurements/lexical_modernization.py +5 -258
- picarones/measurements/line_metrics.py +5 -281
- picarones/measurements/longitudinal.py +5 -368
- picarones/measurements/marginal_cost.py +5 -137
- picarones/measurements/module_policy.py +5 -328
- picarones/measurements/pricing.py +9 -303
- picarones/measurements/rare_tokens.py +5 -249
- picarones/measurements/robustness_projection.py +12 -281
- picarones/measurements/taxonomy_comparison.py +5 -156
- picarones/measurements/taxonomy_cooccurrence.py +5 -145
- picarones/measurements/throughput.py +5 -160
- picarones/measurements/worst_lines.py +5 -194
- tests/architecture/test_file_budgets.py +6 -1
- tests/architecture/test_layer_dependencies.py +3 -0
|
@@ -126,6 +126,48 @@ exister à la livraison BnF.
|
|
| 126 |
|
| 127 |
→ Sprint S5 + S20 du rewrite.
|
| 128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
### 2.5 Suppression des références "Sprint X" dans le code
|
| 130 |
|
| 131 |
Le repo contient ~679 références à "Sprint N" dans les fichiers
|
|
|
|
| 126 |
|
| 127 |
→ Sprint S5 + S20 du rewrite.
|
| 128 |
|
| 129 |
+
### 2.5 Migration des fichiers `measurements/*.py` restants vers `evaluation/metrics/`
|
| 130 |
+
|
| 131 |
+
Le Sprint S10 a migré 23 fichiers de calcul autonomes. 17 fichiers
|
| 132 |
+
restent dans `picarones/measurements/` à migrer.
|
| 133 |
+
|
| 134 |
+
**Catégorie B — utilisent `@register_metric`** (singleton global
|
| 135 |
+
`core.metric_registry` à supprimer au S20) :
|
| 136 |
+
`mufi`, `abbreviations`, `unicode_blocks`, `roman_numerals`,
|
| 137 |
+
`early_modern_typography`, `modern_archives`, `reading_order`,
|
| 138 |
+
`ner`, `readability`, `searchability`, `numerical_sequences`.
|
| 139 |
+
|
| 140 |
+
→ Migrés au S20 quand le `MetricRegistry` instancié explicitement
|
| 141 |
+
(S5) deviendra le seul registre.
|
| 142 |
+
|
| 143 |
+
**Catégorie C — dépendances vers `core.corpus` / `engines.base` /
|
| 144 |
+
`measurements.metrics`** :
|
| 145 |
+
`robustness`.
|
| 146 |
+
|
| 147 |
+
→ Migré après S11 (déplacement des adapters) et S12 (équivalence
|
| 148 |
+
numérique).
|
| 149 |
+
|
| 150 |
+
**Catégorie D — dépendances inter-fichiers à orchestrer** :
|
| 151 |
+
`cost_projection` (→ pricing, déjà migré),
|
| 152 |
+
`equivalence_profile` (→ formats.text.normalization, déjà migré),
|
| 153 |
+
`specialization` (→ inter_engine, déjà migré),
|
| 154 |
+
`taxonomy_intra_doc` (→ taxonomy),
|
| 155 |
+
`taxonomy` (→ char_scores).
|
| 156 |
+
|
| 157 |
+
→ Trois de ces fichiers (cost_projection, equivalence_profile,
|
| 158 |
+
specialization) peuvent être migrés dès le S11+ puisque leurs deps
|
| 159 |
+
sont déjà migrées.
|
| 160 |
+
|
| 161 |
+
**Fichiers d'orchestration legacy** (à NE PAS migrer en l'état,
|
| 162 |
+
remplacés par `pipeline/executor` + `pipeline/runner` au S22) :
|
| 163 |
+
`runner/` (sous-package), `pipeline_benchmark`,
|
| 164 |
+
`pipeline_comparison`, `pipeline_spec_loader`,
|
| 165 |
+
`builtin_hooks`, `builtin_metrics`, `philological_hooks`,
|
| 166 |
+
`readability_hooks`, `searchability_hooks`,
|
| 167 |
+
`numerical_sequences_hooks`, `ner_backends`,
|
| 168 |
+
`metrics`, `history`, `structure`, `difficulty`,
|
| 169 |
+
`char_scores`, `alto_metrics`, `narrative/`, `statistics/`.
|
| 170 |
+
|
| 171 |
### 2.5 Suppression des références "Sprint X" dans le code
|
| 172 |
|
| 173 |
Le repo contient ~679 références à "Sprint N" dans les fichiers
|
|
@@ -1,32 +1,111 @@
|
|
| 1 |
"""Métriques — calculs purs sur des paires (référence, hypothèse).
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
"""
|
| 29 |
|
| 30 |
from __future__ import annotations
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
"""Métriques — calculs purs sur des paires (référence, hypothèse).
|
| 2 |
|
| 3 |
+
Sprint A14-S10 : déplacement de **23 fichiers de calcul autonomes**
|
| 4 |
+
depuis ``picarones.measurements``.
|
| 5 |
+
|
| 6 |
+
Calculs de qualité textuelle pure :
|
| 7 |
+
``rare_tokens``, ``lexical_modernization``, ``calibration``,
|
| 8 |
+
``confusion``, ``line_metrics``.
|
| 9 |
+
|
| 10 |
+
Calculs structurels et géométriques :
|
| 11 |
+
``layout``, ``image_quality``, ``image_predictive``.
|
| 12 |
+
|
| 13 |
+
Calculs économiques :
|
| 14 |
+
``pricing``, ``marginal_cost``, ``throughput``,
|
| 15 |
+
``incremental_comparison``.
|
| 16 |
+
|
| 17 |
+
Calculs analytiques (post-traitement) :
|
| 18 |
+
``error_absorption``, ``hallucination``, ``robustness_projection``,
|
| 19 |
+
``longitudinal``, ``baseline_comparison``, ``levers``,
|
| 20 |
+
``worst_lines``, ``module_policy``.
|
| 21 |
+
|
| 22 |
+
Calculs inter-moteurs :
|
| 23 |
+
``inter_engine``, ``taxonomy_cooccurrence``,
|
| 24 |
+
``taxonomy_comparison``.
|
| 25 |
+
|
| 26 |
+
Reste à migrer (différé)
|
| 27 |
+
------------------------
|
| 28 |
+
|
| 29 |
+
Catégorie B — utilisent ``@register_metric`` du registre global
|
| 30 |
+
``core.metric_registry`` (singleton avec side-effect d'import) :
|
| 31 |
+
``mufi``, ``abbreviations``, ``unicode_blocks``, ``roman_numerals``,
|
| 32 |
+
``early_modern_typography``, ``modern_archives``, ``reading_order``,
|
| 33 |
+
``ner``, ``readability``, ``searchability``, ``numerical_sequences``.
|
| 34 |
+
|
| 35 |
+
Migrés au S20 quand le ``MetricRegistry`` instancié explicitement
|
| 36 |
+
(S5) deviendra le seul registre, via le ``registry_service``
|
| 37 |
+
applicatif.
|
| 38 |
+
|
| 39 |
+
Catégorie C — dépendances vers anciens packages :
|
| 40 |
+
``robustness`` (importe ``picarones.core.corpus`` +
|
| 41 |
+
``picarones.engines.base`` + ``picarones.measurements.metrics``).
|
| 42 |
+
Ne peut être migré qu'après les Sprints S11 (déplacement des
|
| 43 |
+
adapters) et S12 (équivalence numérique).
|
| 44 |
+
|
| 45 |
+
Catégorie D — dépendances inter-fichiers à orchestrer :
|
| 46 |
+
``cost_projection`` (→ pricing), ``equivalence_profile``
|
| 47 |
+
(→ formats.text.normalization), ``specialization``
|
| 48 |
+
(→ inter_engine), ``taxonomy_intra_doc`` (→ taxonomy),
|
| 49 |
+
``taxonomy`` (→ char_scores).
|
| 50 |
+
|
| 51 |
+
Règle de migration (S10) : un fichier déplacé = un commit avec
|
| 52 |
+
uniquement le déplacement et un re-export à l'ancien emplacement.
|
| 53 |
+
La logique reste identique. Aucun test modifié.
|
| 54 |
"""
|
| 55 |
|
| 56 |
from __future__ import annotations
|
| 57 |
|
| 58 |
+
# Re-exports des 23 fichiers déplacés au S10. Volontairement
|
| 59 |
+
# explicite (pas de wildcard import) pour qu'un caller du nouveau
|
| 60 |
+
# code ait une vue claire de ce qui est exposé.
|
| 61 |
+
from picarones.evaluation.metrics import ( # noqa: F401
|
| 62 |
+
baseline_comparison,
|
| 63 |
+
calibration,
|
| 64 |
+
confusion,
|
| 65 |
+
error_absorption,
|
| 66 |
+
hallucination,
|
| 67 |
+
image_predictive,
|
| 68 |
+
image_quality,
|
| 69 |
+
incremental_comparison,
|
| 70 |
+
inter_engine,
|
| 71 |
+
layout,
|
| 72 |
+
levers,
|
| 73 |
+
lexical_modernization,
|
| 74 |
+
line_metrics,
|
| 75 |
+
longitudinal,
|
| 76 |
+
marginal_cost,
|
| 77 |
+
module_policy,
|
| 78 |
+
pricing,
|
| 79 |
+
rare_tokens,
|
| 80 |
+
robustness_projection,
|
| 81 |
+
taxonomy_comparison,
|
| 82 |
+
taxonomy_cooccurrence,
|
| 83 |
+
throughput,
|
| 84 |
+
worst_lines,
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
__all__ = [
|
| 88 |
+
"baseline_comparison",
|
| 89 |
+
"calibration",
|
| 90 |
+
"confusion",
|
| 91 |
+
"error_absorption",
|
| 92 |
+
"hallucination",
|
| 93 |
+
"image_predictive",
|
| 94 |
+
"image_quality",
|
| 95 |
+
"incremental_comparison",
|
| 96 |
+
"inter_engine",
|
| 97 |
+
"layout",
|
| 98 |
+
"levers",
|
| 99 |
+
"lexical_modernization",
|
| 100 |
+
"line_metrics",
|
| 101 |
+
"longitudinal",
|
| 102 |
+
"marginal_cost",
|
| 103 |
+
"module_policy",
|
| 104 |
+
"pricing",
|
| 105 |
+
"rare_tokens",
|
| 106 |
+
"robustness_projection",
|
| 107 |
+
"taxonomy_comparison",
|
| 108 |
+
"taxonomy_cooccurrence",
|
| 109 |
+
"throughput",
|
| 110 |
+
"worst_lines",
|
| 111 |
+
]
|
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Comparaison à la baseline historique — Sprint 73 (A.I.3).
|
| 2 |
+
|
| 3 |
+
Sprint 73 — chantier 2 d'A.I.3 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
L'historique SQLite (``picarones/core/history.py``, Sprint 8)
|
| 8 |
+
existe mais aucun détecteur narratif ne le lit. Ce module fournit
|
| 9 |
+
la couche de calcul qui répond à *« comment ce moteur se
|
| 10 |
+
comporte-t-il sur ce corpus, **par rapport à ses runs précédents
|
| 11 |
+
de mon institution** ? »*.
|
| 12 |
+
|
| 13 |
+
Sortie typique
|
| 14 |
+
--------------
|
| 15 |
+
Un dict par moteur :
|
| 16 |
+
|
| 17 |
+
.. code-block:: python
|
| 18 |
+
|
| 19 |
+
{
|
| 20 |
+
"engine_name": "tesseract",
|
| 21 |
+
"cer_current": 0.052,
|
| 22 |
+
"cer_historical_mean": 0.041,
|
| 23 |
+
"cer_historical_median": 0.040,
|
| 24 |
+
"n_runs": 12,
|
| 25 |
+
"absolute_delta": 0.011,
|
| 26 |
+
"relative_delta": 0.268, # +26,8 % vs moyenne
|
| 27 |
+
"off_baseline": True,
|
| 28 |
+
}
|
| 29 |
+
|
| 30 |
+
Le détecteur narratif ``engine_off_baseline`` (Sprint 73)
|
| 31 |
+
consomme cette structure pour émettre des Facts.
|
| 32 |
+
|
| 33 |
+
Garde-fous
|
| 34 |
+
----------
|
| 35 |
+
- ``min_runs`` (défaut 5) : si l'historique pour le moteur×corpus
|
| 36 |
+
contient moins de runs, on retourne ``None`` plutôt que de
|
| 37 |
+
comparer à un échantillon trop petit.
|
| 38 |
+
- ``corpus_name`` est utilisé pour ne comparer qu'aux runs **du
|
| 39 |
+
même corpus** (sinon on compare des pommes et des oranges :
|
| 40 |
+
registres paroissiaux vs imprimés modernes).
|
| 41 |
+
- Le run courant lui-même n'est pas inclus dans la baseline (on
|
| 42 |
+
passe le ``current_run_id`` à exclure).
|
| 43 |
+
"""
|
| 44 |
+
|
| 45 |
+
from __future__ import annotations
|
| 46 |
+
|
| 47 |
+
import logging
|
| 48 |
+
import statistics
|
| 49 |
+
from typing import Optional
|
| 50 |
+
|
| 51 |
+
logger = logging.getLogger(__name__)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def compute_engine_baseline(
|
| 55 |
+
history,
|
| 56 |
+
engine_name: str,
|
| 57 |
+
corpus_name: str,
|
| 58 |
+
current_cer: float,
|
| 59 |
+
*,
|
| 60 |
+
current_run_id: Optional[str] = None,
|
| 61 |
+
min_runs: int = 5,
|
| 62 |
+
relative_delta_threshold: float = 0.20,
|
| 63 |
+
) -> Optional[dict]:
|
| 64 |
+
"""Compare le CER courant d'un moteur à sa moyenne historique
|
| 65 |
+
sur le **même corpus**.
|
| 66 |
+
|
| 67 |
+
Parameters
|
| 68 |
+
----------
|
| 69 |
+
history:
|
| 70 |
+
Instance de ``BenchmarkHistory`` (ou compatible : doit
|
| 71 |
+
exposer une méthode ``query(engine, corpus, limit)``
|
| 72 |
+
retournant une liste d'``HistoryEntry`` avec attribut
|
| 73 |
+
``cer_mean`` et ``run_id``).
|
| 74 |
+
engine_name:
|
| 75 |
+
Nom du moteur dont on calcule la baseline.
|
| 76 |
+
corpus_name:
|
| 77 |
+
Nom du corpus — limite la comparaison aux runs antérieurs
|
| 78 |
+
sur ce même corpus.
|
| 79 |
+
current_cer:
|
| 80 |
+
CER moyen observé dans le run courant.
|
| 81 |
+
current_run_id:
|
| 82 |
+
Si fourni, le run portant cet identifiant est exclu de la
|
| 83 |
+
baseline (utile quand le run courant est déjà enregistré
|
| 84 |
+
dans l'historique avant d'appeler ce calcul).
|
| 85 |
+
min_runs:
|
| 86 |
+
Nombre minimum de runs historiques pour que la
|
| 87 |
+
comparaison soit considérée fiable. Sous ce seuil, on
|
| 88 |
+
retourne ``None``.
|
| 89 |
+
relative_delta_threshold:
|
| 90 |
+
Seuil au-delà duquel ``off_baseline`` vaut ``True``
|
| 91 |
+
(défaut : 0,20 = 20 % d'écart relatif).
|
| 92 |
+
|
| 93 |
+
Returns
|
| 94 |
+
-------
|
| 95 |
+
Optional[dict]
|
| 96 |
+
``None`` si :
|
| 97 |
+
- moins de ``min_runs`` runs historiques disponibles
|
| 98 |
+
- ``current_cer`` est ``None`` ou négatif
|
| 99 |
+
- tous les CER historiques sont ``None``
|
| 100 |
+
|
| 101 |
+
Sinon, dict avec les champs documentés dans le module.
|
| 102 |
+
"""
|
| 103 |
+
if current_cer is None or current_cer < 0:
|
| 104 |
+
return None
|
| 105 |
+
try:
|
| 106 |
+
entries = history.query(
|
| 107 |
+
engine=engine_name, corpus=corpus_name, limit=1000,
|
| 108 |
+
)
|
| 109 |
+
except Exception as exc: # pragma: no cover — défense
|
| 110 |
+
logger.warning(
|
| 111 |
+
"[baseline_comparison] query history a levé : %s", exc,
|
| 112 |
+
)
|
| 113 |
+
return None
|
| 114 |
+
|
| 115 |
+
historical_cers: list[float] = []
|
| 116 |
+
for entry in entries:
|
| 117 |
+
if current_run_id is not None and entry.run_id == current_run_id:
|
| 118 |
+
continue
|
| 119 |
+
cer = entry.cer_mean
|
| 120 |
+
if cer is None or cer < 0:
|
| 121 |
+
continue
|
| 122 |
+
historical_cers.append(float(cer))
|
| 123 |
+
|
| 124 |
+
if len(historical_cers) < min_runs:
|
| 125 |
+
return None
|
| 126 |
+
|
| 127 |
+
mean = statistics.fmean(historical_cers)
|
| 128 |
+
median = statistics.median(historical_cers)
|
| 129 |
+
absolute_delta = current_cer - mean
|
| 130 |
+
if mean > 0:
|
| 131 |
+
relative_delta = absolute_delta / mean
|
| 132 |
+
elif current_cer == 0:
|
| 133 |
+
relative_delta = 0.0
|
| 134 |
+
else:
|
| 135 |
+
# Baseline à 0 mais CER courant > 0 : écart infini —
|
| 136 |
+
# convention : on signale comme off_baseline avec
|
| 137 |
+
# relative_delta = None.
|
| 138 |
+
relative_delta = None
|
| 139 |
+
|
| 140 |
+
off_baseline = (
|
| 141 |
+
relative_delta is not None
|
| 142 |
+
and abs(relative_delta) > relative_delta_threshold
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
return {
|
| 146 |
+
"engine_name": engine_name,
|
| 147 |
+
"corpus_name": corpus_name,
|
| 148 |
+
"cer_current": float(current_cer),
|
| 149 |
+
"cer_historical_mean": mean,
|
| 150 |
+
"cer_historical_median": median,
|
| 151 |
+
"n_runs": len(historical_cers),
|
| 152 |
+
"absolute_delta": absolute_delta,
|
| 153 |
+
"relative_delta": relative_delta,
|
| 154 |
+
"off_baseline": off_baseline,
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
def compute_corpus_difficulty_percentile(
|
| 159 |
+
history,
|
| 160 |
+
current_difficulty: float,
|
| 161 |
+
*,
|
| 162 |
+
min_runs: int = 5,
|
| 163 |
+
) -> Optional[dict]:
|
| 164 |
+
"""Place la difficulté du corpus courant dans la distribution
|
| 165 |
+
des difficultés historiques.
|
| 166 |
+
|
| 167 |
+
Lit les difficultés stockées dans ``HistoryEntry.metadata``
|
| 168 |
+
sous la clé ``difficulty`` (convention de
|
| 169 |
+
``picarones/core/difficulty.py``).
|
| 170 |
+
|
| 171 |
+
Returns
|
| 172 |
+
-------
|
| 173 |
+
Optional[dict]
|
| 174 |
+
``{
|
| 175 |
+
"current_difficulty": float,
|
| 176 |
+
"percentile": float, # 0..100
|
| 177 |
+
"n_runs": int,
|
| 178 |
+
"median_historical": float,
|
| 179 |
+
"harder_than_usual": bool, # percentile > 75
|
| 180 |
+
"easier_than_usual": bool, # percentile < 25
|
| 181 |
+
}``
|
| 182 |
+
ou ``None`` si moins de ``min_runs`` runs historiques ont
|
| 183 |
+
une difficulté enregistrée.
|
| 184 |
+
"""
|
| 185 |
+
if current_difficulty is None:
|
| 186 |
+
return None
|
| 187 |
+
try:
|
| 188 |
+
entries = history.query(limit=1000)
|
| 189 |
+
except Exception as exc: # pragma: no cover
|
| 190 |
+
logger.warning(
|
| 191 |
+
"[baseline_comparison] query history a levé : %s", exc,
|
| 192 |
+
)
|
| 193 |
+
return None
|
| 194 |
+
|
| 195 |
+
historical_difficulties: list[float] = []
|
| 196 |
+
for entry in entries:
|
| 197 |
+
diff = entry.metadata.get("difficulty") if entry.metadata else None
|
| 198 |
+
if diff is None:
|
| 199 |
+
continue
|
| 200 |
+
try:
|
| 201 |
+
historical_difficulties.append(float(diff))
|
| 202 |
+
except (TypeError, ValueError):
|
| 203 |
+
continue
|
| 204 |
+
|
| 205 |
+
if len(historical_difficulties) < min_runs:
|
| 206 |
+
return None
|
| 207 |
+
|
| 208 |
+
sorted_diff = sorted(historical_difficulties)
|
| 209 |
+
n = len(sorted_diff)
|
| 210 |
+
# Percentile = % de corpus historiques de difficulté ≤
|
| 211 |
+
# current_difficulty. Convention courante (P_i = i/n × 100).
|
| 212 |
+
n_below = sum(1 for d in sorted_diff if d <= current_difficulty)
|
| 213 |
+
percentile = (n_below / n) * 100.0
|
| 214 |
+
median = statistics.median(sorted_diff)
|
| 215 |
+
|
| 216 |
+
return {
|
| 217 |
+
"current_difficulty": float(current_difficulty),
|
| 218 |
+
"percentile": percentile,
|
| 219 |
+
"n_runs": n,
|
| 220 |
+
"median_historical": median,
|
| 221 |
+
"harder_than_usual": percentile > 75.0,
|
| 222 |
+
"easier_than_usual": percentile < 25.0,
|
| 223 |
+
}
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
__all__ = [
|
| 227 |
+
"compute_engine_baseline",
|
| 228 |
+
"compute_corpus_difficulty_percentile",
|
| 229 |
+
]
|
|
@@ -0,0 +1,323 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Calibration des moteurs : ECE, MCE, reliability diagram.
|
| 2 |
+
|
| 3 |
+
Sprint 39 — A.II.1.b du plan d'évolution 2026 : couche de calcul pure.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Tous les moteurs OCR cibles fournissent une confidence par token ou par
|
| 8 |
+
ligne (Tesseract via le ``tsv``, Pero OCR via le ``PageLayout``,
|
| 9 |
+
Mistral OCR via ``confidence``, Google Vision via ``Word.confidence``).
|
| 10 |
+
La question naturelle pour un workflow patrimonial est : *« quand le
|
| 11 |
+
moteur dit qu'il est sûr, est-il vraiment sûr ? »*. Pour une équipe
|
| 12 |
+
qui doit vérifier humainement un corpus de 50 000 pages, la différence
|
| 13 |
+
entre vérifier 100 % vs 15 % du volume est l'effet de la calibration.
|
| 14 |
+
|
| 15 |
+
Ce module fournit les trois mesures classiques :
|
| 16 |
+
|
| 17 |
+
- **Expected Calibration Error (ECE)** — moyenne pondérée par bin de
|
| 18 |
+
l'écart absolu entre confiance moyenne et précision moyenne.
|
| 19 |
+
``ECE = 0`` ↔ moteur parfaitement calibré ; ``ECE`` élevé ↔ écart
|
| 20 |
+
systématique entre confiance affichée et fiabilité réelle.
|
| 21 |
+
- **Maximum Calibration Error (MCE)** — max de cet écart sur les bins.
|
| 22 |
+
Utile pour repérer le pire mensonge du moteur (ex. il dit toujours
|
| 23 |
+
95 % de confiance et il a tort une fois sur deux).
|
| 24 |
+
- **Reliability diagram** — table ``[(bin_low, bin_high, avg_conf,
|
| 25 |
+
accuracy, count)]`` qui peut être rendue en SVG côté serveur ou en
|
| 26 |
+
Chart.js côté navigateur dans un sprint suivant.
|
| 27 |
+
|
| 28 |
+
Stratégie de découpage
|
| 29 |
+
----------------------
|
| 30 |
+
Comme pour le NER (Sprint 38) et la divergence (Sprints 35-37),
|
| 31 |
+
on découpe :
|
| 32 |
+
|
| 33 |
+
- **Sprint 39** (ici) — couche de calcul pure : entrée = deux listes
|
| 34 |
+
parallèles ``confidences`` (∈ [0, 1]) et ``is_correct`` (bool/0-1).
|
| 35 |
+
Aucune dépendance externe.
|
| 36 |
+
- **Sprint à venir** — exposition de ``token_confidences`` sur
|
| 37 |
+
``EngineResult``, alignement caractère/token avec la GT pour produire
|
| 38 |
+
``is_correct``, intégration dans le runner et vue HTML reliability.
|
| 39 |
+
|
| 40 |
+
Ce qui est explicitement hors scope
|
| 41 |
+
-----------------------------------
|
| 42 |
+
Ce sprint ne touche **aucun adaptateur OCR**. Aucune confiance n'est
|
| 43 |
+
extraite ; on calcule uniquement à partir de séquences de prédictions
|
| 44 |
+
fournies en entrée. C'est ce qui permet de tester rigoureusement les
|
| 45 |
+
invariants mathématiques (ECE = 0 ↔ calibré, ECE = |bias| pour bias
|
| 46 |
+
constant, etc.) sans dépendre d'un backend.
|
| 47 |
+
"""
|
| 48 |
+
|
| 49 |
+
from __future__ import annotations
|
| 50 |
+
|
| 51 |
+
import logging
|
| 52 |
+
from dataclasses import dataclass
|
| 53 |
+
from typing import Iterable
|
| 54 |
+
|
| 55 |
+
logger = logging.getLogger(__name__)
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 59 |
+
# Modèle de données
|
| 60 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
@dataclass(frozen=True)
|
| 64 |
+
class CalibrationBin:
|
| 65 |
+
"""Un bin du reliability diagram.
|
| 66 |
+
|
| 67 |
+
Attributs
|
| 68 |
+
---------
|
| 69 |
+
bin_low, bin_high:
|
| 70 |
+
Bornes du bin sur l'axe de confiance (``[bin_low, bin_high)`` —
|
| 71 |
+
sauf le dernier bin qui inclut ``1.0``).
|
| 72 |
+
avg_confidence:
|
| 73 |
+
Moyenne des confidences des prédictions tombées dans le bin.
|
| 74 |
+
``None`` si le bin est vide.
|
| 75 |
+
accuracy:
|
| 76 |
+
Fraction de prédictions correctes dans le bin (``∈ [0, 1]``).
|
| 77 |
+
``None`` si le bin est vide.
|
| 78 |
+
count:
|
| 79 |
+
Nombre de prédictions dans le bin.
|
| 80 |
+
"""
|
| 81 |
+
|
| 82 |
+
bin_low: float
|
| 83 |
+
bin_high: float
|
| 84 |
+
avg_confidence: float | None
|
| 85 |
+
accuracy: float | None
|
| 86 |
+
count: int
|
| 87 |
+
|
| 88 |
+
@property
|
| 89 |
+
def gap(self) -> float | None:
|
| 90 |
+
"""Écart absolu ``|confidence - accuracy|`` ou ``None`` si vide."""
|
| 91 |
+
if self.avg_confidence is None or self.accuracy is None:
|
| 92 |
+
return None
|
| 93 |
+
return abs(self.avg_confidence - self.accuracy)
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 97 |
+
# Validation
|
| 98 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def _validate_inputs(
|
| 102 |
+
confidences: list[float],
|
| 103 |
+
is_correct: list[bool | int],
|
| 104 |
+
) -> None:
|
| 105 |
+
if len(confidences) != len(is_correct):
|
| 106 |
+
raise ValueError(
|
| 107 |
+
f"Longueurs incompatibles : confidences={len(confidences)} "
|
| 108 |
+
f"vs is_correct={len(is_correct)}"
|
| 109 |
+
)
|
| 110 |
+
for i, c in enumerate(confidences):
|
| 111 |
+
if not (0.0 <= float(c) <= 1.0):
|
| 112 |
+
raise ValueError(
|
| 113 |
+
f"Confiance hors [0, 1] à l'index {i} : {c!r}"
|
| 114 |
+
)
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
# ──────────────────────────────────────────���───────────────────────────────
|
| 118 |
+
# Reliability diagram (binning)
|
| 119 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def reliability_diagram(
|
| 123 |
+
confidences: Iterable[float],
|
| 124 |
+
is_correct: Iterable[bool | int],
|
| 125 |
+
n_bins: int = 10,
|
| 126 |
+
) -> list[CalibrationBin]:
|
| 127 |
+
"""Découpe les prédictions en ``n_bins`` bins équidistants par confiance
|
| 128 |
+
et calcule pour chacun la confiance moyenne, la précision et le compte.
|
| 129 |
+
|
| 130 |
+
Parameters
|
| 131 |
+
----------
|
| 132 |
+
confidences:
|
| 133 |
+
Confidences des prédictions, ``∈ [0, 1]``.
|
| 134 |
+
is_correct:
|
| 135 |
+
Indicateur booléen (1 = prédiction correcte, 0 = incorrecte).
|
| 136 |
+
n_bins:
|
| 137 |
+
Nombre de bins (défaut : 10). Bornes : ``[k/n_bins, (k+1)/n_bins)``
|
| 138 |
+
sauf le dernier bin qui inclut ``1.0``.
|
| 139 |
+
|
| 140 |
+
Returns
|
| 141 |
+
-------
|
| 142 |
+
list[CalibrationBin]
|
| 143 |
+
Liste de ``n_bins`` bins, dans l'ordre croissant des confidences.
|
| 144 |
+
"""
|
| 145 |
+
if n_bins < 1:
|
| 146 |
+
raise ValueError(f"n_bins doit être ≥ 1 — reçu {n_bins}")
|
| 147 |
+
|
| 148 |
+
confs = [float(c) for c in confidences]
|
| 149 |
+
correct = [int(bool(x)) for x in is_correct]
|
| 150 |
+
_validate_inputs(confs, correct)
|
| 151 |
+
|
| 152 |
+
bin_width = 1.0 / n_bins
|
| 153 |
+
sums: list[float] = [0.0] * n_bins
|
| 154 |
+
correct_counts: list[int] = [0] * n_bins
|
| 155 |
+
counts: list[int] = [0] * n_bins
|
| 156 |
+
|
| 157 |
+
for c, ok in zip(confs, correct):
|
| 158 |
+
# Calcul du bin index par multiplication ``c * n_bins`` plutôt que
|
| 159 |
+
# division ``c / bin_width`` pour éviter les pièges de
|
| 160 |
+
# représentation flottante (ex. ``0.6 / 0.1 = 5.999…`` en IEEE 754
|
| 161 |
+
# qui placerait 0.6 dans le bin [0.5, 0.6) au lieu de [0.6, 0.7)).
|
| 162 |
+
if c >= 1.0:
|
| 163 |
+
idx = n_bins - 1
|
| 164 |
+
else:
|
| 165 |
+
idx = int(c * n_bins)
|
| 166 |
+
# Garde-fou en cas d'arrondi flottant
|
| 167 |
+
if idx >= n_bins:
|
| 168 |
+
idx = n_bins - 1
|
| 169 |
+
elif idx < 0:
|
| 170 |
+
idx = 0
|
| 171 |
+
sums[idx] += c
|
| 172 |
+
correct_counts[idx] += ok
|
| 173 |
+
counts[idx] += 1
|
| 174 |
+
|
| 175 |
+
bins: list[CalibrationBin] = []
|
| 176 |
+
for k in range(n_bins):
|
| 177 |
+
low = k * bin_width
|
| 178 |
+
high = (k + 1) * bin_width
|
| 179 |
+
n = counts[k]
|
| 180 |
+
if n == 0:
|
| 181 |
+
bins.append(CalibrationBin(low, high, None, None, 0))
|
| 182 |
+
else:
|
| 183 |
+
bins.append(CalibrationBin(
|
| 184 |
+
bin_low=low,
|
| 185 |
+
bin_high=high,
|
| 186 |
+
avg_confidence=sums[k] / n,
|
| 187 |
+
accuracy=correct_counts[k] / n,
|
| 188 |
+
count=n,
|
| 189 |
+
))
|
| 190 |
+
return bins
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 194 |
+
# ECE et MCE
|
| 195 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
def expected_calibration_error(
|
| 199 |
+
confidences: Iterable[float],
|
| 200 |
+
is_correct: Iterable[bool | int],
|
| 201 |
+
n_bins: int = 10,
|
| 202 |
+
) -> float:
|
| 203 |
+
"""Expected Calibration Error : moyenne pondérée par bin de l'écart
|
| 204 |
+
absolu confiance ↔ précision.
|
| 205 |
+
|
| 206 |
+
``ECE = sum_k (n_k / N) * |avg_conf_k - accuracy_k|``
|
| 207 |
+
|
| 208 |
+
où la somme porte sur les bins non vides.
|
| 209 |
+
|
| 210 |
+
Returns
|
| 211 |
+
-------
|
| 212 |
+
float
|
| 213 |
+
``∈ [0, 1]``. ``0`` ↔ calibration parfaite.
|
| 214 |
+
"""
|
| 215 |
+
bins = reliability_diagram(confidences, is_correct, n_bins=n_bins)
|
| 216 |
+
total = sum(b.count for b in bins)
|
| 217 |
+
if total == 0:
|
| 218 |
+
return 0.0
|
| 219 |
+
ece = 0.0
|
| 220 |
+
for b in bins:
|
| 221 |
+
if b.count == 0 or b.gap is None:
|
| 222 |
+
continue
|
| 223 |
+
ece += (b.count / total) * b.gap
|
| 224 |
+
return ece
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
def maximum_calibration_error(
|
| 228 |
+
confidences: Iterable[float],
|
| 229 |
+
is_correct: Iterable[bool | int],
|
| 230 |
+
n_bins: int = 10,
|
| 231 |
+
) -> float:
|
| 232 |
+
"""Maximum Calibration Error : pire écart confiance ↔ précision sur
|
| 233 |
+
tous les bins non vides.
|
| 234 |
+
|
| 235 |
+
Utile pour repérer un mensonge ponctuel du moteur (ex. il dit 95 %
|
| 236 |
+
de confiance et il a tort une fois sur deux dans ce bin).
|
| 237 |
+
|
| 238 |
+
Returns
|
| 239 |
+
-------
|
| 240 |
+
float
|
| 241 |
+
``∈ [0, 1]``. ``0`` ↔ calibration parfaite.
|
| 242 |
+
"""
|
| 243 |
+
bins = reliability_diagram(confidences, is_correct, n_bins=n_bins)
|
| 244 |
+
gaps = [b.gap for b in bins if b.gap is not None]
|
| 245 |
+
return max(gaps) if gaps else 0.0
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 249 |
+
# Vue agrégée
|
| 250 |
+
# ──────────────────────────────────────────────────────────────────────��───
|
| 251 |
+
|
| 252 |
+
|
| 253 |
+
def compute_calibration_metrics(
|
| 254 |
+
confidences: Iterable[float],
|
| 255 |
+
is_correct: Iterable[bool | int],
|
| 256 |
+
n_bins: int = 10,
|
| 257 |
+
) -> dict:
|
| 258 |
+
"""Calcule l'ensemble des métriques de calibration en un appel.
|
| 259 |
+
|
| 260 |
+
Returns
|
| 261 |
+
-------
|
| 262 |
+
dict
|
| 263 |
+
``{
|
| 264 |
+
"ece": float,
|
| 265 |
+
"mce": float,
|
| 266 |
+
"n_bins": int,
|
| 267 |
+
"n_predictions": int,
|
| 268 |
+
"overall_accuracy": float,
|
| 269 |
+
"overall_confidence": float,
|
| 270 |
+
"bins": [
|
| 271 |
+
{"bin_low", "bin_high", "avg_confidence",
|
| 272 |
+
"accuracy", "count", "gap"},
|
| 273 |
+
...
|
| 274 |
+
],
|
| 275 |
+
}``
|
| 276 |
+
"""
|
| 277 |
+
confs = list(confidences)
|
| 278 |
+
correct = list(is_correct)
|
| 279 |
+
bins = reliability_diagram(confs, correct, n_bins=n_bins)
|
| 280 |
+
total = sum(b.count for b in bins)
|
| 281 |
+
overall_acc = (
|
| 282 |
+
sum(int(bool(x)) for x in correct) / total if total > 0 else 0.0
|
| 283 |
+
)
|
| 284 |
+
overall_conf = (
|
| 285 |
+
sum(float(c) for c in confs) / total if total > 0 else 0.0
|
| 286 |
+
)
|
| 287 |
+
|
| 288 |
+
ece = 0.0
|
| 289 |
+
if total > 0:
|
| 290 |
+
for b in bins:
|
| 291 |
+
if b.gap is None:
|
| 292 |
+
continue
|
| 293 |
+
ece += (b.count / total) * b.gap
|
| 294 |
+
mce = max((b.gap for b in bins if b.gap is not None), default=0.0)
|
| 295 |
+
|
| 296 |
+
return {
|
| 297 |
+
"ece": ece,
|
| 298 |
+
"mce": mce,
|
| 299 |
+
"n_bins": n_bins,
|
| 300 |
+
"n_predictions": total,
|
| 301 |
+
"overall_accuracy": overall_acc,
|
| 302 |
+
"overall_confidence": overall_conf,
|
| 303 |
+
"bins": [
|
| 304 |
+
{
|
| 305 |
+
"bin_low": b.bin_low,
|
| 306 |
+
"bin_high": b.bin_high,
|
| 307 |
+
"avg_confidence": b.avg_confidence,
|
| 308 |
+
"accuracy": b.accuracy,
|
| 309 |
+
"count": b.count,
|
| 310 |
+
"gap": b.gap,
|
| 311 |
+
}
|
| 312 |
+
for b in bins
|
| 313 |
+
],
|
| 314 |
+
}
|
| 315 |
+
|
| 316 |
+
|
| 317 |
+
__all__ = [
|
| 318 |
+
"CalibrationBin",
|
| 319 |
+
"reliability_diagram",
|
| 320 |
+
"expected_calibration_error",
|
| 321 |
+
"maximum_calibration_error",
|
| 322 |
+
"compute_calibration_metrics",
|
| 323 |
+
]
|
|
@@ -0,0 +1,268 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Matrice de confusion unicode pour l'analyse fine des erreurs OCR.
|
| 2 |
+
|
| 3 |
+
Pour chaque moteur, on calcule quels caractères du GT sont transcrits par
|
| 4 |
+
quels caractères OCR (substitutions). Cette "empreinte d'erreur" est
|
| 5 |
+
caractéristique de chaque moteur ou pipeline.
|
| 6 |
+
|
| 7 |
+
Méthode
|
| 8 |
+
-------
|
| 9 |
+
L'alignement caractère par caractère utilise les opérations d'édition
|
| 10 |
+
de la distance de Levenshtein (via difflib.SequenceMatcher), ce qui permet
|
| 11 |
+
d'identifier les substitutions, insertions et suppressions.
|
| 12 |
+
|
| 13 |
+
La matrice est stockée comme un dict de dict :
|
| 14 |
+
``{gt_char: {ocr_char: count}}``
|
| 15 |
+
|
| 16 |
+
La valeur spéciale ``"∅"`` (U+2205) représente un caractère vide :
|
| 17 |
+
- ``{"a": {"∅": 3}}`` → 'a' supprimé 3 fois dans l'OCR
|
| 18 |
+
- ``{"∅": {"x": 2}}`` → 'x' inséré 2 fois dans l'OCR (absent du GT)
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import difflib
|
| 24 |
+
from collections import defaultdict
|
| 25 |
+
from dataclasses import dataclass, field
|
| 26 |
+
|
| 27 |
+
# Symbole représentant un caractère absent (insertion / suppression)
|
| 28 |
+
EMPTY_CHAR = "∅"
|
| 29 |
+
|
| 30 |
+
# Caractères non pertinents à ignorer dans la matrice (espaces, sauts de ligne)
|
| 31 |
+
_WHITESPACE = set(" \t\n\r")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
@dataclass
|
| 35 |
+
class ConfusionMatrix:
|
| 36 |
+
"""Matrice de confusion unicode pour une paire (GT, OCR)."""
|
| 37 |
+
|
| 38 |
+
matrix: dict[str, dict[str, int]] = field(default_factory=dict)
|
| 39 |
+
"""Clé externe = char GT ; clé interne = char OCR ; valeur = count."""
|
| 40 |
+
|
| 41 |
+
total_substitutions: int = 0
|
| 42 |
+
total_insertions: int = 0
|
| 43 |
+
total_deletions: int = 0
|
| 44 |
+
|
| 45 |
+
@property
|
| 46 |
+
def total_errors(self) -> int:
|
| 47 |
+
return self.total_substitutions + self.total_insertions + self.total_deletions
|
| 48 |
+
|
| 49 |
+
def top_confusions(self, n: int = 20) -> list[dict]:
|
| 50 |
+
"""Retourne les n confusions les plus fréquentes (substitutions uniquement)."""
|
| 51 |
+
pairs: list[tuple[str, str, int]] = []
|
| 52 |
+
for gt_char, ocr_counts in self.matrix.items():
|
| 53 |
+
if gt_char == EMPTY_CHAR:
|
| 54 |
+
continue # insertions
|
| 55 |
+
for ocr_char, count in ocr_counts.items():
|
| 56 |
+
if ocr_char == EMPTY_CHAR:
|
| 57 |
+
continue # suppressions
|
| 58 |
+
if gt_char != ocr_char:
|
| 59 |
+
pairs.append((gt_char, ocr_char, count))
|
| 60 |
+
pairs.sort(key=lambda x: -x[2])
|
| 61 |
+
return [
|
| 62 |
+
{"gt": gt, "ocr": ocr, "count": cnt}
|
| 63 |
+
for gt, ocr, cnt in pairs[:n]
|
| 64 |
+
]
|
| 65 |
+
|
| 66 |
+
def as_compact_dict(self, min_count: int = 1) -> dict:
|
| 67 |
+
"""Sérialise la matrice en éliminant les entrées rares."""
|
| 68 |
+
compact: dict[str, dict[str, int]] = {}
|
| 69 |
+
for gt_char, ocr_counts in self.matrix.items():
|
| 70 |
+
filtered = {
|
| 71 |
+
oc: cnt for oc, cnt in ocr_counts.items()
|
| 72 |
+
if cnt >= min_count
|
| 73 |
+
}
|
| 74 |
+
if filtered:
|
| 75 |
+
compact[gt_char] = filtered
|
| 76 |
+
return {
|
| 77 |
+
"matrix": compact,
|
| 78 |
+
"total_substitutions": self.total_substitutions,
|
| 79 |
+
"total_insertions": self.total_insertions,
|
| 80 |
+
"total_deletions": self.total_deletions,
|
| 81 |
+
}
|
| 82 |
+
|
| 83 |
+
def as_dict(self) -> dict:
|
| 84 |
+
return self.as_compact_dict(min_count=1)
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def build_confusion_matrix(
|
| 88 |
+
ground_truth: str,
|
| 89 |
+
hypothesis: str,
|
| 90 |
+
ignore_whitespace: bool = True,
|
| 91 |
+
ignore_correct: bool = True,
|
| 92 |
+
) -> ConfusionMatrix:
|
| 93 |
+
"""Construit la matrice de confusion unicode pour une paire GT/OCR.
|
| 94 |
+
|
| 95 |
+
Parameters
|
| 96 |
+
----------
|
| 97 |
+
ground_truth:
|
| 98 |
+
Texte de référence (vérité terrain).
|
| 99 |
+
hypothesis:
|
| 100 |
+
Texte produit par l'OCR.
|
| 101 |
+
ignore_whitespace:
|
| 102 |
+
Si True, ignore les espaces, tabulations et sauts de ligne.
|
| 103 |
+
ignore_correct:
|
| 104 |
+
Si True, n'enregistre pas les paires identiques (gt_char == ocr_char).
|
| 105 |
+
Par défaut True pour réduire la taille de la matrice.
|
| 106 |
+
|
| 107 |
+
Returns
|
| 108 |
+
-------
|
| 109 |
+
ConfusionMatrix
|
| 110 |
+
"""
|
| 111 |
+
matrix: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
| 112 |
+
n_subs = n_ins = n_dels = 0
|
| 113 |
+
|
| 114 |
+
if not ground_truth and not hypothesis:
|
| 115 |
+
return ConfusionMatrix(dict(matrix), 0, 0, 0)
|
| 116 |
+
|
| 117 |
+
# SequenceMatcher sur listes de chars pour un alignement précis
|
| 118 |
+
matcher = difflib.SequenceMatcher(None, ground_truth, hypothesis, autojunk=False)
|
| 119 |
+
|
| 120 |
+
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 121 |
+
if tag == "equal":
|
| 122 |
+
if not ignore_correct:
|
| 123 |
+
for ch in ground_truth[i1:i2]:
|
| 124 |
+
if ignore_whitespace and ch in _WHITESPACE:
|
| 125 |
+
continue
|
| 126 |
+
matrix[ch][ch] += 1
|
| 127 |
+
elif tag == "replace":
|
| 128 |
+
# Aligner char par char les séquences de longueurs différentes
|
| 129 |
+
gt_seg = ground_truth[i1:i2]
|
| 130 |
+
oc_seg = hypothesis[j1:j2]
|
| 131 |
+
_align_segments(gt_seg, oc_seg, matrix, ignore_whitespace)
|
| 132 |
+
# Substitutions = longueur commune, surplus = insertions ou suppressions
|
| 133 |
+
n_subs += min(len(gt_seg), len(oc_seg))
|
| 134 |
+
surplus = abs(len(gt_seg) - len(oc_seg))
|
| 135 |
+
if len(gt_seg) > len(oc_seg):
|
| 136 |
+
n_dels += surplus
|
| 137 |
+
else:
|
| 138 |
+
n_ins += surplus
|
| 139 |
+
elif tag == "delete":
|
| 140 |
+
for ch in ground_truth[i1:i2]:
|
| 141 |
+
if ignore_whitespace and ch in _WHITESPACE:
|
| 142 |
+
continue
|
| 143 |
+
matrix[ch][EMPTY_CHAR] += 1
|
| 144 |
+
n_dels += 1
|
| 145 |
+
elif tag == "insert":
|
| 146 |
+
for ch in hypothesis[j1:j2]:
|
| 147 |
+
if ignore_whitespace and ch in _WHITESPACE:
|
| 148 |
+
continue
|
| 149 |
+
matrix[EMPTY_CHAR][ch] += 1
|
| 150 |
+
n_ins += 1
|
| 151 |
+
|
| 152 |
+
# Convertir defaultdict en dict normal
|
| 153 |
+
result_matrix: dict[str, dict[str, int]] = {
|
| 154 |
+
k: dict(v) for k, v in matrix.items()
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
return ConfusionMatrix(
|
| 158 |
+
matrix=result_matrix,
|
| 159 |
+
total_substitutions=n_subs,
|
| 160 |
+
total_insertions=n_ins,
|
| 161 |
+
total_deletions=n_dels,
|
| 162 |
+
)
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
def _align_segments(
|
| 166 |
+
gt_seg: str,
|
| 167 |
+
oc_seg: str,
|
| 168 |
+
matrix: dict,
|
| 169 |
+
ignore_whitespace: bool,
|
| 170 |
+
) -> None:
|
| 171 |
+
"""Aligne deux segments de longueurs potentiellement différentes."""
|
| 172 |
+
if not gt_seg:
|
| 173 |
+
for ch in oc_seg:
|
| 174 |
+
if ignore_whitespace and ch in _WHITESPACE:
|
| 175 |
+
continue
|
| 176 |
+
matrix[EMPTY_CHAR][ch] += 1
|
| 177 |
+
return
|
| 178 |
+
if not oc_seg:
|
| 179 |
+
for ch in gt_seg:
|
| 180 |
+
if ignore_whitespace and ch in _WHITESPACE:
|
| 181 |
+
continue
|
| 182 |
+
matrix[ch][EMPTY_CHAR] += 1
|
| 183 |
+
return
|
| 184 |
+
|
| 185 |
+
if len(gt_seg) == len(oc_seg):
|
| 186 |
+
# Substitutions 1-pour-1
|
| 187 |
+
for g, o in zip(gt_seg, oc_seg):
|
| 188 |
+
if ignore_whitespace and (g in _WHITESPACE or o in _WHITESPACE):
|
| 189 |
+
continue
|
| 190 |
+
matrix[g][o] += 1
|
| 191 |
+
else:
|
| 192 |
+
# Longueurs différentes : utiliser SequenceMatcher récursif sur segments courts
|
| 193 |
+
sub = difflib.SequenceMatcher(None, gt_seg, oc_seg, autojunk=False)
|
| 194 |
+
for tag2, i1, i2, j1, j2 in sub.get_opcodes():
|
| 195 |
+
if tag2 == "equal":
|
| 196 |
+
pass
|
| 197 |
+
elif tag2 == "replace":
|
| 198 |
+
# Régression simple : aligner par troncature
|
| 199 |
+
for g, o in zip(gt_seg[i1:i2], oc_seg[j1:j2]):
|
| 200 |
+
if ignore_whitespace and (g in _WHITESPACE or o in _WHITESPACE):
|
| 201 |
+
continue
|
| 202 |
+
matrix[g][o] += 1
|
| 203 |
+
elif tag2 == "delete":
|
| 204 |
+
for g in gt_seg[i1:i2]:
|
| 205 |
+
if ignore_whitespace and g in _WHITESPACE:
|
| 206 |
+
continue
|
| 207 |
+
matrix[g][EMPTY_CHAR] += 1
|
| 208 |
+
elif tag2 == "insert":
|
| 209 |
+
for o in oc_seg[j1:j2]:
|
| 210 |
+
if ignore_whitespace and o in _WHITESPACE:
|
| 211 |
+
continue
|
| 212 |
+
matrix[EMPTY_CHAR][o] += 1
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
def aggregate_confusion_matrices(matrices: list[ConfusionMatrix]) -> ConfusionMatrix:
|
| 216 |
+
"""Agrège plusieurs matrices de confusion en une seule.
|
| 217 |
+
|
| 218 |
+
Utile pour obtenir la matrice agrégée sur l'ensemble du corpus.
|
| 219 |
+
"""
|
| 220 |
+
combined: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
| 221 |
+
total_subs = total_ins = total_dels = 0
|
| 222 |
+
|
| 223 |
+
for cm in matrices:
|
| 224 |
+
for gt_char, ocr_counts in cm.matrix.items():
|
| 225 |
+
for ocr_char, count in ocr_counts.items():
|
| 226 |
+
combined[gt_char][ocr_char] += count
|
| 227 |
+
total_subs += cm.total_substitutions
|
| 228 |
+
total_ins += cm.total_insertions
|
| 229 |
+
total_dels += cm.total_deletions
|
| 230 |
+
|
| 231 |
+
return ConfusionMatrix(
|
| 232 |
+
matrix={k: dict(v) for k, v in combined.items()},
|
| 233 |
+
total_substitutions=total_subs,
|
| 234 |
+
total_insertions=total_ins,
|
| 235 |
+
total_deletions=total_dels,
|
| 236 |
+
)
|
| 237 |
+
|
| 238 |
+
|
| 239 |
+
def top_confused_chars(
|
| 240 |
+
matrix: ConfusionMatrix,
|
| 241 |
+
n: int = 15,
|
| 242 |
+
exclude_empty: bool = True,
|
| 243 |
+
) -> list[dict]:
|
| 244 |
+
"""Retourne les caractères GT les plus souvent confondus.
|
| 245 |
+
|
| 246 |
+
Retourne une liste triée par nombre total d'erreurs décroissant :
|
| 247 |
+
``[{"char": "ſ", "total_errors": 47, "top_substitutes": [...]}, ...]``
|
| 248 |
+
"""
|
| 249 |
+
char_stats: dict[str, dict] = {}
|
| 250 |
+
for gt_char, ocr_counts in matrix.matrix.items():
|
| 251 |
+
if exclude_empty and gt_char == EMPTY_CHAR:
|
| 252 |
+
continue
|
| 253 |
+
error_count = sum(
|
| 254 |
+
cnt for oc, cnt in ocr_counts.items()
|
| 255 |
+
if (oc != gt_char) and (not exclude_empty or oc != EMPTY_CHAR)
|
| 256 |
+
)
|
| 257 |
+
if error_count > 0:
|
| 258 |
+
top_subs = sorted(
|
| 259 |
+
[{"ocr": oc, "count": cnt} for oc, cnt in ocr_counts.items() if oc != gt_char],
|
| 260 |
+
key=lambda x: -x["count"],
|
| 261 |
+
)[:5]
|
| 262 |
+
char_stats[gt_char] = {
|
| 263 |
+
"char": gt_char,
|
| 264 |
+
"total_errors": error_count,
|
| 265 |
+
"top_substitutes": top_subs,
|
| 266 |
+
}
|
| 267 |
+
|
| 268 |
+
return sorted(char_stats.values(), key=lambda x: -x["total_errors"])[:n]
|
|
@@ -0,0 +1,276 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Métrique d'absorption d'erreur — Sprint 94 (B.3).
|
| 2 |
+
|
| 3 |
+
Sprint 94 — B.3 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Quand un module de post-correction LLM aplatit les différences
|
| 8 |
+
entre OCR amont, ce n'est pas qu'il « améliore » tous les
|
| 9 |
+
moteurs — c'est qu'il introduit ses propres biais qui dominent
|
| 10 |
+
ceux de l'OCR. Mesurer la dégradation par étape ne suffit
|
| 11 |
+
pas : il faut **séparer** les deux flux.
|
| 12 |
+
|
| 13 |
+
À chaque jonction où un module transforme un artefact, on
|
| 14 |
+
mesure :
|
| 15 |
+
|
| 16 |
+
- **Taux de correction** : parmi les erreurs présentes en
|
| 17 |
+
entrée du module, combien sont corrigées en sortie ?
|
| 18 |
+
- **Taux d'introduction** : parmi les erreurs présentes en
|
| 19 |
+
sortie, combien sont **nouvelles** (absentes en entrée) ?
|
| 20 |
+
|
| 21 |
+
C'est la généralisation du score de sur-normalisation
|
| 22 |
+
(chantier A.I.7) à toute jonction. La formule s'applique
|
| 23 |
+
uniformément à OCR→LLM, OCR→reconstructor, VLM→ALTO_mapper —
|
| 24 |
+
toute jonction qui transforme un artefact en un autre du même
|
| 25 |
+
type.
|
| 26 |
+
|
| 27 |
+
Méthode (token-level)
|
| 28 |
+
---------------------
|
| 29 |
+
On split en tokens whitespace ``reference``, ``before``,
|
| 30 |
+
``after``. On compare en **multiset** (un token GT consommé
|
| 31 |
+
au plus une fois) :
|
| 32 |
+
|
| 33 |
+
- ``errors_before`` = tokens GT non retrouvés dans ``before``
|
| 34 |
+
- ``errors_after`` = tokens GT non retrouvés dans ``after``
|
| 35 |
+
- ``corrected`` = ``errors_before \\ errors_after``
|
| 36 |
+
(présents avant, absents après → corrigés)
|
| 37 |
+
- ``introduced`` = ``errors_after \\ errors_before``
|
| 38 |
+
(absents avant, présents après → introduits)
|
| 39 |
+
|
| 40 |
+
Garde-fou : le module ne classe pas les erreurs (visuelles,
|
| 41 |
+
abréviations, etc.) — c'est une métrique d'**absorption de
|
| 42 |
+
volume**, pas de qualité éditoriale. L'intersection sémantique
|
| 43 |
+
avec ``taxonomy`` (Sprint 5) est documentée dans le glossaire.
|
| 44 |
+
|
| 45 |
+
Sortie
|
| 46 |
+
------
|
| 47 |
+
``compute_error_absorption(reference, before, after)`` retourne :
|
| 48 |
+
|
| 49 |
+
.. code-block:: text
|
| 50 |
+
|
| 51 |
+
{
|
| 52 |
+
"n_gt_tokens": int,
|
| 53 |
+
"n_errors_before": int,
|
| 54 |
+
"n_errors_after": int,
|
| 55 |
+
"n_corrected": int,
|
| 56 |
+
"n_introduced": int,
|
| 57 |
+
"n_kept_wrong": int,
|
| 58 |
+
"correction_rate": float | None, # n_corrected / n_errors_before
|
| 59 |
+
"introduction_rate": float | None, # n_introduced / n_errors_after
|
| 60 |
+
"net_improvement": int, # n_corrected - n_introduced
|
| 61 |
+
"corrected_tokens": list[str],
|
| 62 |
+
"introduced_tokens": list[str],
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
``aggregate_error_absorption(per_doc_results)`` somme les
|
| 66 |
+
compteurs corpus-wide et recalcule les taux *micro*.
|
| 67 |
+
"""
|
| 68 |
+
|
| 69 |
+
from __future__ import annotations
|
| 70 |
+
|
| 71 |
+
import logging
|
| 72 |
+
from collections import Counter
|
| 73 |
+
from typing import Iterable, Optional
|
| 74 |
+
|
| 75 |
+
logger = logging.getLogger(__name__)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def _split_words(text: Optional[str]) -> list[str]:
|
| 79 |
+
if not text:
|
| 80 |
+
return []
|
| 81 |
+
return text.split()
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
def _missing_tokens(
|
| 85 |
+
reference: list[str], hypothesis: list[str],
|
| 86 |
+
) -> Counter:
|
| 87 |
+
"""Tokens GT manquants en hypothèse au sens multiset.
|
| 88 |
+
|
| 89 |
+
Un token GT compte plusieurs fois s'il apparaît plusieurs
|
| 90 |
+
fois ; chaque occurrence en hypothèse en absorbe au plus
|
| 91 |
+
une. Retourne un Counter ``{token: nb_occurrences_manquees}``.
|
| 92 |
+
"""
|
| 93 |
+
ref_count = Counter(reference)
|
| 94 |
+
hyp_count = Counter(hypothesis)
|
| 95 |
+
missing: Counter = Counter()
|
| 96 |
+
for token, n_ref in ref_count.items():
|
| 97 |
+
n_hyp = hyp_count.get(token, 0)
|
| 98 |
+
if n_hyp < n_ref:
|
| 99 |
+
missing[token] = n_ref - n_hyp
|
| 100 |
+
return missing
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def compute_error_absorption(
|
| 104 |
+
reference: Optional[str],
|
| 105 |
+
before: Optional[str],
|
| 106 |
+
after: Optional[str],
|
| 107 |
+
*,
|
| 108 |
+
case_sensitive: bool = False,
|
| 109 |
+
) -> Optional[dict]:
|
| 110 |
+
"""Mesure l'absorption d'erreur entre ``before`` et ``after``.
|
| 111 |
+
|
| 112 |
+
Parameters
|
| 113 |
+
----------
|
| 114 |
+
reference:
|
| 115 |
+
GT (vérité terrain).
|
| 116 |
+
before:
|
| 117 |
+
Sortie de l'étape précédente (typiquement OCR amont).
|
| 118 |
+
after:
|
| 119 |
+
Sortie de l'étape courante (typiquement post-correction LLM).
|
| 120 |
+
case_sensitive:
|
| 121 |
+
Si False (défaut), match case-insensitive — la sortie
|
| 122 |
+
``corrected_tokens``/``introduced_tokens`` reste en casse
|
| 123 |
+
GT originale.
|
| 124 |
+
|
| 125 |
+
Returns
|
| 126 |
+
-------
|
| 127 |
+
dict | None
|
| 128 |
+
``None`` si la GT est vide ou ne contient aucun token.
|
| 129 |
+
"""
|
| 130 |
+
ref_tokens = _split_words(reference)
|
| 131 |
+
if not ref_tokens:
|
| 132 |
+
return None
|
| 133 |
+
before_tokens = _split_words(before)
|
| 134 |
+
after_tokens = _split_words(after)
|
| 135 |
+
|
| 136 |
+
if case_sensitive:
|
| 137 |
+
ref_match = list(ref_tokens)
|
| 138 |
+
before_match = list(before_tokens)
|
| 139 |
+
after_match = list(after_tokens)
|
| 140 |
+
else:
|
| 141 |
+
ref_match = [t.lower() for t in ref_tokens]
|
| 142 |
+
before_match = [t.lower() for t in before_tokens]
|
| 143 |
+
after_match = [t.lower() for t in after_tokens]
|
| 144 |
+
|
| 145 |
+
# Map case-insensitive token → liste de casses GT originales
|
| 146 |
+
ref_orig_by_match: dict[str, list[str]] = {}
|
| 147 |
+
for orig, m in zip(ref_tokens, ref_match):
|
| 148 |
+
ref_orig_by_match.setdefault(m, []).append(orig)
|
| 149 |
+
|
| 150 |
+
missing_before = _missing_tokens(ref_match, before_match)
|
| 151 |
+
missing_after = _missing_tokens(ref_match, after_match)
|
| 152 |
+
|
| 153 |
+
n_errors_before = sum(missing_before.values())
|
| 154 |
+
n_errors_after = sum(missing_after.values())
|
| 155 |
+
|
| 156 |
+
# Calcul corrigé / introduit en multiset
|
| 157 |
+
corrected_counter: Counter = Counter()
|
| 158 |
+
introduced_counter: Counter = Counter()
|
| 159 |
+
kept_wrong_counter: Counter = Counter()
|
| 160 |
+
all_tokens = set(missing_before) | set(missing_after)
|
| 161 |
+
for tok in all_tokens:
|
| 162 |
+
nb = missing_before.get(tok, 0)
|
| 163 |
+
na = missing_after.get(tok, 0)
|
| 164 |
+
if nb > na:
|
| 165 |
+
corrected_counter[tok] = nb - na
|
| 166 |
+
kept_wrong_counter[tok] = na
|
| 167 |
+
elif na > nb:
|
| 168 |
+
introduced_counter[tok] = na - nb
|
| 169 |
+
kept_wrong_counter[tok] = nb
|
| 170 |
+
else:
|
| 171 |
+
kept_wrong_counter[tok] = nb
|
| 172 |
+
|
| 173 |
+
n_corrected = sum(corrected_counter.values())
|
| 174 |
+
n_introduced = sum(introduced_counter.values())
|
| 175 |
+
n_kept_wrong = sum(kept_wrong_counter.values())
|
| 176 |
+
|
| 177 |
+
correction_rate = (
|
| 178 |
+
n_corrected / n_errors_before
|
| 179 |
+
if n_errors_before > 0 else None
|
| 180 |
+
)
|
| 181 |
+
introduction_rate = (
|
| 182 |
+
n_introduced / n_errors_after
|
| 183 |
+
if n_errors_after > 0 else None
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
def _expand(counter: Counter) -> list[str]:
|
| 187 |
+
out: list[str] = []
|
| 188 |
+
for tok, count in counter.items():
|
| 189 |
+
origs = ref_orig_by_match.get(tok, [tok])
|
| 190 |
+
# Ne renvoie que la casse représentative GT
|
| 191 |
+
display = origs[0] if origs else tok
|
| 192 |
+
out.extend([display] * count)
|
| 193 |
+
return out
|
| 194 |
+
|
| 195 |
+
return {
|
| 196 |
+
"n_gt_tokens": len(ref_tokens),
|
| 197 |
+
"n_errors_before": n_errors_before,
|
| 198 |
+
"n_errors_after": n_errors_after,
|
| 199 |
+
"n_corrected": n_corrected,
|
| 200 |
+
"n_introduced": n_introduced,
|
| 201 |
+
"n_kept_wrong": n_kept_wrong,
|
| 202 |
+
"correction_rate": correction_rate,
|
| 203 |
+
"introduction_rate": introduction_rate,
|
| 204 |
+
"net_improvement": n_corrected - n_introduced,
|
| 205 |
+
"corrected_tokens": _expand(corrected_counter),
|
| 206 |
+
"introduced_tokens": _expand(introduced_counter),
|
| 207 |
+
}
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
def aggregate_error_absorption(
|
| 211 |
+
per_doc: Iterable[Optional[dict]],
|
| 212 |
+
*,
|
| 213 |
+
sample_tokens: int = 50,
|
| 214 |
+
) -> Optional[dict]:
|
| 215 |
+
"""Agrège les compteurs corpus-wide et recalcule les taux
|
| 216 |
+
*micro*.
|
| 217 |
+
|
| 218 |
+
Parameters
|
| 219 |
+
----------
|
| 220 |
+
per_doc:
|
| 221 |
+
Itérable de sorties de ``compute_error_absorption`` (ou
|
| 222 |
+
``None`` pour les docs sans GT).
|
| 223 |
+
sample_tokens:
|
| 224 |
+
Nombre maximal de tokens corrigés/introduits gardés dans
|
| 225 |
+
l'échantillon (cap pour ne pas exploser le JSON).
|
| 226 |
+
|
| 227 |
+
Returns
|
| 228 |
+
-------
|
| 229 |
+
dict | None
|
| 230 |
+
``None`` si aucune entry valide.
|
| 231 |
+
"""
|
| 232 |
+
docs = [d for d in per_doc if d]
|
| 233 |
+
if not docs:
|
| 234 |
+
return None
|
| 235 |
+
n_gt = sum(int(d.get("n_gt_tokens") or 0) for d in docs)
|
| 236 |
+
n_errors_before = sum(int(d.get("n_errors_before") or 0) for d in docs)
|
| 237 |
+
n_errors_after = sum(int(d.get("n_errors_after") or 0) for d in docs)
|
| 238 |
+
n_corrected = sum(int(d.get("n_corrected") or 0) for d in docs)
|
| 239 |
+
n_introduced = sum(int(d.get("n_introduced") or 0) for d in docs)
|
| 240 |
+
n_kept_wrong = sum(int(d.get("n_kept_wrong") or 0) for d in docs)
|
| 241 |
+
correction_rate = (
|
| 242 |
+
n_corrected / n_errors_before if n_errors_before > 0 else None
|
| 243 |
+
)
|
| 244 |
+
introduction_rate = (
|
| 245 |
+
n_introduced / n_errors_after if n_errors_after > 0 else None
|
| 246 |
+
)
|
| 247 |
+
corrected_sample: list[str] = []
|
| 248 |
+
introduced_sample: list[str] = []
|
| 249 |
+
for d in docs:
|
| 250 |
+
corrected_sample.extend(d.get("corrected_tokens") or [])
|
| 251 |
+
introduced_sample.extend(d.get("introduced_tokens") or [])
|
| 252 |
+
if (
|
| 253 |
+
len(corrected_sample) >= sample_tokens
|
| 254 |
+
and len(introduced_sample) >= sample_tokens
|
| 255 |
+
):
|
| 256 |
+
break
|
| 257 |
+
return {
|
| 258 |
+
"n_docs": len(docs),
|
| 259 |
+
"n_gt_tokens": n_gt,
|
| 260 |
+
"n_errors_before": n_errors_before,
|
| 261 |
+
"n_errors_after": n_errors_after,
|
| 262 |
+
"n_corrected": n_corrected,
|
| 263 |
+
"n_introduced": n_introduced,
|
| 264 |
+
"n_kept_wrong": n_kept_wrong,
|
| 265 |
+
"correction_rate": correction_rate,
|
| 266 |
+
"introduction_rate": introduction_rate,
|
| 267 |
+
"net_improvement": n_corrected - n_introduced,
|
| 268 |
+
"corrected_tokens_sample": corrected_sample[:sample_tokens],
|
| 269 |
+
"introduced_tokens_sample": introduced_sample[:sample_tokens],
|
| 270 |
+
}
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
__all__ = [
|
| 274 |
+
"compute_error_absorption",
|
| 275 |
+
"aggregate_error_absorption",
|
| 276 |
+
]
|
|
@@ -0,0 +1,331 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Détection des hallucinations VLM/LLM — Sprint 10.
|
| 2 |
+
|
| 3 |
+
Métriques calculées
|
| 4 |
+
-------------------
|
| 5 |
+
- Taux d'insertion net : mots/caractères ajoutés absents du GT, distinct du WIL existant
|
| 6 |
+
- Ratio de longueur : len(hyp) / len(gt) — ratio > 1.2 → hallucination potentielle
|
| 7 |
+
- Score d'ancrage : proportion des n-grammes (trigrammes) de la sortie présents dans le GT
|
| 8 |
+
- Blocs hallucinés : segments continus de la sortie sans correspondance GT au-delà d'un seuil
|
| 9 |
+
- Badge hallucination : True si ancrage faible ou ratio de longueur anormal
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
import re
|
| 15 |
+
from dataclasses import dataclass
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
# ---------------------------------------------------------------------------
|
| 19 |
+
# Helpers texte
|
| 20 |
+
# ---------------------------------------------------------------------------
|
| 21 |
+
|
| 22 |
+
def _tokenize(text: str) -> list[str]:
|
| 23 |
+
"""Découpe en mots (minuscules, sans ponctuation)."""
|
| 24 |
+
return re.findall(r"[^\s]+", text.lower())
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def _ngrams(tokens: list[str], n: int) -> list[tuple[str, ...]]:
|
| 28 |
+
"""Génère les n-grammes d'une liste de tokens."""
|
| 29 |
+
if len(tokens) < n:
|
| 30 |
+
return [tuple(tokens)] if tokens else []
|
| 31 |
+
return [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
# ---------------------------------------------------------------------------
|
| 35 |
+
# Blocs hallucinés (segments continus sans ancrage)
|
| 36 |
+
# ---------------------------------------------------------------------------
|
| 37 |
+
|
| 38 |
+
@dataclass
|
| 39 |
+
class HallucinatedBlock:
|
| 40 |
+
"""Segment continu de la sortie sans correspondance dans le GT."""
|
| 41 |
+
start_token: int
|
| 42 |
+
end_token: int
|
| 43 |
+
text: str
|
| 44 |
+
length: int # nombre de tokens
|
| 45 |
+
|
| 46 |
+
def as_dict(self) -> dict:
|
| 47 |
+
return {
|
| 48 |
+
"start_token": self.start_token,
|
| 49 |
+
"end_token": self.end_token,
|
| 50 |
+
"text": self.text,
|
| 51 |
+
"length": self.length,
|
| 52 |
+
}
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _detect_hallucinated_blocks(
|
| 56 |
+
hyp_tokens: list[str],
|
| 57 |
+
gt_token_set: set[str],
|
| 58 |
+
tolerance: int = 3,
|
| 59 |
+
min_block_length: int = 4,
|
| 60 |
+
) -> list[HallucinatedBlock]:
|
| 61 |
+
"""Détecte les blocs de tokens hypothèse sans correspondance dans le GT.
|
| 62 |
+
|
| 63 |
+
Un bloc est un segment contigu de tokens hypothèse dont aucun n'est présent
|
| 64 |
+
dans le vocabulaire GT. Une tolérance de ``tolerance`` tokens connus interrompus
|
| 65 |
+
est acceptée avant de clore un bloc.
|
| 66 |
+
|
| 67 |
+
Parameters
|
| 68 |
+
----------
|
| 69 |
+
hyp_tokens:
|
| 70 |
+
Tokens de la sortie OCR/VLM.
|
| 71 |
+
gt_token_set:
|
| 72 |
+
Ensemble des tokens du GT (pour recherche O(1)).
|
| 73 |
+
tolerance:
|
| 74 |
+
Nombre de tokens connus consécutifs interrompant un bloc avant de le clore.
|
| 75 |
+
min_block_length:
|
| 76 |
+
Longueur minimale (tokens) pour qu'un bloc soit signalé.
|
| 77 |
+
|
| 78 |
+
Returns
|
| 79 |
+
-------
|
| 80 |
+
list[HallucinatedBlock]
|
| 81 |
+
"""
|
| 82 |
+
blocks: list[HallucinatedBlock] = []
|
| 83 |
+
if not hyp_tokens:
|
| 84 |
+
return blocks
|
| 85 |
+
|
| 86 |
+
in_block = False
|
| 87 |
+
block_start = 0
|
| 88 |
+
consecutive_known = 0
|
| 89 |
+
|
| 90 |
+
for i, tok in enumerate(hyp_tokens):
|
| 91 |
+
is_unknown = tok not in gt_token_set
|
| 92 |
+
if is_unknown:
|
| 93 |
+
if not in_block:
|
| 94 |
+
in_block = True
|
| 95 |
+
block_start = i
|
| 96 |
+
consecutive_known = 0
|
| 97 |
+
else:
|
| 98 |
+
consecutive_known = 0
|
| 99 |
+
else:
|
| 100 |
+
if in_block:
|
| 101 |
+
consecutive_known += 1
|
| 102 |
+
if consecutive_known >= tolerance:
|
| 103 |
+
# Clore le bloc
|
| 104 |
+
end = i - consecutive_known
|
| 105 |
+
length = end - block_start + 1
|
| 106 |
+
if length >= min_block_length:
|
| 107 |
+
text = " ".join(hyp_tokens[block_start:end + 1])
|
| 108 |
+
blocks.append(HallucinatedBlock(
|
| 109 |
+
start_token=block_start,
|
| 110 |
+
end_token=end,
|
| 111 |
+
text=text,
|
| 112 |
+
length=length,
|
| 113 |
+
))
|
| 114 |
+
in_block = False
|
| 115 |
+
consecutive_known = 0
|
| 116 |
+
|
| 117 |
+
# Bloc non terminé
|
| 118 |
+
if in_block:
|
| 119 |
+
end = len(hyp_tokens) - 1
|
| 120 |
+
length = end - block_start + 1
|
| 121 |
+
if length >= min_block_length:
|
| 122 |
+
text = " ".join(hyp_tokens[block_start:end + 1])
|
| 123 |
+
blocks.append(HallucinatedBlock(
|
| 124 |
+
start_token=block_start,
|
| 125 |
+
end_token=end,
|
| 126 |
+
text=text,
|
| 127 |
+
length=length,
|
| 128 |
+
))
|
| 129 |
+
|
| 130 |
+
return blocks
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
# ---------------------------------------------------------------------------
|
| 134 |
+
# Résultat structuré
|
| 135 |
+
# ---------------------------------------------------------------------------
|
| 136 |
+
|
| 137 |
+
@dataclass
|
| 138 |
+
class HallucinationMetrics:
|
| 139 |
+
"""Métriques de détection des hallucinations pour une paire (GT, hypothèse)."""
|
| 140 |
+
|
| 141 |
+
net_insertion_rate: float
|
| 142 |
+
"""Taux d'insertion nette : tokens hypothèse absents du GT / total tokens hypothèse."""
|
| 143 |
+
|
| 144 |
+
length_ratio: float
|
| 145 |
+
"""Ratio de longueur : len(hyp) / len(gt) en caractères. > 1.2 = signal d'hallucination."""
|
| 146 |
+
|
| 147 |
+
anchor_score: float
|
| 148 |
+
"""Score d'ancrage : proportion des trigrammes hypothèse présents dans les trigrammes GT.
|
| 149 |
+
Score élevé → l'hypothèse s'ancre bien dans le GT. Score faible → hallucinations probables."""
|
| 150 |
+
|
| 151 |
+
hallucinated_blocks: list[HallucinatedBlock]
|
| 152 |
+
"""Segments continus de la sortie sans correspondance GT (au-dessus du seuil de tolérance)."""
|
| 153 |
+
|
| 154 |
+
is_hallucinating: bool
|
| 155 |
+
"""True si anchor_score < anchor_threshold OU length_ratio > length_ratio_threshold."""
|
| 156 |
+
|
| 157 |
+
# Détails supplémentaires
|
| 158 |
+
gt_word_count: int = 0
|
| 159 |
+
hyp_word_count: int = 0
|
| 160 |
+
net_inserted_words: int = 0
|
| 161 |
+
anchor_threshold_used: float = 0.5
|
| 162 |
+
length_ratio_threshold_used: float = 1.2
|
| 163 |
+
ngram_size_used: int = 3
|
| 164 |
+
|
| 165 |
+
def as_dict(self) -> dict:
|
| 166 |
+
return {
|
| 167 |
+
"net_insertion_rate": round(self.net_insertion_rate, 6),
|
| 168 |
+
"length_ratio": round(self.length_ratio, 6),
|
| 169 |
+
"anchor_score": round(self.anchor_score, 6),
|
| 170 |
+
"hallucinated_blocks": [b.as_dict() for b in self.hallucinated_blocks],
|
| 171 |
+
"is_hallucinating": self.is_hallucinating,
|
| 172 |
+
"gt_word_count": self.gt_word_count,
|
| 173 |
+
"hyp_word_count": self.hyp_word_count,
|
| 174 |
+
"net_inserted_words": self.net_inserted_words,
|
| 175 |
+
"anchor_threshold_used": self.anchor_threshold_used,
|
| 176 |
+
"length_ratio_threshold_used": self.length_ratio_threshold_used,
|
| 177 |
+
"ngram_size_used": self.ngram_size_used,
|
| 178 |
+
}
|
| 179 |
+
|
| 180 |
+
@classmethod
|
| 181 |
+
def from_dict(cls, d: dict) -> "HallucinationMetrics":
|
| 182 |
+
blocks = [
|
| 183 |
+
HallucinatedBlock(**b) for b in d.get("hallucinated_blocks", [])
|
| 184 |
+
]
|
| 185 |
+
return cls(
|
| 186 |
+
net_insertion_rate=d.get("net_insertion_rate", 0.0),
|
| 187 |
+
length_ratio=d.get("length_ratio", 1.0),
|
| 188 |
+
anchor_score=d.get("anchor_score", 1.0),
|
| 189 |
+
hallucinated_blocks=blocks,
|
| 190 |
+
is_hallucinating=d.get("is_hallucinating", False),
|
| 191 |
+
gt_word_count=d.get("gt_word_count", 0),
|
| 192 |
+
hyp_word_count=d.get("hyp_word_count", 0),
|
| 193 |
+
net_inserted_words=d.get("net_inserted_words", 0),
|
| 194 |
+
anchor_threshold_used=d.get("anchor_threshold_used", 0.5),
|
| 195 |
+
length_ratio_threshold_used=d.get("length_ratio_threshold_used", 1.2),
|
| 196 |
+
ngram_size_used=d.get("ngram_size_used", 3),
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
|
| 200 |
+
# ---------------------------------------------------------------------------
|
| 201 |
+
# Calcul principal
|
| 202 |
+
# ---------------------------------------------------------------------------
|
| 203 |
+
|
| 204 |
+
def compute_hallucination_metrics(
|
| 205 |
+
reference: str,
|
| 206 |
+
hypothesis: str,
|
| 207 |
+
n: int = 3,
|
| 208 |
+
length_ratio_threshold: float = 1.2,
|
| 209 |
+
anchor_threshold: float = 0.5,
|
| 210 |
+
block_tolerance: int = 3,
|
| 211 |
+
min_block_length: int = 4,
|
| 212 |
+
) -> HallucinationMetrics:
|
| 213 |
+
"""Calcule les métriques de détection des hallucinations VLM/LLM.
|
| 214 |
+
|
| 215 |
+
Parameters
|
| 216 |
+
----------
|
| 217 |
+
reference:
|
| 218 |
+
Texte de vérité terrain (GT).
|
| 219 |
+
hypothesis:
|
| 220 |
+
Texte produit par le modèle.
|
| 221 |
+
n:
|
| 222 |
+
Taille des n-grammes pour le score d'ancrage (défaut : trigrammes).
|
| 223 |
+
length_ratio_threshold:
|
| 224 |
+
Seuil de ratio de longueur au-dessus duquel on signale une hallucination potentielle.
|
| 225 |
+
anchor_threshold:
|
| 226 |
+
Seuil de score d'ancrage en dessous duquel on signale une hallucination potentielle.
|
| 227 |
+
block_tolerance:
|
| 228 |
+
Nombre de tokens connus consécutifs acceptés dans un bloc halluciné.
|
| 229 |
+
min_block_length:
|
| 230 |
+
Longueur minimale (tokens) pour signaler un bloc halluciné.
|
| 231 |
+
|
| 232 |
+
Returns
|
| 233 |
+
-------
|
| 234 |
+
HallucinationMetrics
|
| 235 |
+
"""
|
| 236 |
+
gt_tokens = _tokenize(reference)
|
| 237 |
+
hyp_tokens = _tokenize(hypothesis)
|
| 238 |
+
|
| 239 |
+
gt_len_chars = len(reference.strip())
|
| 240 |
+
hyp_len_chars = len(hypothesis.strip())
|
| 241 |
+
|
| 242 |
+
# ── Ratio de longueur ────────────────────────────────────────────────
|
| 243 |
+
if gt_len_chars == 0:
|
| 244 |
+
length_ratio = 1.0 if hyp_len_chars == 0 else float("inf")
|
| 245 |
+
else:
|
| 246 |
+
length_ratio = hyp_len_chars / gt_len_chars
|
| 247 |
+
|
| 248 |
+
# ── Taux d'insertion nette ───────────────────────────────────────────
|
| 249 |
+
gt_token_set = set(gt_tokens)
|
| 250 |
+
hyp_token_count = len(hyp_tokens)
|
| 251 |
+
|
| 252 |
+
if hyp_token_count == 0:
|
| 253 |
+
net_insertion_rate = 0.0
|
| 254 |
+
net_inserted_words = 0
|
| 255 |
+
else:
|
| 256 |
+
net_inserted = [t for t in hyp_tokens if t not in gt_token_set]
|
| 257 |
+
net_inserted_words = len(net_inserted)
|
| 258 |
+
net_insertion_rate = net_inserted_words / hyp_token_count
|
| 259 |
+
|
| 260 |
+
# ── Score d'ancrage (n-grammes) ──────────────────────────────────────
|
| 261 |
+
gt_ngrams = set(_ngrams(gt_tokens, n))
|
| 262 |
+
hyp_ngrams = _ngrams(hyp_tokens, n)
|
| 263 |
+
|
| 264 |
+
if not hyp_ngrams:
|
| 265 |
+
# Pas de n-grammes dans l'hypothèse → ancrage parfait (hypothèse vide ou trop courte)
|
| 266 |
+
anchor_score = 1.0 if not gt_ngrams else 0.0
|
| 267 |
+
elif not gt_ngrams:
|
| 268 |
+
anchor_score = 0.0
|
| 269 |
+
else:
|
| 270 |
+
anchored = sum(1 for ng in hyp_ngrams if ng in gt_ngrams)
|
| 271 |
+
anchor_score = anchored / len(hyp_ngrams)
|
| 272 |
+
|
| 273 |
+
# ── Blocs hallucinés ─────────────────────────────────────────────────
|
| 274 |
+
blocks = _detect_hallucinated_blocks(
|
| 275 |
+
hyp_tokens=hyp_tokens,
|
| 276 |
+
gt_token_set=gt_token_set,
|
| 277 |
+
tolerance=block_tolerance,
|
| 278 |
+
min_block_length=min_block_length,
|
| 279 |
+
)
|
| 280 |
+
|
| 281 |
+
# ── Badge hallucination ──────────────────────────────────────────────
|
| 282 |
+
is_hallucinating = (
|
| 283 |
+
anchor_score < anchor_threshold
|
| 284 |
+
or length_ratio > length_ratio_threshold
|
| 285 |
+
)
|
| 286 |
+
|
| 287 |
+
return HallucinationMetrics(
|
| 288 |
+
net_insertion_rate=net_insertion_rate,
|
| 289 |
+
length_ratio=min(length_ratio, 9.99), # plafonner pour la sérialisation
|
| 290 |
+
anchor_score=anchor_score,
|
| 291 |
+
hallucinated_blocks=blocks,
|
| 292 |
+
is_hallucinating=is_hallucinating,
|
| 293 |
+
gt_word_count=len(gt_tokens),
|
| 294 |
+
hyp_word_count=hyp_token_count,
|
| 295 |
+
net_inserted_words=net_inserted_words,
|
| 296 |
+
anchor_threshold_used=anchor_threshold,
|
| 297 |
+
length_ratio_threshold_used=length_ratio_threshold,
|
| 298 |
+
ngram_size_used=n,
|
| 299 |
+
)
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
# ---------------------------------------------------------------------------
|
| 303 |
+
# Agrégation sur un corpus
|
| 304 |
+
# ---------------------------------------------------------------------------
|
| 305 |
+
|
| 306 |
+
def aggregate_hallucination_metrics(results: list[HallucinationMetrics]) -> dict:
|
| 307 |
+
"""Agrège les métriques d'hallucination sur un corpus.
|
| 308 |
+
|
| 309 |
+
Returns
|
| 310 |
+
-------
|
| 311 |
+
dict
|
| 312 |
+
Statistiques agrégées : anchor_score moyen, taux de documents hallucinés…
|
| 313 |
+
"""
|
| 314 |
+
if not results:
|
| 315 |
+
return {}
|
| 316 |
+
|
| 317 |
+
n = len(results)
|
| 318 |
+
anchor_values = [r.anchor_score for r in results]
|
| 319 |
+
ratio_values = [r.length_ratio for r in results]
|
| 320 |
+
insertion_values = [r.net_insertion_rate for r in results]
|
| 321 |
+
hallucinating_count = sum(1 for r in results if r.is_hallucinating)
|
| 322 |
+
|
| 323 |
+
return {
|
| 324 |
+
"anchor_score_mean": round(sum(anchor_values) / n, 6),
|
| 325 |
+
"anchor_score_min": round(min(anchor_values), 6),
|
| 326 |
+
"length_ratio_mean": round(sum(ratio_values) / n, 6),
|
| 327 |
+
"net_insertion_rate_mean": round(sum(insertion_values) / n, 6),
|
| 328 |
+
"hallucinating_doc_count": hallucinating_count,
|
| 329 |
+
"hallucinating_doc_rate": round(hallucinating_count / n, 6),
|
| 330 |
+
"document_count": n,
|
| 331 |
+
}
|
|
@@ -0,0 +1,283 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Métriques d'image prédictives — Sprint 93 (A.II.7).
|
| 2 |
+
|
| 3 |
+
Sprint 93 — A.II.7 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
``image_quality`` (Sprint 5) mesure des features d'image
|
| 8 |
+
indépendamment ; ce module **les combine** pour produire deux
|
| 9 |
+
indicateurs corpus-level :
|
| 10 |
+
|
| 11 |
+
1. **Score de complexité paléographique** ∈ [0, 1]. Combine
|
| 12 |
+
bruit, faible netteté, faible contraste et rotation en un
|
| 13 |
+
indicateur unique de la difficulté intrinsèque pour un OCR.
|
| 14 |
+
0 = document trivial, 1 = document extrême. Permet
|
| 15 |
+
d'expliquer une partie du CER observé.
|
| 16 |
+
|
| 17 |
+
2. **Score d'homogénéité du corpus** ∈ [0, 1]. Variance des
|
| 18 |
+
features entre documents. 0 = corpus uniforme (la moyenne
|
| 19 |
+
globale du benchmark est fiable), 1 = corpus hétérogène
|
| 20 |
+
(la moyenne ment, il faut stratifier). Couplé au détecteur
|
| 21 |
+
``stratification_recommended`` (Sprint 46) qui agit sur
|
| 22 |
+
``script_type``.
|
| 23 |
+
|
| 24 |
+
Pondérations
|
| 25 |
+
------------
|
| 26 |
+
La roadmap propose une combinaison **pondérée** sans fixer les
|
| 27 |
+
poids — on adopte une convention éditoriale documentée :
|
| 28 |
+
|
| 29 |
+
- ``noise_level`` : poids 0.30 (bruit franc → CER ↑)
|
| 30 |
+
- ``1 - sharpness_score`` : poids 0.30 (flou → CER ↑)
|
| 31 |
+
- ``1 - contrast_score`` : poids 0.20 (faible contraste → CER ↑)
|
| 32 |
+
- ``|rotation_degrees|/30`` : poids 0.20 (rotation > 30° = pire)
|
| 33 |
+
|
| 34 |
+
Les poids somment à 1. L'utilisateur peut surcharger via
|
| 35 |
+
``weights={...}``.
|
| 36 |
+
|
| 37 |
+
Pas de prédiction CER absolue
|
| 38 |
+
-----------------------------
|
| 39 |
+
On ne prétend **pas** prédire une valeur CER en pourcentage —
|
| 40 |
+
ça demanderait un modèle entraîné par moteur, ce que la
|
| 41 |
+
philosophie banc d'essai exclut. On fournit un score relatif
|
| 42 |
+
qui se corrèle au CER observé pour une **lecture
|
| 43 |
+
diagnostique** : *« le document A est ~3× plus complexe que le
|
| 44 |
+
document B, ce qui est cohérent avec le CER observé. »*
|
| 45 |
+
"""
|
| 46 |
+
|
| 47 |
+
from __future__ import annotations
|
| 48 |
+
|
| 49 |
+
import logging
|
| 50 |
+
import math
|
| 51 |
+
import statistics
|
| 52 |
+
from typing import Iterable, Optional
|
| 53 |
+
|
| 54 |
+
logger = logging.getLogger(__name__)
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
# Poids éditoriaux par défaut.
|
| 58 |
+
DEFAULT_COMPLEXITY_WEIGHTS = {
|
| 59 |
+
"noise_level": 0.30,
|
| 60 |
+
"blur": 0.30, # 1 - sharpness_score
|
| 61 |
+
"low_contrast": 0.20, # 1 - contrast_score
|
| 62 |
+
"rotation": 0.20, # |rotation_degrees| / 30
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
# Plage de saturation pour la rotation. Au-delà de 30°, on
|
| 67 |
+
# considère que c'est aussi pire que pire.
|
| 68 |
+
_ROTATION_SATURATION_DEG = 30.0
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def _clip01(x: float) -> float:
|
| 72 |
+
return max(0.0, min(1.0, x))
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def _extract_feature(
|
| 76 |
+
quality: dict, key: str, default: float = 0.0,
|
| 77 |
+
) -> float:
|
| 78 |
+
val = quality.get(key, default)
|
| 79 |
+
if val is None:
|
| 80 |
+
return default
|
| 81 |
+
try:
|
| 82 |
+
return float(val)
|
| 83 |
+
except (TypeError, ValueError):
|
| 84 |
+
return default
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def compute_paleographic_complexity(
|
| 88 |
+
quality: dict,
|
| 89 |
+
*,
|
| 90 |
+
weights: Optional[dict[str, float]] = None,
|
| 91 |
+
) -> Optional[dict]:
|
| 92 |
+
"""Score de complexité paléographique d'une image.
|
| 93 |
+
|
| 94 |
+
Parameters
|
| 95 |
+
----------
|
| 96 |
+
quality:
|
| 97 |
+
Dict ``ImageQualityResult.as_dict()`` ou compatible.
|
| 98 |
+
Champs lus : ``noise_level``, ``sharpness_score``,
|
| 99 |
+
``contrast_score``, ``rotation_degrees``.
|
| 100 |
+
weights:
|
| 101 |
+
Poids surchargeant les défauts. Doit contenir les
|
| 102 |
+
4 clés ``noise_level``, ``blur``, ``low_contrast``,
|
| 103 |
+
``rotation``. Les poids sont normalisés (somme = 1).
|
| 104 |
+
|
| 105 |
+
Returns
|
| 106 |
+
-------
|
| 107 |
+
dict | None
|
| 108 |
+
``{
|
| 109 |
+
"score": float, # ∈ [0, 1]
|
| 110 |
+
"components": {
|
| 111 |
+
"noise": float, "blur": float,
|
| 112 |
+
"low_contrast": float, "rotation": float,
|
| 113 |
+
},
|
| 114 |
+
"weights_used": dict,
|
| 115 |
+
}`` ou ``None`` si ``quality`` est falsy.
|
| 116 |
+
"""
|
| 117 |
+
if not quality:
|
| 118 |
+
return None
|
| 119 |
+
w = dict(DEFAULT_COMPLEXITY_WEIGHTS)
|
| 120 |
+
if weights:
|
| 121 |
+
for k in w:
|
| 122 |
+
if k in weights:
|
| 123 |
+
w[k] = float(weights[k])
|
| 124 |
+
total = sum(w.values())
|
| 125 |
+
if total <= 0:
|
| 126 |
+
return None
|
| 127 |
+
w = {k: v / total for k, v in w.items()}
|
| 128 |
+
noise = _clip01(_extract_feature(quality, "noise_level"))
|
| 129 |
+
sharpness = _clip01(_extract_feature(quality, "sharpness_score"))
|
| 130 |
+
contrast = _clip01(_extract_feature(quality, "contrast_score"))
|
| 131 |
+
rotation_deg = abs(_extract_feature(quality, "rotation_degrees"))
|
| 132 |
+
blur = 1.0 - sharpness
|
| 133 |
+
low_contrast = 1.0 - contrast
|
| 134 |
+
rotation = _clip01(rotation_deg / _ROTATION_SATURATION_DEG)
|
| 135 |
+
score = (
|
| 136 |
+
w["noise_level"] * noise
|
| 137 |
+
+ w["blur"] * blur
|
| 138 |
+
+ w["low_contrast"] * low_contrast
|
| 139 |
+
+ w["rotation"] * rotation
|
| 140 |
+
)
|
| 141 |
+
return {
|
| 142 |
+
"score": _clip01(score),
|
| 143 |
+
"components": {
|
| 144 |
+
"noise": noise,
|
| 145 |
+
"blur": blur,
|
| 146 |
+
"low_contrast": low_contrast,
|
| 147 |
+
"rotation": rotation,
|
| 148 |
+
},
|
| 149 |
+
"weights_used": w,
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
def compute_corpus_homogeneity(
|
| 154 |
+
image_qualities: Iterable[dict],
|
| 155 |
+
) -> Optional[dict]:
|
| 156 |
+
"""Score d'homogénéité du corpus ∈ [0, 1].
|
| 157 |
+
|
| 158 |
+
0 = corpus uniforme (faible variance entre documents),
|
| 159 |
+
1 = corpus hétérogène.
|
| 160 |
+
|
| 161 |
+
Méthode : pour chaque feature dans ``noise_level``,
|
| 162 |
+
``sharpness_score``, ``contrast_score``, ``rotation_degrees``,
|
| 163 |
+
on calcule l'écart-type *normalisé* sur les documents (par
|
| 164 |
+
une plage de référence), puis on prend la moyenne des 4.
|
| 165 |
+
|
| 166 |
+
Plages de normalisation :
|
| 167 |
+
- ``noise_level``, ``sharpness_score``, ``contrast_score``
|
| 168 |
+
∈ [0, 1] → écart-type / 0.5 (max théorique de l'écart-type
|
| 169 |
+
d'une distribution sur [0,1]) borné à 1.
|
| 170 |
+
- ``rotation_degrees`` → écart-type / 10°.
|
| 171 |
+
|
| 172 |
+
Parameters
|
| 173 |
+
----------
|
| 174 |
+
image_qualities:
|
| 175 |
+
Itérable de dicts ``ImageQualityResult.as_dict()``.
|
| 176 |
+
|
| 177 |
+
Returns
|
| 178 |
+
-------
|
| 179 |
+
dict | None
|
| 180 |
+
``{
|
| 181 |
+
"score": float, # ∈ [0, 1]
|
| 182 |
+
"n_docs": int,
|
| 183 |
+
"per_feature": {
|
| 184 |
+
feature: {"mean": float, "stdev": float,
|
| 185 |
+
"normalised": float},
|
| 186 |
+
},
|
| 187 |
+
}`` ou ``None`` si moins de 2 documents.
|
| 188 |
+
"""
|
| 189 |
+
docs = [q for q in image_qualities if q]
|
| 190 |
+
if len(docs) < 2:
|
| 191 |
+
return None
|
| 192 |
+
features = (
|
| 193 |
+
("noise_level", 0.5),
|
| 194 |
+
("sharpness_score", 0.5),
|
| 195 |
+
("contrast_score", 0.5),
|
| 196 |
+
("rotation_degrees", 10.0),
|
| 197 |
+
)
|
| 198 |
+
per_feature: dict[str, dict] = {}
|
| 199 |
+
norm_stdevs: list[float] = []
|
| 200 |
+
for key, divisor in features:
|
| 201 |
+
values = [
|
| 202 |
+
_extract_feature(q, key)
|
| 203 |
+
for q in docs
|
| 204 |
+
]
|
| 205 |
+
if not values:
|
| 206 |
+
continue
|
| 207 |
+
mean = statistics.fmean(values)
|
| 208 |
+
try:
|
| 209 |
+
stdev = statistics.stdev(values) if len(values) >= 2 else 0.0
|
| 210 |
+
except statistics.StatisticsError:
|
| 211 |
+
stdev = 0.0
|
| 212 |
+
normalised = _clip01(stdev / divisor) if divisor > 0 else 0.0
|
| 213 |
+
per_feature[key] = {
|
| 214 |
+
"mean": mean,
|
| 215 |
+
"stdev": stdev,
|
| 216 |
+
"normalised": normalised,
|
| 217 |
+
}
|
| 218 |
+
norm_stdevs.append(normalised)
|
| 219 |
+
if not norm_stdevs:
|
| 220 |
+
return None
|
| 221 |
+
score = statistics.fmean(norm_stdevs)
|
| 222 |
+
return {
|
| 223 |
+
"score": _clip01(score),
|
| 224 |
+
"n_docs": len(docs),
|
| 225 |
+
"per_feature": per_feature,
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
def aggregate_corpus_predictive(
|
| 230 |
+
image_qualities: Iterable[dict],
|
| 231 |
+
*,
|
| 232 |
+
weights: Optional[dict[str, float]] = None,
|
| 233 |
+
) -> Optional[dict]:
|
| 234 |
+
"""Synthèse corpus-wide : complexité moyenne + homogénéité.
|
| 235 |
+
|
| 236 |
+
Returns
|
| 237 |
+
-------
|
| 238 |
+
dict | None
|
| 239 |
+
``{
|
| 240 |
+
"n_docs": int,
|
| 241 |
+
"complexity_mean": float,
|
| 242 |
+
"complexity_median": float,
|
| 243 |
+
"complexity_min": float,
|
| 244 |
+
"complexity_max": float,
|
| 245 |
+
"complexity_stdev": float,
|
| 246 |
+
"homogeneity": dict, # sortie de
|
| 247 |
+
# compute_corpus_homogeneity
|
| 248 |
+
}`` ou ``None`` si moins d'un document.
|
| 249 |
+
"""
|
| 250 |
+
docs = [q for q in image_qualities if q]
|
| 251 |
+
if not docs:
|
| 252 |
+
return None
|
| 253 |
+
scores: list[float] = []
|
| 254 |
+
for q in docs:
|
| 255 |
+
result = compute_paleographic_complexity(q, weights=weights)
|
| 256 |
+
if result is not None:
|
| 257 |
+
scores.append(float(result["score"]))
|
| 258 |
+
if not scores:
|
| 259 |
+
return None
|
| 260 |
+
homogeneity = compute_corpus_homogeneity(docs)
|
| 261 |
+
return {
|
| 262 |
+
"n_docs": len(docs),
|
| 263 |
+
"complexity_mean": statistics.fmean(scores),
|
| 264 |
+
"complexity_median": statistics.median(scores),
|
| 265 |
+
"complexity_min": min(scores),
|
| 266 |
+
"complexity_max": max(scores),
|
| 267 |
+
"complexity_stdev": (
|
| 268 |
+
statistics.stdev(scores) if len(scores) >= 2 else 0.0
|
| 269 |
+
),
|
| 270 |
+
"homogeneity": homogeneity,
|
| 271 |
+
}
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
__all__ = [
|
| 275 |
+
"DEFAULT_COMPLEXITY_WEIGHTS",
|
| 276 |
+
"compute_paleographic_complexity",
|
| 277 |
+
"compute_corpus_homogeneity",
|
| 278 |
+
"aggregate_corpus_predictive",
|
| 279 |
+
]
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
# Évite warning import inutilisé
|
| 283 |
+
_ = math
|
|
@@ -0,0 +1,391 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Analyse automatique de la qualité des images de documents numérisés.
|
| 2 |
+
|
| 3 |
+
Métriques
|
| 4 |
+
---------
|
| 5 |
+
- **Score de netteté** : variance du laplacien (plus élevé = plus net)
|
| 6 |
+
- **Niveau de bruit** : écart-type des résidus haute-fréquence
|
| 7 |
+
- **Angle de rotation résiduel** : estimé par projection horizontale
|
| 8 |
+
- **Score de contraste** : ratio Michelson entre zones sombres (encre) et claires (fond)
|
| 9 |
+
- **Score de qualité global** : combinaison normalisée des métriques ci-dessus
|
| 10 |
+
|
| 11 |
+
Ces calculs sont réalisés en pur Python + bibliothèques stdlib ou Pillow.
|
| 12 |
+
NumPy est utilisé si disponible (calculs plus rapides), mais les méthodes
|
| 13 |
+
de fallback n'en dépendent pas.
|
| 14 |
+
|
| 15 |
+
Note
|
| 16 |
+
----
|
| 17 |
+
Pour les images placeholder (fixtures), des valeurs fictives cohérentes
|
| 18 |
+
sont générées via `generate_mock_quality_scores()`.
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import logging
|
| 24 |
+
import math
|
| 25 |
+
import statistics
|
| 26 |
+
from dataclasses import dataclass
|
| 27 |
+
from pathlib import Path
|
| 28 |
+
from typing import Optional
|
| 29 |
+
|
| 30 |
+
logger = logging.getLogger(__name__)
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
@dataclass
|
| 34 |
+
class ImageQualityResult:
|
| 35 |
+
"""Métriques de qualité d'une image de document."""
|
| 36 |
+
|
| 37 |
+
sharpness_score: float = 0.0
|
| 38 |
+
"""Score de netteté [0, 1]. Basé sur la variance du laplacien normalisée."""
|
| 39 |
+
|
| 40 |
+
noise_level: float = 0.0
|
| 41 |
+
"""Niveau de bruit [0, 1]. 0 = pas de bruit, 1 = très bruité."""
|
| 42 |
+
|
| 43 |
+
rotation_degrees: float = 0.0
|
| 44 |
+
"""Angle de rotation résiduel estimé en degrés (positif = sens horaire)."""
|
| 45 |
+
|
| 46 |
+
contrast_score: float = 0.0
|
| 47 |
+
"""Score de contraste [0, 1]. Ratio Michelson encre/fond."""
|
| 48 |
+
|
| 49 |
+
quality_score: float = 0.0
|
| 50 |
+
"""Score de qualité global [0, 1]. Combinaison pondérée des autres métriques."""
|
| 51 |
+
|
| 52 |
+
analysis_method: str = "none"
|
| 53 |
+
"""Méthode d'analyse utilisée : 'pillow', 'numpy', 'mock'."""
|
| 54 |
+
|
| 55 |
+
error: Optional[str] = None
|
| 56 |
+
"""Erreur si l'analyse a échoué."""
|
| 57 |
+
|
| 58 |
+
@property
|
| 59 |
+
def is_good_quality(self) -> bool:
|
| 60 |
+
"""Vrai si le score de qualité global est ≥ 0.7."""
|
| 61 |
+
return self.quality_score >= 0.7
|
| 62 |
+
|
| 63 |
+
@property
|
| 64 |
+
def quality_tier(self) -> str:
|
| 65 |
+
"""Catégorie de qualité : 'good', 'medium', 'poor'."""
|
| 66 |
+
if self.quality_score >= 0.7:
|
| 67 |
+
return "good"
|
| 68 |
+
elif self.quality_score >= 0.4:
|
| 69 |
+
return "medium"
|
| 70 |
+
return "poor"
|
| 71 |
+
|
| 72 |
+
def as_dict(self) -> dict:
|
| 73 |
+
d = {
|
| 74 |
+
"sharpness_score": round(self.sharpness_score, 4),
|
| 75 |
+
"noise_level": round(self.noise_level, 4),
|
| 76 |
+
"rotation_degrees": round(self.rotation_degrees, 2),
|
| 77 |
+
"contrast_score": round(self.contrast_score, 4),
|
| 78 |
+
"quality_score": round(self.quality_score, 4),
|
| 79 |
+
"quality_tier": self.quality_tier,
|
| 80 |
+
"analysis_method": self.analysis_method,
|
| 81 |
+
}
|
| 82 |
+
if self.error:
|
| 83 |
+
d["error"] = self.error
|
| 84 |
+
return d
|
| 85 |
+
|
| 86 |
+
@classmethod
|
| 87 |
+
def from_dict(cls, data: dict) -> "ImageQualityResult":
|
| 88 |
+
return cls(
|
| 89 |
+
sharpness_score=data.get("sharpness_score", 0.0),
|
| 90 |
+
noise_level=data.get("noise_level", 0.0),
|
| 91 |
+
rotation_degrees=data.get("rotation_degrees", 0.0),
|
| 92 |
+
contrast_score=data.get("contrast_score", 0.0),
|
| 93 |
+
quality_score=data.get("quality_score", 0.0),
|
| 94 |
+
analysis_method=data.get("analysis_method", "none"),
|
| 95 |
+
error=data.get("error"),
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def analyze_image_quality(image_path: str | Path) -> ImageQualityResult:
|
| 100 |
+
"""Analyse la qualité d'une image de document numérisé.
|
| 101 |
+
|
| 102 |
+
Essaie successivement :
|
| 103 |
+
1. Pillow + NumPy (méthode complète)
|
| 104 |
+
2. Pillow seul (méthode simplifiée)
|
| 105 |
+
3. Fallback : retourne un résultat vide avec erreur
|
| 106 |
+
|
| 107 |
+
Parameters
|
| 108 |
+
----------
|
| 109 |
+
image_path:
|
| 110 |
+
Chemin vers l'image (JPG, PNG, TIFF…).
|
| 111 |
+
|
| 112 |
+
Returns
|
| 113 |
+
-------
|
| 114 |
+
ImageQualityResult
|
| 115 |
+
"""
|
| 116 |
+
path = Path(image_path)
|
| 117 |
+
if not path.exists():
|
| 118 |
+
return ImageQualityResult(
|
| 119 |
+
error=f"Fichier image introuvable : {image_path}",
|
| 120 |
+
analysis_method="none",
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
# Essai avec Pillow + NumPy
|
| 124 |
+
try:
|
| 125 |
+
import numpy as np
|
| 126 |
+
from PIL import Image
|
| 127 |
+
return _analyze_with_numpy(path, np, Image)
|
| 128 |
+
except ImportError:
|
| 129 |
+
pass
|
| 130 |
+
|
| 131 |
+
# Essai avec Pillow seul
|
| 132 |
+
try:
|
| 133 |
+
from PIL import Image
|
| 134 |
+
return _analyze_with_pillow(path, Image)
|
| 135 |
+
except ImportError:
|
| 136 |
+
pass
|
| 137 |
+
|
| 138 |
+
return ImageQualityResult(
|
| 139 |
+
error="Pillow non disponible (pip install Pillow)",
|
| 140 |
+
analysis_method="none",
|
| 141 |
+
quality_score=0.5, # valeur neutre
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
def _analyze_with_numpy(path: Path, np, Image) -> ImageQualityResult:
|
| 146 |
+
"""Analyse complète avec NumPy."""
|
| 147 |
+
img = Image.open(path).convert("L") # niveaux de gris
|
| 148 |
+
arr = np.array(img, dtype=np.float32)
|
| 149 |
+
|
| 150 |
+
# 1. Netteté : variance du laplacien
|
| 151 |
+
laplacian = _laplacian_variance_numpy(arr, np)
|
| 152 |
+
# Normalisation empirique : variance > 500 = très net, < 50 = flou
|
| 153 |
+
sharpness = min(1.0, laplacian / 500.0)
|
| 154 |
+
|
| 155 |
+
# 2. Bruit : écart-type des résidus (différence image - image lissée)
|
| 156 |
+
noise = _noise_level_numpy(arr, np)
|
| 157 |
+
|
| 158 |
+
# 3. Rotation : angle d'inclinaison estimé
|
| 159 |
+
rotation = _estimate_rotation_numpy(arr, np)
|
| 160 |
+
|
| 161 |
+
# 4. Contraste : ratio Michelson
|
| 162 |
+
contrast = _contrast_score_numpy(arr, np)
|
| 163 |
+
|
| 164 |
+
# 5. Score global pondéré
|
| 165 |
+
quality = _global_quality_score(sharpness, noise, abs(rotation), contrast)
|
| 166 |
+
|
| 167 |
+
return ImageQualityResult(
|
| 168 |
+
sharpness_score=float(sharpness),
|
| 169 |
+
noise_level=float(noise),
|
| 170 |
+
rotation_degrees=float(rotation),
|
| 171 |
+
contrast_score=float(contrast),
|
| 172 |
+
quality_score=float(quality),
|
| 173 |
+
analysis_method="numpy",
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
def _analyze_with_pillow(path: Path, Image) -> ImageQualityResult:
|
| 178 |
+
"""Analyse simplifiée avec Pillow seul (sans NumPy)."""
|
| 179 |
+
img = Image.open(path).convert("L")
|
| 180 |
+
pixels = list(img.tobytes()) # mode "L" = 1 byte/pixel
|
| 181 |
+
w, h = img.size
|
| 182 |
+
|
| 183 |
+
if not pixels:
|
| 184 |
+
return ImageQualityResult(quality_score=0.5, analysis_method="pillow")
|
| 185 |
+
|
| 186 |
+
# Contraste : étendue des valeurs
|
| 187 |
+
min_val = min(pixels)
|
| 188 |
+
max_val = max(pixels)
|
| 189 |
+
if max_val + min_val > 0:
|
| 190 |
+
contrast = (max_val - min_val) / (max_val + min_val)
|
| 191 |
+
else:
|
| 192 |
+
contrast = 0.0
|
| 193 |
+
|
| 194 |
+
# Netteté approximée : variance globale des pixels
|
| 195 |
+
try:
|
| 196 |
+
variance = statistics.variance(pixels)
|
| 197 |
+
except statistics.StatisticsError:
|
| 198 |
+
variance = 0.0
|
| 199 |
+
sharpness = min(1.0, math.sqrt(variance) / 128.0)
|
| 200 |
+
|
| 201 |
+
# Bruit : approximation grossière
|
| 202 |
+
noise = min(1.0, statistics.stdev(pixels[:min(1000, len(pixels))]) / 64.0) if len(pixels) > 1 else 0.0
|
| 203 |
+
|
| 204 |
+
quality = _global_quality_score(sharpness, noise, 0.0, contrast)
|
| 205 |
+
|
| 206 |
+
return ImageQualityResult(
|
| 207 |
+
sharpness_score=sharpness,
|
| 208 |
+
noise_level=noise,
|
| 209 |
+
rotation_degrees=0.0, # non calculé sans NumPy
|
| 210 |
+
contrast_score=contrast,
|
| 211 |
+
quality_score=quality,
|
| 212 |
+
analysis_method="pillow",
|
| 213 |
+
)
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
def _laplacian_variance_numpy(arr, np) -> float:
|
| 217 |
+
"""Calcule la variance du laplacien (mesure de netteté)."""
|
| 218 |
+
# Convolution laplacien 3x3 via slicing (bordures ignorées)
|
| 219 |
+
h, w = arr.shape
|
| 220 |
+
if h < 3 or w < 3:
|
| 221 |
+
return float(np.var(arr))
|
| 222 |
+
|
| 223 |
+
# Utiliser une convolution rapide avec slicing
|
| 224 |
+
center = arr[1:-1, 1:-1]
|
| 225 |
+
top = arr[:-2, 1:-1]
|
| 226 |
+
bottom = arr[2:, 1:-1]
|
| 227 |
+
left = arr[1:-1, :-2]
|
| 228 |
+
right = arr[1:-1, 2:]
|
| 229 |
+
lap = top + bottom + left + right - 4 * center
|
| 230 |
+
|
| 231 |
+
return float(np.var(lap))
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
def _noise_level_numpy(arr, np) -> float:
|
| 235 |
+
"""Estime le niveau de bruit par la MAD (Median Absolute Deviation) des gradients."""
|
| 236 |
+
h, w = arr.shape
|
| 237 |
+
if h < 2 or w < 2:
|
| 238 |
+
return 0.0
|
| 239 |
+
# Différences horizontales et verticales
|
| 240 |
+
diff_h = np.abs(arr[:, 1:] - arr[:, :-1])
|
| 241 |
+
diff_v = np.abs(arr[1:, :] - arr[:-1, :])
|
| 242 |
+
noise_std = float(np.median(np.concatenate([diff_h.ravel(), diff_v.ravel()])))
|
| 243 |
+
# Normaliser : 0 = pas de bruit, 1 = très bruité (seuil à ~30)
|
| 244 |
+
return min(1.0, noise_std / 30.0)
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def _estimate_rotation_numpy(arr, np) -> float:
|
| 248 |
+
"""Estime l'angle de rotation par projection horizontale simplifiée.
|
| 249 |
+
|
| 250 |
+
Retourne l'angle estimé en degrés [-45, 45].
|
| 251 |
+
"""
|
| 252 |
+
# Méthode simplifiée : analyse de la variance des projections à différents angles
|
| 253 |
+
# Limiter à quelques angles pour la performance
|
| 254 |
+
h, w = arr.shape
|
| 255 |
+
if h < 20 or w < 20:
|
| 256 |
+
return 0.0
|
| 257 |
+
|
| 258 |
+
# Sous-échantillonnage pour la performance
|
| 259 |
+
step = max(1, h // 100)
|
| 260 |
+
sample = arr[::step, :]
|
| 261 |
+
|
| 262 |
+
best_angle = 0.0
|
| 263 |
+
best_var = -1.0
|
| 264 |
+
|
| 265 |
+
for angle_deg in range(-5, 6): # ±5 degrés, pas de 1°
|
| 266 |
+
angle_rad = math.radians(angle_deg)
|
| 267 |
+
# Projection horizontale après rotation approximative
|
| 268 |
+
# (approximation linéaire rapide)
|
| 269 |
+
offsets = np.round(
|
| 270 |
+
np.arange(sample.shape[0]) * math.tan(angle_rad)
|
| 271 |
+
).astype(int)
|
| 272 |
+
offsets = np.clip(offsets, 0, w - 1)
|
| 273 |
+
|
| 274 |
+
# Variance des sommes de lignes décalées
|
| 275 |
+
try:
|
| 276 |
+
row_sums = np.array([
|
| 277 |
+
float(np.sum(sample[i, max(0, offsets[i]):min(w, offsets[i]+w)]))
|
| 278 |
+
for i in range(sample.shape[0])
|
| 279 |
+
])
|
| 280 |
+
var = float(np.var(row_sums))
|
| 281 |
+
if var > best_var:
|
| 282 |
+
best_var = var
|
| 283 |
+
best_angle = float(angle_deg)
|
| 284 |
+
except Exception as e:
|
| 285 |
+
logger.warning(
|
| 286 |
+
"[image_quality] projection à %d° indisponible : %s",
|
| 287 |
+
angle_deg, e,
|
| 288 |
+
)
|
| 289 |
+
|
| 290 |
+
return best_angle
|
| 291 |
+
|
| 292 |
+
|
| 293 |
+
def _contrast_score_numpy(arr, np) -> float:
|
| 294 |
+
"""Score de contraste Michelson [0, 1]."""
|
| 295 |
+
p5 = float(np.percentile(arr, 5)) # fond clair
|
| 296 |
+
p95 = float(np.percentile(arr, 95)) # encre sombre
|
| 297 |
+
if p5 + p95 == 0:
|
| 298 |
+
return 0.0
|
| 299 |
+
# Michelson : (Imax - Imin) / (Imax + Imin)
|
| 300 |
+
return float((p95 - p5) / (p95 + p5))
|
| 301 |
+
|
| 302 |
+
|
| 303 |
+
def _global_quality_score(
|
| 304 |
+
sharpness: float,
|
| 305 |
+
noise: float,
|
| 306 |
+
rotation_abs: float,
|
| 307 |
+
contrast: float,
|
| 308 |
+
) -> float:
|
| 309 |
+
"""Calcule le score de qualité global pondéré."""
|
| 310 |
+
# Poids : netteté (40%), contraste (30%), bruit (20%), rotation (10%)
|
| 311 |
+
score = (
|
| 312 |
+
0.40 * sharpness
|
| 313 |
+
+ 0.30 * contrast
|
| 314 |
+
+ 0.20 * (1.0 - noise) # moins de bruit = mieux
|
| 315 |
+
+ 0.10 * max(0.0, 1.0 - rotation_abs / 10.0) # ±10° max
|
| 316 |
+
)
|
| 317 |
+
return round(min(1.0, max(0.0, score)), 4)
|
| 318 |
+
|
| 319 |
+
|
| 320 |
+
# ---------------------------------------------------------------------------
|
| 321 |
+
# Données fictives pour les fixtures de démo
|
| 322 |
+
# ---------------------------------------------------------------------------
|
| 323 |
+
|
| 324 |
+
def generate_mock_quality_scores(
|
| 325 |
+
doc_id: str,
|
| 326 |
+
seed: Optional[int] = None,
|
| 327 |
+
) -> ImageQualityResult:
|
| 328 |
+
"""Génère des métriques de qualité fictives mais cohérentes pour un document.
|
| 329 |
+
|
| 330 |
+
Utilisé par les fixtures de démo pour simuler une diversité réaliste
|
| 331 |
+
de qualités d'image (bonne, moyenne, dégradée).
|
| 332 |
+
|
| 333 |
+
Parameters
|
| 334 |
+
----------
|
| 335 |
+
doc_id:
|
| 336 |
+
Identifiant du document (utilisé pour la reproductibilité).
|
| 337 |
+
seed:
|
| 338 |
+
Graine aléatoire optionnelle.
|
| 339 |
+
"""
|
| 340 |
+
import random
|
| 341 |
+
rng = random.Random(seed or hash(doc_id) % 2**32)
|
| 342 |
+
|
| 343 |
+
# Générer une qualité cohérente : certains docs sont plus difficiles
|
| 344 |
+
base_quality = 0.3 + rng.random() * 0.6 # 0.3 à 0.9
|
| 345 |
+
|
| 346 |
+
sharpness = max(0.1, min(1.0, base_quality + rng.gauss(0, 0.1)))
|
| 347 |
+
noise = max(0.0, min(1.0, (1.0 - base_quality) * 0.8 + rng.gauss(0, 0.05)))
|
| 348 |
+
rotation = rng.gauss(0, 1.5) # ±1.5° typique
|
| 349 |
+
contrast = max(0.2, min(1.0, base_quality + rng.gauss(0, 0.15)))
|
| 350 |
+
|
| 351 |
+
quality = _global_quality_score(sharpness, noise, abs(rotation), contrast)
|
| 352 |
+
|
| 353 |
+
return ImageQualityResult(
|
| 354 |
+
sharpness_score=round(sharpness, 4),
|
| 355 |
+
noise_level=round(noise, 4),
|
| 356 |
+
rotation_degrees=round(rotation, 2),
|
| 357 |
+
contrast_score=round(contrast, 4),
|
| 358 |
+
quality_score=round(quality, 4),
|
| 359 |
+
analysis_method="mock",
|
| 360 |
+
)
|
| 361 |
+
|
| 362 |
+
|
| 363 |
+
def aggregate_image_quality(results: list[ImageQualityResult]) -> dict:
|
| 364 |
+
"""Agrège les métriques de qualité image sur un corpus."""
|
| 365 |
+
if not results:
|
| 366 |
+
return {}
|
| 367 |
+
|
| 368 |
+
valid = [r for r in results if r.error is None]
|
| 369 |
+
if not valid:
|
| 370 |
+
return {"error": "Aucune analyse réussie"}
|
| 371 |
+
|
| 372 |
+
def _mean(vals: list[float]) -> float:
|
| 373 |
+
return round(statistics.mean(vals), 4) if vals else 0.0
|
| 374 |
+
|
| 375 |
+
quality_scores = [r.quality_score for r in valid]
|
| 376 |
+
sharpness_scores = [r.sharpness_score for r in valid]
|
| 377 |
+
noise_levels = [r.noise_level for r in valid]
|
| 378 |
+
|
| 379 |
+
# Distribution par tier
|
| 380 |
+
tiers = {"good": 0, "medium": 0, "poor": 0}
|
| 381 |
+
for r in valid:
|
| 382 |
+
tiers[r.quality_tier] += 1
|
| 383 |
+
|
| 384 |
+
return {
|
| 385 |
+
"mean_quality_score": _mean(quality_scores),
|
| 386 |
+
"mean_sharpness": _mean(sharpness_scores),
|
| 387 |
+
"mean_noise_level": _mean(noise_levels),
|
| 388 |
+
"quality_distribution": tiers,
|
| 389 |
+
"document_count": len(valid),
|
| 390 |
+
"scores": [r.quality_score for r in valid], # pour scatter plot
|
| 391 |
+
}
|
|
@@ -0,0 +1,253 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Comparaison incrémentale de pipelines composées — Sprint 96 (B.5).
|
| 2 |
+
|
| 3 |
+
Sprint 96 — B.5 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Avec 5 OCR × 3 reconstructeurs × 4 post-correcteurs × 3
|
| 8 |
+
mappeurs = 180 pipelines à comparer, le rapport noie
|
| 9 |
+
l'information. Il faut un mécanisme de **comparaison
|
| 10 |
+
contrôlée** type design d'expérience.
|
| 11 |
+
|
| 12 |
+
Méthode
|
| 13 |
+
-------
|
| 14 |
+
Pour mesurer l'effet isolé d'un slot ``varying`` :
|
| 15 |
+
|
| 16 |
+
1. Fixer les valeurs des autres slots (``fixed``).
|
| 17 |
+
2. Pour chaque combinaison des fixed, comparer les pipelines
|
| 18 |
+
qui ne diffèrent que sur le slot varying.
|
| 19 |
+
3. Agréger : pour chaque valeur du slot varying, calculer
|
| 20 |
+
sa moyenne, son écart-type, son rang moyen sur les groupes.
|
| 21 |
+
|
| 22 |
+
C'est presque un Latin square automatisé. Sans ça, le
|
| 23 |
+
rapport sur 180 pipelines est inutilisable.
|
| 24 |
+
|
| 25 |
+
Pas de tests statistiques scipy
|
| 26 |
+
-------------------------------
|
| 27 |
+
On ne reconstruit pas Friedman/Nemenyi (déjà dans Sprint 18) ;
|
| 28 |
+
on agrège ici les données nécessaires pour qu'un
|
| 29 |
+
tests statistique externe puisse les consommer. Le rapport
|
| 30 |
+
existant reste libre de brancher
|
| 31 |
+
``picarones.measurements.statistics.friedman_test`` sur la sortie de
|
| 32 |
+
ce module.
|
| 33 |
+
|
| 34 |
+
Sortie
|
| 35 |
+
------
|
| 36 |
+
``compare_isolated_effect(runs, varying_slot)`` retourne :
|
| 37 |
+
|
| 38 |
+
.. code-block:: text
|
| 39 |
+
|
| 40 |
+
{
|
| 41 |
+
"varying_slot": str,
|
| 42 |
+
"n_runs": int,
|
| 43 |
+
"n_groups": int, # combinaisons fixed distinctes
|
| 44 |
+
"values": list[str], # valeurs distinctes du slot
|
| 45 |
+
"per_value": {value: {
|
| 46 |
+
"n_observations": int,
|
| 47 |
+
"mean": float | None,
|
| 48 |
+
"stdev": float | None,
|
| 49 |
+
"min": float, "max": float,
|
| 50 |
+
"mean_rank": float | None,
|
| 51 |
+
}},
|
| 52 |
+
"best_value": str | None,
|
| 53 |
+
"worst_value": str | None,
|
| 54 |
+
"groups": list[dict], # détail par groupe
|
| 55 |
+
}
|
| 56 |
+
"""
|
| 57 |
+
|
| 58 |
+
from __future__ import annotations
|
| 59 |
+
|
| 60 |
+
import logging
|
| 61 |
+
import statistics
|
| 62 |
+
from dataclasses import dataclass
|
| 63 |
+
from typing import Optional
|
| 64 |
+
|
| 65 |
+
logger = logging.getLogger(__name__)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
@dataclass(frozen=True)
|
| 69 |
+
class PipelineRun:
|
| 70 |
+
"""Un run de pipeline composée pour la comparaison contrôlée.
|
| 71 |
+
|
| 72 |
+
Attributes
|
| 73 |
+
----------
|
| 74 |
+
name:
|
| 75 |
+
Nom du run (libre — informatif uniquement).
|
| 76 |
+
slots:
|
| 77 |
+
Map ``{slot_name: module_name}`` décrivant la pipeline
|
| 78 |
+
(ex. ``{"ocr": "tess", "llm": "gpt-4o"}``).
|
| 79 |
+
score:
|
| 80 |
+
Métrique numérique à comparer (CER moyen typiquement).
|
| 81 |
+
Plus bas = meilleur par convention sauf si
|
| 82 |
+
``higher_is_better=True`` est passé à
|
| 83 |
+
``compare_isolated_effect``.
|
| 84 |
+
"""
|
| 85 |
+
|
| 86 |
+
name: str
|
| 87 |
+
slots: dict[str, str]
|
| 88 |
+
score: float
|
| 89 |
+
|
| 90 |
+
def as_dict(self) -> dict:
|
| 91 |
+
return {
|
| 92 |
+
"name": self.name,
|
| 93 |
+
"slots": dict(self.slots),
|
| 94 |
+
"score": self.score,
|
| 95 |
+
}
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def _normalise_runs(runs) -> list[PipelineRun]:
|
| 99 |
+
"""Accepte une liste de ``PipelineRun`` ou de dicts compatibles."""
|
| 100 |
+
out: list[PipelineRun] = []
|
| 101 |
+
for r in runs:
|
| 102 |
+
if isinstance(r, PipelineRun):
|
| 103 |
+
out.append(r)
|
| 104 |
+
continue
|
| 105 |
+
if not isinstance(r, dict):
|
| 106 |
+
continue
|
| 107 |
+
slots = r.get("slots") or {}
|
| 108 |
+
if not isinstance(slots, dict):
|
| 109 |
+
continue
|
| 110 |
+
try:
|
| 111 |
+
score = float(r.get("score"))
|
| 112 |
+
except (TypeError, ValueError):
|
| 113 |
+
continue
|
| 114 |
+
out.append(PipelineRun(
|
| 115 |
+
name=str(r.get("name") or ""),
|
| 116 |
+
slots={str(k): str(v) for k, v in slots.items()},
|
| 117 |
+
score=score,
|
| 118 |
+
))
|
| 119 |
+
return out
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def compare_isolated_effect(
|
| 123 |
+
runs,
|
| 124 |
+
varying_slot: str,
|
| 125 |
+
*,
|
| 126 |
+
higher_is_better: bool = False,
|
| 127 |
+
) -> Optional[dict]:
|
| 128 |
+
"""Mesure l'effet isolé du slot ``varying_slot``.
|
| 129 |
+
|
| 130 |
+
Parameters
|
| 131 |
+
----------
|
| 132 |
+
runs:
|
| 133 |
+
Liste de ``PipelineRun`` (ou dicts compatibles).
|
| 134 |
+
varying_slot:
|
| 135 |
+
Nom du slot dont on veut isoler l'effet. Les autres
|
| 136 |
+
slots constituent les groupes de contrôle.
|
| 137 |
+
higher_is_better:
|
| 138 |
+
Si ``True``, on inverse la convention de classement
|
| 139 |
+
(rang 1 = score le plus haut). Défaut ``False`` =
|
| 140 |
+
rang 1 = score le plus bas (CER).
|
| 141 |
+
|
| 142 |
+
Returns
|
| 143 |
+
-------
|
| 144 |
+
dict | None
|
| 145 |
+
``None`` si moins de 2 runs ou si ``varying_slot``
|
| 146 |
+
n'est présent dans aucun run.
|
| 147 |
+
"""
|
| 148 |
+
runs_list = _normalise_runs(runs)
|
| 149 |
+
if len(runs_list) < 2:
|
| 150 |
+
return None
|
| 151 |
+
runs_list = [r for r in runs_list if varying_slot in r.slots]
|
| 152 |
+
if not runs_list:
|
| 153 |
+
return None
|
| 154 |
+
|
| 155 |
+
# Constitue les groupes par valeurs des slots fixed
|
| 156 |
+
groups: dict[tuple, list[PipelineRun]] = {}
|
| 157 |
+
fixed_slot_names: list[str] = []
|
| 158 |
+
for r in runs_list:
|
| 159 |
+
other_slots = sorted(k for k in r.slots if k != varying_slot)
|
| 160 |
+
if not fixed_slot_names:
|
| 161 |
+
fixed_slot_names = other_slots
|
| 162 |
+
# Skip runs avec un schéma de slots incompatible
|
| 163 |
+
if other_slots != fixed_slot_names:
|
| 164 |
+
continue
|
| 165 |
+
key = tuple((k, r.slots[k]) for k in other_slots)
|
| 166 |
+
groups.setdefault(key, []).append(r)
|
| 167 |
+
|
| 168 |
+
if not groups:
|
| 169 |
+
return None
|
| 170 |
+
|
| 171 |
+
# Pour chaque groupe : ranking des runs par score
|
| 172 |
+
per_value: dict[str, dict] = {}
|
| 173 |
+
group_details: list[dict] = []
|
| 174 |
+
for key, members in groups.items():
|
| 175 |
+
members_sorted = sorted(
|
| 176 |
+
members, key=lambda x: x.score, reverse=higher_is_better,
|
| 177 |
+
)
|
| 178 |
+
# Rangs : runs ex aequo partagent la moyenne des rangs
|
| 179 |
+
ranks: dict[str, float] = {}
|
| 180 |
+
i = 0
|
| 181 |
+
while i < len(members_sorted):
|
| 182 |
+
j = i
|
| 183 |
+
while (
|
| 184 |
+
j + 1 < len(members_sorted)
|
| 185 |
+
and members_sorted[j + 1].score == members_sorted[i].score
|
| 186 |
+
):
|
| 187 |
+
j += 1
|
| 188 |
+
avg_rank = (i + 1 + j + 1) / 2
|
| 189 |
+
for k in range(i, j + 1):
|
| 190 |
+
value = members_sorted[k].slots[varying_slot]
|
| 191 |
+
ranks[value] = avg_rank
|
| 192 |
+
i = j + 1
|
| 193 |
+
|
| 194 |
+
for r in members:
|
| 195 |
+
value = r.slots[varying_slot]
|
| 196 |
+
slot = per_value.setdefault(value, {
|
| 197 |
+
"scores": [],
|
| 198 |
+
"ranks": [],
|
| 199 |
+
})
|
| 200 |
+
slot["scores"].append(r.score)
|
| 201 |
+
slot["ranks"].append(ranks[value])
|
| 202 |
+
group_details.append({
|
| 203 |
+
"fixed_slots": dict(key),
|
| 204 |
+
"n_members": len(members),
|
| 205 |
+
"values": [r.slots[varying_slot] for r in members_sorted],
|
| 206 |
+
"scores": [r.score for r in members_sorted],
|
| 207 |
+
})
|
| 208 |
+
|
| 209 |
+
# Calcul mean/stdev/min/max + rang moyen par valeur
|
| 210 |
+
summary: dict[str, dict] = {}
|
| 211 |
+
for value, slot in per_value.items():
|
| 212 |
+
scores = slot["scores"]
|
| 213 |
+
ranks = slot["ranks"]
|
| 214 |
+
summary[value] = {
|
| 215 |
+
"n_observations": len(scores),
|
| 216 |
+
"mean": statistics.fmean(scores) if scores else None,
|
| 217 |
+
"stdev": (
|
| 218 |
+
statistics.stdev(scores) if len(scores) >= 2 else None
|
| 219 |
+
),
|
| 220 |
+
"min": min(scores),
|
| 221 |
+
"max": max(scores),
|
| 222 |
+
"mean_rank": (
|
| 223 |
+
statistics.fmean(ranks) if ranks else None
|
| 224 |
+
),
|
| 225 |
+
}
|
| 226 |
+
|
| 227 |
+
# Best/worst : sur la mean (convention CER : plus bas = meilleur)
|
| 228 |
+
by_mean = sorted(
|
| 229 |
+
((v, d["mean"]) for v, d in summary.items()
|
| 230 |
+
if d["mean"] is not None),
|
| 231 |
+
key=lambda kv: kv[1],
|
| 232 |
+
reverse=higher_is_better,
|
| 233 |
+
)
|
| 234 |
+
best_value = by_mean[0][0] if by_mean else None
|
| 235 |
+
worst_value = by_mean[-1][0] if by_mean else None
|
| 236 |
+
|
| 237 |
+
return {
|
| 238 |
+
"varying_slot": varying_slot,
|
| 239 |
+
"n_runs": len(runs_list),
|
| 240 |
+
"n_groups": len(groups),
|
| 241 |
+
"values": sorted(per_value.keys()),
|
| 242 |
+
"per_value": summary,
|
| 243 |
+
"best_value": best_value,
|
| 244 |
+
"worst_value": worst_value,
|
| 245 |
+
"groups": group_details,
|
| 246 |
+
"higher_is_better": higher_is_better,
|
| 247 |
+
}
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
__all__ = [
|
| 251 |
+
"PipelineRun",
|
| 252 |
+
"compare_isolated_effect",
|
| 253 |
+
]
|
|
@@ -0,0 +1,484 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Métriques inter-moteurs (Sprint 35 — Étape 2 du plan d'évolution).
|
| 2 |
+
|
| 3 |
+
Deux familles de mesures qui répondent à des questions différentes mais
|
| 4 |
+
liées :
|
| 5 |
+
|
| 6 |
+
1. **Divergence taxonomique** (`kl_divergence`, `jensen_shannon_divergence`,
|
| 7 |
+
`taxonomy_divergence_matrix`) — *à quel point les moteurs font-ils des
|
| 8 |
+
erreurs de natures différentes ?* Une divergence élevée signale des
|
| 9 |
+
moteurs spécialisés sur des classes d'erreurs distinctes (visual vs
|
| 10 |
+
abréviation vs casse) et donc des candidats pour un voting ensemble.
|
| 11 |
+
|
| 12 |
+
2. **Complémentarité** (`oracle_token_recall`, `complementarity_gap`,
|
| 13 |
+
`pairwise_disagreement_rate`) — *quel CER serait atteignable si on
|
| 14 |
+
combinait les moteurs ?* La borne inférieure du CER atteignable par
|
| 15 |
+
un voting majoritaire token-level est ``1 - oracle_token_recall``.
|
| 16 |
+
Si elle est très inférieure au CER du meilleur moteur seul, l'effort
|
| 17 |
+
d'un pipeline d'ensemble se justifie. Sinon non.
|
| 18 |
+
|
| 19 |
+
Convention de typage
|
| 20 |
+
--------------------
|
| 21 |
+
Toutes les fonctions sont enregistrables dans le registre Sprint 34 si
|
| 22 |
+
on les wrappe par un adaptateur ``(input_types=(TEXT, TEXT))``. Pour
|
| 23 |
+
limiter le bruit, on ne les enregistre **pas** automatiquement : ce sont
|
| 24 |
+
des métriques d'agrégation (multi-moteurs ou multi-documents) qui ne
|
| 25 |
+
correspondent pas au modèle « une jonction = une métrique » du runner.
|
| 26 |
+
Elles sont consommées par les détecteurs narratifs et le rapport HTML.
|
| 27 |
+
|
| 28 |
+
Note sur l'oracle
|
| 29 |
+
-----------------
|
| 30 |
+
La métrique ``oracle_token_recall`` retournée ici utilise un alignement
|
| 31 |
+
bag-of-words pondéré par multiplicité. Ce n'est **pas** une vraie
|
| 32 |
+
borne atteignable par voting majoritaire séquentiel — c'est une borne
|
| 33 |
+
supérieure (proxy optimiste). La vraie borne demanderait un
|
| 34 |
+
alignement séquentiel des hypothèses, ce qui est plus coûteux. Pour
|
| 35 |
+
le diagnostic « ensemble vaut-il le coup ? », le proxy suffit
|
| 36 |
+
largement ; on documente clairement la limite dans le glossaire et le
|
| 37 |
+
rapport.
|
| 38 |
+
"""
|
| 39 |
+
|
| 40 |
+
from __future__ import annotations
|
| 41 |
+
|
| 42 |
+
import logging
|
| 43 |
+
import math
|
| 44 |
+
from collections import Counter
|
| 45 |
+
|
| 46 |
+
logger = logging.getLogger(__name__)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 50 |
+
# Divergence taxonomique (KL / Jensen-Shannon)
|
| 51 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def _smoothed_distribution(
|
| 55 |
+
distribution: dict[str, float],
|
| 56 |
+
keys: list[str],
|
| 57 |
+
epsilon: float = 1e-12,
|
| 58 |
+
) -> list[float]:
|
| 59 |
+
"""Aligne une distribution sur l'ordre de ``keys`` et lisse les zéros.
|
| 60 |
+
|
| 61 |
+
Le lissage évite ``log(0)`` dans la KL. ``epsilon`` est volontairement
|
| 62 |
+
minuscule pour ne pas modifier le résultat de manière sensible.
|
| 63 |
+
"""
|
| 64 |
+
smoothed = [max(distribution.get(k, 0.0), epsilon) for k in keys]
|
| 65 |
+
total = sum(smoothed)
|
| 66 |
+
return [v / total for v in smoothed]
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def kl_divergence(p: dict[str, float], q: dict[str, float]) -> float:
|
| 70 |
+
"""KL-divergence ``D(P||Q)`` en bits, sur l'union des clés.
|
| 71 |
+
|
| 72 |
+
Les distributions n'ont pas besoin de partager exactement les mêmes
|
| 73 |
+
clés ; les clés manquantes sont lissées à ``epsilon`` puis
|
| 74 |
+
renormalisées.
|
| 75 |
+
|
| 76 |
+
Returns
|
| 77 |
+
-------
|
| 78 |
+
float
|
| 79 |
+
``D(P||Q) ≥ 0``. Vaut 0 si et seulement si P == Q. N'est pas
|
| 80 |
+
symétrique : ``kl(p, q) != kl(q, p)`` en général.
|
| 81 |
+
"""
|
| 82 |
+
keys = sorted(set(p.keys()) | set(q.keys()))
|
| 83 |
+
if not keys:
|
| 84 |
+
return 0.0
|
| 85 |
+
p_vec = _smoothed_distribution(p, keys)
|
| 86 |
+
q_vec = _smoothed_distribution(q, keys)
|
| 87 |
+
return sum(pi * math.log2(pi / qi) for pi, qi in zip(p_vec, q_vec))
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def jensen_shannon_divergence(
|
| 91 |
+
p: dict[str, float],
|
| 92 |
+
q: dict[str, float],
|
| 93 |
+
) -> float:
|
| 94 |
+
"""JS-divergence symétrique en bits, bornée dans ``[0, 1]``.
|
| 95 |
+
|
| 96 |
+
``JS(P, Q) = ½ D(P||M) + ½ D(Q||M)`` avec ``M = (P + Q) / 2``.
|
| 97 |
+
Symétrique et bornée — préférable à la KL pour construire une
|
| 98 |
+
matrice triangulaire de divergences entre moteurs.
|
| 99 |
+
"""
|
| 100 |
+
keys = sorted(set(p.keys()) | set(q.keys()))
|
| 101 |
+
if not keys:
|
| 102 |
+
return 0.0
|
| 103 |
+
p_vec = _smoothed_distribution(p, keys)
|
| 104 |
+
q_vec = _smoothed_distribution(q, keys)
|
| 105 |
+
m_vec = [(pi + qi) / 2.0 for pi, qi in zip(p_vec, q_vec)]
|
| 106 |
+
|
| 107 |
+
def _kl(a: list[float], b: list[float]) -> float:
|
| 108 |
+
return sum(ai * math.log2(ai / bi) for ai, bi in zip(a, b) if ai > 0)
|
| 109 |
+
|
| 110 |
+
js = 0.5 * _kl(p_vec, m_vec) + 0.5 * _kl(q_vec, m_vec)
|
| 111 |
+
# Borne théorique : JS ∈ [0, 1] en bits. Clamp pour absorber les
|
| 112 |
+
# erreurs d'arrondi flottant.
|
| 113 |
+
return max(0.0, min(1.0, js))
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def taxonomy_divergence_matrix(
|
| 117 |
+
distributions: dict[str, dict[str, float]],
|
| 118 |
+
metric: str = "js",
|
| 119 |
+
) -> dict[str, dict[str, float]]:
|
| 120 |
+
"""Construit la matrice de divergence triangulaire entre moteurs.
|
| 121 |
+
|
| 122 |
+
Parameters
|
| 123 |
+
----------
|
| 124 |
+
distributions:
|
| 125 |
+
``{engine_name: {error_class: probability}}``. Chaque
|
| 126 |
+
distribution doit sommer à environ 1 (pas de validation stricte
|
| 127 |
+
— les distributions taxonomiques de Picarones sont déjà
|
| 128 |
+
normalisées par ``aggregate_taxonomy``).
|
| 129 |
+
metric:
|
| 130 |
+
``"js"`` (défaut, symétrique) ou ``"kl"`` (asymétrique).
|
| 131 |
+
|
| 132 |
+
Returns
|
| 133 |
+
-------
|
| 134 |
+
dict[str, dict[str, float]]
|
| 135 |
+
Matrice ``{engine_a: {engine_b: divergence}}`` symétrique pour
|
| 136 |
+
``js``, asymétrique pour ``kl``. La diagonale vaut 0.
|
| 137 |
+
"""
|
| 138 |
+
if metric not in ("js", "kl"):
|
| 139 |
+
raise ValueError(f"metric doit être 'js' ou 'kl' — reçu {metric!r}")
|
| 140 |
+
fn = jensen_shannon_divergence if metric == "js" else kl_divergence
|
| 141 |
+
|
| 142 |
+
engines = sorted(distributions.keys())
|
| 143 |
+
matrix: dict[str, dict[str, float]] = {a: {} for a in engines}
|
| 144 |
+
for a in engines:
|
| 145 |
+
for b in engines:
|
| 146 |
+
if a == b:
|
| 147 |
+
matrix[a][b] = 0.0
|
| 148 |
+
elif metric == "js" and b in matrix and a in matrix[b]:
|
| 149 |
+
# Symétrique : recopie pour éviter de recalculer
|
| 150 |
+
matrix[a][b] = matrix[b][a]
|
| 151 |
+
else:
|
| 152 |
+
matrix[a][b] = fn(distributions[a], distributions[b])
|
| 153 |
+
return matrix
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 157 |
+
# Complémentarité (oracle token recall)
|
| 158 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def _word_multiset(text: str) -> Counter[str]:
|
| 162 |
+
"""Décomposition en multiset de tokens (séparateur whitespace)."""
|
| 163 |
+
return Counter(tok for tok in text.split() if tok)
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
def oracle_token_recall(
|
| 167 |
+
reference: str,
|
| 168 |
+
hypotheses: dict[str, str],
|
| 169 |
+
) -> float:
|
| 170 |
+
"""Borne supérieure (proxy bag-of-words) du token-recall atteignable
|
| 171 |
+
par un voting majoritaire entre tous les moteurs fournis.
|
| 172 |
+
|
| 173 |
+
Pour chaque token de la référence (avec sa multiplicité), on
|
| 174 |
+
considère qu'il est "préservé" par l'ensemble si au moins un moteur
|
| 175 |
+
en produit une occurrence non encore comptée. Le score est le ratio
|
| 176 |
+
d'occurrences GT préservées sur le total.
|
| 177 |
+
|
| 178 |
+
Parameters
|
| 179 |
+
----------
|
| 180 |
+
reference:
|
| 181 |
+
Texte GT.
|
| 182 |
+
hypotheses:
|
| 183 |
+
``{engine_name: hypothesis_text}``.
|
| 184 |
+
|
| 185 |
+
Returns
|
| 186 |
+
-------
|
| 187 |
+
float
|
| 188 |
+
Ratio dans ``[0, 1]``. ``1.0`` = chaque token GT est présent
|
| 189 |
+
dans au moins une hypothèse à hauteur de sa multiplicité.
|
| 190 |
+
|
| 191 |
+
Note
|
| 192 |
+
----
|
| 193 |
+
Cette borne est **optimiste** (supérieure à la vraie borne par
|
| 194 |
+
voting séquentiel) car elle ignore l'ordre d'apparition. Pour le
|
| 195 |
+
diagnostic « un voting vaut-il l'effort ? » le proxy suffit ; pour
|
| 196 |
+
une vraie borne il faudrait un alignement séquentiel.
|
| 197 |
+
"""
|
| 198 |
+
ref_counter = _word_multiset(reference)
|
| 199 |
+
if not ref_counter or not hypotheses:
|
| 200 |
+
return 1.0 if not ref_counter else 0.0
|
| 201 |
+
|
| 202 |
+
hyp_counters = [_word_multiset(h) for h in hypotheses.values()]
|
| 203 |
+
total_ref = sum(ref_counter.values())
|
| 204 |
+
preserved = 0
|
| 205 |
+
for token, gt_count in ref_counter.items():
|
| 206 |
+
# Pour chaque moteur, le nombre d'occurrences disponibles, plafonné
|
| 207 |
+
# à la multiplicité GT. L'oracle prend le max sur les moteurs.
|
| 208 |
+
best = max((min(gt_count, hc.get(token, 0)) for hc in hyp_counters), default=0)
|
| 209 |
+
preserved += best
|
| 210 |
+
return preserved / total_ref
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
def complementarity_gap(
|
| 214 |
+
reference: str,
|
| 215 |
+
hypotheses: dict[str, str],
|
| 216 |
+
) -> dict[str, float]:
|
| 217 |
+
"""Compare l'oracle au meilleur moteur seul.
|
| 218 |
+
|
| 219 |
+
Returns
|
| 220 |
+
-------
|
| 221 |
+
dict
|
| 222 |
+
``{
|
| 223 |
+
"oracle_recall": float, # bag-of-words recall de l'oracle
|
| 224 |
+
"best_single_recall": float, # meilleur recall token d'un moteur seul
|
| 225 |
+
"best_engine": str, # nom du moteur correspondant
|
| 226 |
+
"absolute_gap": float, # oracle - best_single (toujours ≥ 0)
|
| 227 |
+
"relative_gap": float, # absolute_gap / (1 - best_single + ε)
|
| 228 |
+
# = fraction des erreurs encore évitables
|
| 229 |
+
# par un ensemble
|
| 230 |
+
}``
|
| 231 |
+
"""
|
| 232 |
+
ref_counter = _word_multiset(reference)
|
| 233 |
+
total = sum(ref_counter.values())
|
| 234 |
+
if not total:
|
| 235 |
+
return {
|
| 236 |
+
"oracle_recall": 1.0,
|
| 237 |
+
"best_single_recall": 1.0,
|
| 238 |
+
"best_engine": "",
|
| 239 |
+
"absolute_gap": 0.0,
|
| 240 |
+
"relative_gap": 0.0,
|
| 241 |
+
}
|
| 242 |
+
|
| 243 |
+
def _single_recall(hyp_text: str) -> float:
|
| 244 |
+
hc = _word_multiset(hyp_text)
|
| 245 |
+
preserved = sum(min(gt, hc.get(tok, 0)) for tok, gt in ref_counter.items())
|
| 246 |
+
return preserved / total
|
| 247 |
+
|
| 248 |
+
if not hypotheses:
|
| 249 |
+
return {
|
| 250 |
+
"oracle_recall": 0.0,
|
| 251 |
+
"best_single_recall": 0.0,
|
| 252 |
+
"best_engine": "",
|
| 253 |
+
"absolute_gap": 0.0,
|
| 254 |
+
"relative_gap": 0.0,
|
| 255 |
+
}
|
| 256 |
+
|
| 257 |
+
per_engine = {name: _single_recall(h) for name, h in hypotheses.items()}
|
| 258 |
+
best_engine, best_recall = max(per_engine.items(), key=lambda kv: kv[1])
|
| 259 |
+
oracle = oracle_token_recall(reference, hypotheses)
|
| 260 |
+
|
| 261 |
+
absolute_gap = max(0.0, oracle - best_recall)
|
| 262 |
+
# relative_gap : fraction des erreurs du meilleur moteur que l'ensemble
|
| 263 |
+
# serait théoriquement capable de récupérer (∈ [0, 1])
|
| 264 |
+
headroom = max(1.0 - best_recall, 1e-12)
|
| 265 |
+
relative_gap = min(1.0, absolute_gap / headroom)
|
| 266 |
+
|
| 267 |
+
return {
|
| 268 |
+
"oracle_recall": oracle,
|
| 269 |
+
"best_single_recall": best_recall,
|
| 270 |
+
"best_engine": best_engine,
|
| 271 |
+
"absolute_gap": absolute_gap,
|
| 272 |
+
"relative_gap": relative_gap,
|
| 273 |
+
}
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
def pairwise_disagreement_rate(
|
| 277 |
+
reference: str,
|
| 278 |
+
hyp_a: str,
|
| 279 |
+
hyp_b: str,
|
| 280 |
+
) -> float:
|
| 281 |
+
"""Fraction de tokens GT pour lesquels A et B sont en désaccord.
|
| 282 |
+
|
| 283 |
+
Un désaccord = (l'un préserve le token, l'autre non) OU
|
| 284 |
+
(les deux le ratent mais avec des substitutions différentes — non
|
| 285 |
+
capturé ici, on reste sur la version simple présence/absence).
|
| 286 |
+
|
| 287 |
+
Returns
|
| 288 |
+
-------
|
| 289 |
+
float
|
| 290 |
+
Ratio dans ``[0, 1]``. ``0`` = A et B font les mêmes choix
|
| 291 |
+
(pas de gain d'ensemble). ``1`` = A et B sont toujours en
|
| 292 |
+
désaccord (gain d'ensemble maximal).
|
| 293 |
+
"""
|
| 294 |
+
ref_counter = _word_multiset(reference)
|
| 295 |
+
if not ref_counter:
|
| 296 |
+
return 0.0
|
| 297 |
+
a = _word_multiset(hyp_a)
|
| 298 |
+
b = _word_multiset(hyp_b)
|
| 299 |
+
total = sum(ref_counter.values())
|
| 300 |
+
disagree = 0
|
| 301 |
+
for tok, gt_count in ref_counter.items():
|
| 302 |
+
a_pres = min(gt_count, a.get(tok, 0))
|
| 303 |
+
b_pres = min(gt_count, b.get(tok, 0))
|
| 304 |
+
# Compte les positions où A et B donnent une réponse différente
|
| 305 |
+
disagree += abs(a_pres - b_pres)
|
| 306 |
+
return disagree / total
|
| 307 |
+
|
| 308 |
+
|
| 309 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 310 |
+
# Agrégation au niveau benchmark (Sprint 36)
|
| 311 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 312 |
+
|
| 313 |
+
|
| 314 |
+
def compute_inter_engine_analysis(
|
| 315 |
+
*,
|
| 316 |
+
per_engine_outputs: dict[str, dict[str, str]],
|
| 317 |
+
ground_truths: dict[str, str],
|
| 318 |
+
taxonomy_distributions: dict[str, dict[str, float]] | None = None,
|
| 319 |
+
divergence_metric: str = "js",
|
| 320 |
+
) -> dict:
|
| 321 |
+
"""Agrège les métriques inter-moteurs sur l'ensemble du corpus.
|
| 322 |
+
|
| 323 |
+
Parameters
|
| 324 |
+
----------
|
| 325 |
+
per_engine_outputs:
|
| 326 |
+
``{engine_name: {doc_id: hypothesis_text}}``. Une entrée par
|
| 327 |
+
moteur, avec une hypothèse par document. Les documents absents
|
| 328 |
+
d'un moteur (échecs, timeouts) sont simplement ignorés pour ce
|
| 329 |
+
moteur — l'oracle est calculé sur les moteurs qui ont produit
|
| 330 |
+
une sortie pour le doc.
|
| 331 |
+
ground_truths:
|
| 332 |
+
``{doc_id: ground_truth_text}``. La GT est la même pour tous
|
| 333 |
+
les moteurs ; on la passe une seule fois.
|
| 334 |
+
taxonomy_distributions:
|
| 335 |
+
``{engine_name: {error_class: probability}}`` — typiquement
|
| 336 |
+
``EngineReport.aggregated_taxonomy["class_distribution"]``. Si
|
| 337 |
+
``None`` ou vide, la divergence taxonomique n'est pas calculée.
|
| 338 |
+
divergence_metric:
|
| 339 |
+
``"js"`` (défaut, symétrique) ou ``"kl"``.
|
| 340 |
+
|
| 341 |
+
Returns
|
| 342 |
+
-------
|
| 343 |
+
dict
|
| 344 |
+
Structure stable consommable par les détecteurs narratifs et le
|
| 345 |
+
rapport HTML :
|
| 346 |
+
``{
|
| 347 |
+
"complementarity": {
|
| 348 |
+
"oracle_recall": float,
|
| 349 |
+
"best_single_recall": float,
|
| 350 |
+
"best_engine": str,
|
| 351 |
+
"absolute_gap": float,
|
| 352 |
+
"relative_gap": float,
|
| 353 |
+
"doc_count": int,
|
| 354 |
+
"per_doc": [{doc_id, oracle, best, gap}, ...] # max 50 docs
|
| 355 |
+
},
|
| 356 |
+
"taxonomy_divergence": {
|
| 357 |
+
"metric": "js"|"kl",
|
| 358 |
+
"matrix": {engine_a: {engine_b: divergence}},
|
| 359 |
+
"max_pair": [engine_a, engine_b, value] # paire la plus divergente
|
| 360 |
+
} | None,
|
| 361 |
+
"engines": [...], # liste des moteurs analysés (ordre stable)
|
| 362 |
+
}``
|
| 363 |
+
"""
|
| 364 |
+
engines = sorted(per_engine_outputs.keys())
|
| 365 |
+
result: dict = {"engines": engines}
|
| 366 |
+
|
| 367 |
+
# ── Complémentarité agrégée doc par doc ──────────────────────────────
|
| 368 |
+
if not engines:
|
| 369 |
+
result["complementarity"] = None
|
| 370 |
+
else:
|
| 371 |
+
total_oracle_preserved = 0
|
| 372 |
+
total_ref_tokens = 0
|
| 373 |
+
per_engine_preserved: dict[str, int] = {name: 0 for name in engines}
|
| 374 |
+
per_doc_records: list[dict] = []
|
| 375 |
+
|
| 376 |
+
for doc_id, gt in ground_truths.items():
|
| 377 |
+
ref_counter = _word_multiset(gt)
|
| 378 |
+
ref_total = sum(ref_counter.values())
|
| 379 |
+
if not ref_total:
|
| 380 |
+
continue
|
| 381 |
+
total_ref_tokens += ref_total
|
| 382 |
+
|
| 383 |
+
doc_hyps: dict[str, str] = {}
|
| 384 |
+
for name in engines:
|
| 385 |
+
hyp = per_engine_outputs.get(name, {}).get(doc_id)
|
| 386 |
+
if hyp is not None:
|
| 387 |
+
doc_hyps[name] = hyp
|
| 388 |
+
|
| 389 |
+
if not doc_hyps:
|
| 390 |
+
continue
|
| 391 |
+
|
| 392 |
+
hyp_counters = {n: _word_multiset(h) for n, h in doc_hyps.items()}
|
| 393 |
+
|
| 394 |
+
doc_oracle = 0
|
| 395 |
+
doc_best_per_engine: dict[str, int] = {n: 0 for n in doc_hyps}
|
| 396 |
+
for tok, gt_count in ref_counter.items():
|
| 397 |
+
# Oracle : meilleur des moteurs sur ce token
|
| 398 |
+
best_for_token = 0
|
| 399 |
+
for name, hc in hyp_counters.items():
|
| 400 |
+
preserved = min(gt_count, hc.get(tok, 0))
|
| 401 |
+
doc_best_per_engine[name] += preserved
|
| 402 |
+
if preserved > best_for_token:
|
| 403 |
+
best_for_token = preserved
|
| 404 |
+
doc_oracle += best_for_token
|
| 405 |
+
|
| 406 |
+
total_oracle_preserved += doc_oracle
|
| 407 |
+
for name, count in doc_best_per_engine.items():
|
| 408 |
+
per_engine_preserved[name] += count
|
| 409 |
+
|
| 410 |
+
doc_best = max(doc_best_per_engine.values()) if doc_best_per_engine else 0
|
| 411 |
+
per_doc_records.append({
|
| 412 |
+
"doc_id": doc_id,
|
| 413 |
+
"oracle_recall": doc_oracle / ref_total,
|
| 414 |
+
"best_single_recall": doc_best / ref_total,
|
| 415 |
+
"absolute_gap": (doc_oracle - doc_best) / ref_total,
|
| 416 |
+
})
|
| 417 |
+
|
| 418 |
+
if total_ref_tokens == 0:
|
| 419 |
+
result["complementarity"] = None
|
| 420 |
+
else:
|
| 421 |
+
oracle_recall = total_oracle_preserved / total_ref_tokens
|
| 422 |
+
recalls = {
|
| 423 |
+
name: per_engine_preserved[name] / total_ref_tokens
|
| 424 |
+
for name in engines
|
| 425 |
+
}
|
| 426 |
+
best_engine, best_recall = max(recalls.items(), key=lambda kv: kv[1])
|
| 427 |
+
absolute_gap = max(0.0, oracle_recall - best_recall)
|
| 428 |
+
headroom = max(1.0 - best_recall, 1e-12)
|
| 429 |
+
relative_gap = min(1.0, absolute_gap / headroom)
|
| 430 |
+
|
| 431 |
+
# Garder les ``per_doc_records`` les plus instructifs : tri par
|
| 432 |
+
# gap absolu décroissant, top 50. Les détecteurs narratifs
|
| 433 |
+
# n'en consomment que quelques-uns.
|
| 434 |
+
per_doc_records.sort(key=lambda r: r["absolute_gap"], reverse=True)
|
| 435 |
+
per_doc_top = per_doc_records[:50]
|
| 436 |
+
|
| 437 |
+
result["complementarity"] = {
|
| 438 |
+
"oracle_recall": oracle_recall,
|
| 439 |
+
"best_single_recall": best_recall,
|
| 440 |
+
"best_engine": best_engine,
|
| 441 |
+
"absolute_gap": absolute_gap,
|
| 442 |
+
"relative_gap": relative_gap,
|
| 443 |
+
"doc_count": len(per_doc_records),
|
| 444 |
+
"per_engine_recall": recalls,
|
| 445 |
+
"per_doc": per_doc_top,
|
| 446 |
+
}
|
| 447 |
+
|
| 448 |
+
# ── Divergence taxonomique ─────────────────────────────────────────
|
| 449 |
+
if not taxonomy_distributions:
|
| 450 |
+
result["taxonomy_divergence"] = None
|
| 451 |
+
else:
|
| 452 |
+
matrix = taxonomy_divergence_matrix(
|
| 453 |
+
taxonomy_distributions,
|
| 454 |
+
metric=divergence_metric,
|
| 455 |
+
)
|
| 456 |
+
# Cherche la paire la plus divergente (utile pour la synthèse
|
| 457 |
+
# narrative qui veut nommer les deux moteurs candidats à
|
| 458 |
+
# l'ensemble).
|
| 459 |
+
max_pair: tuple[str, str, float] = ("", "", 0.0)
|
| 460 |
+
names = sorted(matrix.keys())
|
| 461 |
+
for i, a in enumerate(names):
|
| 462 |
+
for b in names[i + 1:]:
|
| 463 |
+
v = matrix[a][b]
|
| 464 |
+
if v > max_pair[2]:
|
| 465 |
+
max_pair = (a, b, v)
|
| 466 |
+
|
| 467 |
+
result["taxonomy_divergence"] = {
|
| 468 |
+
"metric": divergence_metric,
|
| 469 |
+
"matrix": matrix,
|
| 470 |
+
"max_pair": list(max_pair) if max_pair[2] > 0 else None,
|
| 471 |
+
}
|
| 472 |
+
|
| 473 |
+
return result
|
| 474 |
+
|
| 475 |
+
|
| 476 |
+
__all__ = [
|
| 477 |
+
"kl_divergence",
|
| 478 |
+
"jensen_shannon_divergence",
|
| 479 |
+
"taxonomy_divergence_matrix",
|
| 480 |
+
"oracle_token_recall",
|
| 481 |
+
"complementarity_gap",
|
| 482 |
+
"pairwise_disagreement_rate",
|
| 483 |
+
"compute_inter_engine_analysis",
|
| 484 |
+
]
|
|
@@ -0,0 +1,280 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Layout F1 par type de région — Sprint 54.
|
| 2 |
+
|
| 3 |
+
Sprint 54 — A.II.2.2 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Un médiéviste qui édite un manuscrit glosé veut savoir : *« le moteur
|
| 8 |
+
sépare-t-il bien le texte principal de la glose ? »*. Le score de
|
| 9 |
+
structure global de Picarones (Sprint 5) agrège fusion/fragmentation
|
| 10 |
+
de lignes en un seul nombre — utile mais non typé. Ce module
|
| 11 |
+
discrimine par **type de région** ALTO/PAGE (``TextRegion``,
|
| 12 |
+
``MarginNote``, ``Header``, ``Footer``, ``Drop-Cap``...) en
|
| 13 |
+
appliquant le pattern ICDAR layout standard :
|
| 14 |
+
|
| 15 |
+
- **TP** : région GT et région hypothèse de **même type** avec
|
| 16 |
+
chevauchement IoU ≥ seuil (alignement greedy par IoU décroissant),
|
| 17 |
+
- **FN** : région GT non matchée,
|
| 18 |
+
- **FP** : région hypothèse non matchée,
|
| 19 |
+
- F1 calculé global et par type.
|
| 20 |
+
|
| 21 |
+
Le pattern d'alignement est le même que pour le NER (Sprint 38) — on
|
| 22 |
+
réutilise une approche éprouvée plutôt que d'en inventer une nouvelle.
|
| 23 |
+
|
| 24 |
+
Stratégie de découpage
|
| 25 |
+
----------------------
|
| 26 |
+
Cohérente avec NER (Sprint 38), Flesch (Sprint 52), Reading order F1
|
| 27 |
+
(Sprint 53) : couche de calcul pure d'abord. L'utilisateur fournit
|
| 28 |
+
deux listes de ``Region`` (typiquement extraites de ALTO/PAGE par un
|
| 29 |
+
parser amont — le parser ALTO/PAGE standard de Picarones suivra
|
| 30 |
+
dans un sprint dédié). Pas de câblage runner ni de vue HTML ici.
|
| 31 |
+
|
| 32 |
+
Convention de coordonnées
|
| 33 |
+
-------------------------
|
| 34 |
+
Une bbox est un tuple ``(x, y, width, height)`` en pixels (origine
|
| 35 |
+
en haut à gauche, axe y vers le bas — convention ALTO et PAGE
|
| 36 |
+
standard). L'IoU est calculée sur l'aire d'intersection / union des
|
| 37 |
+
rectangles.
|
| 38 |
+
"""
|
| 39 |
+
|
| 40 |
+
from __future__ import annotations
|
| 41 |
+
|
| 42 |
+
import logging
|
| 43 |
+
from dataclasses import dataclass
|
| 44 |
+
from typing import Iterable
|
| 45 |
+
|
| 46 |
+
logger = logging.getLogger(__name__)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 50 |
+
# Modèle de données
|
| 51 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
@dataclass(frozen=True)
|
| 55 |
+
class Region:
|
| 56 |
+
"""Une région ALTO/PAGE alignable sur sa GT.
|
| 57 |
+
|
| 58 |
+
Attributs
|
| 59 |
+
---------
|
| 60 |
+
id:
|
| 61 |
+
Identifiant unique au sein de la séquence (ex. ``"r_1"``,
|
| 62 |
+
``"region_main"``). Informatif — l'alignement se fait par IoU,
|
| 63 |
+
pas par ID.
|
| 64 |
+
type:
|
| 65 |
+
Catégorie de la région (``"TextRegion"``, ``"MarginNote"``,
|
| 66 |
+
``"Header"``, etc.). Comparaison **case-insensitive**.
|
| 67 |
+
bbox:
|
| 68 |
+
Rectangle ``(x, y, width, height)`` en pixels, origine en haut
|
| 69 |
+
à gauche. Doit avoir width > 0 et height > 0.
|
| 70 |
+
"""
|
| 71 |
+
|
| 72 |
+
id: str
|
| 73 |
+
type: str
|
| 74 |
+
bbox: tuple[int, int, int, int]
|
| 75 |
+
|
| 76 |
+
def __post_init__(self) -> None:
|
| 77 |
+
x, y, w, h = self.bbox
|
| 78 |
+
if w <= 0 or h <= 0:
|
| 79 |
+
raise ValueError(
|
| 80 |
+
f"Region {self.id!r} : bbox invalide (w={w}, h={h}). "
|
| 81 |
+
"width et height doivent être strictement positifs."
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
@property
|
| 85 |
+
def area(self) -> int:
|
| 86 |
+
_, _, w, h = self.bbox
|
| 87 |
+
return w * h
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def _to_region(obj: Region | dict) -> Region:
|
| 91 |
+
"""Coerce un dict en ``Region`` (clés ``id``, ``type``, ``bbox``)."""
|
| 92 |
+
if isinstance(obj, Region):
|
| 93 |
+
return obj
|
| 94 |
+
return Region(
|
| 95 |
+
id=str(obj["id"]),
|
| 96 |
+
type=str(obj["type"]),
|
| 97 |
+
bbox=tuple(obj["bbox"]), # type: ignore[arg-type]
|
| 98 |
+
)
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 102 |
+
# IoU + alignement greedy
|
| 103 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
def _iou_bbox(a: Region, b: Region) -> float:
|
| 107 |
+
"""Intersection-over-Union de deux bboxes ``(x, y, w, h)``."""
|
| 108 |
+
ax, ay, aw, ah = a.bbox
|
| 109 |
+
bx, by, bw, bh = b.bbox
|
| 110 |
+
inter_x = max(ax, bx)
|
| 111 |
+
inter_y = max(ay, by)
|
| 112 |
+
inter_x_end = min(ax + aw, bx + bw)
|
| 113 |
+
inter_y_end = min(ay + ah, by + bh)
|
| 114 |
+
inter_w = max(0, inter_x_end - inter_x)
|
| 115 |
+
inter_h = max(0, inter_y_end - inter_y)
|
| 116 |
+
inter = inter_w * inter_h
|
| 117 |
+
if inter == 0:
|
| 118 |
+
return 0.0
|
| 119 |
+
union = a.area + b.area - inter
|
| 120 |
+
if union <= 0:
|
| 121 |
+
return 0.0
|
| 122 |
+
return inter / union
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
def _align_regions(
|
| 126 |
+
references: list[Region],
|
| 127 |
+
hypotheses: list[Region],
|
| 128 |
+
iou_threshold: float,
|
| 129 |
+
) -> tuple[list[tuple[int, int, float]], set[int], set[int]]:
|
| 130 |
+
"""Appareillage greedy par IoU décroissant ; same type requis.
|
| 131 |
+
|
| 132 |
+
Renvoie ``(matches, unmatched_refs, unmatched_hyps)`` —
|
| 133 |
+
``matches`` est une liste de ``(idx_ref, idx_hyp, iou)``.
|
| 134 |
+
"""
|
| 135 |
+
candidates: list[tuple[float, int, int]] = []
|
| 136 |
+
for i, r in enumerate(references):
|
| 137 |
+
for j, h in enumerate(hypotheses):
|
| 138 |
+
if r.type.casefold() != h.type.casefold():
|
| 139 |
+
continue
|
| 140 |
+
iou = _iou_bbox(r, h)
|
| 141 |
+
if iou >= iou_threshold:
|
| 142 |
+
candidates.append((iou, i, j))
|
| 143 |
+
|
| 144 |
+
# Tri stable : IoU décroissant, puis indices croissants pour
|
| 145 |
+
# déterminisme sur égalités.
|
| 146 |
+
candidates.sort(key=lambda t: (-t[0], t[1], t[2]))
|
| 147 |
+
|
| 148 |
+
matched_refs: set[int] = set()
|
| 149 |
+
matched_hyps: set[int] = set()
|
| 150 |
+
matches: list[tuple[int, int, float]] = []
|
| 151 |
+
for iou, i, j in candidates:
|
| 152 |
+
if i in matched_refs or j in matched_hyps:
|
| 153 |
+
continue
|
| 154 |
+
matched_refs.add(i)
|
| 155 |
+
matched_hyps.add(j)
|
| 156 |
+
matches.append((i, j, iou))
|
| 157 |
+
|
| 158 |
+
unmatched_refs = set(range(len(references))) - matched_refs
|
| 159 |
+
unmatched_hyps = set(range(len(hypotheses))) - matched_hyps
|
| 160 |
+
return matches, unmatched_refs, unmatched_hyps
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 164 |
+
# Métrique principale
|
| 165 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
def _prf(tp: int, fp: int, fn: int) -> dict[str, float]:
|
| 169 |
+
p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
|
| 170 |
+
r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
|
| 171 |
+
f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
|
| 172 |
+
return {"precision": p, "recall": r, "f1": f1, "support": tp + fn}
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
def compute_layout_metrics(
|
| 176 |
+
reference_regions: Iterable[Region | dict] | None,
|
| 177 |
+
hypothesis_regions: Iterable[Region | dict] | None,
|
| 178 |
+
iou_threshold: float = 0.5,
|
| 179 |
+
) -> dict:
|
| 180 |
+
"""Calcule precision/recall/F1 sur le layout par type de région.
|
| 181 |
+
|
| 182 |
+
Parameters
|
| 183 |
+
----------
|
| 184 |
+
reference_regions:
|
| 185 |
+
Liste de régions GT (``Region`` ou dict ``{id, type, bbox}``).
|
| 186 |
+
hypothesis_regions:
|
| 187 |
+
Liste de régions produites par le moteur OCR/HTR ou un
|
| 188 |
+
layout-detector.
|
| 189 |
+
iou_threshold:
|
| 190 |
+
Seuil de chevauchement minimal pour déclarer un appariement
|
| 191 |
+
(défaut : 0,5 — convention ICDAR).
|
| 192 |
+
|
| 193 |
+
Returns
|
| 194 |
+
-------
|
| 195 |
+
dict
|
| 196 |
+
``{
|
| 197 |
+
"global": {"precision", "recall", "f1", "support"},
|
| 198 |
+
"per_type": {type_name: {"precision", ...}},
|
| 199 |
+
"true_positives": int,
|
| 200 |
+
"false_positives": int,
|
| 201 |
+
"false_negatives": int,
|
| 202 |
+
"missed_regions": list[dict], # GT non matchées
|
| 203 |
+
"hallucinated_regions": list[dict], # hyp non matchées
|
| 204 |
+
"iou_threshold": float,
|
| 205 |
+
}``
|
| 206 |
+
|
| 207 |
+
Cas dégénérés
|
| 208 |
+
-------------
|
| 209 |
+
- Deux listes vides → F1 = 0 et tous compteurs à 0.
|
| 210 |
+
- GT vide + hyp non-vide → F1 = 0 (toutes hyp = FP).
|
| 211 |
+
- hyp vide + GT non-vide → F1 = 0 (toutes GT = FN).
|
| 212 |
+
"""
|
| 213 |
+
refs = [_to_region(r) for r in (reference_regions or [])]
|
| 214 |
+
hyps = [_to_region(h) for h in (hypothesis_regions or [])]
|
| 215 |
+
|
| 216 |
+
matches, unmatched_refs, unmatched_hyps = _align_regions(
|
| 217 |
+
refs, hyps, iou_threshold,
|
| 218 |
+
)
|
| 219 |
+
|
| 220 |
+
tp = len(matches)
|
| 221 |
+
fn = len(unmatched_refs)
|
| 222 |
+
fp = len(unmatched_hyps)
|
| 223 |
+
|
| 224 |
+
cat_tp: dict[str, int] = {}
|
| 225 |
+
cat_fn: dict[str, int] = {}
|
| 226 |
+
cat_fp: dict[str, int] = {}
|
| 227 |
+
for i, _j, _iou in matches:
|
| 228 |
+
cat = refs[i].type
|
| 229 |
+
cat_tp[cat] = cat_tp.get(cat, 0) + 1
|
| 230 |
+
for i in unmatched_refs:
|
| 231 |
+
cat = refs[i].type
|
| 232 |
+
cat_fn[cat] = cat_fn.get(cat, 0) + 1
|
| 233 |
+
for j in unmatched_hyps:
|
| 234 |
+
cat = hyps[j].type
|
| 235 |
+
cat_fp[cat] = cat_fp.get(cat, 0) + 1
|
| 236 |
+
|
| 237 |
+
all_categories = sorted(set(cat_tp) | set(cat_fn) | set(cat_fp))
|
| 238 |
+
per_type = {
|
| 239 |
+
cat: _prf(
|
| 240 |
+
cat_tp.get(cat, 0),
|
| 241 |
+
cat_fp.get(cat, 0),
|
| 242 |
+
cat_fn.get(cat, 0),
|
| 243 |
+
)
|
| 244 |
+
for cat in all_categories
|
| 245 |
+
}
|
| 246 |
+
|
| 247 |
+
return {
|
| 248 |
+
"global": _prf(tp, fp, fn),
|
| 249 |
+
"per_type": per_type,
|
| 250 |
+
"true_positives": tp,
|
| 251 |
+
"false_positives": fp,
|
| 252 |
+
"false_negatives": fn,
|
| 253 |
+
"missed_regions": [
|
| 254 |
+
{"id": refs[i].id, "type": refs[i].type, "bbox": list(refs[i].bbox)}
|
| 255 |
+
for i in sorted(unmatched_refs)
|
| 256 |
+
],
|
| 257 |
+
"hallucinated_regions": [
|
| 258 |
+
{"id": hyps[j].id, "type": hyps[j].type, "bbox": list(hyps[j].bbox)}
|
| 259 |
+
for j in sorted(unmatched_hyps)
|
| 260 |
+
],
|
| 261 |
+
"iou_threshold": iou_threshold,
|
| 262 |
+
}
|
| 263 |
+
|
| 264 |
+
|
| 265 |
+
def layout_f1(
|
| 266 |
+
reference_regions: Iterable[Region | dict] | None,
|
| 267 |
+
hypothesis_regions: Iterable[Region | dict] | None,
|
| 268 |
+
iou_threshold: float = 0.5,
|
| 269 |
+
) -> float:
|
| 270 |
+
"""Raccourci : F1 global du layout."""
|
| 271 |
+
return compute_layout_metrics(
|
| 272 |
+
reference_regions, hypothesis_regions, iou_threshold,
|
| 273 |
+
)["global"]["f1"]
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
__all__ = [
|
| 277 |
+
"Region",
|
| 278 |
+
"compute_layout_metrics",
|
| 279 |
+
"layout_f1",
|
| 280 |
+
]
|
|
@@ -0,0 +1,561 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Section « Leviers d'amélioration » — Sprint 82 (A.I.9).
|
| 2 |
+
|
| 3 |
+
Sprint 82 — A.I.9 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Le moteur narratif (Sprint 19) émet des `Fact` qui décrivent **ce
|
| 8 |
+
qui s'est passé** dans le benchmark : qui gagne, qui s'effondre,
|
| 9 |
+
qui est fragile. Ce sprint répond à une question
|
| 10 |
+
complémentaire : **sur quelle dimension le bénéfice attendu d'une
|
| 11 |
+
amélioration serait-il le plus visible ?**
|
| 12 |
+
|
| 13 |
+
Pas de prescription
|
| 14 |
+
-------------------
|
| 15 |
+
Picarones est un **outil de recherche**, pas un atelier de
|
| 16 |
+
production. Le module ne dit jamais *« faites X »* ni
|
| 17 |
+
*« utilisez le moteur Y »* ; il agrège des **observations
|
| 18 |
+
factuelles** déjà calculées dans d'autres modules (Sprints 75-81)
|
| 19 |
+
et les présente comme un récapitulatif compact en bas du rapport.
|
| 20 |
+
Le chercheur lit, juge et arbitre.
|
| 21 |
+
|
| 22 |
+
Exemples de leviers émis
|
| 23 |
+
------------------------
|
| 24 |
+
- *« 65 % des erreurs de Tesseract sont de classe récupérable
|
| 25 |
+
(case_error, ligature_error, abbreviation_error) — un
|
| 26 |
+
post-processing trivial absorberait une partie. »*
|
| 27 |
+
- *« 12 % de vos documents concentrent 78 % du CER total
|
| 28 |
+
(Pareto-CER). »*
|
| 29 |
+
- *« Le déficit projeté du moteur le plus fragile sur le corpus
|
| 30 |
+
réel est de 4,2 points de CER (Sprint 81). »*
|
| 31 |
+
- *« Le top-3 des tokens GT systématiquement modernisés est
|
| 32 |
+
maistre, nostre, veoir (Sprint 80). »*
|
| 33 |
+
|
| 34 |
+
Structure
|
| 35 |
+
---------
|
| 36 |
+
Module parallèle au registre narratif Sprint 19 : `Lever` est la
|
| 37 |
+
dataclass équivalente à `Fact`, `LeverImportance` reprend la
|
| 38 |
+
sémantique de `FactImportance`, `@register_lever` indexe les
|
| 39 |
+
détecteurs. Garde-fou anti-hallucination identique : chaque
|
| 40 |
+
nombre rendu doit être présent dans le `payload` du `Lever`.
|
| 41 |
+
|
| 42 |
+
Les détecteurs lisent **uniquement** des structures déjà
|
| 43 |
+
construites par le pipeline du benchmark — ils ne calculent rien
|
| 44 |
+
de nouveau, ils synthétisent. C'est pourquoi le module est
|
| 45 |
+
résolument optionnel : si un benchmark n'expose pas
|
| 46 |
+
`taxonomy_aggregated`, `inter_engine_analysis`, `corpus_difficulty`,
|
| 47 |
+
`lexical_modernization` ou `robustness_projection`, le détecteur
|
| 48 |
+
correspondant retourne tout simplement `[]`.
|
| 49 |
+
"""
|
| 50 |
+
|
| 51 |
+
from __future__ import annotations
|
| 52 |
+
|
| 53 |
+
import logging
|
| 54 |
+
import threading
|
| 55 |
+
from dataclasses import dataclass
|
| 56 |
+
from enum import Enum
|
| 57 |
+
from typing import Callable
|
| 58 |
+
|
| 59 |
+
logger = logging.getLogger(__name__)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 63 |
+
# Modèle
|
| 64 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
class LeverType(str, Enum):
|
| 68 |
+
"""Types de leviers détectés."""
|
| 69 |
+
|
| 70 |
+
DOMINANT_RECOVERABLE_CLASS = "dominant_recoverable_class"
|
| 71 |
+
"""Une part importante des erreurs d'un moteur est dans des classes
|
| 72 |
+
catégorisées « récupérables » (Sprint 77)."""
|
| 73 |
+
|
| 74 |
+
PARETO_CONCENTRATION = "pareto_concentration"
|
| 75 |
+
"""Une fraction minoritaire de documents concentre une fraction
|
| 76 |
+
majoritaire du CER total — l'inspection ciblée est rentable."""
|
| 77 |
+
|
| 78 |
+
COMPLEMENTARITY_OBSERVATION = "complementarity_observation"
|
| 79 |
+
"""Le `complementarity_gap` (Sprint 35) entre l'oracle et le
|
| 80 |
+
meilleur moteur seul est non négligeable — observation factuelle,
|
| 81 |
+
aucune recommandation d'ensemble."""
|
| 82 |
+
|
| 83 |
+
LEXICAL_MODERNIZATION_OBSERVATION = "lexical_modernization_observation"
|
| 84 |
+
"""Top-N des tokens GT systématiquement modernisés (Sprint 80)."""
|
| 85 |
+
|
| 86 |
+
ROBUSTNESS_PROJECTION_OBSERVATION = "robustness_projection_observation"
|
| 87 |
+
"""Déficit projeté global le plus important pour un moteur sur
|
| 88 |
+
le corpus réel (Sprint 81)."""
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
class LeverImportance(int, Enum):
|
| 92 |
+
"""Importance éditoriale d'un levier."""
|
| 93 |
+
|
| 94 |
+
HIGH = 70
|
| 95 |
+
MEDIUM = 40
|
| 96 |
+
LOW = 10
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
@dataclass
|
| 100 |
+
class Lever:
|
| 101 |
+
"""Observation factuelle synthétisable en encart « Leviers ».
|
| 102 |
+
|
| 103 |
+
Attributes
|
| 104 |
+
----------
|
| 105 |
+
type:
|
| 106 |
+
Le type de levier (voir `LeverType`).
|
| 107 |
+
importance:
|
| 108 |
+
Score qui décide l'ordre d'affichage.
|
| 109 |
+
payload:
|
| 110 |
+
Données brutes — **tout chiffre rendu dans le HTML doit
|
| 111 |
+
provenir d'ici**, jamais d'un calcul du renderer.
|
| 112 |
+
engines_involved:
|
| 113 |
+
Noms des moteurs concernés (peut être vide pour un levier
|
| 114 |
+
corpus-wide).
|
| 115 |
+
"""
|
| 116 |
+
|
| 117 |
+
type: LeverType
|
| 118 |
+
importance: LeverImportance
|
| 119 |
+
payload: dict
|
| 120 |
+
engines_involved: tuple[str, ...] = ()
|
| 121 |
+
|
| 122 |
+
def as_dict(self) -> dict:
|
| 123 |
+
return {
|
| 124 |
+
"type": self.type.value,
|
| 125 |
+
"importance": int(self.importance),
|
| 126 |
+
"payload": self.payload,
|
| 127 |
+
"engines_involved": list(self.engines_involved),
|
| 128 |
+
}
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 132 |
+
# Registre
|
| 133 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
LeverDetectorFn = Callable[[dict], list[Lever]]
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
@dataclass(frozen=True)
|
| 140 |
+
class LeverDetectorEntry:
|
| 141 |
+
lever_type: LeverType
|
| 142 |
+
fn: LeverDetectorFn
|
| 143 |
+
priority: int
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
_LEVER_REGISTRY: dict[LeverType, LeverDetectorEntry] = {}
|
| 147 |
+
_LEVER_REGISTRY_LOCK = threading.Lock()
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
def register_lever(
|
| 151 |
+
lever_type: LeverType,
|
| 152 |
+
*,
|
| 153 |
+
priority: int,
|
| 154 |
+
) -> Callable[[LeverDetectorFn], LeverDetectorFn]:
|
| 155 |
+
"""Décorateur : enregistre un détecteur de levier.
|
| 156 |
+
|
| 157 |
+
Une seule fonction par type — réenregistrer lève `ValueError`.
|
| 158 |
+
"""
|
| 159 |
+
def _decorator(fn: LeverDetectorFn) -> LeverDetectorFn:
|
| 160 |
+
with _LEVER_REGISTRY_LOCK:
|
| 161 |
+
if lever_type in _LEVER_REGISTRY:
|
| 162 |
+
raise ValueError(
|
| 163 |
+
f"Détecteur déjà enregistré pour {lever_type.value!r} : "
|
| 164 |
+
f"{_LEVER_REGISTRY[lever_type].fn.__name__}."
|
| 165 |
+
)
|
| 166 |
+
_LEVER_REGISTRY[lever_type] = LeverDetectorEntry(
|
| 167 |
+
lever_type=lever_type, fn=fn, priority=int(priority),
|
| 168 |
+
)
|
| 169 |
+
return fn
|
| 170 |
+
return _decorator
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
def unregister_lever(lever_type: LeverType) -> None:
|
| 174 |
+
with _LEVER_REGISTRY_LOCK:
|
| 175 |
+
_LEVER_REGISTRY.pop(lever_type, None)
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
def iter_lever_detectors() -> list[LeverDetectorEntry]:
|
| 179 |
+
with _LEVER_REGISTRY_LOCK:
|
| 180 |
+
entries = list(_LEVER_REGISTRY.values())
|
| 181 |
+
entries.sort(key=lambda e: e.priority)
|
| 182 |
+
return entries
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
def detect_levers(benchmark_data: dict) -> list[Lever]:
|
| 186 |
+
"""Applique tous les détecteurs enregistrés et trie par importance
|
| 187 |
+
décroissante puis priorité d'enregistrement croissante."""
|
| 188 |
+
levers: list[Lever] = []
|
| 189 |
+
for entry in iter_lever_detectors():
|
| 190 |
+
try:
|
| 191 |
+
result = entry.fn(benchmark_data)
|
| 192 |
+
except Exception as e:
|
| 193 |
+
logger.warning(
|
| 194 |
+
"[levers.detector.%s] fonctionnalité dégradée : %s",
|
| 195 |
+
entry.lever_type.value, e,
|
| 196 |
+
)
|
| 197 |
+
continue
|
| 198 |
+
if result:
|
| 199 |
+
levers.extend(result)
|
| 200 |
+
# Tri stable : importance décroissante d'abord
|
| 201 |
+
levers.sort(key=lambda lv: -int(lv.importance))
|
| 202 |
+
return levers
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 206 |
+
# Détecteurs
|
| 207 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
# Catégorisation reprise du Sprint 77 (taxonomy_comparison.py).
|
| 211 |
+
# Volontairement dupliquée ici pour ne pas introduire d'import
|
| 212 |
+
# circulaire — la sémantique est gelée.
|
| 213 |
+
_RECOVERABILITY: dict[str, str] = {
|
| 214 |
+
"case_error": "recoverable",
|
| 215 |
+
"ligature_error": "recoverable",
|
| 216 |
+
"abbreviation_error": "recoverable",
|
| 217 |
+
"diacritic_error": "difficult",
|
| 218 |
+
"visual_confusion": "difficult",
|
| 219 |
+
"hapax": "difficult",
|
| 220 |
+
"lacuna": "irrecoverable",
|
| 221 |
+
"oov_character": "irrecoverable",
|
| 222 |
+
"segmentation_error": "irrecoverable",
|
| 223 |
+
}
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
@register_lever(LeverType.DOMINANT_RECOVERABLE_CLASS, priority=10)
|
| 227 |
+
def detect_dominant_recoverable_class(
|
| 228 |
+
benchmark_data: dict,
|
| 229 |
+
*,
|
| 230 |
+
threshold: float = 0.30,
|
| 231 |
+
) -> list[Lever]:
|
| 232 |
+
"""Émet un levier si ≥ `threshold` des erreurs d'un moteur sont
|
| 233 |
+
classifiées récupérables (catégorisation Sprint 77).
|
| 234 |
+
|
| 235 |
+
Lit `benchmark_data["engines"][i]["aggregated_taxonomy"]` —
|
| 236 |
+
structure produite par le runner historique. Si absent, retourne
|
| 237 |
+
[].
|
| 238 |
+
"""
|
| 239 |
+
engines = benchmark_data.get("engines") or []
|
| 240 |
+
out: list[Lever] = []
|
| 241 |
+
for engine in engines:
|
| 242 |
+
taxonomy = engine.get("aggregated_taxonomy")
|
| 243 |
+
if not taxonomy:
|
| 244 |
+
continue
|
| 245 |
+
# `taxonomy` peut être {class_name: int} ou un dict avec une
|
| 246 |
+
# sous-clé "counts" — on accepte les deux conventions.
|
| 247 |
+
counts = taxonomy.get("counts") if isinstance(taxonomy, dict) and "counts" in taxonomy else taxonomy
|
| 248 |
+
if not isinstance(counts, dict) or not counts:
|
| 249 |
+
continue
|
| 250 |
+
try:
|
| 251 |
+
int_counts = {k: int(v) for k, v in counts.items() if isinstance(v, (int, float))}
|
| 252 |
+
except (TypeError, ValueError):
|
| 253 |
+
continue
|
| 254 |
+
total = sum(int_counts.values())
|
| 255 |
+
if total <= 0:
|
| 256 |
+
continue
|
| 257 |
+
recoverable_total = sum(
|
| 258 |
+
v for k, v in int_counts.items()
|
| 259 |
+
if _RECOVERABILITY.get(k) == "recoverable"
|
| 260 |
+
)
|
| 261 |
+
share = recoverable_total / total
|
| 262 |
+
if share < threshold:
|
| 263 |
+
continue
|
| 264 |
+
# Classes récupérables non vides triées par count décroissant
|
| 265 |
+
breakdown = sorted(
|
| 266 |
+
(
|
| 267 |
+
(k, v) for k, v in int_counts.items()
|
| 268 |
+
if _RECOVERABILITY.get(k) == "recoverable" and v > 0
|
| 269 |
+
),
|
| 270 |
+
key=lambda kv: -kv[1],
|
| 271 |
+
)
|
| 272 |
+
importance = (
|
| 273 |
+
LeverImportance.HIGH if share >= 0.50 else LeverImportance.MEDIUM
|
| 274 |
+
)
|
| 275 |
+
out.append(Lever(
|
| 276 |
+
type=LeverType.DOMINANT_RECOVERABLE_CLASS,
|
| 277 |
+
importance=importance,
|
| 278 |
+
payload={
|
| 279 |
+
"engine": engine.get("name") or "?",
|
| 280 |
+
"share_recoverable": share,
|
| 281 |
+
"share_recoverable_pct": round(share * 100, 1),
|
| 282 |
+
"n_recoverable": recoverable_total,
|
| 283 |
+
"n_total_errors": total,
|
| 284 |
+
"top_classes": [
|
| 285 |
+
{"class": k, "count": v} for k, v in breakdown[:3]
|
| 286 |
+
],
|
| 287 |
+
},
|
| 288 |
+
engines_involved=(engine.get("name") or "?",),
|
| 289 |
+
))
|
| 290 |
+
return out
|
| 291 |
+
|
| 292 |
+
|
| 293 |
+
@register_lever(LeverType.PARETO_CONCENTRATION, priority=20)
|
| 294 |
+
def detect_pareto_concentration(
|
| 295 |
+
benchmark_data: dict,
|
| 296 |
+
*,
|
| 297 |
+
top_share: float = 0.20,
|
| 298 |
+
cer_share_threshold: float = 0.50,
|
| 299 |
+
) -> list[Lever]:
|
| 300 |
+
"""Émet un levier si une fraction minoritaire de documents
|
| 301 |
+
(`top_share`) concentre plus de `cer_share_threshold` du CER
|
| 302 |
+
total cumulé sur le moteur leader.
|
| 303 |
+
|
| 304 |
+
Lit `benchmark_data["per_doc_cer"][engine_name]` ou tente de
|
| 305 |
+
reconstruire depuis `benchmark_data["engines"][...]["per_doc"]`.
|
| 306 |
+
Si rien d'exploitable, retourne [].
|
| 307 |
+
"""
|
| 308 |
+
ranking = benchmark_data.get("ranking") or []
|
| 309 |
+
if not ranking:
|
| 310 |
+
return []
|
| 311 |
+
leader = ranking[0]
|
| 312 |
+
leader_name = leader.get("engine")
|
| 313 |
+
if not leader_name:
|
| 314 |
+
return []
|
| 315 |
+
|
| 316 |
+
per_doc_cer: list[float] = []
|
| 317 |
+
# Voie 1 : structure plate "per_doc_cer"
|
| 318 |
+
flat = benchmark_data.get("per_doc_cer") or {}
|
| 319 |
+
if isinstance(flat, dict) and leader_name in flat and isinstance(flat[leader_name], list):
|
| 320 |
+
per_doc_cer = [float(x) for x in flat[leader_name] if isinstance(x, (int, float))]
|
| 321 |
+
else:
|
| 322 |
+
# Voie 2 : engine.per_doc liste de dicts {cer: float}
|
| 323 |
+
for engine in benchmark_data.get("engines") or []:
|
| 324 |
+
if engine.get("name") != leader_name:
|
| 325 |
+
continue
|
| 326 |
+
per_doc = engine.get("per_doc") or []
|
| 327 |
+
for entry in per_doc:
|
| 328 |
+
if isinstance(entry, dict) and isinstance(entry.get("cer"), (int, float)):
|
| 329 |
+
per_doc_cer.append(float(entry["cer"]))
|
| 330 |
+
break
|
| 331 |
+
|
| 332 |
+
if not per_doc_cer:
|
| 333 |
+
return []
|
| 334 |
+
total_cer = sum(per_doc_cer)
|
| 335 |
+
if total_cer <= 0:
|
| 336 |
+
return []
|
| 337 |
+
|
| 338 |
+
sorted_cer = sorted(per_doc_cer, reverse=True)
|
| 339 |
+
n = len(sorted_cer)
|
| 340 |
+
n_top = max(1, int(round(top_share * n)))
|
| 341 |
+
top_cer_sum = sum(sorted_cer[:n_top])
|
| 342 |
+
share_of_total = top_cer_sum / total_cer
|
| 343 |
+
if share_of_total < cer_share_threshold:
|
| 344 |
+
return []
|
| 345 |
+
importance = (
|
| 346 |
+
LeverImportance.HIGH if share_of_total >= 0.75
|
| 347 |
+
else LeverImportance.MEDIUM
|
| 348 |
+
)
|
| 349 |
+
return [Lever(
|
| 350 |
+
type=LeverType.PARETO_CONCENTRATION,
|
| 351 |
+
importance=importance,
|
| 352 |
+
payload={
|
| 353 |
+
"engine": leader_name,
|
| 354 |
+
"n_docs": n,
|
| 355 |
+
"n_docs_top": n_top,
|
| 356 |
+
"top_share_pct": round((n_top / n) * 100, 1),
|
| 357 |
+
"cer_share_of_total": share_of_total,
|
| 358 |
+
"cer_share_pct": round(share_of_total * 100, 1),
|
| 359 |
+
},
|
| 360 |
+
engines_involved=(leader_name,),
|
| 361 |
+
)]
|
| 362 |
+
|
| 363 |
+
|
| 364 |
+
@register_lever(LeverType.COMPLEMENTARITY_OBSERVATION, priority=30)
|
| 365 |
+
def detect_complementarity_observation(
|
| 366 |
+
benchmark_data: dict,
|
| 367 |
+
*,
|
| 368 |
+
min_relative_gap: float = 0.20,
|
| 369 |
+
) -> list[Lever]:
|
| 370 |
+
"""Reformule factuellement le `complementarity_gap` (Sprint 35).
|
| 371 |
+
|
| 372 |
+
Lit `benchmark_data["inter_engine_analysis"]`. Garde-fou : ne
|
| 373 |
+
déclenche que si `relative_gap` ≥ `min_relative_gap`. **Aucune
|
| 374 |
+
recommandation d'ensemble** — le levier dit factuellement
|
| 375 |
+
« X points séparent l'oracle du meilleur moteur », c'est tout.
|
| 376 |
+
"""
|
| 377 |
+
inter = benchmark_data.get("inter_engine_analysis") or {}
|
| 378 |
+
cgap = inter.get("complementarity_gap") or {}
|
| 379 |
+
relative_gap = cgap.get("relative_gap")
|
| 380 |
+
absolute_gap = cgap.get("absolute_gap")
|
| 381 |
+
if relative_gap is None or absolute_gap is None:
|
| 382 |
+
return []
|
| 383 |
+
try:
|
| 384 |
+
rg = float(relative_gap)
|
| 385 |
+
ag = float(absolute_gap)
|
| 386 |
+
except (TypeError, ValueError):
|
| 387 |
+
return []
|
| 388 |
+
if rg < min_relative_gap:
|
| 389 |
+
return []
|
| 390 |
+
importance = (
|
| 391 |
+
LeverImportance.HIGH if rg >= 0.50 else LeverImportance.MEDIUM
|
| 392 |
+
)
|
| 393 |
+
payload: dict = {
|
| 394 |
+
"absolute_gap": ag,
|
| 395 |
+
"absolute_gap_pct": round(ag * 100, 1),
|
| 396 |
+
"relative_gap": rg,
|
| 397 |
+
"relative_gap_pct": round(rg * 100, 1),
|
| 398 |
+
}
|
| 399 |
+
best_engine = cgap.get("best_engine") or inter.get("best_engine")
|
| 400 |
+
best_recall = cgap.get("best_recall") or inter.get("best_engine_recall")
|
| 401 |
+
oracle_recall = cgap.get("oracle_recall") or inter.get("oracle_recall")
|
| 402 |
+
engines_involved: tuple[str, ...] = ()
|
| 403 |
+
if best_engine:
|
| 404 |
+
payload["best_engine"] = str(best_engine)
|
| 405 |
+
engines_involved = (str(best_engine),)
|
| 406 |
+
if isinstance(best_recall, (int, float)):
|
| 407 |
+
payload["best_recall"] = float(best_recall)
|
| 408 |
+
if isinstance(oracle_recall, (int, float)):
|
| 409 |
+
payload["oracle_recall"] = float(oracle_recall)
|
| 410 |
+
return [Lever(
|
| 411 |
+
type=LeverType.COMPLEMENTARITY_OBSERVATION,
|
| 412 |
+
importance=importance,
|
| 413 |
+
payload=payload,
|
| 414 |
+
engines_involved=engines_involved,
|
| 415 |
+
)]
|
| 416 |
+
|
| 417 |
+
|
| 418 |
+
@register_lever(LeverType.LEXICAL_MODERNIZATION_OBSERVATION, priority=40)
|
| 419 |
+
def detect_lexical_modernization_observation(
|
| 420 |
+
benchmark_data: dict,
|
| 421 |
+
*,
|
| 422 |
+
top_n: int = 3,
|
| 423 |
+
min_total: int = 3,
|
| 424 |
+
min_rate: float = 0.50,
|
| 425 |
+
) -> list[Lever]:
|
| 426 |
+
"""Pour chaque moteur disposant de `lexical_modernization`,
|
| 427 |
+
émet un levier listant les `top_n` tokens GT les plus modernisés.
|
| 428 |
+
|
| 429 |
+
Lit `benchmark_data["engines"][i]["lexical_modernization"]` qui
|
| 430 |
+
suit la forme produite par `compute_lexical_modernization` du
|
| 431 |
+
Sprint 80 (`{"n_gt_tokens": int, "tokens": dict}`).
|
| 432 |
+
"""
|
| 433 |
+
out: list[Lever] = []
|
| 434 |
+
for engine in benchmark_data.get("engines") or []:
|
| 435 |
+
data = engine.get("lexical_modernization")
|
| 436 |
+
if not isinstance(data, dict):
|
| 437 |
+
continue
|
| 438 |
+
tokens = data.get("tokens") or {}
|
| 439 |
+
if not isinstance(tokens, dict) or not tokens:
|
| 440 |
+
continue
|
| 441 |
+
candidates: list[tuple[str, dict]] = []
|
| 442 |
+
for gt_token, slot in tokens.items():
|
| 443 |
+
if not isinstance(slot, dict):
|
| 444 |
+
continue
|
| 445 |
+
n_total = slot.get("n_total")
|
| 446 |
+
rate = slot.get("rate_modernized")
|
| 447 |
+
if not isinstance(n_total, (int, float)) or not isinstance(rate, (int, float)):
|
| 448 |
+
continue
|
| 449 |
+
if int(n_total) < min_total:
|
| 450 |
+
continue
|
| 451 |
+
if float(rate) < min_rate:
|
| 452 |
+
continue
|
| 453 |
+
candidates.append((gt_token, dict(slot)))
|
| 454 |
+
if not candidates:
|
| 455 |
+
continue
|
| 456 |
+
candidates.sort(
|
| 457 |
+
key=lambda kv: (-float(kv[1].get("rate_modernized", 0.0)),
|
| 458 |
+
-int(kv[1].get("n_total", 0)),
|
| 459 |
+
kv[0]),
|
| 460 |
+
)
|
| 461 |
+
top = candidates[:top_n]
|
| 462 |
+
engine_name = engine.get("name") or "?"
|
| 463 |
+
max_rate = max(float(slot.get("rate_modernized", 0.0)) for _, slot in top)
|
| 464 |
+
importance = (
|
| 465 |
+
LeverImportance.HIGH if max_rate >= 0.90 else LeverImportance.MEDIUM
|
| 466 |
+
)
|
| 467 |
+
out.append(Lever(
|
| 468 |
+
type=LeverType.LEXICAL_MODERNIZATION_OBSERVATION,
|
| 469 |
+
importance=importance,
|
| 470 |
+
payload={
|
| 471 |
+
"engine": engine_name,
|
| 472 |
+
"top_tokens": [
|
| 473 |
+
{
|
| 474 |
+
"gt_token": gt,
|
| 475 |
+
"n_total": int(slot.get("n_total", 0)),
|
| 476 |
+
"rate_modernized": float(slot.get("rate_modernized", 0.0)),
|
| 477 |
+
"rate_modernized_pct": round(
|
| 478 |
+
float(slot.get("rate_modernized", 0.0)) * 100, 1,
|
| 479 |
+
),
|
| 480 |
+
}
|
| 481 |
+
for gt, slot in top
|
| 482 |
+
],
|
| 483 |
+
},
|
| 484 |
+
engines_involved=(engine_name,),
|
| 485 |
+
))
|
| 486 |
+
return out
|
| 487 |
+
|
| 488 |
+
|
| 489 |
+
@register_lever(LeverType.ROBUSTNESS_PROJECTION_OBSERVATION, priority=50)
|
| 490 |
+
def detect_robustness_projection_observation(
|
| 491 |
+
benchmark_data: dict,
|
| 492 |
+
*,
|
| 493 |
+
min_total_deficit: float = 0.02,
|
| 494 |
+
) -> list[Lever]:
|
| 495 |
+
"""Lit l'agrégation par moteur de la projection de robustesse
|
| 496 |
+
(Sprint 81). Émet le levier pour le moteur dont
|
| 497 |
+
`total_expected_deficit` est ≥ `min_total_deficit` (par défaut
|
| 498 |
+
2 points de CER).
|
| 499 |
+
|
| 500 |
+
Lit `benchmark_data["robustness_projection_aggregated"]` —
|
| 501 |
+
structure produite par `aggregate_projection_per_engine`.
|
| 502 |
+
"""
|
| 503 |
+
agg = benchmark_data.get("robustness_projection_aggregated") or {}
|
| 504 |
+
if not isinstance(agg, dict) or not agg:
|
| 505 |
+
return []
|
| 506 |
+
out: list[Lever] = []
|
| 507 |
+
for engine_name, info in agg.items():
|
| 508 |
+
if not isinstance(info, dict):
|
| 509 |
+
continue
|
| 510 |
+
total_deficit = info.get("total_expected_deficit")
|
| 511 |
+
worst_type = info.get("worst_degradation_type")
|
| 512 |
+
worst_deficit = info.get("worst_degradation_deficit")
|
| 513 |
+
if not isinstance(total_deficit, (int, float)):
|
| 514 |
+
continue
|
| 515 |
+
if float(total_deficit) < min_total_deficit:
|
| 516 |
+
continue
|
| 517 |
+
importance = (
|
| 518 |
+
LeverImportance.HIGH if float(total_deficit) >= 0.05
|
| 519 |
+
else LeverImportance.MEDIUM
|
| 520 |
+
)
|
| 521 |
+
payload: dict = {
|
| 522 |
+
"engine": engine_name,
|
| 523 |
+
"total_expected_deficit": float(total_deficit),
|
| 524 |
+
"total_expected_deficit_pct": round(float(total_deficit) * 100, 1),
|
| 525 |
+
"n_degradation_types": int(info.get("n_degradation_types") or 0),
|
| 526 |
+
}
|
| 527 |
+
if isinstance(worst_type, str):
|
| 528 |
+
payload["worst_degradation_type"] = worst_type
|
| 529 |
+
if isinstance(worst_deficit, (int, float)):
|
| 530 |
+
payload["worst_degradation_deficit"] = float(worst_deficit)
|
| 531 |
+
payload["worst_degradation_deficit_pct"] = round(
|
| 532 |
+
float(worst_deficit) * 100, 1,
|
| 533 |
+
)
|
| 534 |
+
out.append(Lever(
|
| 535 |
+
type=LeverType.ROBUSTNESS_PROJECTION_OBSERVATION,
|
| 536 |
+
importance=importance,
|
| 537 |
+
payload=payload,
|
| 538 |
+
engines_involved=(engine_name,),
|
| 539 |
+
))
|
| 540 |
+
# Tri par déficit décroissant pour stabilité d'affichage.
|
| 541 |
+
out.sort(
|
| 542 |
+
key=lambda lv: -float(lv.payload.get("total_expected_deficit") or 0.0),
|
| 543 |
+
)
|
| 544 |
+
return out
|
| 545 |
+
|
| 546 |
+
|
| 547 |
+
__all__ = [
|
| 548 |
+
"Lever",
|
| 549 |
+
"LeverImportance",
|
| 550 |
+
"LeverType",
|
| 551 |
+
"LeverDetectorEntry",
|
| 552 |
+
"register_lever",
|
| 553 |
+
"unregister_lever",
|
| 554 |
+
"iter_lever_detectors",
|
| 555 |
+
"detect_levers",
|
| 556 |
+
"detect_dominant_recoverable_class",
|
| 557 |
+
"detect_pareto_concentration",
|
| 558 |
+
"detect_complementarity_observation",
|
| 559 |
+
"detect_lexical_modernization_observation",
|
| 560 |
+
"detect_robustness_projection_observation",
|
| 561 |
+
]
|
|
@@ -0,0 +1,263 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Détection de la sur-normalisation lexicale par les LLM/VLM —
|
| 2 |
+
Sprint 80 (A.I.7).
|
| 3 |
+
|
| 4 |
+
Sprint 80 — A.I.7 du plan d'évolution 2026.
|
| 5 |
+
|
| 6 |
+
Pourquoi ce module
|
| 7 |
+
------------------
|
| 8 |
+
Le détecteur ``llm_hallucination_flag`` (Sprint 19) signale qu'un
|
| 9 |
+
moteur sur-normalise (« 0,05 % »). Mais ce score agrégé ne dit
|
| 10 |
+
rien sur **quoi** corriger dans le prompt. Ce module produit
|
| 11 |
+
une **table de fréquences détaillée** :
|
| 12 |
+
|
| 13 |
+
+----------------------+--------------------+------+----------+
|
| 14 |
+
| Forme historique GT | Forme modernisée | n GT | % modern |
|
| 15 |
+
+======================+====================+======+==========+
|
| 16 |
+
| maistre | maître | 47 | 85 % |
|
| 17 |
+
| nostre | nostre | 92 | 8 % |
|
| 18 |
+
| veoir | voir | 23 | 100 % |
|
| 19 |
+
+----------------------+--------------------+------+----------+
|
| 20 |
+
|
| 21 |
+
Lecture immédiate : *« le LLM modernise systématiquement
|
| 22 |
+
maistre → maître ; pour préserver l'orthographe historique, ajouter
|
| 23 |
+
au prompt "ne pas moderniser maistre, nostre, veoir" »*.
|
| 24 |
+
|
| 25 |
+
Méthode
|
| 26 |
+
-------
|
| 27 |
+
Alignement mot-à-mot via ``difflib.SequenceMatcher``. Chaque
|
| 28 |
+
``replace`` ou ``equal`` produit une paire ``(gt_token,
|
| 29 |
+
hyp_token)``. On accumule pour chaque ``gt_token`` :
|
| 30 |
+
|
| 31 |
+
- ``n_total`` : nombre d'occurrences du token dans la GT
|
| 32 |
+
- ``n_modernized`` : nombre d'occurrences où ``hyp_token != gt_token``
|
| 33 |
+
- ``variants`` : dict des hyp_tokens observés avec leur count
|
| 34 |
+
|
| 35 |
+
Stop-list
|
| 36 |
+
---------
|
| 37 |
+
L'utilisateur peut passer ``stop_list`` (ensemble de tokens GT à
|
| 38 |
+
ignorer). Par défaut, vide — le module ne tente pas de deviner ce
|
| 39 |
+
qui est « moderne » ou « historique », c'est au chercheur de
|
| 40 |
+
fournir le filtre adapté à son corpus.
|
| 41 |
+
|
| 42 |
+
Sortie
|
| 43 |
+
------
|
| 44 |
+
``compute_lexical_modernization`` retourne une structure adaptée
|
| 45 |
+
au rendu HTML. ``aggregate_lexical_modernization`` agrège
|
| 46 |
+
plusieurs documents.
|
| 47 |
+
|
| 48 |
+
Limites documentées
|
| 49 |
+
-------------------
|
| 50 |
+
- Tokenisation au niveau mot (split sur espace) — cohérent avec
|
| 51 |
+
``taxonomy.py`` et autres modules. Pas de stemming ni de
|
| 52 |
+
lemmatisation.
|
| 53 |
+
- La métrique mesure la **réécriture lexicale** ; elle n'attrape
|
| 54 |
+
pas les modernisations infra-mot (perte du s long ſ qui se
|
| 55 |
+
fond dans la même forme). Pour ça, voir ``early_modern_typography``
|
| 56 |
+
(Sprint 58) et ``equivalence_profile`` (Sprint 78).
|
| 57 |
+
"""
|
| 58 |
+
|
| 59 |
+
from __future__ import annotations
|
| 60 |
+
|
| 61 |
+
import difflib
|
| 62 |
+
import logging
|
| 63 |
+
from typing import Iterable, Optional
|
| 64 |
+
|
| 65 |
+
logger = logging.getLogger(__name__)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def _split_words(text: Optional[str]) -> list[str]:
|
| 69 |
+
"""Tokenisation simple par split sur whitespace."""
|
| 70 |
+
if not text:
|
| 71 |
+
return []
|
| 72 |
+
return text.split()
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def compute_lexical_modernization(
|
| 76 |
+
reference: Optional[str],
|
| 77 |
+
hypothesis: Optional[str],
|
| 78 |
+
*,
|
| 79 |
+
stop_list: Optional[Iterable[str]] = None,
|
| 80 |
+
case_sensitive: bool = False,
|
| 81 |
+
) -> dict:
|
| 82 |
+
"""Calcule le tableau de modernisation lexicale pour un document.
|
| 83 |
+
|
| 84 |
+
Returns
|
| 85 |
+
-------
|
| 86 |
+
dict
|
| 87 |
+
``{
|
| 88 |
+
"n_gt_tokens": int,
|
| 89 |
+
"tokens": {
|
| 90 |
+
gt_token: {
|
| 91 |
+
"n_total": int,
|
| 92 |
+
"n_modernized": int,
|
| 93 |
+
"rate_modernized": float, # ∈ [0, 1]
|
| 94 |
+
"variants": {hyp_token: count, ...},
|
| 95 |
+
},
|
| 96 |
+
...
|
| 97 |
+
},
|
| 98 |
+
}``
|
| 99 |
+
Si ``reference`` est vide → ``tokens == {}``.
|
| 100 |
+
"""
|
| 101 |
+
ref_tokens = _split_words(reference)
|
| 102 |
+
hyp_tokens = _split_words(hypothesis)
|
| 103 |
+
if not ref_tokens:
|
| 104 |
+
return {"n_gt_tokens": 0, "tokens": {}}
|
| 105 |
+
|
| 106 |
+
if not case_sensitive:
|
| 107 |
+
ref_for_match = [t.lower() for t in ref_tokens]
|
| 108 |
+
hyp_for_match = [t.lower() for t in hyp_tokens]
|
| 109 |
+
else:
|
| 110 |
+
ref_for_match = ref_tokens
|
| 111 |
+
hyp_for_match = hyp_tokens
|
| 112 |
+
|
| 113 |
+
stop = frozenset(
|
| 114 |
+
(t.lower() if not case_sensitive else t)
|
| 115 |
+
for t in (stop_list or [])
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
# On accumule par gt_token (forme display = forme originale,
|
| 119 |
+
# match key = forme casée selon ``case_sensitive``).
|
| 120 |
+
tokens_data: dict[str, dict] = {}
|
| 121 |
+
|
| 122 |
+
matcher = difflib.SequenceMatcher(
|
| 123 |
+
None, ref_for_match, hyp_for_match, autojunk=False,
|
| 124 |
+
)
|
| 125 |
+
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 126 |
+
if tag == "equal":
|
| 127 |
+
for k in range(i2 - i1):
|
| 128 |
+
gt_orig = ref_tokens[i1 + k]
|
| 129 |
+
gt_match = ref_for_match[i1 + k]
|
| 130 |
+
if gt_match in stop:
|
| 131 |
+
continue
|
| 132 |
+
slot = tokens_data.setdefault(
|
| 133 |
+
gt_orig,
|
| 134 |
+
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 135 |
+
)
|
| 136 |
+
slot["n_total"] += 1
|
| 137 |
+
elif tag == "replace":
|
| 138 |
+
# Apparier 1-à-1 quand possible
|
| 139 |
+
paired = min(i2 - i1, j2 - j1)
|
| 140 |
+
for k in range(paired):
|
| 141 |
+
gt_orig = ref_tokens[i1 + k]
|
| 142 |
+
gt_match = ref_for_match[i1 + k]
|
| 143 |
+
if gt_match in stop:
|
| 144 |
+
continue
|
| 145 |
+
hyp_orig = hyp_tokens[j1 + k]
|
| 146 |
+
slot = tokens_data.setdefault(
|
| 147 |
+
gt_orig,
|
| 148 |
+
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 149 |
+
)
|
| 150 |
+
slot["n_total"] += 1
|
| 151 |
+
slot["n_modernized"] += 1
|
| 152 |
+
slot["variants"][hyp_orig] = slot["variants"].get(hyp_orig, 0) + 1
|
| 153 |
+
# Si plus de gt que de hyp, le reste des gt_tokens est
|
| 154 |
+
# « perdu » — on les compte comme totaux mais pas comme
|
| 155 |
+
# modernisés (on ne sait pas en quoi).
|
| 156 |
+
for k in range(paired, i2 - i1):
|
| 157 |
+
gt_orig = ref_tokens[i1 + k]
|
| 158 |
+
gt_match = ref_for_match[i1 + k]
|
| 159 |
+
if gt_match in stop:
|
| 160 |
+
continue
|
| 161 |
+
slot = tokens_data.setdefault(
|
| 162 |
+
gt_orig,
|
| 163 |
+
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 164 |
+
)
|
| 165 |
+
slot["n_total"] += 1
|
| 166 |
+
slot["n_modernized"] += 1
|
| 167 |
+
slot["variants"]["∅"] = slot["variants"].get("∅", 0) + 1
|
| 168 |
+
elif tag == "delete":
|
| 169 |
+
# gt présent, pas en hyp → modernisation par
|
| 170 |
+
# suppression (ou perte pure)
|
| 171 |
+
for k in range(i2 - i1):
|
| 172 |
+
gt_orig = ref_tokens[i1 + k]
|
| 173 |
+
gt_match = ref_for_match[i1 + k]
|
| 174 |
+
if gt_match in stop:
|
| 175 |
+
continue
|
| 176 |
+
slot = tokens_data.setdefault(
|
| 177 |
+
gt_orig,
|
| 178 |
+
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 179 |
+
)
|
| 180 |
+
slot["n_total"] += 1
|
| 181 |
+
slot["n_modernized"] += 1
|
| 182 |
+
slot["variants"]["∅"] = slot["variants"].get("∅", 0) + 1
|
| 183 |
+
|
| 184 |
+
# Calcul du taux par token
|
| 185 |
+
for slot in tokens_data.values():
|
| 186 |
+
total = slot["n_total"]
|
| 187 |
+
slot["rate_modernized"] = (
|
| 188 |
+
slot["n_modernized"] / total if total > 0 else 0.0
|
| 189 |
+
)
|
| 190 |
+
|
| 191 |
+
return {
|
| 192 |
+
"n_gt_tokens": len(ref_tokens),
|
| 193 |
+
"tokens": tokens_data,
|
| 194 |
+
}
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def aggregate_lexical_modernization(
|
| 198 |
+
per_doc_results: Iterable[dict],
|
| 199 |
+
) -> dict:
|
| 200 |
+
"""Agrège des ``compute_lexical_modernization`` per-doc.
|
| 201 |
+
|
| 202 |
+
Renvoie la structure agrégée corpus-wide avec la même forme
|
| 203 |
+
que ``compute_lexical_modernization``.
|
| 204 |
+
"""
|
| 205 |
+
agg_tokens: dict[str, dict] = {}
|
| 206 |
+
n_gt_total = 0
|
| 207 |
+
for doc_result in per_doc_results:
|
| 208 |
+
if not doc_result:
|
| 209 |
+
continue
|
| 210 |
+
n_gt_total += doc_result.get("n_gt_tokens", 0)
|
| 211 |
+
for gt, data in (doc_result.get("tokens") or {}).items():
|
| 212 |
+
slot = agg_tokens.setdefault(
|
| 213 |
+
gt, {"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 214 |
+
)
|
| 215 |
+
slot["n_total"] += data.get("n_total", 0)
|
| 216 |
+
slot["n_modernized"] += data.get("n_modernized", 0)
|
| 217 |
+
for hyp_t, count in (data.get("variants") or {}).items():
|
| 218 |
+
slot["variants"][hyp_t] = slot["variants"].get(hyp_t, 0) + count
|
| 219 |
+
|
| 220 |
+
for slot in agg_tokens.values():
|
| 221 |
+
total = slot["n_total"]
|
| 222 |
+
slot["rate_modernized"] = (
|
| 223 |
+
slot["n_modernized"] / total if total > 0 else 0.0
|
| 224 |
+
)
|
| 225 |
+
return {
|
| 226 |
+
"n_gt_tokens": n_gt_total,
|
| 227 |
+
"tokens": agg_tokens,
|
| 228 |
+
}
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
def top_modernized_tokens(
|
| 232 |
+
data: dict,
|
| 233 |
+
*,
|
| 234 |
+
n: int = 20,
|
| 235 |
+
min_total: int = 1,
|
| 236 |
+
) -> list[tuple[str, dict]]:
|
| 237 |
+
"""Top-N tokens GT par taux de modernisation.
|
| 238 |
+
|
| 239 |
+
Filtre les tokens dont ``n_total < min_total`` (anecdotiques).
|
| 240 |
+
Tri par ``rate_modernized`` décroissant, tie-break par
|
| 241 |
+
``n_total`` décroissant.
|
| 242 |
+
"""
|
| 243 |
+
tokens = data.get("tokens") or {}
|
| 244 |
+
candidates = [
|
| 245 |
+
(gt, slot) for gt, slot in tokens.items()
|
| 246 |
+
if slot.get("n_total", 0) >= min_total
|
| 247 |
+
and slot.get("n_modernized", 0) > 0
|
| 248 |
+
]
|
| 249 |
+
candidates.sort(
|
| 250 |
+
key=lambda pair: (
|
| 251 |
+
-pair[1].get("rate_modernized", 0.0),
|
| 252 |
+
-pair[1].get("n_total", 0),
|
| 253 |
+
pair[0],
|
| 254 |
+
),
|
| 255 |
+
)
|
| 256 |
+
return candidates[:n]
|
| 257 |
+
|
| 258 |
+
|
| 259 |
+
__all__ = [
|
| 260 |
+
"compute_lexical_modernization",
|
| 261 |
+
"aggregate_lexical_modernization",
|
| 262 |
+
"top_modernized_tokens",
|
| 263 |
+
]
|
|
@@ -0,0 +1,286 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Distribution des erreurs CER par ligne — Sprint 10.
|
| 2 |
+
|
| 3 |
+
Métriques calculées
|
| 4 |
+
-------------------
|
| 5 |
+
- CER par ligne : distance d'édition caractère/longueur GT sur chaque paire de lignes
|
| 6 |
+
- Percentiles : p50, p75, p90, p95, p99 sur la distribution des CER ligne
|
| 7 |
+
- Taux catastrophiques : % de lignes dépassant des seuils configurables (30 %, 50 %, 100 %)
|
| 8 |
+
- Coefficient de Gini : concentration des erreurs (0 = uniformes, 1 = toutes concentrées)
|
| 9 |
+
- Carte thermique : CER moyen par tranche de position dans le document
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
import unicodedata
|
| 15 |
+
from dataclasses import dataclass
|
| 16 |
+
from typing import Optional
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
# ---------------------------------------------------------------------------
|
| 20 |
+
# CER d'une paire de lignes (distance d'édition Levenshtein normalisée)
|
| 21 |
+
# ---------------------------------------------------------------------------
|
| 22 |
+
|
| 23 |
+
def _edit_distance(a: str, b: str) -> int:
|
| 24 |
+
"""Distance de Levenshtein entre deux chaînes."""
|
| 25 |
+
if not a:
|
| 26 |
+
return len(b)
|
| 27 |
+
if not b:
|
| 28 |
+
return len(a)
|
| 29 |
+
prev = list(range(len(b) + 1))
|
| 30 |
+
for i, ca in enumerate(a, 1):
|
| 31 |
+
curr = [i]
|
| 32 |
+
for j, cb in enumerate(b, 1):
|
| 33 |
+
cost = 0 if ca == cb else 1
|
| 34 |
+
curr.append(min(curr[j - 1] + 1, prev[j] + 1, prev[j - 1] + cost))
|
| 35 |
+
prev = curr
|
| 36 |
+
return prev[-1]
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def _line_cer(ref_line: str, hyp_line: str) -> float:
|
| 40 |
+
"""CER pour une paire de lignes. Retourne 1.0 si le GT est vide et que l'hyp ne l'est pas."""
|
| 41 |
+
ref = unicodedata.normalize("NFC", ref_line.strip())
|
| 42 |
+
hyp = unicodedata.normalize("NFC", hyp_line.strip())
|
| 43 |
+
if not ref:
|
| 44 |
+
return 0.0 if not hyp else 1.0
|
| 45 |
+
dist = _edit_distance(ref, hyp)
|
| 46 |
+
return dist / len(ref)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
# ---------------------------------------------------------------------------
|
| 50 |
+
# Percentiles (implémentation pur-Python, sans numpy)
|
| 51 |
+
# ---------------------------------------------------------------------------
|
| 52 |
+
|
| 53 |
+
def _percentile(sorted_values: list[float], p: float) -> float:
|
| 54 |
+
"""Retourne le p-ième percentile (0 ≤ p ≤ 100) d'une liste triée."""
|
| 55 |
+
if not sorted_values:
|
| 56 |
+
return 0.0
|
| 57 |
+
n = len(sorted_values)
|
| 58 |
+
index = p / 100 * (n - 1)
|
| 59 |
+
lo = int(index)
|
| 60 |
+
hi = min(lo + 1, n - 1)
|
| 61 |
+
frac = index - lo
|
| 62 |
+
return sorted_values[lo] + frac * (sorted_values[hi] - sorted_values[lo])
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
# ---------------------------------------------------------------------------
|
| 66 |
+
# Coefficient de Gini
|
| 67 |
+
# ---------------------------------------------------------------------------
|
| 68 |
+
|
| 69 |
+
def _gini(values: list[float]) -> float:
|
| 70 |
+
"""Coefficient de Gini des erreurs (0 = uniformes, 1 = toutes concentrées).
|
| 71 |
+
|
| 72 |
+
Formule : G = (2 * Σ i*x_i) / (n * Σ x_i) - (n+1)/n
|
| 73 |
+
sur les valeurs triées par ordre croissant.
|
| 74 |
+
"""
|
| 75 |
+
if not values:
|
| 76 |
+
return 0.0
|
| 77 |
+
xs = sorted(max(v, 0.0) for v in values)
|
| 78 |
+
n = len(xs)
|
| 79 |
+
total = sum(xs)
|
| 80 |
+
if total == 0.0:
|
| 81 |
+
return 0.0
|
| 82 |
+
weighted_sum = sum((i + 1) * x for i, x in enumerate(xs))
|
| 83 |
+
return (2.0 * weighted_sum) / (n * total) - (n + 1) / n
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
# ---------------------------------------------------------------------------
|
| 87 |
+
# Résultat structuré
|
| 88 |
+
# ---------------------------------------------------------------------------
|
| 89 |
+
|
| 90 |
+
@dataclass
|
| 91 |
+
class LineMetrics:
|
| 92 |
+
"""Distribution des erreurs CER par ligne pour une paire (GT, hypothèse)."""
|
| 93 |
+
|
| 94 |
+
cer_per_line: list[float]
|
| 95 |
+
"""CER de chaque ligne (longueur = nombre de lignes GT)."""
|
| 96 |
+
|
| 97 |
+
percentiles: dict[str, float]
|
| 98 |
+
"""Percentiles : p50, p75, p90, p95, p99."""
|
| 99 |
+
|
| 100 |
+
catastrophic_rate: dict[str, float]
|
| 101 |
+
"""Taux de lignes catastrophiques pour chaque seuil (ex. {0.3: 0.12, 0.5: 0.07, 1.0: 0.02})."""
|
| 102 |
+
|
| 103 |
+
gini: float
|
| 104 |
+
"""Coefficient de Gini des erreurs (0 → uniforme, 1 → concentrées)."""
|
| 105 |
+
|
| 106 |
+
heatmap: list[float]
|
| 107 |
+
"""CER moyen par tranche de position dans le document (longueur = heatmap_bins)."""
|
| 108 |
+
|
| 109 |
+
line_count: int
|
| 110 |
+
"""Nombre de lignes GT traitées."""
|
| 111 |
+
|
| 112 |
+
mean_cer: float
|
| 113 |
+
"""CER moyen sur l'ensemble des lignes."""
|
| 114 |
+
|
| 115 |
+
def as_dict(self) -> dict:
|
| 116 |
+
return {
|
| 117 |
+
"cer_per_line": [round(v, 6) for v in self.cer_per_line],
|
| 118 |
+
"percentiles": {k: round(v, 6) for k, v in self.percentiles.items()},
|
| 119 |
+
"catastrophic_rate": {str(k): round(v, 6) for k, v in self.catastrophic_rate.items()},
|
| 120 |
+
"gini": round(self.gini, 6),
|
| 121 |
+
"heatmap": [round(v, 6) for v in self.heatmap],
|
| 122 |
+
"line_count": self.line_count,
|
| 123 |
+
"mean_cer": round(self.mean_cer, 6),
|
| 124 |
+
}
|
| 125 |
+
|
| 126 |
+
@classmethod
|
| 127 |
+
def from_dict(cls, d: dict) -> "LineMetrics":
|
| 128 |
+
return cls(
|
| 129 |
+
cer_per_line=d.get("cer_per_line", []),
|
| 130 |
+
percentiles=d.get("percentiles", {}),
|
| 131 |
+
catastrophic_rate={float(k): v for k, v in d.get("catastrophic_rate", {}).items()},
|
| 132 |
+
gini=d.get("gini", 0.0),
|
| 133 |
+
heatmap=d.get("heatmap", []),
|
| 134 |
+
line_count=d.get("line_count", 0),
|
| 135 |
+
mean_cer=d.get("mean_cer", 0.0),
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
# ---------------------------------------------------------------------------
|
| 140 |
+
# Calcul principal
|
| 141 |
+
# ---------------------------------------------------------------------------
|
| 142 |
+
|
| 143 |
+
def compute_line_metrics(
|
| 144 |
+
reference: str,
|
| 145 |
+
hypothesis: str,
|
| 146 |
+
thresholds: Optional[list[float]] = None,
|
| 147 |
+
heatmap_bins: int = 10,
|
| 148 |
+
) -> LineMetrics:
|
| 149 |
+
"""Calcule la distribution des erreurs CER ligne par ligne.
|
| 150 |
+
|
| 151 |
+
Parameters
|
| 152 |
+
----------
|
| 153 |
+
reference:
|
| 154 |
+
Texte de vérité terrain (GT) avec sauts de ligne.
|
| 155 |
+
hypothesis:
|
| 156 |
+
Texte produit par le moteur OCR.
|
| 157 |
+
thresholds:
|
| 158 |
+
Seuils CER pour le taux catastrophique. Défaut : [0.30, 0.50, 1.00].
|
| 159 |
+
heatmap_bins:
|
| 160 |
+
Nombre de tranches de position pour la carte thermique.
|
| 161 |
+
|
| 162 |
+
Returns
|
| 163 |
+
-------
|
| 164 |
+
LineMetrics
|
| 165 |
+
"""
|
| 166 |
+
if thresholds is None:
|
| 167 |
+
thresholds = [0.30, 0.50, 1.00]
|
| 168 |
+
|
| 169 |
+
ref_lines = reference.splitlines()
|
| 170 |
+
hyp_lines = hypothesis.splitlines()
|
| 171 |
+
|
| 172 |
+
# Aligner les lignes GT / hypothèse — on prend au moins autant de lignes que le GT
|
| 173 |
+
n = len(ref_lines)
|
| 174 |
+
if n == 0:
|
| 175 |
+
# Pas de lignes : retourner des métriques neutres
|
| 176 |
+
return LineMetrics(
|
| 177 |
+
cer_per_line=[],
|
| 178 |
+
percentiles={f"p{p}": 0.0 for p in (50, 75, 90, 95, 99)},
|
| 179 |
+
catastrophic_rate={t: 0.0 for t in thresholds},
|
| 180 |
+
gini=0.0,
|
| 181 |
+
heatmap=[0.0] * heatmap_bins,
|
| 182 |
+
line_count=0,
|
| 183 |
+
mean_cer=0.0,
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
# Aligner en ignorant les lignes d'hypothèse supplémentaires
|
| 187 |
+
# Si l'hypothèse a moins de lignes, les lignes manquantes comptent comme supprimées (CER = 1.0)
|
| 188 |
+
cer_per_line: list[float] = []
|
| 189 |
+
for i, ref_line in enumerate(ref_lines):
|
| 190 |
+
hyp_line = hyp_lines[i] if i < len(hyp_lines) else ""
|
| 191 |
+
cer_per_line.append(min(_line_cer(ref_line, hyp_line), 1.0))
|
| 192 |
+
|
| 193 |
+
sorted_cer = sorted(cer_per_line)
|
| 194 |
+
|
| 195 |
+
# Percentiles
|
| 196 |
+
percentiles = {
|
| 197 |
+
f"p{p}": _percentile(sorted_cer, p)
|
| 198 |
+
for p in (50, 75, 90, 95, 99)
|
| 199 |
+
}
|
| 200 |
+
|
| 201 |
+
# Taux catastrophiques
|
| 202 |
+
catastrophic_rate: dict[float, float] = {}
|
| 203 |
+
for t in thresholds:
|
| 204 |
+
count = sum(1 for v in cer_per_line if v > t)
|
| 205 |
+
catastrophic_rate[t] = count / n
|
| 206 |
+
|
| 207 |
+
# Gini
|
| 208 |
+
gini = _gini(cer_per_line)
|
| 209 |
+
|
| 210 |
+
# Carte thermique par tranche de position
|
| 211 |
+
bins = heatmap_bins
|
| 212 |
+
heatmap: list[float] = []
|
| 213 |
+
for b in range(bins):
|
| 214 |
+
start = int(b * n / bins)
|
| 215 |
+
end = int((b + 1) * n / bins)
|
| 216 |
+
slice_ = cer_per_line[start:end]
|
| 217 |
+
heatmap.append(sum(slice_) / len(slice_) if slice_ else 0.0)
|
| 218 |
+
|
| 219 |
+
mean_cer = sum(cer_per_line) / n
|
| 220 |
+
|
| 221 |
+
return LineMetrics(
|
| 222 |
+
cer_per_line=cer_per_line,
|
| 223 |
+
percentiles=percentiles,
|
| 224 |
+
catastrophic_rate=catastrophic_rate,
|
| 225 |
+
gini=gini,
|
| 226 |
+
heatmap=heatmap,
|
| 227 |
+
line_count=n,
|
| 228 |
+
mean_cer=mean_cer,
|
| 229 |
+
)
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
# ---------------------------------------------------------------------------
|
| 233 |
+
# Agrégation sur un corpus
|
| 234 |
+
# ---------------------------------------------------------------------------
|
| 235 |
+
|
| 236 |
+
def aggregate_line_metrics(results: list[LineMetrics]) -> dict:
|
| 237 |
+
"""Agrège les métriques de distribution par ligne sur un corpus.
|
| 238 |
+
|
| 239 |
+
Returns
|
| 240 |
+
-------
|
| 241 |
+
dict
|
| 242 |
+
Statistiques agrégées : Gini moyen, percentiles moyens, taux catastrophiques moyens.
|
| 243 |
+
"""
|
| 244 |
+
if not results:
|
| 245 |
+
return {}
|
| 246 |
+
|
| 247 |
+
import statistics as _stats
|
| 248 |
+
|
| 249 |
+
gini_values = [r.gini for r in results]
|
| 250 |
+
mean_cer_values = [r.mean_cer for r in results]
|
| 251 |
+
|
| 252 |
+
# Percentiles moyens
|
| 253 |
+
pct_keys = ["p50", "p75", "p90", "p95", "p99"]
|
| 254 |
+
avg_percentiles = {}
|
| 255 |
+
for k in pct_keys:
|
| 256 |
+
vals = [r.percentiles.get(k, 0.0) for r in results]
|
| 257 |
+
avg_percentiles[k] = round(sum(vals) / len(vals), 6) if vals else 0.0
|
| 258 |
+
|
| 259 |
+
# Taux catastrophiques moyens (union des seuils)
|
| 260 |
+
all_thresholds: set[float] = set()
|
| 261 |
+
for r in results:
|
| 262 |
+
all_thresholds.update(r.catastrophic_rate.keys())
|
| 263 |
+
avg_catastrophic: dict[str, float] = {}
|
| 264 |
+
for t in sorted(all_thresholds):
|
| 265 |
+
vals = [r.catastrophic_rate.get(t, 0.0) for r in results]
|
| 266 |
+
avg_catastrophic[str(t)] = round(sum(vals) / len(vals), 6) if vals else 0.0
|
| 267 |
+
|
| 268 |
+
# Heatmap moyenne (longueur = max des longueurs)
|
| 269 |
+
if results and results[0].heatmap:
|
| 270 |
+
n_bins = len(results[0].heatmap)
|
| 271 |
+
heatmap_avg = []
|
| 272 |
+
for b in range(n_bins):
|
| 273 |
+
vals = [r.heatmap[b] for r in results if b < len(r.heatmap)]
|
| 274 |
+
heatmap_avg.append(round(sum(vals) / len(vals), 6) if vals else 0.0)
|
| 275 |
+
else:
|
| 276 |
+
heatmap_avg = []
|
| 277 |
+
|
| 278 |
+
return {
|
| 279 |
+
"gini_mean": round(sum(gini_values) / len(gini_values), 6),
|
| 280 |
+
"gini_stdev": round(_stats.stdev(gini_values), 6) if len(gini_values) > 1 else 0.0,
|
| 281 |
+
"mean_cer_mean": round(sum(mean_cer_values) / len(mean_cer_values), 6),
|
| 282 |
+
"percentiles": avg_percentiles,
|
| 283 |
+
"catastrophic_rate": avg_catastrophic,
|
| 284 |
+
"heatmap": heatmap_avg,
|
| 285 |
+
"document_count": len(results),
|
| 286 |
+
}
|
|
@@ -0,0 +1,373 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Métriques longitudinales — Sprint 92 (A.II.9).
|
| 2 |
+
|
| 3 |
+
Sprint 92 — A.II.9 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
L'historique SQLite (`core/history.py`, Sprint 8) collecte les
|
| 8 |
+
résultats de chaque run de benchmark, mais aucune métrique
|
| 9 |
+
n'en sortait dans le rapport. Ce module exploite la série
|
| 10 |
+
temporelle des CER d'un moteur pour répondre à deux
|
| 11 |
+
questions :
|
| 12 |
+
|
| 13 |
+
1. **Y a-t-il une tendance ?** Régression linéaire simple
|
| 14 |
+
(méthode des moindres carrés) sur ``(t, CER)`` — pente,
|
| 15 |
+
ordonnée à l'origine, R², n_runs. Une pente > 0 signale
|
| 16 |
+
une régression progressive ; une pente < 0 une amélioration.
|
| 17 |
+
|
| 18 |
+
2. **Y a-t-il un point de rupture ?** Algorithme de
|
| 19 |
+
change-point pur Python (différence de moyennes maximale,
|
| 20 |
+
variante de Pettitt simplifiée). Identifie l'index où la
|
| 21 |
+
série se sépare en deux segments avec moyennes les plus
|
| 22 |
+
différentes — typiquement le run où un modèle a changé de
|
| 23 |
+
comportement.
|
| 24 |
+
|
| 25 |
+
Pas de scipy
|
| 26 |
+
------------
|
| 27 |
+
Pour rester sans dépendance lourde, on implémente :
|
| 28 |
+
- la régression linéaire en pur Python (closed-form OLS) ;
|
| 29 |
+
- le change-point par balayage exhaustif (O(N) pour de petits
|
| 30 |
+
N — l'historique d'une institution dépasse rarement quelques
|
| 31 |
+
centaines de runs).
|
| 32 |
+
"""
|
| 33 |
+
|
| 34 |
+
from __future__ import annotations
|
| 35 |
+
|
| 36 |
+
import logging
|
| 37 |
+
import math
|
| 38 |
+
import statistics
|
| 39 |
+
from dataclasses import dataclass
|
| 40 |
+
from datetime import datetime
|
| 41 |
+
from typing import Iterable, Optional
|
| 42 |
+
|
| 43 |
+
logger = logging.getLogger(__name__)
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
@dataclass
|
| 47 |
+
class LinearTrend:
|
| 48 |
+
"""Résultat d'une régression linéaire sur une série CER."""
|
| 49 |
+
slope: float
|
| 50 |
+
"""Pente (CER par jour). Positif = régression."""
|
| 51 |
+
intercept: float
|
| 52 |
+
"""Ordonnée à l'origine."""
|
| 53 |
+
r_squared: float
|
| 54 |
+
"""Qualité de l'ajustement, ∈ [0, 1]."""
|
| 55 |
+
n_runs: int
|
| 56 |
+
"""Nombre de points utilisés."""
|
| 57 |
+
|
| 58 |
+
def as_dict(self) -> dict:
|
| 59 |
+
return {
|
| 60 |
+
"slope": self.slope,
|
| 61 |
+
"intercept": self.intercept,
|
| 62 |
+
"r_squared": self.r_squared,
|
| 63 |
+
"n_runs": self.n_runs,
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
@dataclass
|
| 68 |
+
class ChangePointResult:
|
| 69 |
+
"""Résultat d'une détection de point de rupture."""
|
| 70 |
+
index: int
|
| 71 |
+
"""Index de la rupture (0-based, le segment 1 est [0:index],
|
| 72 |
+
le segment 2 est [index:N])."""
|
| 73 |
+
timestamp: str
|
| 74 |
+
"""Timestamp du run à la rupture."""
|
| 75 |
+
mean_before: float
|
| 76 |
+
mean_after: float
|
| 77 |
+
delta: float
|
| 78 |
+
"""``mean_after - mean_before``. Positif = régression."""
|
| 79 |
+
n_before: int
|
| 80 |
+
n_after: int
|
| 81 |
+
|
| 82 |
+
def as_dict(self) -> dict:
|
| 83 |
+
return {
|
| 84 |
+
"index": self.index,
|
| 85 |
+
"timestamp": self.timestamp,
|
| 86 |
+
"mean_before": self.mean_before,
|
| 87 |
+
"mean_after": self.mean_after,
|
| 88 |
+
"delta": self.delta,
|
| 89 |
+
"n_before": self.n_before,
|
| 90 |
+
"n_after": self.n_after,
|
| 91 |
+
}
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
def _parse_timestamp(ts: str) -> Optional[float]:
|
| 95 |
+
"""Parse un ISO timestamp en jour ordinal float.
|
| 96 |
+
|
| 97 |
+
Tolère ``YYYY-MM-DD`` et ``YYYY-MM-DDTHH:MM:SS``. Retourne
|
| 98 |
+
``None`` si non parsable.
|
| 99 |
+
"""
|
| 100 |
+
if not ts:
|
| 101 |
+
return None
|
| 102 |
+
formats = (
|
| 103 |
+
"%Y-%m-%dT%H:%M:%S.%f",
|
| 104 |
+
"%Y-%m-%dT%H:%M:%S",
|
| 105 |
+
"%Y-%m-%d %H:%M:%S",
|
| 106 |
+
"%Y-%m-%d",
|
| 107 |
+
)
|
| 108 |
+
for fmt in formats:
|
| 109 |
+
try:
|
| 110 |
+
dt = datetime.strptime(ts.split("+")[0].split("Z")[0], fmt)
|
| 111 |
+
return dt.toordinal() + (
|
| 112 |
+
dt.hour * 3600 + dt.minute * 60 + dt.second
|
| 113 |
+
) / 86400.0
|
| 114 |
+
except ValueError:
|
| 115 |
+
continue
|
| 116 |
+
return None
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
def compute_linear_trend(
|
| 120 |
+
cer_series: Iterable[tuple[str, float]],
|
| 121 |
+
) -> Optional[LinearTrend]:
|
| 122 |
+
"""Régression linéaire OLS sur une série temporelle de CER.
|
| 123 |
+
|
| 124 |
+
Parameters
|
| 125 |
+
----------
|
| 126 |
+
cer_series:
|
| 127 |
+
Itérable de ``(timestamp_iso, cer)``. Au moins 2 points
|
| 128 |
+
valides requis.
|
| 129 |
+
|
| 130 |
+
Returns
|
| 131 |
+
-------
|
| 132 |
+
LinearTrend | None
|
| 133 |
+
``None`` si moins de 2 points ou si tous les timestamps
|
| 134 |
+
sont identiques (variance nulle sur t).
|
| 135 |
+
"""
|
| 136 |
+
points: list[tuple[float, float]] = []
|
| 137 |
+
for ts, cer in cer_series:
|
| 138 |
+
t = _parse_timestamp(ts)
|
| 139 |
+
if t is None or cer is None:
|
| 140 |
+
continue
|
| 141 |
+
try:
|
| 142 |
+
cer_f = float(cer)
|
| 143 |
+
except (TypeError, ValueError):
|
| 144 |
+
continue
|
| 145 |
+
points.append((t, cer_f))
|
| 146 |
+
n = len(points)
|
| 147 |
+
if n < 2:
|
| 148 |
+
return None
|
| 149 |
+
xs = [p[0] for p in points]
|
| 150 |
+
ys = [p[1] for p in points]
|
| 151 |
+
x_mean = statistics.fmean(xs)
|
| 152 |
+
y_mean = statistics.fmean(ys)
|
| 153 |
+
sxx = sum((x - x_mean) ** 2 for x in xs)
|
| 154 |
+
sxy = sum((x - x_mean) * (y - y_mean) for x, y in zip(xs, ys))
|
| 155 |
+
if sxx == 0:
|
| 156 |
+
return None
|
| 157 |
+
slope = sxy / sxx
|
| 158 |
+
intercept = y_mean - slope * x_mean
|
| 159 |
+
syy = sum((y - y_mean) ** 2 for y in ys)
|
| 160 |
+
if syy == 0:
|
| 161 |
+
# Tous les CER sont égaux → R² mathématiquement indéfini ;
|
| 162 |
+
# on retourne 1.0 (parfaite "non-tendance").
|
| 163 |
+
r_squared = 1.0
|
| 164 |
+
else:
|
| 165 |
+
ss_res = sum(
|
| 166 |
+
(y - (slope * x + intercept)) ** 2
|
| 167 |
+
for x, y in zip(xs, ys)
|
| 168 |
+
)
|
| 169 |
+
r_squared = max(0.0, 1.0 - ss_res / syy)
|
| 170 |
+
return LinearTrend(
|
| 171 |
+
slope=slope,
|
| 172 |
+
intercept=intercept,
|
| 173 |
+
r_squared=r_squared,
|
| 174 |
+
n_runs=n,
|
| 175 |
+
)
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
def detect_change_point(
|
| 179 |
+
cer_series: Iterable[tuple[str, float]],
|
| 180 |
+
min_segment_size: int = 3,
|
| 181 |
+
) -> Optional[ChangePointResult]:
|
| 182 |
+
"""Détecte le point de rupture maximisant l'écart de moyennes.
|
| 183 |
+
|
| 184 |
+
Algorithme : balayage des indices ``i`` où la série se
|
| 185 |
+
sépare en deux segments d'au moins ``min_segment_size``
|
| 186 |
+
points chacun ; on retient l'index où ``|mean_after -
|
| 187 |
+
mean_before|`` est maximal. Variante simplifiée de Pettitt.
|
| 188 |
+
|
| 189 |
+
Parameters
|
| 190 |
+
----------
|
| 191 |
+
cer_series:
|
| 192 |
+
Itérable de ``(timestamp_iso, cer)``.
|
| 193 |
+
min_segment_size:
|
| 194 |
+
Taille minimale des deux segments. Défaut 3.
|
| 195 |
+
|
| 196 |
+
Returns
|
| 197 |
+
-------
|
| 198 |
+
ChangePointResult | None
|
| 199 |
+
``None`` si la série a moins de ``2 × min_segment_size``
|
| 200 |
+
points valides.
|
| 201 |
+
"""
|
| 202 |
+
points: list[tuple[str, float, float]] = []
|
| 203 |
+
for ts, cer in cer_series:
|
| 204 |
+
t = _parse_timestamp(ts)
|
| 205 |
+
if t is None or cer is None:
|
| 206 |
+
continue
|
| 207 |
+
try:
|
| 208 |
+
cer_f = float(cer)
|
| 209 |
+
except (TypeError, ValueError):
|
| 210 |
+
continue
|
| 211 |
+
points.append((ts, t, cer_f))
|
| 212 |
+
if len(points) < 2 * min_segment_size:
|
| 213 |
+
return None
|
| 214 |
+
points.sort(key=lambda p: p[1])
|
| 215 |
+
n = len(points)
|
| 216 |
+
best_index = -1
|
| 217 |
+
best_abs_delta = -1.0
|
| 218 |
+
best_delta = 0.0
|
| 219 |
+
best_mean_before = 0.0
|
| 220 |
+
best_mean_after = 0.0
|
| 221 |
+
for i in range(min_segment_size, n - min_segment_size + 1):
|
| 222 |
+
before = [p[2] for p in points[:i]]
|
| 223 |
+
after = [p[2] for p in points[i:]]
|
| 224 |
+
mean_b = statistics.fmean(before)
|
| 225 |
+
mean_a = statistics.fmean(after)
|
| 226 |
+
delta = mean_a - mean_b
|
| 227 |
+
abs_delta = abs(delta)
|
| 228 |
+
if abs_delta > best_abs_delta:
|
| 229 |
+
best_abs_delta = abs_delta
|
| 230 |
+
best_index = i
|
| 231 |
+
best_delta = delta
|
| 232 |
+
best_mean_before = mean_b
|
| 233 |
+
best_mean_after = mean_a
|
| 234 |
+
if best_index < 0:
|
| 235 |
+
return None
|
| 236 |
+
return ChangePointResult(
|
| 237 |
+
index=best_index,
|
| 238 |
+
timestamp=points[best_index][0],
|
| 239 |
+
mean_before=best_mean_before,
|
| 240 |
+
mean_after=best_mean_after,
|
| 241 |
+
delta=best_delta,
|
| 242 |
+
n_before=best_index,
|
| 243 |
+
n_after=n - best_index,
|
| 244 |
+
)
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def compute_engine_longitudinal(
|
| 248 |
+
history_entries: Iterable,
|
| 249 |
+
engine_name: str,
|
| 250 |
+
corpus_name: Optional[str] = None,
|
| 251 |
+
*,
|
| 252 |
+
min_runs_for_trend: int = 3,
|
| 253 |
+
min_segment_size: int = 3,
|
| 254 |
+
change_point_threshold: float = 0.01,
|
| 255 |
+
) -> Optional[dict]:
|
| 256 |
+
"""Calcule trend + change_point pour un moteur.
|
| 257 |
+
|
| 258 |
+
Parameters
|
| 259 |
+
----------
|
| 260 |
+
history_entries:
|
| 261 |
+
Liste de ``HistoryEntry`` (ou dicts compatibles).
|
| 262 |
+
engine_name:
|
| 263 |
+
Filtre sur le nom du moteur.
|
| 264 |
+
corpus_name:
|
| 265 |
+
Filtre optionnel sur le corpus. ``None`` (défaut) : tous
|
| 266 |
+
les corpus.
|
| 267 |
+
min_runs_for_trend:
|
| 268 |
+
Minimum de runs pour calculer une tendance.
|
| 269 |
+
min_segment_size:
|
| 270 |
+
Taille minimale des segments pour le change-point.
|
| 271 |
+
change_point_threshold:
|
| 272 |
+
Magnitude absolue minimale du delta (en CER) pour
|
| 273 |
+
retenir le change-point. Défaut 0.01 (1 point de CER).
|
| 274 |
+
|
| 275 |
+
Returns
|
| 276 |
+
-------
|
| 277 |
+
dict | None
|
| 278 |
+
``{
|
| 279 |
+
"engine_name", "corpus_name", "n_runs", "trend",
|
| 280 |
+
"change_point", # ou None
|
| 281 |
+
"first_timestamp", "last_timestamp",
|
| 282 |
+
"first_cer", "last_cer", "absolute_delta_pct",
|
| 283 |
+
}`` ou ``None`` si moins de ``min_runs_for_trend`` runs.
|
| 284 |
+
"""
|
| 285 |
+
series: list[tuple[str, float]] = []
|
| 286 |
+
for entry in history_entries:
|
| 287 |
+
if hasattr(entry, "as_dict"):
|
| 288 |
+
data = entry.as_dict()
|
| 289 |
+
else:
|
| 290 |
+
data = entry
|
| 291 |
+
if data.get("engine_name") != engine_name:
|
| 292 |
+
continue
|
| 293 |
+
if corpus_name is not None and data.get("corpus_name") != corpus_name:
|
| 294 |
+
continue
|
| 295 |
+
cer = data.get("cer_mean")
|
| 296 |
+
ts = data.get("timestamp")
|
| 297 |
+
if cer is None or ts is None:
|
| 298 |
+
continue
|
| 299 |
+
series.append((ts, float(cer)))
|
| 300 |
+
if len(series) < min_runs_for_trend:
|
| 301 |
+
return None
|
| 302 |
+
series.sort(key=lambda p: _parse_timestamp(p[0]) or 0.0)
|
| 303 |
+
trend = compute_linear_trend(series)
|
| 304 |
+
cp = detect_change_point(series, min_segment_size=min_segment_size)
|
| 305 |
+
if cp is not None and abs(cp.delta) < change_point_threshold:
|
| 306 |
+
cp = None
|
| 307 |
+
first_ts, first_cer = series[0]
|
| 308 |
+
last_ts, last_cer = series[-1]
|
| 309 |
+
return {
|
| 310 |
+
"engine_name": engine_name,
|
| 311 |
+
"corpus_name": corpus_name,
|
| 312 |
+
"n_runs": len(series),
|
| 313 |
+
"trend": trend.as_dict() if trend else None,
|
| 314 |
+
"change_point": cp.as_dict() if cp else None,
|
| 315 |
+
"first_timestamp": first_ts,
|
| 316 |
+
"last_timestamp": last_ts,
|
| 317 |
+
"first_cer": first_cer,
|
| 318 |
+
"last_cer": last_cer,
|
| 319 |
+
"absolute_delta": last_cer - first_cer,
|
| 320 |
+
"absolute_delta_pct": round((last_cer - first_cer) * 100, 2),
|
| 321 |
+
}
|
| 322 |
+
|
| 323 |
+
|
| 324 |
+
def compute_corpus_longitudinal(
|
| 325 |
+
history_entries: Iterable,
|
| 326 |
+
corpus_name: Optional[str] = None,
|
| 327 |
+
*,
|
| 328 |
+
min_runs_for_trend: int = 3,
|
| 329 |
+
min_segment_size: int = 3,
|
| 330 |
+
change_point_threshold: float = 0.01,
|
| 331 |
+
) -> list[dict]:
|
| 332 |
+
"""Pour chaque moteur présent dans l'historique sur ``corpus_name``,
|
| 333 |
+
calcule trend + change_point.
|
| 334 |
+
|
| 335 |
+
Returns
|
| 336 |
+
-------
|
| 337 |
+
list[dict]
|
| 338 |
+
Une entrée par moteur (filtrée), liste vide si rien.
|
| 339 |
+
"""
|
| 340 |
+
entries = list(history_entries)
|
| 341 |
+
engines: set[str] = set()
|
| 342 |
+
for entry in entries:
|
| 343 |
+
data = entry.as_dict() if hasattr(entry, "as_dict") else entry
|
| 344 |
+
if corpus_name is not None and data.get("corpus_name") != corpus_name:
|
| 345 |
+
continue
|
| 346 |
+
name = data.get("engine_name")
|
| 347 |
+
if name:
|
| 348 |
+
engines.add(name)
|
| 349 |
+
out: list[dict] = []
|
| 350 |
+
for engine in sorted(engines):
|
| 351 |
+
result = compute_engine_longitudinal(
|
| 352 |
+
entries, engine, corpus_name=corpus_name,
|
| 353 |
+
min_runs_for_trend=min_runs_for_trend,
|
| 354 |
+
min_segment_size=min_segment_size,
|
| 355 |
+
change_point_threshold=change_point_threshold,
|
| 356 |
+
)
|
| 357 |
+
if result is not None:
|
| 358 |
+
out.append(result)
|
| 359 |
+
return out
|
| 360 |
+
|
| 361 |
+
|
| 362 |
+
__all__ = [
|
| 363 |
+
"LinearTrend",
|
| 364 |
+
"ChangePointResult",
|
| 365 |
+
"compute_linear_trend",
|
| 366 |
+
"detect_change_point",
|
| 367 |
+
"compute_engine_longitudinal",
|
| 368 |
+
"compute_corpus_longitudinal",
|
| 369 |
+
]
|
| 370 |
+
|
| 371 |
+
|
| 372 |
+
# Marqueur d'évitement d'import inutilisé (math)
|
| 373 |
+
_ = math
|
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Coût marginal par erreur évitée — Sprint 91 (A.II.6 chantier 2).
|
| 2 |
+
|
| 3 |
+
Sprint 91 — A.II.6 chantier 2 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
La vue Pareto (Sprint 20) trace CER vs coût mais n'arbitre pas
|
| 8 |
+
quel surcoût est *raisonnable* pour quelle réduction d'erreur.
|
| 9 |
+
Une institution avec un budget contraint a besoin d'une
|
| 10 |
+
réponse opérationnelle :
|
| 11 |
+
|
| 12 |
+
*« Passer de Tesseract à Mistral OCR coûte 0,83 € par
|
| 13 |
+
erreur évitée — décider selon votre budget par millier
|
| 14 |
+
d'erreurs corrigées. »*
|
| 15 |
+
|
| 16 |
+
Formule
|
| 17 |
+
-------
|
| 18 |
+
Pour deux moteurs A et B où B fait **moins** d'erreurs que A
|
| 19 |
+
(donc B est plus précis) :
|
| 20 |
+
|
| 21 |
+
.. code::
|
| 22 |
+
|
| 23 |
+
coût_marginal = (coût_B − coût_A) / (errors_A − errors_B)
|
| 24 |
+
|
| 25 |
+
- Si ``cost_B > cost_A`` et ``errors_B < errors_A`` :
|
| 26 |
+
``cost_per_avoided_error > 0`` (cas standard, B coûte plus
|
| 27 |
+
pour moins d'erreurs).
|
| 28 |
+
- Si ``cost_B ≤ cost_A`` et ``errors_B < errors_A`` :
|
| 29 |
+
``cost_per_avoided_error ≤ 0`` (cas idéal, B est strictement
|
| 30 |
+
meilleur).
|
| 31 |
+
- Si ``errors_B ≥ errors_A`` : non comparable dans ce sens
|
| 32 |
+
(B n'évite pas d'erreur), retourne ``None``.
|
| 33 |
+
|
| 34 |
+
Sortie
|
| 35 |
+
------
|
| 36 |
+
``compute_marginal_cost(cost_a, errors_a, cost_b, errors_b)``
|
| 37 |
+
retourne ``{cost_per_avoided_error, n_errors_avoided,
|
| 38 |
+
cost_delta, dominated}`` ou ``None`` si non comparable.
|
| 39 |
+
|
| 40 |
+
``compute_marginal_cost_matrix(per_engine)`` retourne, pour
|
| 41 |
+
chaque paire ordonnée ``(A → B)`` où B est plus précis, le
|
| 42 |
+
coût marginal correspondant. Trié par coût marginal croissant
|
| 43 |
+
(meilleur ratio en tête).
|
| 44 |
+
"""
|
| 45 |
+
|
| 46 |
+
from __future__ import annotations
|
| 47 |
+
|
| 48 |
+
import logging
|
| 49 |
+
from typing import Optional
|
| 50 |
+
|
| 51 |
+
logger = logging.getLogger(__name__)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def compute_marginal_cost(
|
| 55 |
+
cost_a: float,
|
| 56 |
+
errors_a: float,
|
| 57 |
+
cost_b: float,
|
| 58 |
+
errors_b: float,
|
| 59 |
+
) -> Optional[dict]:
|
| 60 |
+
"""Coût marginal du passage A → B (B plus précis).
|
| 61 |
+
|
| 62 |
+
Retourne ``None`` si :
|
| 63 |
+
- ``errors_b >= errors_a`` (B n'évite pas d'erreur) ;
|
| 64 |
+
- les valeurs ne sont pas finies.
|
| 65 |
+
"""
|
| 66 |
+
try:
|
| 67 |
+
ca = float(cost_a)
|
| 68 |
+
cb = float(cost_b)
|
| 69 |
+
ea = float(errors_a)
|
| 70 |
+
eb = float(errors_b)
|
| 71 |
+
except (TypeError, ValueError):
|
| 72 |
+
return None
|
| 73 |
+
if ea <= eb:
|
| 74 |
+
# B ne fait pas mieux que A → pas de gain à mesurer.
|
| 75 |
+
return None
|
| 76 |
+
n_avoided = ea - eb
|
| 77 |
+
cost_delta = cb - ca
|
| 78 |
+
cost_per_avoided = cost_delta / n_avoided
|
| 79 |
+
dominated = cost_delta <= 0 # B aussi cher ou moins → cas idéal
|
| 80 |
+
return {
|
| 81 |
+
"cost_per_avoided_error": cost_per_avoided,
|
| 82 |
+
"n_errors_avoided": n_avoided,
|
| 83 |
+
"cost_delta": cost_delta,
|
| 84 |
+
"dominated": dominated,
|
| 85 |
+
}
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def compute_marginal_cost_matrix(
|
| 89 |
+
per_engine: dict[str, dict],
|
| 90 |
+
) -> Optional[dict]:
|
| 91 |
+
"""Pour chaque paire A → B où B fait moins d'erreurs, calcule
|
| 92 |
+
le coût marginal.
|
| 93 |
+
|
| 94 |
+
Parameters
|
| 95 |
+
----------
|
| 96 |
+
per_engine:
|
| 97 |
+
Map ``{engine_name: {"cost": float, "errors": float}}``.
|
| 98 |
+
|
| 99 |
+
Returns
|
| 100 |
+
-------
|
| 101 |
+
dict | None
|
| 102 |
+
``{
|
| 103 |
+
"pairs": list[
|
| 104 |
+
{"engine_a", "engine_b", "cost_per_avoided_error",
|
| 105 |
+
"n_errors_avoided", "cost_delta", "dominated"}
|
| 106 |
+
], # triée par cost_per_avoided_error croissant
|
| 107 |
+
}``
|
| 108 |
+
ou ``None`` si moins de 2 moteurs.
|
| 109 |
+
"""
|
| 110 |
+
if not per_engine or len(per_engine) < 2:
|
| 111 |
+
return None
|
| 112 |
+
engines = sorted(per_engine.keys())
|
| 113 |
+
pairs: list[dict] = []
|
| 114 |
+
for a in engines:
|
| 115 |
+
for b in engines:
|
| 116 |
+
if a == b:
|
| 117 |
+
continue
|
| 118 |
+
data_a = per_engine[a]
|
| 119 |
+
data_b = per_engine[b]
|
| 120 |
+
try:
|
| 121 |
+
ca = float(data_a.get("cost"))
|
| 122 |
+
ea = float(data_a.get("errors"))
|
| 123 |
+
cb = float(data_b.get("cost"))
|
| 124 |
+
eb = float(data_b.get("errors"))
|
| 125 |
+
except (TypeError, ValueError):
|
| 126 |
+
continue
|
| 127 |
+
result = compute_marginal_cost(ca, ea, cb, eb)
|
| 128 |
+
if result is None:
|
| 129 |
+
continue
|
| 130 |
+
entry = {"engine_a": a, "engine_b": b}
|
| 131 |
+
entry.update(result)
|
| 132 |
+
pairs.append(entry)
|
| 133 |
+
if not pairs:
|
| 134 |
+
return None
|
| 135 |
+
pairs.sort(key=lambda p: p["cost_per_avoided_error"])
|
| 136 |
+
return {"pairs": pairs}
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
__all__ = [
|
| 140 |
+
"compute_marginal_cost",
|
| 141 |
+
"compute_marginal_cost_matrix",
|
| 142 |
+
]
|
|
@@ -0,0 +1,333 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Politique de modules contribués — Sprint 97 (B.6).
|
| 2 |
+
|
| 3 |
+
Sprint 97 — B.6 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Avant d'ouvrir Picarones aux contributions externes (axe B —
|
| 8 |
+
modules tiers que l'utilisateur amène), il faut un cadre de
|
| 9 |
+
qualité explicite : *« un module qui ne passe pas l'audit
|
| 10 |
+
n'est pas exécutable. »*
|
| 11 |
+
|
| 12 |
+
Ce module fournit l'**enveloppe d'audit** :
|
| 13 |
+
|
| 14 |
+
- ``ModuleManifest`` — métadonnées obligatoires (auteur,
|
| 15 |
+
licence, version, citation, contrat d'entrée/sortie typé).
|
| 16 |
+
- ``validate_manifest(manifest)`` — vérifie que tous les champs
|
| 17 |
+
obligatoires sont présents et bien formés.
|
| 18 |
+
- ``audit_module(module_class_or_instance, manifest)`` —
|
| 19 |
+
vérifie en plus que la classe respecte le contrat ``BaseModule``
|
| 20 |
+
et que ``input_types``/``output_types`` correspondent au
|
| 21 |
+
manifeste.
|
| 22 |
+
- ``AuditResult`` — verdict structuré ``passed/failed`` + liste
|
| 23 |
+
des checks détaillés.
|
| 24 |
+
|
| 25 |
+
Stratégie d'ouverture
|
| 26 |
+
---------------------
|
| 27 |
+
Phase fermée actuelle : modules officiels uniquement,
|
| 28 |
+
contributions via PR sur le repo principal. Phase ouverte
|
| 29 |
+
future : une fois 5–6 modules officiels stables, ouverture via
|
| 30 |
+
``entry_points`` sur PyPI (``picarones-module-X``). Ce module
|
| 31 |
+
prépare la phase ouverte sans la déclencher : tout module
|
| 32 |
+
externe devra fournir un ``ModuleManifest`` valide pour être
|
| 33 |
+
exécuté.
|
| 34 |
+
|
| 35 |
+
Pas de SPDX validator
|
| 36 |
+
---------------------
|
| 37 |
+
On vérifie la présence et la non-vacuité des champs licence ;
|
| 38 |
+
on ne valide pas la conformité SPDX du nom (``MIT`` vs
|
| 39 |
+
``mit-license`` vs ``MIT License``). Le chercheur reste
|
| 40 |
+
responsable du choix de licence ; l'outil documente, il ne
|
| 41 |
+
juge pas.
|
| 42 |
+
"""
|
| 43 |
+
|
| 44 |
+
from __future__ import annotations
|
| 45 |
+
|
| 46 |
+
import logging
|
| 47 |
+
from dataclasses import dataclass, field
|
| 48 |
+
from typing import Any, Optional
|
| 49 |
+
|
| 50 |
+
logger = logging.getLogger(__name__)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
# Champs obligatoires d'un ManifestModule (texte non-vide).
|
| 54 |
+
_REQUIRED_TEXT_FIELDS = (
|
| 55 |
+
"name", "version", "author", "license",
|
| 56 |
+
"description",
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
@dataclass
|
| 61 |
+
class ModuleManifest:
|
| 62 |
+
"""Métadonnées d'un module contribué.
|
| 63 |
+
|
| 64 |
+
Attributes
|
| 65 |
+
----------
|
| 66 |
+
name:
|
| 67 |
+
Identifiant unique du module (ex. ``"my-llm-correcteur"``).
|
| 68 |
+
version:
|
| 69 |
+
Version sémantique (ex. ``"1.2.0"``).
|
| 70 |
+
author:
|
| 71 |
+
Auteur ou institution responsable.
|
| 72 |
+
license:
|
| 73 |
+
Identifiant de licence (SPDX recommandé, non validé).
|
| 74 |
+
description:
|
| 75 |
+
Description courte (≤ 1 phrase).
|
| 76 |
+
input_types:
|
| 77 |
+
Liste des types d'entrée (chaînes). Doit correspondre
|
| 78 |
+
à ``module.input_types`` (Sprint 33).
|
| 79 |
+
output_types:
|
| 80 |
+
Liste des types de sortie. Doit correspondre à
|
| 81 |
+
``module.output_types``.
|
| 82 |
+
citation:
|
| 83 |
+
Citation académique (BibTeX, DOI, ou texte libre).
|
| 84 |
+
Optionnel.
|
| 85 |
+
homepage:
|
| 86 |
+
URL du dépôt ou de la page projet. Optionnel.
|
| 87 |
+
picarones_min_version:
|
| 88 |
+
Version minimale de Picarones requise. Optionnel.
|
| 89 |
+
extra:
|
| 90 |
+
Métadonnées libres (clé → valeur).
|
| 91 |
+
"""
|
| 92 |
+
|
| 93 |
+
name: str
|
| 94 |
+
version: str
|
| 95 |
+
author: str
|
| 96 |
+
license: str
|
| 97 |
+
description: str
|
| 98 |
+
input_types: list[str] = field(default_factory=list)
|
| 99 |
+
output_types: list[str] = field(default_factory=list)
|
| 100 |
+
citation: Optional[str] = None
|
| 101 |
+
homepage: Optional[str] = None
|
| 102 |
+
picarones_min_version: Optional[str] = None
|
| 103 |
+
extra: dict = field(default_factory=dict)
|
| 104 |
+
|
| 105 |
+
def as_dict(self) -> dict:
|
| 106 |
+
return {
|
| 107 |
+
"name": self.name,
|
| 108 |
+
"version": self.version,
|
| 109 |
+
"author": self.author,
|
| 110 |
+
"license": self.license,
|
| 111 |
+
"description": self.description,
|
| 112 |
+
"input_types": list(self.input_types),
|
| 113 |
+
"output_types": list(self.output_types),
|
| 114 |
+
"citation": self.citation,
|
| 115 |
+
"homepage": self.homepage,
|
| 116 |
+
"picarones_min_version": self.picarones_min_version,
|
| 117 |
+
"extra": dict(self.extra),
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
@dataclass
|
| 122 |
+
class AuditCheck:
|
| 123 |
+
"""Un check individuel de l'audit."""
|
| 124 |
+
|
| 125 |
+
name: str
|
| 126 |
+
passed: bool
|
| 127 |
+
detail: Optional[str] = None
|
| 128 |
+
|
| 129 |
+
def as_dict(self) -> dict:
|
| 130 |
+
return {
|
| 131 |
+
"name": self.name,
|
| 132 |
+
"passed": self.passed,
|
| 133 |
+
"detail": self.detail,
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
@dataclass
|
| 138 |
+
class AuditResult:
|
| 139 |
+
"""Résultat global d'un audit de module."""
|
| 140 |
+
|
| 141 |
+
module_name: str
|
| 142 |
+
passed: bool
|
| 143 |
+
checks: list[AuditCheck] = field(default_factory=list)
|
| 144 |
+
|
| 145 |
+
@property
|
| 146 |
+
def n_passed(self) -> int:
|
| 147 |
+
return sum(1 for c in self.checks if c.passed)
|
| 148 |
+
|
| 149 |
+
@property
|
| 150 |
+
def n_failed(self) -> int:
|
| 151 |
+
return sum(1 for c in self.checks if not c.passed)
|
| 152 |
+
|
| 153 |
+
def as_dict(self) -> dict:
|
| 154 |
+
return {
|
| 155 |
+
"module_name": self.module_name,
|
| 156 |
+
"passed": self.passed,
|
| 157 |
+
"n_passed": self.n_passed,
|
| 158 |
+
"n_failed": self.n_failed,
|
| 159 |
+
"checks": [c.as_dict() for c in self.checks],
|
| 160 |
+
}
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
def validate_manifest(manifest: ModuleManifest) -> list[AuditCheck]:
|
| 164 |
+
"""Vérifie qu'un manifest est complet et bien formé.
|
| 165 |
+
|
| 166 |
+
Returns
|
| 167 |
+
-------
|
| 168 |
+
list[AuditCheck]
|
| 169 |
+
Un check par champ obligatoire + un check pour
|
| 170 |
+
``input_types``/``output_types`` non vides.
|
| 171 |
+
"""
|
| 172 |
+
checks: list[AuditCheck] = []
|
| 173 |
+
for field_name in _REQUIRED_TEXT_FIELDS:
|
| 174 |
+
value = getattr(manifest, field_name, None)
|
| 175 |
+
ok = isinstance(value, str) and bool(value.strip())
|
| 176 |
+
checks.append(AuditCheck(
|
| 177 |
+
name=f"manifest.{field_name}",
|
| 178 |
+
passed=ok,
|
| 179 |
+
detail=None if ok else f"champ '{field_name}' vide ou absent",
|
| 180 |
+
))
|
| 181 |
+
# input_types / output_types : au moins une entrée chacun
|
| 182 |
+
in_ok = (
|
| 183 |
+
isinstance(manifest.input_types, list)
|
| 184 |
+
and len(manifest.input_types) > 0
|
| 185 |
+
and all(
|
| 186 |
+
isinstance(t, str) and t for t in manifest.input_types
|
| 187 |
+
)
|
| 188 |
+
)
|
| 189 |
+
checks.append(AuditCheck(
|
| 190 |
+
name="manifest.input_types",
|
| 191 |
+
passed=in_ok,
|
| 192 |
+
detail=None if in_ok else "input_types vide ou non-string",
|
| 193 |
+
))
|
| 194 |
+
out_ok = (
|
| 195 |
+
isinstance(manifest.output_types, list)
|
| 196 |
+
and len(manifest.output_types) > 0
|
| 197 |
+
and all(
|
| 198 |
+
isinstance(t, str) and t for t in manifest.output_types
|
| 199 |
+
)
|
| 200 |
+
)
|
| 201 |
+
checks.append(AuditCheck(
|
| 202 |
+
name="manifest.output_types",
|
| 203 |
+
passed=out_ok,
|
| 204 |
+
detail=None if out_ok else "output_types vide ou non-string",
|
| 205 |
+
))
|
| 206 |
+
return checks
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
def _is_base_module(cls: Any) -> bool:
|
| 210 |
+
"""Best-effort : vérifie que cls hérite de BaseModule.
|
| 211 |
+
|
| 212 |
+
On ne **pas** importer ``BaseModule`` au top-level pour
|
| 213 |
+
éviter les cycles : on inspecte la chaîne de classes par
|
| 214 |
+
leur nom.
|
| 215 |
+
"""
|
| 216 |
+
try:
|
| 217 |
+
for base in cls.__mro__:
|
| 218 |
+
if base.__name__ == "BaseModule":
|
| 219 |
+
return True
|
| 220 |
+
except AttributeError:
|
| 221 |
+
return False
|
| 222 |
+
return False
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
def audit_module(
|
| 226 |
+
module_class_or_instance: Any,
|
| 227 |
+
manifest: ModuleManifest,
|
| 228 |
+
) -> AuditResult:
|
| 229 |
+
"""Audite un module contribué : interface + manifest.
|
| 230 |
+
|
| 231 |
+
Parameters
|
| 232 |
+
----------
|
| 233 |
+
module_class_or_instance:
|
| 234 |
+
Soit la classe ``BaseModule`` (Sprint 33), soit une
|
| 235 |
+
instance.
|
| 236 |
+
manifest:
|
| 237 |
+
``ModuleManifest`` correspondant au module.
|
| 238 |
+
|
| 239 |
+
Returns
|
| 240 |
+
-------
|
| 241 |
+
AuditResult
|
| 242 |
+
``passed=True`` ssi tous les checks passent.
|
| 243 |
+
"""
|
| 244 |
+
checks = validate_manifest(manifest)
|
| 245 |
+
|
| 246 |
+
# Check : héritage de BaseModule
|
| 247 |
+
cls = (
|
| 248 |
+
type(module_class_or_instance)
|
| 249 |
+
if not isinstance(module_class_or_instance, type)
|
| 250 |
+
else module_class_or_instance
|
| 251 |
+
)
|
| 252 |
+
inherits_base = _is_base_module(cls)
|
| 253 |
+
checks.append(AuditCheck(
|
| 254 |
+
name="module.inherits_base_module",
|
| 255 |
+
passed=inherits_base,
|
| 256 |
+
detail=(
|
| 257 |
+
None if inherits_base
|
| 258 |
+
else "la classe n'hérite pas de picarones.core.modules.BaseModule"
|
| 259 |
+
),
|
| 260 |
+
))
|
| 261 |
+
|
| 262 |
+
# Check : input_types / output_types correspondent
|
| 263 |
+
declared_in: list[str] = []
|
| 264 |
+
declared_out: list[str] = []
|
| 265 |
+
try:
|
| 266 |
+
instance = (
|
| 267 |
+
module_class_or_instance
|
| 268 |
+
if not isinstance(module_class_or_instance, type)
|
| 269 |
+
else None
|
| 270 |
+
)
|
| 271 |
+
attr_in = getattr(cls, "input_types", None)
|
| 272 |
+
attr_out = getattr(cls, "output_types", None)
|
| 273 |
+
if instance is not None:
|
| 274 |
+
attr_in = getattr(instance, "input_types", attr_in)
|
| 275 |
+
attr_out = getattr(instance, "output_types", attr_out)
|
| 276 |
+
if attr_in is not None:
|
| 277 |
+
declared_in = [
|
| 278 |
+
getattr(t, "value", str(t)) for t in attr_in
|
| 279 |
+
]
|
| 280 |
+
if attr_out is not None:
|
| 281 |
+
declared_out = [
|
| 282 |
+
getattr(t, "value", str(t)) for t in attr_out
|
| 283 |
+
]
|
| 284 |
+
except Exception: # noqa: BLE001
|
| 285 |
+
pass
|
| 286 |
+
# Comparaison case-insensitive : on accepte "TEXT" ou "text"
|
| 287 |
+
# côté manifest, le contrat sémantique est le même.
|
| 288 |
+
declared_in_lower = sorted(t.lower() for t in declared_in)
|
| 289 |
+
declared_out_lower = sorted(t.lower() for t in declared_out)
|
| 290 |
+
manifest_in_lower = sorted(t.lower() for t in manifest.input_types)
|
| 291 |
+
manifest_out_lower = sorted(t.lower() for t in manifest.output_types)
|
| 292 |
+
in_match = declared_in_lower == manifest_in_lower
|
| 293 |
+
checks.append(AuditCheck(
|
| 294 |
+
name="module.input_types_match_manifest",
|
| 295 |
+
passed=in_match,
|
| 296 |
+
detail=(
|
| 297 |
+
None if in_match
|
| 298 |
+
else f"déclaré {declared_in} vs manifest {manifest.input_types}"
|
| 299 |
+
),
|
| 300 |
+
))
|
| 301 |
+
out_match = declared_out_lower == manifest_out_lower
|
| 302 |
+
checks.append(AuditCheck(
|
| 303 |
+
name="module.output_types_match_manifest",
|
| 304 |
+
passed=out_match,
|
| 305 |
+
detail=(
|
| 306 |
+
None if out_match
|
| 307 |
+
else f"déclaré {declared_out} vs manifest {manifest.output_types}"
|
| 308 |
+
),
|
| 309 |
+
))
|
| 310 |
+
|
| 311 |
+
# Check : process callable
|
| 312 |
+
has_process = callable(getattr(cls, "process", None))
|
| 313 |
+
checks.append(AuditCheck(
|
| 314 |
+
name="module.has_process",
|
| 315 |
+
passed=has_process,
|
| 316 |
+
detail=None if has_process else "méthode process() absente",
|
| 317 |
+
))
|
| 318 |
+
|
| 319 |
+
passed = all(c.passed for c in checks)
|
| 320 |
+
return AuditResult(
|
| 321 |
+
module_name=manifest.name,
|
| 322 |
+
passed=passed,
|
| 323 |
+
checks=checks,
|
| 324 |
+
)
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
__all__ = [
|
| 328 |
+
"ModuleManifest",
|
| 329 |
+
"AuditCheck",
|
| 330 |
+
"AuditResult",
|
| 331 |
+
"validate_manifest",
|
| 332 |
+
"audit_module",
|
| 333 |
+
]
|
|
@@ -0,0 +1,313 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Modélisation des coûts — APIs cloud et temps d'inférence local.
|
| 2 |
+
|
| 3 |
+
Sert uniquement à la vue Pareto coût/qualité du rapport (Sprint 5).
|
| 4 |
+
Les prix sont indicatifs et vieillissent vite : voir ``picarones/data/pricing.yaml``
|
| 5 |
+
pour les hypothèses, dates et URLs de référence.
|
| 6 |
+
|
| 7 |
+
Conventions
|
| 8 |
+
-----------
|
| 9 |
+
- Unité monétaire : EUR (conversion indicative depuis USD quand applicable).
|
| 10 |
+
- Coût exprimé par **1 000 pages** traitées.
|
| 11 |
+
- Coût local = temps moyen d'inférence × taux horaire (paramétrable).
|
| 12 |
+
- Empreinte carbone optionnelle : kWh × intensité g CO₂/kWh du réseau
|
| 13 |
+
d'exécution (mix France bas carbone par défaut pour le local,
|
| 14 |
+
moyenne cloud hyperscaler pour les APIs).
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
import logging
|
| 20 |
+
from dataclasses import dataclass, field
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
from typing import Optional
|
| 23 |
+
|
| 24 |
+
import yaml
|
| 25 |
+
|
| 26 |
+
logger = logging.getLogger(__name__)
|
| 27 |
+
|
| 28 |
+
# Sprint A14-S10 — chemin ajusté après déplacement de
|
| 29 |
+
# ``picarones/measurements/pricing.py`` vers
|
| 30 |
+
# ``picarones/evaluation/metrics/pricing.py``. Le YAML reste dans
|
| 31 |
+
# ``picarones/data/``, donc on remonte de 3 niveaux au lieu de 2.
|
| 32 |
+
_DEFAULT_PRICING_PATH = Path(__file__).parent.parent.parent / "data" / "pricing.yaml"
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
@dataclass(frozen=True)
|
| 36 |
+
class PricingDefaults:
|
| 37 |
+
"""Valeurs par défaut du fichier de prix (section ``meta``)."""
|
| 38 |
+
|
| 39 |
+
last_updated: Optional[str] = None
|
| 40 |
+
currency: str = "EUR"
|
| 41 |
+
hourly_rate_local_cpu_eur: float = 0.08
|
| 42 |
+
hourly_rate_local_gpu_eur: float = 1.20
|
| 43 |
+
grid_intensity_local: float = 58.0
|
| 44 |
+
grid_intensity_cloud: float = 380.0
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
@dataclass
|
| 48 |
+
class EngineCost:
|
| 49 |
+
"""Coût estimé d'un moteur sur 1 000 pages, avec traçabilité des hypothèses.
|
| 50 |
+
|
| 51 |
+
La représentation est immuable après construction : une fois que l'utilisateur
|
| 52 |
+
a choisi un taux horaire local, toutes les instances partagent cette
|
| 53 |
+
hypothèse par injection explicite dans ``build_costs_for_benchmark``.
|
| 54 |
+
"""
|
| 55 |
+
|
| 56 |
+
engine_key: str
|
| 57 |
+
"""Nom ou modèle servant de clé dans la table (ex. ``"gpt-4o"``, ``"tesseract"``)."""
|
| 58 |
+
|
| 59 |
+
type: str # "local" | "cloud_api" | "unknown"
|
| 60 |
+
|
| 61 |
+
cost_per_1k_pages_eur: Optional[float] = None
|
| 62 |
+
"""Coût par 1 000 pages en euros. ``None`` si les données sont insuffisantes."""
|
| 63 |
+
|
| 64 |
+
currency: str = "EUR"
|
| 65 |
+
|
| 66 |
+
# Source / date
|
| 67 |
+
pricing_source_url: Optional[str] = None
|
| 68 |
+
pricing_date: Optional[str] = None
|
| 69 |
+
|
| 70 |
+
# Pour les APIs cloud : prix brut
|
| 71 |
+
api_price_per_1k_pages: Optional[float] = None
|
| 72 |
+
|
| 73 |
+
# Pour le local : temps d'inférence et taux horaire utilisés
|
| 74 |
+
local_mean_seconds_per_page: Optional[float] = None
|
| 75 |
+
hourly_rate_eur: Optional[float] = None
|
| 76 |
+
|
| 77 |
+
# Empreinte carbone (estimation — étiquetée "expérimentale" dans le rapport)
|
| 78 |
+
kwh_per_1k_pages: Optional[float] = None
|
| 79 |
+
grid_intensity_g_co2_per_kwh: Optional[float] = None
|
| 80 |
+
co2_per_1k_pages_g: Optional[float] = None
|
| 81 |
+
|
| 82 |
+
notes: Optional[str] = None
|
| 83 |
+
|
| 84 |
+
assumptions: list[str] = field(default_factory=list)
|
| 85 |
+
"""Liste d'hypothèses textuelles à afficher sous le graphique."""
|
| 86 |
+
|
| 87 |
+
def as_dict(self) -> dict:
|
| 88 |
+
return {
|
| 89 |
+
"engine_key": self.engine_key,
|
| 90 |
+
"type": self.type,
|
| 91 |
+
"cost_per_1k_pages_eur": self.cost_per_1k_pages_eur,
|
| 92 |
+
"currency": self.currency,
|
| 93 |
+
"pricing_source_url": self.pricing_source_url,
|
| 94 |
+
"pricing_date": self.pricing_date,
|
| 95 |
+
"api_price_per_1k_pages": self.api_price_per_1k_pages,
|
| 96 |
+
"local_mean_seconds_per_page": self.local_mean_seconds_per_page,
|
| 97 |
+
"hourly_rate_eur": self.hourly_rate_eur,
|
| 98 |
+
"kwh_per_1k_pages": self.kwh_per_1k_pages,
|
| 99 |
+
"grid_intensity_g_co2_per_kwh": self.grid_intensity_g_co2_per_kwh,
|
| 100 |
+
"co2_per_1k_pages_g": self.co2_per_1k_pages_g,
|
| 101 |
+
"notes": self.notes,
|
| 102 |
+
"assumptions": list(self.assumptions),
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
def load_pricing_database(path: Optional[Path] = None) -> tuple[PricingDefaults, dict]:
|
| 107 |
+
"""Charge la table de prix YAML.
|
| 108 |
+
|
| 109 |
+
Retourne ``(defaults, engines_table)`` où ``engines_table`` est un dict
|
| 110 |
+
``{engine_key: raw_entry}``.
|
| 111 |
+
"""
|
| 112 |
+
path = Path(path) if path else _DEFAULT_PRICING_PATH
|
| 113 |
+
if not path.exists():
|
| 114 |
+
logger.warning("[pricing] fichier %s introuvable", path)
|
| 115 |
+
return PricingDefaults(), {}
|
| 116 |
+
try:
|
| 117 |
+
with path.open(encoding="utf-8") as fh:
|
| 118 |
+
data = yaml.safe_load(fh) or {}
|
| 119 |
+
except yaml.YAMLError as e:
|
| 120 |
+
logger.warning("[pricing] échec parsing %s : %s", path, e)
|
| 121 |
+
return PricingDefaults(), {}
|
| 122 |
+
|
| 123 |
+
meta = data.get("meta", {}) or {}
|
| 124 |
+
defaults = PricingDefaults(
|
| 125 |
+
last_updated=meta.get("last_updated"),
|
| 126 |
+
currency=meta.get("currency", "EUR"),
|
| 127 |
+
hourly_rate_local_cpu_eur=float(meta.get("default_hourly_rate_local_cpu_eur", 0.08)),
|
| 128 |
+
hourly_rate_local_gpu_eur=float(meta.get("default_hourly_rate_local_gpu_eur", 1.20)),
|
| 129 |
+
grid_intensity_local=float(meta.get("default_grid_intensity_g_co2_per_kwh", 58.0)),
|
| 130 |
+
grid_intensity_cloud=float(meta.get("cloud_grid_intensity_g_co2_per_kwh", 380.0)),
|
| 131 |
+
)
|
| 132 |
+
engines_table = data.get("engines", {}) or {}
|
| 133 |
+
return defaults, engines_table
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
def _match_key(engine_name: str, llm_model: Optional[str], table: dict) -> Optional[str]:
|
| 137 |
+
"""Cherche la meilleure clé pour ce moteur dans la table.
|
| 138 |
+
|
| 139 |
+
Stratégie : d'abord le nom du modèle LLM (pour les pipelines), puis le
|
| 140 |
+
nom OCR, puis un match partiel (substring) comme filet de sécurité.
|
| 141 |
+
"""
|
| 142 |
+
candidates = [llm_model, engine_name]
|
| 143 |
+
for c in candidates:
|
| 144 |
+
if c and c in table:
|
| 145 |
+
return c
|
| 146 |
+
# Matching partiel — utile pour "tesseract → gpt-4o" ou "gpt-4o-vision"
|
| 147 |
+
for c in candidates:
|
| 148 |
+
if not c:
|
| 149 |
+
continue
|
| 150 |
+
for key in table:
|
| 151 |
+
if key in c:
|
| 152 |
+
return key
|
| 153 |
+
return None
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def estimate_cost(
|
| 157 |
+
engine_name: str,
|
| 158 |
+
*,
|
| 159 |
+
llm_model: Optional[str] = None,
|
| 160 |
+
is_pipeline: bool = False,
|
| 161 |
+
measured_seconds_per_page: Optional[float] = None,
|
| 162 |
+
table: Optional[dict] = None,
|
| 163 |
+
defaults: Optional[PricingDefaults] = None,
|
| 164 |
+
hourly_rate_override_eur: Optional[float] = None,
|
| 165 |
+
) -> EngineCost:
|
| 166 |
+
"""Calcule le ``EngineCost`` pour un moteur donné.
|
| 167 |
+
|
| 168 |
+
Parameters
|
| 169 |
+
----------
|
| 170 |
+
engine_name:
|
| 171 |
+
Nom public du moteur (ex. ``"tesseract"``, ``"tesseract → gpt-4o"``).
|
| 172 |
+
llm_model:
|
| 173 |
+
Si pipeline OCR+LLM, le modèle LLM utilisé — prioritaire pour la
|
| 174 |
+
lookup car c'est lui qui domine le coût.
|
| 175 |
+
is_pipeline:
|
| 176 |
+
Indique un pipeline OCR+LLM (change la sémantique de lookup).
|
| 177 |
+
measured_seconds_per_page:
|
| 178 |
+
Temps moyen observé sur le benchmark courant. Remplace la valeur
|
| 179 |
+
indicative de la table si fournie (plus fiable).
|
| 180 |
+
table, defaults:
|
| 181 |
+
Overrides pour tests ou usage institutionnel.
|
| 182 |
+
hourly_rate_override_eur:
|
| 183 |
+
Taux horaire à utiliser pour le calcul local (sinon valeur table
|
| 184 |
+
ou défaut).
|
| 185 |
+
"""
|
| 186 |
+
if table is None or defaults is None:
|
| 187 |
+
_defaults, _table = load_pricing_database()
|
| 188 |
+
defaults = defaults or _defaults
|
| 189 |
+
table = table or _table
|
| 190 |
+
|
| 191 |
+
key = _match_key(engine_name, llm_model if is_pipeline else None, table)
|
| 192 |
+
if key is None:
|
| 193 |
+
return EngineCost(
|
| 194 |
+
engine_key=engine_name,
|
| 195 |
+
type="unknown",
|
| 196 |
+
assumptions=["Aucune entrée dans la table de prix pour ce moteur."],
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
entry = table[key]
|
| 200 |
+
etype = str(entry.get("type", "unknown"))
|
| 201 |
+
notes = entry.get("notes")
|
| 202 |
+
assumptions: list[str] = []
|
| 203 |
+
currency = defaults.currency
|
| 204 |
+
|
| 205 |
+
cost_eur: Optional[float] = None
|
| 206 |
+
api_price: Optional[float] = None
|
| 207 |
+
local_seconds = measured_seconds_per_page
|
| 208 |
+
hourly_rate = None
|
| 209 |
+
|
| 210 |
+
if etype == "cloud_api":
|
| 211 |
+
api_price = entry.get("api_price_per_1k_pages")
|
| 212 |
+
if api_price is not None:
|
| 213 |
+
cost_eur = float(api_price)
|
| 214 |
+
assumptions.append(
|
| 215 |
+
f"Prix API indicatif : {cost_eur:.2f} €/1000 pages "
|
| 216 |
+
f"(source : {entry.get('pricing_source_url', '—')}, {entry.get('pricing_date', 'date inconnue')})."
|
| 217 |
+
)
|
| 218 |
+
elif etype == "local":
|
| 219 |
+
indicative_seconds = entry.get("local_mean_seconds_per_page")
|
| 220 |
+
if local_seconds is None and indicative_seconds is not None:
|
| 221 |
+
local_seconds = float(indicative_seconds)
|
| 222 |
+
assumptions.append(
|
| 223 |
+
f"Temps d'inférence indicatif : {local_seconds:.1f} s/page (non mesuré sur ce benchmark)."
|
| 224 |
+
)
|
| 225 |
+
elif local_seconds is not None:
|
| 226 |
+
assumptions.append(
|
| 227 |
+
f"Temps d'inférence mesuré : {local_seconds:.1f} s/page (moyenne sur le corpus)."
|
| 228 |
+
)
|
| 229 |
+
|
| 230 |
+
hourly_rate = (
|
| 231 |
+
hourly_rate_override_eur
|
| 232 |
+
if hourly_rate_override_eur is not None
|
| 233 |
+
else entry.get("hourly_rate_override_eur")
|
| 234 |
+
)
|
| 235 |
+
if hourly_rate is None:
|
| 236 |
+
# Heuristique : si l'entrée précise un override GPU, sinon CPU
|
| 237 |
+
hourly_rate = (
|
| 238 |
+
defaults.hourly_rate_local_gpu_eur
|
| 239 |
+
if "gpu" in str(notes or "").lower()
|
| 240 |
+
else defaults.hourly_rate_local_cpu_eur
|
| 241 |
+
)
|
| 242 |
+
hourly_rate = float(hourly_rate)
|
| 243 |
+
|
| 244 |
+
if local_seconds is not None and hourly_rate is not None:
|
| 245 |
+
cost_eur = (local_seconds / 3600.0) * hourly_rate * 1000.0
|
| 246 |
+
assumptions.append(
|
| 247 |
+
f"Taux horaire appliqué : {hourly_rate:.2f} €/h "
|
| 248 |
+
f"(défaut {'GPU' if hourly_rate >= 0.5 else 'CPU'})."
|
| 249 |
+
)
|
| 250 |
+
|
| 251 |
+
# Empreinte carbone optionnelle
|
| 252 |
+
kwh_1k = entry.get("kwh_per_1k_pages")
|
| 253 |
+
grid = (
|
| 254 |
+
entry.get("grid_intensity_g_co2_per_kwh")
|
| 255 |
+
or (defaults.grid_intensity_cloud if etype == "cloud_api" else defaults.grid_intensity_local)
|
| 256 |
+
)
|
| 257 |
+
co2_g = None
|
| 258 |
+
if kwh_1k is not None and grid is not None:
|
| 259 |
+
co2_g = float(kwh_1k) * float(grid)
|
| 260 |
+
|
| 261 |
+
return EngineCost(
|
| 262 |
+
engine_key=key,
|
| 263 |
+
type=etype,
|
| 264 |
+
cost_per_1k_pages_eur=cost_eur,
|
| 265 |
+
currency=currency,
|
| 266 |
+
pricing_source_url=entry.get("pricing_source_url"),
|
| 267 |
+
pricing_date=entry.get("pricing_date"),
|
| 268 |
+
api_price_per_1k_pages=api_price,
|
| 269 |
+
local_mean_seconds_per_page=local_seconds,
|
| 270 |
+
hourly_rate_eur=hourly_rate,
|
| 271 |
+
kwh_per_1k_pages=float(kwh_1k) if kwh_1k is not None else None,
|
| 272 |
+
grid_intensity_g_co2_per_kwh=float(grid) if grid is not None else None,
|
| 273 |
+
co2_per_1k_pages_g=co2_g,
|
| 274 |
+
notes=notes,
|
| 275 |
+
assumptions=assumptions,
|
| 276 |
+
)
|
| 277 |
+
|
| 278 |
+
|
| 279 |
+
def build_costs_for_benchmark(
|
| 280 |
+
engines_summary: list[dict],
|
| 281 |
+
durations_by_engine: dict[str, float],
|
| 282 |
+
*,
|
| 283 |
+
hourly_rate_local_eur: Optional[float] = None,
|
| 284 |
+
pricing_path: Optional[Path] = None,
|
| 285 |
+
) -> dict[str, dict]:
|
| 286 |
+
"""Calcule le coût de chaque moteur d'un benchmark.
|
| 287 |
+
|
| 288 |
+
Returns
|
| 289 |
+
-------
|
| 290 |
+
dict ``{engine_name: EngineCost.as_dict()}``.
|
| 291 |
+
"""
|
| 292 |
+
defaults, table = load_pricing_database(pricing_path)
|
| 293 |
+
out: dict[str, dict] = {}
|
| 294 |
+
for e in engines_summary:
|
| 295 |
+
name = e.get("name")
|
| 296 |
+
if not name:
|
| 297 |
+
continue
|
| 298 |
+
measured = durations_by_engine.get(name)
|
| 299 |
+
llm_model = None
|
| 300 |
+
pipeline_info = e.get("pipeline_info") or {}
|
| 301 |
+
if pipeline_info:
|
| 302 |
+
llm_model = pipeline_info.get("llm_model")
|
| 303 |
+
cost = estimate_cost(
|
| 304 |
+
engine_name=name,
|
| 305 |
+
llm_model=llm_model,
|
| 306 |
+
is_pipeline=bool(e.get("is_pipeline")),
|
| 307 |
+
measured_seconds_per_page=measured,
|
| 308 |
+
table=table,
|
| 309 |
+
defaults=defaults,
|
| 310 |
+
hourly_rate_override_eur=hourly_rate_local_eur,
|
| 311 |
+
)
|
| 312 |
+
out[name] = cost.as_dict()
|
| 313 |
+
return out
|
|
@@ -0,0 +1,254 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Rare-token recall — Sprint 71 (A.I.1 chantier 2 du plan 2026).
|
| 2 |
+
|
| 3 |
+
Pourquoi ce module
|
| 4 |
+
------------------
|
| 5 |
+
Le CER global d'un moteur peut sembler bon (ex. 5 %) tout en
|
| 6 |
+
masquant des **erreurs systématiques sur les tokens rares** : noms
|
| 7 |
+
propres, toponymes peu fréquents, mots techniques, formules latines
|
| 8 |
+
récurrentes mais pas dominantes. Pour un usage prosopographique
|
| 9 |
+
(indexation de noms, recherche généalogique), ce sont précisément
|
| 10 |
+
ces tokens-là qui comptent.
|
| 11 |
+
|
| 12 |
+
Ce module mesure le **rappel sur les tokens rares** d'un corpus —
|
| 13 |
+
défaut : tokens dont la fréquence corpus-wide est ≤ 2 (hapax +
|
| 14 |
+
dis legomena, terminologie de lexicométrie classique).
|
| 15 |
+
|
| 16 |
+
Hypothèse à valider expérimentalement
|
| 17 |
+
-------------------------------------
|
| 18 |
+
La conjecture du plan A.I.1 : *« cette métrique discrimine plus
|
| 19 |
+
les moteurs que le CER global »*. Si confirmée sur un corpus
|
| 20 |
+
patrimonial réel, elle gagne sa place dans le tableau de
|
| 21 |
+
classement principal — décision laissée au chercheur après
|
| 22 |
+
observation.
|
| 23 |
+
|
| 24 |
+
Stratégie de découpage
|
| 25 |
+
----------------------
|
| 26 |
+
Cohérente avec NER (38), Flesch (52), philologie (55-60) : couche
|
| 27 |
+
de calcul pure d'abord, sans intégration runner. La vue HTML
|
| 28 |
+
« worst lines / rare tokens manqués » suit dans un sprint dédié.
|
| 29 |
+
|
| 30 |
+
Pas d'enregistrement dans le registre typé Sprint 34
|
| 31 |
+
----------------------------------------------------
|
| 32 |
+
La métrique exige **trois entrées** (reference, hypothesis, set
|
| 33 |
+
des tokens rares) et le set des rares est calculé corpus-wide
|
| 34 |
+
(donc connu seulement après itération sur tout le corpus). La
|
| 35 |
+
signature ne rentre pas dans ``(TEXT, TEXT)``. L'utilisateur
|
| 36 |
+
appelle explicitement ``compute_rare_token_recall`` avec le set
|
| 37 |
+
qu'il a calculé.
|
| 38 |
+
"""
|
| 39 |
+
|
| 40 |
+
from __future__ import annotations
|
| 41 |
+
|
| 42 |
+
import logging
|
| 43 |
+
import re
|
| 44 |
+
from collections import Counter
|
| 45 |
+
from typing import Iterable, Optional
|
| 46 |
+
|
| 47 |
+
logger = logging.getLogger(__name__)
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 51 |
+
# Tokenisation Unicode-aware
|
| 52 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 53 |
+
|
| 54 |
+
# Token = séquence maximale de caractères de mot Unicode (\w en
|
| 55 |
+
# Python 3 utilise déjà la table Unicode), incluant l'apostrophe
|
| 56 |
+
# typographique '’' à l'intérieur (« l'an », « d’une ») et les
|
| 57 |
+
# tirets internes (« peut-être »). La ponctuation isolée et les
|
| 58 |
+
# espaces sont des séparateurs.
|
| 59 |
+
|
| 60 |
+
_TOKEN_RE = re.compile(
|
| 61 |
+
r"\w+(?:[’'\-]\w+)*",
|
| 62 |
+
flags=re.UNICODE,
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def tokenize(text: Optional[str]) -> list[str]:
|
| 67 |
+
"""Tokenisation Unicode-aware.
|
| 68 |
+
|
| 69 |
+
Conserve les contractions (``l'an``, ``d’une``) et les mots
|
| 70 |
+
composés (``peut-être``, ``c'est-à-dire``) comme un seul token.
|
| 71 |
+
Casse préservée — l'utilisateur normalise lui-même via
|
| 72 |
+
``case_sensitive=False`` dans les fonctions aval s'il le veut.
|
| 73 |
+
"""
|
| 74 |
+
if not text:
|
| 75 |
+
return []
|
| 76 |
+
return _TOKEN_RE.findall(text)
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 80 |
+
# Distribution de fréquence corpus-wide
|
| 81 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
def frequency_distribution(
|
| 85 |
+
documents: Iterable[str],
|
| 86 |
+
*,
|
| 87 |
+
case_sensitive: bool = False,
|
| 88 |
+
) -> Counter[str]:
|
| 89 |
+
"""Calcule ``{token: count}`` sur l'ensemble du corpus.
|
| 90 |
+
|
| 91 |
+
Parameters
|
| 92 |
+
----------
|
| 93 |
+
documents:
|
| 94 |
+
Itérable de textes (typiquement les ``ground_truth`` des
|
| 95 |
+
documents du corpus).
|
| 96 |
+
case_sensitive:
|
| 97 |
+
Si ``False`` (défaut), tous les tokens sont mis en
|
| 98 |
+
minuscule avant comptage.
|
| 99 |
+
"""
|
| 100 |
+
counter: Counter[str] = Counter()
|
| 101 |
+
for doc in documents:
|
| 102 |
+
tokens = tokenize(doc)
|
| 103 |
+
if not case_sensitive:
|
| 104 |
+
tokens = [t.lower() for t in tokens]
|
| 105 |
+
counter.update(tokens)
|
| 106 |
+
return counter
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def extract_rare_tokens(
|
| 110 |
+
documents: Iterable[str],
|
| 111 |
+
*,
|
| 112 |
+
max_freq: int = 2,
|
| 113 |
+
case_sensitive: bool = False,
|
| 114 |
+
) -> frozenset[str]:
|
| 115 |
+
"""Retourne l'ensemble des tokens dont la fréquence
|
| 116 |
+
corpus-wide est ``≤ max_freq``.
|
| 117 |
+
|
| 118 |
+
Convention de lexicométrie : ``max_freq=1`` retourne uniquement
|
| 119 |
+
les hapax legomena (1 occurrence) ; ``max_freq=2`` retourne
|
| 120 |
+
hapax + dis legomena (≤ 2 occurrences) — défaut.
|
| 121 |
+
|
| 122 |
+
Les tokens qui n'apparaissent **jamais** dans le corpus ne sont
|
| 123 |
+
évidemment pas inclus (le ``Counter`` ne les liste pas).
|
| 124 |
+
"""
|
| 125 |
+
if max_freq < 1:
|
| 126 |
+
raise ValueError("max_freq doit être ≥ 1")
|
| 127 |
+
counter = frequency_distribution(
|
| 128 |
+
documents, case_sensitive=case_sensitive,
|
| 129 |
+
)
|
| 130 |
+
return frozenset(t for t, c in counter.items() if c <= max_freq)
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 134 |
+
# Calcul du rappel par document
|
| 135 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
def compute_rare_token_recall(
|
| 139 |
+
reference: Optional[str],
|
| 140 |
+
hypothesis: Optional[str],
|
| 141 |
+
rare_tokens: Iterable[str],
|
| 142 |
+
*,
|
| 143 |
+
case_sensitive: bool = False,
|
| 144 |
+
) -> dict:
|
| 145 |
+
"""Calcule le rappel sur les tokens rares présents dans la GT.
|
| 146 |
+
|
| 147 |
+
Parameters
|
| 148 |
+
----------
|
| 149 |
+
reference:
|
| 150 |
+
Texte GT du document.
|
| 151 |
+
hypothesis:
|
| 152 |
+
Texte produit par l'OCR.
|
| 153 |
+
rare_tokens:
|
| 154 |
+
Itérable des tokens rares — typiquement le résultat de
|
| 155 |
+
``extract_rare_tokens`` sur le corpus complet.
|
| 156 |
+
case_sensitive:
|
| 157 |
+
Si ``False`` (défaut), la comparaison se fait sur les
|
| 158 |
+
formes minuscules.
|
| 159 |
+
|
| 160 |
+
Returns
|
| 161 |
+
-------
|
| 162 |
+
dict
|
| 163 |
+
``{
|
| 164 |
+
"n_rare_tokens_in_reference": int,
|
| 165 |
+
# nombre d'**occurrences** de tokens rares dans la GT
|
| 166 |
+
# (multiplicité préservée — un token rare présent 2
|
| 167 |
+
# fois compte 2)
|
| 168 |
+
"n_rare_tokens_recalled": int,
|
| 169 |
+
# nombre d'occurrences correctement présentes dans hyp
|
| 170 |
+
# (alignement bag-of-tokens : min(count_ref, count_hyp))
|
| 171 |
+
"recall": float,
|
| 172 |
+
# ratio dans [0, 1], ou 0.0 si aucun rare en GT
|
| 173 |
+
"missed_tokens": list[str],
|
| 174 |
+
# liste des tokens rares **manqués** (avec multiplicité,
|
| 175 |
+
# ex. "Dupont" présent 2 fois en GT et 1 fois en hyp →
|
| 176 |
+
# missed_tokens contient ["Dupont"] une fois)
|
| 177 |
+
}``
|
| 178 |
+
|
| 179 |
+
Cas dégénérés
|
| 180 |
+
-------------
|
| 181 |
+
- GT vide ou aucun token rare présent → recall = 0.0, listes
|
| 182 |
+
vides (convention : on ne récompense pas l'absence de
|
| 183 |
+
tokens rares).
|
| 184 |
+
- Hyp vide avec rares en GT → tous manqués, recall = 0.0.
|
| 185 |
+
"""
|
| 186 |
+
ref = reference or ""
|
| 187 |
+
hyp = hypothesis or ""
|
| 188 |
+
|
| 189 |
+
if case_sensitive:
|
| 190 |
+
rare_set = frozenset(rare_tokens)
|
| 191 |
+
ref_tokens = tokenize(ref)
|
| 192 |
+
hyp_tokens = tokenize(hyp)
|
| 193 |
+
else:
|
| 194 |
+
rare_set = frozenset(t.lower() for t in rare_tokens)
|
| 195 |
+
ref_tokens = [t.lower() for t in tokenize(ref)]
|
| 196 |
+
hyp_tokens = [t.lower() for t in tokenize(hyp)]
|
| 197 |
+
|
| 198 |
+
# Multiplicité : on compte uniquement les rares présents dans la GT
|
| 199 |
+
ref_rare_counts: Counter[str] = Counter(
|
| 200 |
+
t for t in ref_tokens if t in rare_set
|
| 201 |
+
)
|
| 202 |
+
n_rare_in_ref = sum(ref_rare_counts.values())
|
| 203 |
+
if n_rare_in_ref == 0:
|
| 204 |
+
return {
|
| 205 |
+
"n_rare_tokens_in_reference": 0,
|
| 206 |
+
"n_rare_tokens_recalled": 0,
|
| 207 |
+
"recall": 0.0,
|
| 208 |
+
"missed_tokens": [],
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
# Bag-of-tokens dans hyp pour les tokens rares uniquement
|
| 212 |
+
hyp_rare_counts: Counter[str] = Counter(
|
| 213 |
+
t for t in hyp_tokens if t in rare_set
|
| 214 |
+
)
|
| 215 |
+
# Recall multiplicitaire : pour chaque token, min(ref_count, hyp_count)
|
| 216 |
+
n_recalled = 0
|
| 217 |
+
missed: list[str] = []
|
| 218 |
+
for token, ref_count in ref_rare_counts.items():
|
| 219 |
+
hyp_count = hyp_rare_counts.get(token, 0)
|
| 220 |
+
recalled = min(ref_count, hyp_count)
|
| 221 |
+
n_recalled += recalled
|
| 222 |
+
missed_count = ref_count - recalled
|
| 223 |
+
if missed_count > 0:
|
| 224 |
+
missed.extend([token] * missed_count)
|
| 225 |
+
|
| 226 |
+
return {
|
| 227 |
+
"n_rare_tokens_in_reference": n_rare_in_ref,
|
| 228 |
+
"n_rare_tokens_recalled": n_recalled,
|
| 229 |
+
"recall": n_recalled / n_rare_in_ref,
|
| 230 |
+
"missed_tokens": missed,
|
| 231 |
+
}
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
def rare_token_recall(
|
| 235 |
+
reference: Optional[str],
|
| 236 |
+
hypothesis: Optional[str],
|
| 237 |
+
rare_tokens: Iterable[str],
|
| 238 |
+
*,
|
| 239 |
+
case_sensitive: bool = False,
|
| 240 |
+
) -> float:
|
| 241 |
+
"""Raccourci : retourne uniquement le rappel ∈ [0, 1]."""
|
| 242 |
+
return compute_rare_token_recall(
|
| 243 |
+
reference, hypothesis, rare_tokens,
|
| 244 |
+
case_sensitive=case_sensitive,
|
| 245 |
+
)["recall"]
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
__all__ = [
|
| 249 |
+
"tokenize",
|
| 250 |
+
"frequency_distribution",
|
| 251 |
+
"extract_rare_tokens",
|
| 252 |
+
"compute_rare_token_recall",
|
| 253 |
+
"rare_token_recall",
|
| 254 |
+
]
|
|
@@ -0,0 +1,287 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Projection de robustesse synthétique sur le corpus réel —
|
| 2 |
+
Sprint 81 (A.I.8).
|
| 3 |
+
|
| 4 |
+
Sprint 81 — A.I.8 du plan d'évolution 2026.
|
| 5 |
+
|
| 6 |
+
Pourquoi ce module
|
| 7 |
+
------------------
|
| 8 |
+
Le module ``picarones/core/robustness.py`` (Sprint 8) génère des
|
| 9 |
+
courbes CER vs niveau de dégradation **synthétique** (bruit, flou,
|
| 10 |
+
rotation, résolution). ``picarones/core/image_quality.py`` mesure
|
| 11 |
+
le bruit/flou/contraste **réels** des images du corpus. Ce
|
| 12 |
+
sprint **projette** les caractéristiques réelles sur les courbes
|
| 13 |
+
synthétiques pour estimer le **déficit attendu de CER** sur le
|
| 14 |
+
corpus dans son état actuel.
|
| 15 |
+
|
| 16 |
+
Lecture concrète
|
| 17 |
+
----------------
|
| 18 |
+
*« 30 % de vos documents ont un bruit équivalent à σ=15 où
|
| 19 |
+
Tesseract perd 8 points de CER — soit un déficit attendu global
|
| 20 |
+
de 2,4 points (30 % × 8 points). »*
|
| 21 |
+
|
| 22 |
+
Méthode
|
| 23 |
+
-------
|
| 24 |
+
1. Pour chaque document, on extrait la valeur de qualité réelle
|
| 25 |
+
(``noise_level``, ``blur_score``, ``contrast_score``…) depuis
|
| 26 |
+
``ImageQualityResult``.
|
| 27 |
+
2. Pour chaque type de dégradation, on interpole linéairement la
|
| 28 |
+
``DegradationCurve`` synthétique : CER attendu à ce niveau.
|
| 29 |
+
3. On agrège : CER moyen attendu, % docs au-dessus du seuil
|
| 30 |
+
critique de la courbe, déficit projeté = CER_attendu -
|
| 31 |
+
CER_baseline (niveau nul).
|
| 32 |
+
|
| 33 |
+
Sortie
|
| 34 |
+
------
|
| 35 |
+
``project_robustness_on_corpus(curves, image_qualities)`` retourne
|
| 36 |
+
``{engine_name: {degradation_type: {expected_cer_mean,
|
| 37 |
+
deficit_vs_baseline, n_docs_above_critical, n_docs}}}``.
|
| 38 |
+
|
| 39 |
+
Limites
|
| 40 |
+
-------
|
| 41 |
+
- Mapping ``image_quality → degradation level`` : on suppose que
|
| 42 |
+
``noise_level`` (ImageQualityResult) correspond à σ
|
| 43 |
+
(DegradationCurve), et idem pour ``blur_score`` ↔ rayon de
|
| 44 |
+
flou. Si un corpus expose ces valeurs avec une échelle
|
| 45 |
+
différente, le mapping est documenté et l'utilisateur peut
|
| 46 |
+
passer ``quality_to_level`` custom.
|
| 47 |
+
- Interpolation **linéaire** entre les points de la courbe. Au-
|
| 48 |
+
delà des bornes, on **clip** au point extrême (pas
|
| 49 |
+
d'extrapolation hasardeuse).
|
| 50 |
+
"""
|
| 51 |
+
|
| 52 |
+
from __future__ import annotations
|
| 53 |
+
|
| 54 |
+
import logging
|
| 55 |
+
import statistics
|
| 56 |
+
from typing import Callable, Iterable, Optional
|
| 57 |
+
|
| 58 |
+
logger = logging.getLogger(__name__)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
# Mapping par défaut entre attributs ImageQualityResult et types
|
| 62 |
+
# de dégradation synthétique. L'utilisateur peut passer un dict
|
| 63 |
+
# custom pour modifier ce mapping.
|
| 64 |
+
_DEFAULT_QUALITY_FIELD: dict[str, str] = {
|
| 65 |
+
"noise": "noise_level", # σ
|
| 66 |
+
"blur": "blur_score", # Variance laplacienne (inverse)
|
| 67 |
+
"contrast": "contrast_score",
|
| 68 |
+
"rotation": "rotation_angle",
|
| 69 |
+
"resolution": "resolution_score", # peut être absent
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def _interpolate_cer(
|
| 74 |
+
levels: list[float],
|
| 75 |
+
cer_values: list[Optional[float]],
|
| 76 |
+
target_level: float,
|
| 77 |
+
) -> Optional[float]:
|
| 78 |
+
"""Interpolation linéaire : retourne CER attendu à
|
| 79 |
+
``target_level``.
|
| 80 |
+
|
| 81 |
+
- Si ``target_level`` est en-dessous du minimum de levels,
|
| 82 |
+
retourne le CER au minimum (clip).
|
| 83 |
+
- Si au-dessus du maximum, retourne le CER au maximum.
|
| 84 |
+
- Sinon, interpolation linéaire entre les deux points
|
| 85 |
+
encadrants.
|
| 86 |
+
- Retourne ``None`` si aucun ``cer_value`` valide.
|
| 87 |
+
"""
|
| 88 |
+
if not levels:
|
| 89 |
+
return None
|
| 90 |
+
# Filtrer les paires (level, cer) où cer est None
|
| 91 |
+
pairs = [
|
| 92 |
+
(lvl, cer) for lvl, cer in zip(levels, cer_values)
|
| 93 |
+
if cer is not None
|
| 94 |
+
]
|
| 95 |
+
if not pairs:
|
| 96 |
+
return None
|
| 97 |
+
pairs.sort(key=lambda p: p[0])
|
| 98 |
+
# Clip
|
| 99 |
+
if target_level <= pairs[0][0]:
|
| 100 |
+
return pairs[0][1]
|
| 101 |
+
if target_level >= pairs[-1][0]:
|
| 102 |
+
return pairs[-1][1]
|
| 103 |
+
# Interpolation
|
| 104 |
+
for i in range(len(pairs) - 1):
|
| 105 |
+
lo_lvl, lo_cer = pairs[i]
|
| 106 |
+
hi_lvl, hi_cer = pairs[i + 1]
|
| 107 |
+
if lo_lvl <= target_level <= hi_lvl:
|
| 108 |
+
if hi_lvl == lo_lvl:
|
| 109 |
+
return lo_cer
|
| 110 |
+
ratio = (target_level - lo_lvl) / (hi_lvl - lo_lvl)
|
| 111 |
+
return lo_cer + (hi_cer - lo_cer) * ratio
|
| 112 |
+
return None # ne devrait pas arriver
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def _extract_quality_value(
|
| 116 |
+
quality: dict, degradation_type: str,
|
| 117 |
+
custom_mapping: Optional[dict[str, str]] = None,
|
| 118 |
+
) -> Optional[float]:
|
| 119 |
+
"""Extrait la valeur de qualité pertinente pour un type de
|
| 120 |
+
dégradation depuis un ``ImageQualityResult.as_dict()``."""
|
| 121 |
+
mapping = custom_mapping or _DEFAULT_QUALITY_FIELD
|
| 122 |
+
field = mapping.get(degradation_type)
|
| 123 |
+
if field is None:
|
| 124 |
+
return None
|
| 125 |
+
value = quality.get(field)
|
| 126 |
+
if value is None:
|
| 127 |
+
return None
|
| 128 |
+
try:
|
| 129 |
+
return float(value)
|
| 130 |
+
except (TypeError, ValueError):
|
| 131 |
+
return None
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
def project_robustness_on_corpus(
|
| 135 |
+
curves: Iterable,
|
| 136 |
+
image_qualities: list[dict],
|
| 137 |
+
*,
|
| 138 |
+
quality_to_level: Optional[Callable[[dict, str], Optional[float]]] = None,
|
| 139 |
+
critical_threshold: Optional[float] = None,
|
| 140 |
+
) -> dict:
|
| 141 |
+
"""Projette les courbes de robustesse sur les qualités réelles.
|
| 142 |
+
|
| 143 |
+
Parameters
|
| 144 |
+
----------
|
| 145 |
+
curves:
|
| 146 |
+
Itérable de ``DegradationCurve`` (ou dicts compatibles
|
| 147 |
+
avec ``engine_name``, ``degradation_type``, ``levels``,
|
| 148 |
+
``cer_values``, ``critical_threshold_level``).
|
| 149 |
+
image_qualities:
|
| 150 |
+
Liste de dicts ``ImageQualityResult.as_dict()`` (un par
|
| 151 |
+
document). Si vide, retourne une projection vide.
|
| 152 |
+
quality_to_level:
|
| 153 |
+
Fonction custom ``(quality_dict, degradation_type) →
|
| 154 |
+
Optional[float]`` pour adapter le mapping qualité→niveau.
|
| 155 |
+
Par défaut, utilise ``_DEFAULT_QUALITY_FIELD``.
|
| 156 |
+
critical_threshold:
|
| 157 |
+
Override pour le seuil critique de CER (défaut : utilise
|
| 158 |
+
``DegradationCurve.cer_threshold``).
|
| 159 |
+
|
| 160 |
+
Returns
|
| 161 |
+
-------
|
| 162 |
+
dict
|
| 163 |
+
``{
|
| 164 |
+
engine_name: {
|
| 165 |
+
degradation_type: {
|
| 166 |
+
"n_docs": int,
|
| 167 |
+
"n_docs_with_data": int, # qualité disponible
|
| 168 |
+
"expected_cer_mean": float, # moyenne CER attendu
|
| 169 |
+
"expected_cer_median": float,
|
| 170 |
+
"baseline_cer": float, # CER à niveau min
|
| 171 |
+
"deficit_vs_baseline": float,
|
| 172 |
+
"n_docs_above_critical": int,
|
| 173 |
+
"critical_threshold_level": float | None,
|
| 174 |
+
"critical_threshold_cer": float,
|
| 175 |
+
},
|
| 176 |
+
},
|
| 177 |
+
}``
|
| 178 |
+
"""
|
| 179 |
+
extractor = quality_to_level or (
|
| 180 |
+
lambda q, dt: _extract_quality_value(q, dt)
|
| 181 |
+
)
|
| 182 |
+
out: dict[str, dict] = {}
|
| 183 |
+
|
| 184 |
+
for curve in curves:
|
| 185 |
+
# Accepter dict ou DegradationCurve
|
| 186 |
+
if hasattr(curve, "as_dict"):
|
| 187 |
+
data = curve.as_dict()
|
| 188 |
+
else:
|
| 189 |
+
data = curve
|
| 190 |
+
engine = data.get("engine_name")
|
| 191 |
+
deg_type = data.get("degradation_type")
|
| 192 |
+
levels = data.get("levels") or []
|
| 193 |
+
cer_values = data.get("cer_values") or []
|
| 194 |
+
crit_lvl = data.get("critical_threshold_level")
|
| 195 |
+
crit_cer = (
|
| 196 |
+
critical_threshold
|
| 197 |
+
if critical_threshold is not None
|
| 198 |
+
else data.get("cer_threshold", 0.20)
|
| 199 |
+
)
|
| 200 |
+
if not engine or not deg_type:
|
| 201 |
+
continue
|
| 202 |
+
|
| 203 |
+
per_doc_cer: list[float] = []
|
| 204 |
+
n_docs_with_data = 0
|
| 205 |
+
n_above_critical = 0
|
| 206 |
+
for quality in image_qualities:
|
| 207 |
+
level = extractor(quality, deg_type)
|
| 208 |
+
if level is None:
|
| 209 |
+
continue
|
| 210 |
+
n_docs_with_data += 1
|
| 211 |
+
cer = _interpolate_cer(levels, cer_values, level)
|
| 212 |
+
if cer is None:
|
| 213 |
+
continue
|
| 214 |
+
per_doc_cer.append(cer)
|
| 215 |
+
if cer > crit_cer:
|
| 216 |
+
n_above_critical += 1
|
| 217 |
+
|
| 218 |
+
if not per_doc_cer:
|
| 219 |
+
continue
|
| 220 |
+
|
| 221 |
+
# Baseline = CER au niveau minimum (sans dégradation)
|
| 222 |
+
baseline = _interpolate_cer(
|
| 223 |
+
levels, cer_values,
|
| 224 |
+
min(levels) if levels else 0.0,
|
| 225 |
+
)
|
| 226 |
+
expected_mean = statistics.fmean(per_doc_cer)
|
| 227 |
+
expected_median = statistics.median(per_doc_cer)
|
| 228 |
+
deficit = (
|
| 229 |
+
expected_mean - baseline
|
| 230 |
+
if baseline is not None else None
|
| 231 |
+
)
|
| 232 |
+
|
| 233 |
+
out.setdefault(engine, {})[deg_type] = {
|
| 234 |
+
"n_docs": len(image_qualities),
|
| 235 |
+
"n_docs_with_data": n_docs_with_data,
|
| 236 |
+
"expected_cer_mean": expected_mean,
|
| 237 |
+
"expected_cer_median": expected_median,
|
| 238 |
+
"baseline_cer": baseline,
|
| 239 |
+
"deficit_vs_baseline": deficit,
|
| 240 |
+
"n_docs_above_critical": n_above_critical,
|
| 241 |
+
"critical_threshold_level": crit_lvl,
|
| 242 |
+
"critical_threshold_cer": crit_cer,
|
| 243 |
+
}
|
| 244 |
+
return out
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def aggregate_projection_per_engine(projection: dict) -> dict:
|
| 248 |
+
"""Pour chaque moteur, agrège le déficit projeté en sommant
|
| 249 |
+
sur tous les types de dégradation.
|
| 250 |
+
|
| 251 |
+
Lecture : *« déficit total attendu pour Tesseract = 5,2 points
|
| 252 |
+
de CER si on considère les 4 dégradations indépendamment »*.
|
| 253 |
+
|
| 254 |
+
Note : la sommation **suppose l'indépendance** des
|
| 255 |
+
dégradations, ce qui n'est pas strictement vrai mais reste
|
| 256 |
+
une approximation utile pour le diagnostic.
|
| 257 |
+
"""
|
| 258 |
+
out: dict[str, dict] = {}
|
| 259 |
+
for engine, per_type in projection.items():
|
| 260 |
+
total_deficit = 0.0
|
| 261 |
+
n_types_with_data = 0
|
| 262 |
+
max_deficit_type: Optional[tuple[str, float]] = None
|
| 263 |
+
for deg_type, stats in per_type.items():
|
| 264 |
+
deficit = stats.get("deficit_vs_baseline")
|
| 265 |
+
if deficit is None:
|
| 266 |
+
continue
|
| 267 |
+
total_deficit += deficit
|
| 268 |
+
n_types_with_data += 1
|
| 269 |
+
if max_deficit_type is None or deficit > max_deficit_type[1]:
|
| 270 |
+
max_deficit_type = (deg_type, deficit)
|
| 271 |
+
out[engine] = {
|
| 272 |
+
"total_expected_deficit": total_deficit,
|
| 273 |
+
"n_degradation_types": n_types_with_data,
|
| 274 |
+
"worst_degradation_type": (
|
| 275 |
+
max_deficit_type[0] if max_deficit_type else None
|
| 276 |
+
),
|
| 277 |
+
"worst_degradation_deficit": (
|
| 278 |
+
max_deficit_type[1] if max_deficit_type else None
|
| 279 |
+
),
|
| 280 |
+
}
|
| 281 |
+
return out
|
| 282 |
+
|
| 283 |
+
|
| 284 |
+
__all__ = [
|
| 285 |
+
"project_robustness_on_corpus",
|
| 286 |
+
"aggregate_projection_per_engine",
|
| 287 |
+
]
|
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Taxonomie comparative entre deux moteurs — Sprint 77 (A.I.4 chantier 3).
|
| 2 |
+
|
| 3 |
+
Sprint 77 — A.I.4 chantier 3 du plan d'évolution 2026 (clôture A.I.4).
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Le détecteur narratif ``error_profile_outlier`` (Sprint 19) signale
|
| 8 |
+
qu'un moteur a un profil taxonomique éloigné de ses concurrents,
|
| 9 |
+
mais le rapport n'expose pas cette différence visuellement. Ce
|
| 10 |
+
sprint répond à *« deux moteurs ont le même CER global, mais lequel
|
| 11 |
+
fait des erreurs plus récupérables ? »*.
|
| 12 |
+
|
| 13 |
+
Lecture concrète
|
| 14 |
+
----------------
|
| 15 |
+
- Moteur A : 80 % d'erreurs ``case_error`` → toutes corrigeables
|
| 16 |
+
par un post-processing trivial (récupérables).
|
| 17 |
+
- Moteur B : 80 % d'erreurs ``lacuna`` (mots manquants) →
|
| 18 |
+
irrécupérables sans relire l'image.
|
| 19 |
+
|
| 20 |
+
À CER égal, A est massivement préférable pour un workflow
|
| 21 |
+
d'édition critique. Cette vue rend la différence visible.
|
| 22 |
+
|
| 23 |
+
Catégorisation des classes
|
| 24 |
+
--------------------------
|
| 25 |
+
On annote chaque classe d'erreur d'un degré de **récupérabilité**
|
| 26 |
+
(critère éditorial pragmatique, pas verdict imposé) :
|
| 27 |
+
|
| 28 |
+
- ``recoverable`` : récupérable par post-processing trivial
|
| 29 |
+
(case_error, ligature_error, abbreviation_error)
|
| 30 |
+
- ``difficult`` : récupérable au prix d'un effort
|
| 31 |
+
(diacritic_error, visual_confusion, hapax)
|
| 32 |
+
- ``irrecoverable`` : impossible à corriger sans l'image
|
| 33 |
+
(lacuna, oov_character, segmentation_error)
|
| 34 |
+
|
| 35 |
+
L'utilisateur consulte ces catégories comme un guide, pas un
|
| 36 |
+
verdict — c'est lui qui juge selon ses besoins éditoriaux.
|
| 37 |
+
"""
|
| 38 |
+
|
| 39 |
+
from __future__ import annotations
|
| 40 |
+
|
| 41 |
+
import logging
|
| 42 |
+
from typing import Optional
|
| 43 |
+
|
| 44 |
+
logger = logging.getLogger(__name__)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# Classification éditoriale. Documentée dans la docstring.
|
| 48 |
+
RECOVERABILITY: dict[str, str] = {
|
| 49 |
+
"case_error": "recoverable",
|
| 50 |
+
"ligature_error": "recoverable",
|
| 51 |
+
"abbreviation_error": "recoverable",
|
| 52 |
+
"diacritic_error": "difficult",
|
| 53 |
+
"visual_confusion": "difficult",
|
| 54 |
+
"hapax": "difficult",
|
| 55 |
+
"lacuna": "irrecoverable",
|
| 56 |
+
"oov_character": "irrecoverable",
|
| 57 |
+
"segmentation_error": "irrecoverable",
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def _normalize_counts(counts: dict[str, int]) -> dict[str, float]:
|
| 62 |
+
"""Convertit un dict de comptes en proportions [0, 1]."""
|
| 63 |
+
total = sum(counts.values())
|
| 64 |
+
if total <= 0:
|
| 65 |
+
return {k: 0.0 for k in counts}
|
| 66 |
+
return {k: v / total for k, v in counts.items()}
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def compare_taxonomies(
|
| 70 |
+
engine_a_name: str,
|
| 71 |
+
engine_a_counts: dict[str, int],
|
| 72 |
+
engine_b_name: str,
|
| 73 |
+
engine_b_counts: dict[str, int],
|
| 74 |
+
) -> Optional[dict]:
|
| 75 |
+
"""Compare deux profils taxonomiques.
|
| 76 |
+
|
| 77 |
+
Parameters
|
| 78 |
+
----------
|
| 79 |
+
engine_a_name, engine_b_name:
|
| 80 |
+
Noms d'identification des moteurs (utilisés dans le rendu).
|
| 81 |
+
engine_a_counts, engine_b_counts:
|
| 82 |
+
Maps ``{class_name: count}`` produites par
|
| 83 |
+
``aggregate_taxonomy``.
|
| 84 |
+
|
| 85 |
+
Returns
|
| 86 |
+
-------
|
| 87 |
+
Optional[dict]
|
| 88 |
+
``{
|
| 89 |
+
"engine_a": str, "engine_b": str,
|
| 90 |
+
"total_a": int, "total_b": int,
|
| 91 |
+
"classes": list[str], # classes apparaissant chez A ou B
|
| 92 |
+
"proportions_a": dict[str, float],
|
| 93 |
+
"proportions_b": dict[str, float],
|
| 94 |
+
"deltas": dict[str, float], # prop_b - prop_a (signé)
|
| 95 |
+
"recoverability": dict[str, str], # mapping class → niveau
|
| 96 |
+
"totals_by_recoverability": {
|
| 97 |
+
"recoverable": {"a": float, "b": float},
|
| 98 |
+
"difficult": {"a": float, "b": float},
|
| 99 |
+
"irrecoverable": {"a": float, "b": float},
|
| 100 |
+
},
|
| 101 |
+
}``
|
| 102 |
+
Ou ``None`` si les deux moteurs ont 0 erreur chacun.
|
| 103 |
+
"""
|
| 104 |
+
if engine_a_name == engine_b_name:
|
| 105 |
+
# On accepte des comparaisons même si les noms sont
|
| 106 |
+
# identiques (cas tests), mais on émet un warning.
|
| 107 |
+
logger.warning(
|
| 108 |
+
"[taxonomy_comparison] engine_a et engine_b ont le même nom : %s",
|
| 109 |
+
engine_a_name,
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
total_a = sum(engine_a_counts.values()) if engine_a_counts else 0
|
| 113 |
+
total_b = sum(engine_b_counts.values()) if engine_b_counts else 0
|
| 114 |
+
if total_a == 0 and total_b == 0:
|
| 115 |
+
return None
|
| 116 |
+
|
| 117 |
+
classes = sorted(set(engine_a_counts) | set(engine_b_counts))
|
| 118 |
+
if not classes:
|
| 119 |
+
return None
|
| 120 |
+
|
| 121 |
+
prop_a = _normalize_counts(
|
| 122 |
+
{c: engine_a_counts.get(c, 0) for c in classes},
|
| 123 |
+
)
|
| 124 |
+
prop_b = _normalize_counts(
|
| 125 |
+
{c: engine_b_counts.get(c, 0) for c in classes},
|
| 126 |
+
)
|
| 127 |
+
deltas = {c: prop_b[c] - prop_a[c] for c in classes}
|
| 128 |
+
|
| 129 |
+
# Agrégat par récupérabilité (utile pour la lecture rapide)
|
| 130 |
+
totals_recov: dict[str, dict[str, float]] = {
|
| 131 |
+
"recoverable": {"a": 0.0, "b": 0.0},
|
| 132 |
+
"difficult": {"a": 0.0, "b": 0.0},
|
| 133 |
+
"irrecoverable": {"a": 0.0, "b": 0.0},
|
| 134 |
+
}
|
| 135 |
+
for cls in classes:
|
| 136 |
+
level = RECOVERABILITY.get(cls, "difficult")
|
| 137 |
+
if level not in totals_recov:
|
| 138 |
+
level = "difficult"
|
| 139 |
+
totals_recov[level]["a"] += prop_a[cls]
|
| 140 |
+
totals_recov[level]["b"] += prop_b[cls]
|
| 141 |
+
|
| 142 |
+
return {
|
| 143 |
+
"engine_a": engine_a_name,
|
| 144 |
+
"engine_b": engine_b_name,
|
| 145 |
+
"total_a": total_a,
|
| 146 |
+
"total_b": total_b,
|
| 147 |
+
"classes": classes,
|
| 148 |
+
"proportions_a": prop_a,
|
| 149 |
+
"proportions_b": prop_b,
|
| 150 |
+
"deltas": deltas,
|
| 151 |
+
"recoverability": {
|
| 152 |
+
cls: RECOVERABILITY.get(cls, "difficult") for cls in classes
|
| 153 |
+
},
|
| 154 |
+
"totals_by_recoverability": totals_recov,
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
__all__ = [
|
| 159 |
+
"RECOVERABILITY",
|
| 160 |
+
"compare_taxonomies",
|
| 161 |
+
]
|
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Co-occurrence des classes taxonomiques d'erreur — Sprint 75 (A.I.4 chantier 1).
|
| 2 |
+
|
| 3 |
+
Sprint 75 — A.I.4 chantier 1 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
La taxonomie d'erreurs (10 classes, ``picarones/core/taxonomy.py``)
|
| 8 |
+
est calculée par document mais le rapport actuel ne montre qu'un
|
| 9 |
+
seul histogramme global. La roadmap A.I.4 demande trois lectures
|
| 10 |
+
plus fines de cette taxonomie ; ce sprint livre la première :
|
| 11 |
+
**co-occurrence**.
|
| 12 |
+
|
| 13 |
+
Si ``ligature_error`` et ``abbreviation_error`` co-occurrent
|
| 14 |
+
toujours dans les mêmes documents, c'est un signal de scribe
|
| 15 |
+
particulier — utile pour stratifier le corpus *a posteriori*
|
| 16 |
+
(qu'est-ce qui caractérise les documents difficiles ?).
|
| 17 |
+
|
| 18 |
+
Mesure
|
| 19 |
+
------
|
| 20 |
+
Indice de **Jaccard** entre paires de classes au niveau
|
| 21 |
+
**document** :
|
| 22 |
+
|
| 23 |
+
.. math::
|
| 24 |
+
|
| 25 |
+
J(A, B) = \\frac{|D_A \\cap D_B|}{|D_A \\cup D_B|}
|
| 26 |
+
|
| 27 |
+
où ``D_X`` est l'ensemble des documents qui contiennent au moins
|
| 28 |
+
une erreur de classe ``X``.
|
| 29 |
+
|
| 30 |
+
- ``J(A, B) = 1`` : A et B apparaissent toujours ensemble (et
|
| 31 |
+
jamais l'un sans l'autre).
|
| 32 |
+
- ``J(A, B) = 0`` : A et B ne co-occurrent jamais.
|
| 33 |
+
- ``J(A, B) = 0,5`` : A et B partagent la moitié de leur union.
|
| 34 |
+
|
| 35 |
+
Stratégie de découpage
|
| 36 |
+
----------------------
|
| 37 |
+
Couche de calcul pure d'abord (pattern Sprint 35, 38, 52-58).
|
| 38 |
+
Le rendu HTML (heatmap SVG) est livré dans le même sprint pour
|
| 39 |
+
boucler la dimension ; les chantiers 2 et 3 d'A.I.4 (évolution
|
| 40 |
+
intra-document, taxonomie comparative) suivent.
|
| 41 |
+
"""
|
| 42 |
+
|
| 43 |
+
from __future__ import annotations
|
| 44 |
+
|
| 45 |
+
import logging
|
| 46 |
+
from typing import Iterable, Optional
|
| 47 |
+
|
| 48 |
+
logger = logging.getLogger(__name__)
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def compute_taxonomy_cooccurrence(
|
| 52 |
+
per_doc_classes: Iterable[Iterable[str]],
|
| 53 |
+
*,
|
| 54 |
+
min_doc_count: int = 1,
|
| 55 |
+
top_n_pairs: int = 10,
|
| 56 |
+
) -> Optional[dict]:
|
| 57 |
+
"""Calcule la matrice de Jaccard inter-classes au niveau document.
|
| 58 |
+
|
| 59 |
+
Parameters
|
| 60 |
+
----------
|
| 61 |
+
per_doc_classes:
|
| 62 |
+
Itérable de docs, chaque doc étant un itérable de noms de
|
| 63 |
+
classes taxonomiques détectées (set, list, tuple…).
|
| 64 |
+
Les doublons à l'intérieur d'un doc sont ignorés (présence
|
| 65 |
+
binaire au niveau doc).
|
| 66 |
+
min_doc_count:
|
| 67 |
+
Nombre minimum de documents dans lesquels une classe doit
|
| 68 |
+
apparaître pour figurer dans la matrice (défaut 1).
|
| 69 |
+
Permet d'écarter les classes anecdotiques.
|
| 70 |
+
top_n_pairs:
|
| 71 |
+
Nombre de paires retournées dans ``top_pairs`` (triées par
|
| 72 |
+
Jaccard décroissant). Défaut 10.
|
| 73 |
+
|
| 74 |
+
Returns
|
| 75 |
+
-------
|
| 76 |
+
Optional[dict]
|
| 77 |
+
``{
|
| 78 |
+
"classes": list[str], # triées alpha
|
| 79 |
+
"n_documents": int,
|
| 80 |
+
"doc_count": dict[str, int], # nb docs par classe
|
| 81 |
+
"cooccurrence_matrix": dict[str, dict[str, float]],
|
| 82 |
+
# symétrique, diagonale = 1.0 (sauf classe vide)
|
| 83 |
+
"top_pairs": list[tuple[str, str, float]],
|
| 84 |
+
# paires les plus co-occurrentes (Jaccard désc.)
|
| 85 |
+
}``
|
| 86 |
+
ou ``None`` si aucune classe ne dépasse ``min_doc_count``
|
| 87 |
+
ou si l'itérable est vide.
|
| 88 |
+
"""
|
| 89 |
+
docs: list[frozenset[str]] = []
|
| 90 |
+
for doc_classes in per_doc_classes:
|
| 91 |
+
if doc_classes is None:
|
| 92 |
+
continue
|
| 93 |
+
cleaned = frozenset(c for c in doc_classes if c)
|
| 94 |
+
docs.append(cleaned)
|
| 95 |
+
if not docs:
|
| 96 |
+
return None
|
| 97 |
+
|
| 98 |
+
# Comptage par classe
|
| 99 |
+
doc_count: dict[str, int] = {}
|
| 100 |
+
for doc in docs:
|
| 101 |
+
for cls in doc:
|
| 102 |
+
doc_count[cls] = doc_count.get(cls, 0) + 1
|
| 103 |
+
|
| 104 |
+
# Filtrage min_doc_count
|
| 105 |
+
classes = sorted(
|
| 106 |
+
c for c, n in doc_count.items() if n >= min_doc_count
|
| 107 |
+
)
|
| 108 |
+
if not classes:
|
| 109 |
+
return None
|
| 110 |
+
|
| 111 |
+
# Matrice de Jaccard
|
| 112 |
+
matrix: dict[str, dict[str, float]] = {
|
| 113 |
+
c: {} for c in classes
|
| 114 |
+
}
|
| 115 |
+
for i, ca in enumerate(classes):
|
| 116 |
+
docs_a = {idx for idx, d in enumerate(docs) if ca in d}
|
| 117 |
+
for cb in classes[i:]:
|
| 118 |
+
if ca == cb:
|
| 119 |
+
# Diagonale : Jaccard(X, X) = 1 si X est présent
|
| 120 |
+
matrix[ca][cb] = 1.0 if docs_a else 0.0
|
| 121 |
+
continue
|
| 122 |
+
docs_b = {idx for idx, d in enumerate(docs) if cb in d}
|
| 123 |
+
inter = len(docs_a & docs_b)
|
| 124 |
+
union = len(docs_a | docs_b)
|
| 125 |
+
jaccard = inter / union if union > 0 else 0.0
|
| 126 |
+
matrix[ca][cb] = jaccard
|
| 127 |
+
matrix[cb][ca] = jaccard # symétrique
|
| 128 |
+
|
| 129 |
+
# Top paires (hors diagonale)
|
| 130 |
+
pairs: list[tuple[str, str, float]] = []
|
| 131 |
+
for i, ca in enumerate(classes):
|
| 132 |
+
for cb in classes[i + 1:]:
|
| 133 |
+
j = matrix[ca][cb]
|
| 134 |
+
if j > 0:
|
| 135 |
+
pairs.append((ca, cb, j))
|
| 136 |
+
pairs.sort(key=lambda p: (-p[2], p[0], p[1]))
|
| 137 |
+
top_pairs = pairs[:top_n_pairs]
|
| 138 |
+
|
| 139 |
+
return {
|
| 140 |
+
"classes": classes,
|
| 141 |
+
"n_documents": len(docs),
|
| 142 |
+
"doc_count": doc_count,
|
| 143 |
+
"cooccurrence_matrix": matrix,
|
| 144 |
+
"top_pairs": top_pairs,
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
__all__ = [
|
| 149 |
+
"compute_taxonomy_cooccurrence",
|
| 150 |
+
]
|
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Throughput effectif (Sprint 91 — A.II.6).
|
| 2 |
+
|
| 3 |
+
Sprint 91 — A.II.6 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Le throughput brut (pages/heure d'OCR pur) ment quand un moteur
|
| 8 |
+
est rapide mais imprécis : la correction humaine *post hoc*
|
| 9 |
+
absorbe le gain. La **vraie** vitesse opérationnelle inclut
|
| 10 |
+
le temps de correction. Cette métrique discrimine fortement
|
| 11 |
+
entre un cloud rapide à 30 % de timeouts/erreurs et un local
|
| 12 |
+
lent à 100 % de fiabilité.
|
| 13 |
+
|
| 14 |
+
Formule
|
| 15 |
+
-------
|
| 16 |
+
.. code::
|
| 17 |
+
|
| 18 |
+
pages_par_heure_utilisable =
|
| 19 |
+
pages_traitées / (durée_totale + temps_correction_humaine)
|
| 20 |
+
|
| 21 |
+
Le temps de correction est estimé linéairement :
|
| 22 |
+
``temps_par_erreur × nombre_d_erreurs``. Le défaut
|
| 23 |
+
``time_per_error_seconds=5.0`` correspond aux études HTR-United
|
| 24 |
+
(saisie manuelle d'une correction de mot par un opérateur
|
| 25 |
+
formé : ≈ 5 s par erreur). L'utilisateur peut le surcharger
|
| 26 |
+
pour son institution.
|
| 27 |
+
|
| 28 |
+
Sortie
|
| 29 |
+
------
|
| 30 |
+
``compute_effective_throughput(n_pages, duration_seconds,
|
| 31 |
+
n_errors, time_per_error_seconds=5.0)`` retourne ``{n_pages,
|
| 32 |
+
duration_seconds, n_errors, time_per_error_seconds,
|
| 33 |
+
correction_time_seconds, total_seconds, pages_per_hour_raw,
|
| 34 |
+
pages_per_hour_effective, drag_ratio}``.
|
| 35 |
+
|
| 36 |
+
``aggregate_effective_throughput(per_engine_data)`` agrège par
|
| 37 |
+
moteur sur l'ensemble du corpus.
|
| 38 |
+
"""
|
| 39 |
+
|
| 40 |
+
from __future__ import annotations
|
| 41 |
+
|
| 42 |
+
import logging
|
| 43 |
+
from typing import Iterable, Optional
|
| 44 |
+
|
| 45 |
+
logger = logging.getLogger(__name__)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
_DEFAULT_TIME_PER_ERROR_SECONDS = 5.0
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def compute_effective_throughput(
|
| 52 |
+
n_pages: int,
|
| 53 |
+
duration_seconds: float,
|
| 54 |
+
n_errors: int,
|
| 55 |
+
*,
|
| 56 |
+
time_per_error_seconds: float = _DEFAULT_TIME_PER_ERROR_SECONDS,
|
| 57 |
+
) -> Optional[dict]:
|
| 58 |
+
"""Throughput effectif (pages/heure utilisables).
|
| 59 |
+
|
| 60 |
+
Parameters
|
| 61 |
+
----------
|
| 62 |
+
n_pages:
|
| 63 |
+
Nombre de pages traitées.
|
| 64 |
+
duration_seconds:
|
| 65 |
+
Durée totale de l'OCR (somme des durées par doc).
|
| 66 |
+
n_errors:
|
| 67 |
+
Nombre d'erreurs (au niveau mot, typiquement
|
| 68 |
+
``WER × n_words_total``).
|
| 69 |
+
time_per_error_seconds:
|
| 70 |
+
Temps moyen de correction humaine par erreur. Défaut
|
| 71 |
+
5 s (HTR-United). Doit être ≥ 0.
|
| 72 |
+
|
| 73 |
+
Returns
|
| 74 |
+
-------
|
| 75 |
+
dict | None
|
| 76 |
+
``None`` si ``n_pages == 0`` ou ``total_seconds == 0``
|
| 77 |
+
(pas de division par zéro).
|
| 78 |
+
"""
|
| 79 |
+
if n_pages <= 0:
|
| 80 |
+
return None
|
| 81 |
+
if duration_seconds < 0 or n_errors < 0 or time_per_error_seconds < 0:
|
| 82 |
+
raise ValueError(
|
| 83 |
+
"duration_seconds, n_errors et time_per_error_seconds "
|
| 84 |
+
"doivent être ≥ 0",
|
| 85 |
+
)
|
| 86 |
+
correction_seconds = float(n_errors) * float(time_per_error_seconds)
|
| 87 |
+
total_seconds = float(duration_seconds) + correction_seconds
|
| 88 |
+
if total_seconds <= 0:
|
| 89 |
+
# Aucun temps écoulé : impossible de définir un throughput
|
| 90 |
+
return None
|
| 91 |
+
pages_per_hour_raw = (
|
| 92 |
+
n_pages / duration_seconds * 3600.0
|
| 93 |
+
if duration_seconds > 0 else None
|
| 94 |
+
)
|
| 95 |
+
pages_per_hour_effective = n_pages / total_seconds * 3600.0
|
| 96 |
+
drag_ratio = (
|
| 97 |
+
correction_seconds / total_seconds if total_seconds > 0 else 0.0
|
| 98 |
+
)
|
| 99 |
+
return {
|
| 100 |
+
"n_pages": int(n_pages),
|
| 101 |
+
"duration_seconds": float(duration_seconds),
|
| 102 |
+
"n_errors": int(n_errors),
|
| 103 |
+
"time_per_error_seconds": float(time_per_error_seconds),
|
| 104 |
+
"correction_time_seconds": correction_seconds,
|
| 105 |
+
"total_seconds": total_seconds,
|
| 106 |
+
"pages_per_hour_raw": pages_per_hour_raw,
|
| 107 |
+
"pages_per_hour_effective": pages_per_hour_effective,
|
| 108 |
+
"drag_ratio": drag_ratio,
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def aggregate_effective_throughput(
|
| 113 |
+
per_engine: Iterable[dict],
|
| 114 |
+
*,
|
| 115 |
+
time_per_error_seconds: float = _DEFAULT_TIME_PER_ERROR_SECONDS,
|
| 116 |
+
) -> Optional[dict]:
|
| 117 |
+
"""Agrège le throughput effectif par moteur.
|
| 118 |
+
|
| 119 |
+
Parameters
|
| 120 |
+
----------
|
| 121 |
+
per_engine:
|
| 122 |
+
Itérable de dicts ``{engine_name, n_pages,
|
| 123 |
+
duration_seconds, n_errors}``.
|
| 124 |
+
|
| 125 |
+
Returns
|
| 126 |
+
-------
|
| 127 |
+
dict | None
|
| 128 |
+
``{
|
| 129 |
+
"engines": [
|
| 130 |
+
{"engine_name", ..., compute_effective_throughput
|
| 131 |
+
fields},
|
| 132 |
+
...
|
| 133 |
+
],
|
| 134 |
+
"time_per_error_seconds": float,
|
| 135 |
+
}`` ou ``None`` si aucun moteur exploitable.
|
| 136 |
+
"""
|
| 137 |
+
rows: list[dict] = []
|
| 138 |
+
for entry in per_engine:
|
| 139 |
+
if not isinstance(entry, dict):
|
| 140 |
+
continue
|
| 141 |
+
name = entry.get("engine_name") or entry.get("engine")
|
| 142 |
+
if not name:
|
| 143 |
+
continue
|
| 144 |
+
result = compute_effective_throughput(
|
| 145 |
+
int(entry.get("n_pages") or 0),
|
| 146 |
+
float(entry.get("duration_seconds") or 0.0),
|
| 147 |
+
int(entry.get("n_errors") or 0),
|
| 148 |
+
time_per_error_seconds=time_per_error_seconds,
|
| 149 |
+
)
|
| 150 |
+
if result is None:
|
| 151 |
+
continue
|
| 152 |
+
result["engine_name"] = str(name)
|
| 153 |
+
rows.append(result)
|
| 154 |
+
if not rows:
|
| 155 |
+
return None
|
| 156 |
+
return {
|
| 157 |
+
"engines": rows,
|
| 158 |
+
"time_per_error_seconds": float(time_per_error_seconds),
|
| 159 |
+
}
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
__all__ = [
|
| 163 |
+
"compute_effective_throughput",
|
| 164 |
+
"aggregate_effective_throughput",
|
| 165 |
+
]
|
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Extraction transversale des « Worst lines » du corpus — Sprint 72.
|
| 2 |
+
|
| 3 |
+
Sprint 72 — A.I.1 chantier 1 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Le percentile p95 du CER ligne (calculé par ``line_metrics.py``,
|
| 8 |
+
Sprint 10) est un nombre abstrait : *« 5 % de mes lignes ont un
|
| 9 |
+
CER > 0,42 »*. Le chercheur veut **voir** ces lignes : leur
|
| 10 |
+
texte, leur diff, leur document parent, pour comprendre ce qui
|
| 11 |
+
casse.
|
| 12 |
+
|
| 13 |
+
Ce module fournit la requête transversale qui collecte, depuis un
|
| 14 |
+
``BenchmarkResult``, les **N lignes les plus mal transcrites de
|
| 15 |
+
tout le corpus**, classées par CER ligne. Filtrable par moteur
|
| 16 |
+
et par strate.
|
| 17 |
+
|
| 18 |
+
Limite documentée
|
| 19 |
+
-----------------
|
| 20 |
+
``DocumentResult.line_metrics`` ne stocke que les CER par ligne,
|
| 21 |
+
**pas le texte des lignes**. Pour récupérer les textes GT/hyp
|
| 22 |
+
on resplitte ``ground_truth`` et ``hypothesis`` du
|
| 23 |
+
``DocumentResult`` à l'index de la ligne. Cette logique
|
| 24 |
+
**suppose un BenchmarkResult non-compacté** — après ``compact()``
|
| 25 |
+
les textes sont tronqués à 200 caractères et les lignes au-delà
|
| 26 |
+
de cette troncature ne sont plus accessibles. En pratique on
|
| 27 |
+
extrait les worst lines **avant** la sérialisation/compactage.
|
| 28 |
+
"""
|
| 29 |
+
|
| 30 |
+
from __future__ import annotations
|
| 31 |
+
|
| 32 |
+
import logging
|
| 33 |
+
from dataclasses import dataclass
|
| 34 |
+
from typing import Optional
|
| 35 |
+
|
| 36 |
+
logger = logging.getLogger(__name__)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
@dataclass
|
| 40 |
+
class WorstLineEntry:
|
| 41 |
+
"""Une ligne du corpus identifiée comme mal transcrite.
|
| 42 |
+
|
| 43 |
+
Champs
|
| 44 |
+
------
|
| 45 |
+
rank:
|
| 46 |
+
Position dans le classement (1-based, 1 = pire CER).
|
| 47 |
+
cer:
|
| 48 |
+
CER de la ligne ∈ [0, 1].
|
| 49 |
+
engine_name:
|
| 50 |
+
Nom du moteur ayant produit cette hypothèse.
|
| 51 |
+
doc_id:
|
| 52 |
+
Identifiant du document parent.
|
| 53 |
+
line_index:
|
| 54 |
+
Index 0-based de la ligne dans le document GT.
|
| 55 |
+
gt_line:
|
| 56 |
+
Texte de la ligne dans la GT.
|
| 57 |
+
hyp_line:
|
| 58 |
+
Texte correspondant dans l'hypothèse (peut être ``""``
|
| 59 |
+
si l'OCR a sauté la ligne).
|
| 60 |
+
script_type:
|
| 61 |
+
Strate du document si disponible (``script_type``
|
| 62 |
+
capturé par le runner pour la stratification A.III).
|
| 63 |
+
"""
|
| 64 |
+
|
| 65 |
+
rank: int
|
| 66 |
+
cer: float
|
| 67 |
+
engine_name: str
|
| 68 |
+
doc_id: str
|
| 69 |
+
line_index: int
|
| 70 |
+
gt_line: str
|
| 71 |
+
hyp_line: str
|
| 72 |
+
script_type: Optional[str] = None
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def _split_lines(text: Optional[str]) -> list[str]:
|
| 76 |
+
"""Splitte un texte en lignes (cohérent avec ``line_metrics``).
|
| 77 |
+
|
| 78 |
+
Supporte les fins de ligne ``\\n``, ``\\r\\n``, ``\\r``. Les
|
| 79 |
+
lignes vides sont préservées. Retourne une liste vide si le
|
| 80 |
+
texte est None ou vide.
|
| 81 |
+
"""
|
| 82 |
+
if not text:
|
| 83 |
+
return []
|
| 84 |
+
# ``splitlines`` gère \r\n et \r correctement
|
| 85 |
+
return text.splitlines()
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def _line_at(text: Optional[str], index: int) -> str:
|
| 89 |
+
"""Retourne la ligne à l'index demandé, ou ``""`` si l'index
|
| 90 |
+
est hors borne (cas où l'OCR a moins de lignes que la GT)."""
|
| 91 |
+
lines = _split_lines(text)
|
| 92 |
+
if 0 <= index < len(lines):
|
| 93 |
+
return lines[index]
|
| 94 |
+
return ""
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
def extract_worst_lines(
|
| 98 |
+
benchmark,
|
| 99 |
+
*,
|
| 100 |
+
top_n: int = 20,
|
| 101 |
+
engine_filter: Optional[str] = None,
|
| 102 |
+
script_type_filter: Optional[str] = None,
|
| 103 |
+
) -> list[WorstLineEntry]:
|
| 104 |
+
"""Extrait les ``top_n`` lignes les plus mal transcrites du
|
| 105 |
+
corpus, transversalement à tous les moteurs et documents.
|
| 106 |
+
|
| 107 |
+
Parameters
|
| 108 |
+
----------
|
| 109 |
+
benchmark:
|
| 110 |
+
``BenchmarkResult`` non-compacté (cf. limite ci-dessus).
|
| 111 |
+
L'objet doit exposer ``engine_reports`` (liste de
|
| 112 |
+
``EngineReport``) et optionnellement ``doc_strata``
|
| 113 |
+
(map ``{doc_id: script_type}``, Sprint 45).
|
| 114 |
+
top_n:
|
| 115 |
+
Nombre de lignes à retourner. Défaut : 20.
|
| 116 |
+
engine_filter:
|
| 117 |
+
Si fourni, n'inclut que les lignes produites par ce moteur
|
| 118 |
+
(match exact sur ``engine_name``).
|
| 119 |
+
script_type_filter:
|
| 120 |
+
Si fourni, n'inclut que les lignes des documents de cette
|
| 121 |
+
strate (nécessite ``benchmark.doc_strata``).
|
| 122 |
+
|
| 123 |
+
Returns
|
| 124 |
+
-------
|
| 125 |
+
list[WorstLineEntry]
|
| 126 |
+
Liste triée par CER décroissant (pire en premier),
|
| 127 |
+
rang 1-based attribué après tri. Vide si aucune ligne
|
| 128 |
+
exploitable.
|
| 129 |
+
"""
|
| 130 |
+
if top_n <= 0:
|
| 131 |
+
return []
|
| 132 |
+
|
| 133 |
+
doc_strata = getattr(benchmark, "doc_strata", None) or {}
|
| 134 |
+
candidates: list[tuple[float, str, str, int, str, str, Optional[str]]] = []
|
| 135 |
+
|
| 136 |
+
for engine_report in getattr(benchmark, "engine_reports", []):
|
| 137 |
+
engine_name = engine_report.engine_name
|
| 138 |
+
if engine_filter is not None and engine_name != engine_filter:
|
| 139 |
+
continue
|
| 140 |
+
for dr in engine_report.document_results:
|
| 141 |
+
line_metrics = getattr(dr, "line_metrics", None)
|
| 142 |
+
if not line_metrics:
|
| 143 |
+
continue
|
| 144 |
+
cer_per_line = line_metrics.get("cer_per_line") if isinstance(
|
| 145 |
+
line_metrics, dict,
|
| 146 |
+
) else getattr(line_metrics, "cer_per_line", None)
|
| 147 |
+
if not cer_per_line:
|
| 148 |
+
continue
|
| 149 |
+
doc_id = dr.doc_id
|
| 150 |
+
doc_strata_value = doc_strata.get(doc_id)
|
| 151 |
+
if (
|
| 152 |
+
script_type_filter is not None
|
| 153 |
+
and doc_strata_value != script_type_filter
|
| 154 |
+
):
|
| 155 |
+
continue
|
| 156 |
+
for idx, cer in enumerate(cer_per_line):
|
| 157 |
+
if cer <= 0.0:
|
| 158 |
+
continue
|
| 159 |
+
gt_line = _line_at(dr.ground_truth, idx)
|
| 160 |
+
hyp_line = _line_at(dr.hypothesis, idx)
|
| 161 |
+
if not gt_line and not hyp_line:
|
| 162 |
+
continue
|
| 163 |
+
candidates.append((
|
| 164 |
+
float(cer), engine_name, doc_id, idx,
|
| 165 |
+
gt_line, hyp_line, doc_strata_value,
|
| 166 |
+
))
|
| 167 |
+
|
| 168 |
+
if not candidates:
|
| 169 |
+
return []
|
| 170 |
+
|
| 171 |
+
# Tri par CER décroissant ; en cas d'égalité, ordre stable
|
| 172 |
+
# (engine, doc_id, line_index) pour reproductibilité.
|
| 173 |
+
candidates.sort(
|
| 174 |
+
key=lambda c: (-c[0], c[1], c[2], c[3]),
|
| 175 |
+
)
|
| 176 |
+
selected = candidates[:top_n]
|
| 177 |
+
|
| 178 |
+
return [
|
| 179 |
+
WorstLineEntry(
|
| 180 |
+
rank=i + 1,
|
| 181 |
+
cer=cer,
|
| 182 |
+
engine_name=engine,
|
| 183 |
+
doc_id=doc_id,
|
| 184 |
+
line_index=line_index,
|
| 185 |
+
gt_line=gt_line,
|
| 186 |
+
hyp_line=hyp_line,
|
| 187 |
+
script_type=script_type,
|
| 188 |
+
)
|
| 189 |
+
for i, (
|
| 190 |
+
cer, engine, doc_id, line_index,
|
| 191 |
+
gt_line, hyp_line, script_type,
|
| 192 |
+
) in enumerate(selected)
|
| 193 |
+
]
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
__all__ = [
|
| 197 |
+
"WorstLineEntry",
|
| 198 |
+
"extract_worst_lines",
|
| 199 |
+
]
|
|
@@ -1,229 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
L'historique SQLite (``picarones/core/history.py``, Sprint 8)
|
| 8 |
-
existe mais aucun détecteur narratif ne le lit. Ce module fournit
|
| 9 |
-
la couche de calcul qui répond à *« comment ce moteur se
|
| 10 |
-
comporte-t-il sur ce corpus, **par rapport à ses runs précédents
|
| 11 |
-
de mon institution** ? »*.
|
| 12 |
-
|
| 13 |
-
Sortie typique
|
| 14 |
-
--------------
|
| 15 |
-
Un dict par moteur :
|
| 16 |
-
|
| 17 |
-
.. code-block:: python
|
| 18 |
-
|
| 19 |
-
{
|
| 20 |
-
"engine_name": "tesseract",
|
| 21 |
-
"cer_current": 0.052,
|
| 22 |
-
"cer_historical_mean": 0.041,
|
| 23 |
-
"cer_historical_median": 0.040,
|
| 24 |
-
"n_runs": 12,
|
| 25 |
-
"absolute_delta": 0.011,
|
| 26 |
-
"relative_delta": 0.268, # +26,8 % vs moyenne
|
| 27 |
-
"off_baseline": True,
|
| 28 |
-
}
|
| 29 |
-
|
| 30 |
-
Le détecteur narratif ``engine_off_baseline`` (Sprint 73)
|
| 31 |
-
consomme cette structure pour émettre des Facts.
|
| 32 |
-
|
| 33 |
-
Garde-fous
|
| 34 |
-
----------
|
| 35 |
-
- ``min_runs`` (défaut 5) : si l'historique pour le moteur×corpus
|
| 36 |
-
contient moins de runs, on retourne ``None`` plutôt que de
|
| 37 |
-
comparer à un échantillon trop petit.
|
| 38 |
-
- ``corpus_name`` est utilisé pour ne comparer qu'aux runs **du
|
| 39 |
-
même corpus** (sinon on compare des pommes et des oranges :
|
| 40 |
-
registres paroissiaux vs imprimés modernes).
|
| 41 |
-
- Le run courant lui-même n'est pas inclus dans la baseline (on
|
| 42 |
-
passe le ``current_run_id`` à exclure).
|
| 43 |
"""
|
| 44 |
|
| 45 |
from __future__ import annotations
|
| 46 |
|
| 47 |
-
import
|
| 48 |
-
import statistics
|
| 49 |
-
from typing import Optional
|
| 50 |
-
|
| 51 |
-
logger = logging.getLogger(__name__)
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
def compute_engine_baseline(
|
| 55 |
-
history,
|
| 56 |
-
engine_name: str,
|
| 57 |
-
corpus_name: str,
|
| 58 |
-
current_cer: float,
|
| 59 |
-
*,
|
| 60 |
-
current_run_id: Optional[str] = None,
|
| 61 |
-
min_runs: int = 5,
|
| 62 |
-
relative_delta_threshold: float = 0.20,
|
| 63 |
-
) -> Optional[dict]:
|
| 64 |
-
"""Compare le CER courant d'un moteur à sa moyenne historique
|
| 65 |
-
sur le **même corpus**.
|
| 66 |
-
|
| 67 |
-
Parameters
|
| 68 |
-
----------
|
| 69 |
-
history:
|
| 70 |
-
Instance de ``BenchmarkHistory`` (ou compatible : doit
|
| 71 |
-
exposer une méthode ``query(engine, corpus, limit)``
|
| 72 |
-
retournant une liste d'``HistoryEntry`` avec attribut
|
| 73 |
-
``cer_mean`` et ``run_id``).
|
| 74 |
-
engine_name:
|
| 75 |
-
Nom du moteur dont on calcule la baseline.
|
| 76 |
-
corpus_name:
|
| 77 |
-
Nom du corpus — limite la comparaison aux runs antérieurs
|
| 78 |
-
sur ce même corpus.
|
| 79 |
-
current_cer:
|
| 80 |
-
CER moyen observé dans le run courant.
|
| 81 |
-
current_run_id:
|
| 82 |
-
Si fourni, le run portant cet identifiant est exclu de la
|
| 83 |
-
baseline (utile quand le run courant est déjà enregistré
|
| 84 |
-
dans l'historique avant d'appeler ce calcul).
|
| 85 |
-
min_runs:
|
| 86 |
-
Nombre minimum de runs historiques pour que la
|
| 87 |
-
comparaison soit considérée fiable. Sous ce seuil, on
|
| 88 |
-
retourne ``None``.
|
| 89 |
-
relative_delta_threshold:
|
| 90 |
-
Seuil au-delà duquel ``off_baseline`` vaut ``True``
|
| 91 |
-
(défaut : 0,20 = 20 % d'écart relatif).
|
| 92 |
-
|
| 93 |
-
Returns
|
| 94 |
-
-------
|
| 95 |
-
Optional[dict]
|
| 96 |
-
``None`` si :
|
| 97 |
-
- moins de ``min_runs`` runs historiques disponibles
|
| 98 |
-
- ``current_cer`` est ``None`` ou négatif
|
| 99 |
-
- tous les CER historiques sont ``None``
|
| 100 |
-
|
| 101 |
-
Sinon, dict avec les champs documentés dans le module.
|
| 102 |
-
"""
|
| 103 |
-
if current_cer is None or current_cer < 0:
|
| 104 |
-
return None
|
| 105 |
-
try:
|
| 106 |
-
entries = history.query(
|
| 107 |
-
engine=engine_name, corpus=corpus_name, limit=1000,
|
| 108 |
-
)
|
| 109 |
-
except Exception as exc: # pragma: no cover — défense
|
| 110 |
-
logger.warning(
|
| 111 |
-
"[baseline_comparison] query history a levé : %s", exc,
|
| 112 |
-
)
|
| 113 |
-
return None
|
| 114 |
-
|
| 115 |
-
historical_cers: list[float] = []
|
| 116 |
-
for entry in entries:
|
| 117 |
-
if current_run_id is not None and entry.run_id == current_run_id:
|
| 118 |
-
continue
|
| 119 |
-
cer = entry.cer_mean
|
| 120 |
-
if cer is None or cer < 0:
|
| 121 |
-
continue
|
| 122 |
-
historical_cers.append(float(cer))
|
| 123 |
-
|
| 124 |
-
if len(historical_cers) < min_runs:
|
| 125 |
-
return None
|
| 126 |
-
|
| 127 |
-
mean = statistics.fmean(historical_cers)
|
| 128 |
-
median = statistics.median(historical_cers)
|
| 129 |
-
absolute_delta = current_cer - mean
|
| 130 |
-
if mean > 0:
|
| 131 |
-
relative_delta = absolute_delta / mean
|
| 132 |
-
elif current_cer == 0:
|
| 133 |
-
relative_delta = 0.0
|
| 134 |
-
else:
|
| 135 |
-
# Baseline à 0 mais CER courant > 0 : écart infini —
|
| 136 |
-
# convention : on signale comme off_baseline avec
|
| 137 |
-
# relative_delta = None.
|
| 138 |
-
relative_delta = None
|
| 139 |
-
|
| 140 |
-
off_baseline = (
|
| 141 |
-
relative_delta is not None
|
| 142 |
-
and abs(relative_delta) > relative_delta_threshold
|
| 143 |
-
)
|
| 144 |
-
|
| 145 |
-
return {
|
| 146 |
-
"engine_name": engine_name,
|
| 147 |
-
"corpus_name": corpus_name,
|
| 148 |
-
"cer_current": float(current_cer),
|
| 149 |
-
"cer_historical_mean": mean,
|
| 150 |
-
"cer_historical_median": median,
|
| 151 |
-
"n_runs": len(historical_cers),
|
| 152 |
-
"absolute_delta": absolute_delta,
|
| 153 |
-
"relative_delta": relative_delta,
|
| 154 |
-
"off_baseline": off_baseline,
|
| 155 |
-
}
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
def compute_corpus_difficulty_percentile(
|
| 159 |
-
history,
|
| 160 |
-
current_difficulty: float,
|
| 161 |
-
*,
|
| 162 |
-
min_runs: int = 5,
|
| 163 |
-
) -> Optional[dict]:
|
| 164 |
-
"""Place la difficulté du corpus courant dans la distribution
|
| 165 |
-
des difficultés historiques.
|
| 166 |
-
|
| 167 |
-
Lit les difficultés stockées dans ``HistoryEntry.metadata``
|
| 168 |
-
sous la clé ``difficulty`` (convention de
|
| 169 |
-
``picarones/core/difficulty.py``).
|
| 170 |
-
|
| 171 |
-
Returns
|
| 172 |
-
-------
|
| 173 |
-
Optional[dict]
|
| 174 |
-
``{
|
| 175 |
-
"current_difficulty": float,
|
| 176 |
-
"percentile": float, # 0..100
|
| 177 |
-
"n_runs": int,
|
| 178 |
-
"median_historical": float,
|
| 179 |
-
"harder_than_usual": bool, # percentile > 75
|
| 180 |
-
"easier_than_usual": bool, # percentile < 25
|
| 181 |
-
}``
|
| 182 |
-
ou ``None`` si moins de ``min_runs`` runs historiques ont
|
| 183 |
-
une difficulté enregistrée.
|
| 184 |
-
"""
|
| 185 |
-
if current_difficulty is None:
|
| 186 |
-
return None
|
| 187 |
-
try:
|
| 188 |
-
entries = history.query(limit=1000)
|
| 189 |
-
except Exception as exc: # pragma: no cover
|
| 190 |
-
logger.warning(
|
| 191 |
-
"[baseline_comparison] query history a levé : %s", exc,
|
| 192 |
-
)
|
| 193 |
-
return None
|
| 194 |
-
|
| 195 |
-
historical_difficulties: list[float] = []
|
| 196 |
-
for entry in entries:
|
| 197 |
-
diff = entry.metadata.get("difficulty") if entry.metadata else None
|
| 198 |
-
if diff is None:
|
| 199 |
-
continue
|
| 200 |
-
try:
|
| 201 |
-
historical_difficulties.append(float(diff))
|
| 202 |
-
except (TypeError, ValueError):
|
| 203 |
-
continue
|
| 204 |
-
|
| 205 |
-
if len(historical_difficulties) < min_runs:
|
| 206 |
-
return None
|
| 207 |
-
|
| 208 |
-
sorted_diff = sorted(historical_difficulties)
|
| 209 |
-
n = len(sorted_diff)
|
| 210 |
-
# Percentile = % de corpus historiques de difficulté ≤
|
| 211 |
-
# current_difficulty. Convention courante (P_i = i/n × 100).
|
| 212 |
-
n_below = sum(1 for d in sorted_diff if d <= current_difficulty)
|
| 213 |
-
percentile = (n_below / n) * 100.0
|
| 214 |
-
median = statistics.median(sorted_diff)
|
| 215 |
-
|
| 216 |
-
return {
|
| 217 |
-
"current_difficulty": float(current_difficulty),
|
| 218 |
-
"percentile": percentile,
|
| 219 |
-
"n_runs": n,
|
| 220 |
-
"median_historical": median,
|
| 221 |
-
"harder_than_usual": percentile > 75.0,
|
| 222 |
-
"easier_than_usual": percentile < 25.0,
|
| 223 |
-
}
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
__all__ = [
|
| 227 |
-
"compute_engine_baseline",
|
| 228 |
-
"compute_corpus_difficulty_percentile",
|
| 229 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.baseline_comparison``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.baseline_comparison`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.baseline_comparison import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,323 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Tous les moteurs OCR cibles fournissent une confidence par token ou par
|
| 8 |
-
ligne (Tesseract via le ``tsv``, Pero OCR via le ``PageLayout``,
|
| 9 |
-
Mistral OCR via ``confidence``, Google Vision via ``Word.confidence``).
|
| 10 |
-
La question naturelle pour un workflow patrimonial est : *« quand le
|
| 11 |
-
moteur dit qu'il est sûr, est-il vraiment sûr ? »*. Pour une équipe
|
| 12 |
-
qui doit vérifier humainement un corpus de 50 000 pages, la différence
|
| 13 |
-
entre vérifier 100 % vs 15 % du volume est l'effet de la calibration.
|
| 14 |
-
|
| 15 |
-
Ce module fournit les trois mesures classiques :
|
| 16 |
-
|
| 17 |
-
- **Expected Calibration Error (ECE)** — moyenne pondérée par bin de
|
| 18 |
-
l'écart absolu entre confiance moyenne et précision moyenne.
|
| 19 |
-
``ECE = 0`` ↔ moteur parfaitement calibré ; ``ECE`` élevé ↔ écart
|
| 20 |
-
systématique entre confiance affichée et fiabilité réelle.
|
| 21 |
-
- **Maximum Calibration Error (MCE)** — max de cet écart sur les bins.
|
| 22 |
-
Utile pour repérer le pire mensonge du moteur (ex. il dit toujours
|
| 23 |
-
95 % de confiance et il a tort une fois sur deux).
|
| 24 |
-
- **Reliability diagram** — table ``[(bin_low, bin_high, avg_conf,
|
| 25 |
-
accuracy, count)]`` qui peut être rendue en SVG côté serveur ou en
|
| 26 |
-
Chart.js côté navigateur dans un sprint suivant.
|
| 27 |
-
|
| 28 |
-
Stratégie de découpage
|
| 29 |
-
----------------------
|
| 30 |
-
Comme pour le NER (Sprint 38) et la divergence (Sprints 35-37),
|
| 31 |
-
on découpe :
|
| 32 |
-
|
| 33 |
-
- **Sprint 39** (ici) — couche de calcul pure : entrée = deux listes
|
| 34 |
-
parallèles ``confidences`` (∈ [0, 1]) et ``is_correct`` (bool/0-1).
|
| 35 |
-
Aucune dépendance externe.
|
| 36 |
-
- **Sprint à venir** — exposition de ``token_confidences`` sur
|
| 37 |
-
``EngineResult``, alignement caractère/token avec la GT pour produire
|
| 38 |
-
``is_correct``, intégration dans le runner et vue HTML reliability.
|
| 39 |
-
|
| 40 |
-
Ce qui est explicitement hors scope
|
| 41 |
-
-----------------------------------
|
| 42 |
-
Ce sprint ne touche **aucun adaptateur OCR**. Aucune confiance n'est
|
| 43 |
-
extraite ; on calcule uniquement à partir de séquences de prédictions
|
| 44 |
-
fournies en entrée. C'est ce qui permet de tester rigoureusement les
|
| 45 |
-
invariants mathématiques (ECE = 0 ↔ calibré, ECE = |bias| pour bias
|
| 46 |
-
constant, etc.) sans dépendre d'un backend.
|
| 47 |
"""
|
| 48 |
|
| 49 |
from __future__ import annotations
|
| 50 |
|
| 51 |
-
import
|
| 52 |
-
from dataclasses import dataclass
|
| 53 |
-
from typing import Iterable
|
| 54 |
-
|
| 55 |
-
logger = logging.getLogger(__name__)
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 59 |
-
# Modèle de données
|
| 60 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
@dataclass(frozen=True)
|
| 64 |
-
class CalibrationBin:
|
| 65 |
-
"""Un bin du reliability diagram.
|
| 66 |
-
|
| 67 |
-
Attributs
|
| 68 |
-
---------
|
| 69 |
-
bin_low, bin_high:
|
| 70 |
-
Bornes du bin sur l'axe de confiance (``[bin_low, bin_high)`` —
|
| 71 |
-
sauf le dernier bin qui inclut ``1.0``).
|
| 72 |
-
avg_confidence:
|
| 73 |
-
Moyenne des confidences des prédictions tombées dans le bin.
|
| 74 |
-
``None`` si le bin est vide.
|
| 75 |
-
accuracy:
|
| 76 |
-
Fraction de prédictions correctes dans le bin (``∈ [0, 1]``).
|
| 77 |
-
``None`` si le bin est vide.
|
| 78 |
-
count:
|
| 79 |
-
Nombre de prédictions dans le bin.
|
| 80 |
-
"""
|
| 81 |
-
|
| 82 |
-
bin_low: float
|
| 83 |
-
bin_high: float
|
| 84 |
-
avg_confidence: float | None
|
| 85 |
-
accuracy: float | None
|
| 86 |
-
count: int
|
| 87 |
-
|
| 88 |
-
@property
|
| 89 |
-
def gap(self) -> float | None:
|
| 90 |
-
"""Écart absolu ``|confidence - accuracy|`` ou ``None`` si vide."""
|
| 91 |
-
if self.avg_confidence is None or self.accuracy is None:
|
| 92 |
-
return None
|
| 93 |
-
return abs(self.avg_confidence - self.accuracy)
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 97 |
-
# Validation
|
| 98 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
def _validate_inputs(
|
| 102 |
-
confidences: list[float],
|
| 103 |
-
is_correct: list[bool | int],
|
| 104 |
-
) -> None:
|
| 105 |
-
if len(confidences) != len(is_correct):
|
| 106 |
-
raise ValueError(
|
| 107 |
-
f"Longueurs incompatibles : confidences={len(confidences)} "
|
| 108 |
-
f"vs is_correct={len(is_correct)}"
|
| 109 |
-
)
|
| 110 |
-
for i, c in enumerate(confidences):
|
| 111 |
-
if not (0.0 <= float(c) <= 1.0):
|
| 112 |
-
raise ValueError(
|
| 113 |
-
f"Confiance hors [0, 1] à l'index {i} : {c!r}"
|
| 114 |
-
)
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 118 |
-
# Reliability diagram (binning)
|
| 119 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
def reliability_diagram(
|
| 123 |
-
confidences: Iterable[float],
|
| 124 |
-
is_correct: Iterable[bool | int],
|
| 125 |
-
n_bins: int = 10,
|
| 126 |
-
) -> list[CalibrationBin]:
|
| 127 |
-
"""Découpe les prédictions en ``n_bins`` bins équidistants par confiance
|
| 128 |
-
et calcule pour chacun la confiance moyenne, la précision et le compte.
|
| 129 |
-
|
| 130 |
-
Parameters
|
| 131 |
-
----------
|
| 132 |
-
confidences:
|
| 133 |
-
Confidences des prédictions, ``∈ [0, 1]``.
|
| 134 |
-
is_correct:
|
| 135 |
-
Indicateur booléen (1 = prédiction correcte, 0 = incorrecte).
|
| 136 |
-
n_bins:
|
| 137 |
-
Nombre de bins (défaut : 10). Bornes : ``[k/n_bins, (k+1)/n_bins)``
|
| 138 |
-
sauf le dernier bin qui inclut ``1.0``.
|
| 139 |
-
|
| 140 |
-
Returns
|
| 141 |
-
-------
|
| 142 |
-
list[CalibrationBin]
|
| 143 |
-
Liste de ``n_bins`` bins, dans l'ordre croissant des confidences.
|
| 144 |
-
"""
|
| 145 |
-
if n_bins < 1:
|
| 146 |
-
raise ValueError(f"n_bins doit être ≥ 1 — reçu {n_bins}")
|
| 147 |
-
|
| 148 |
-
confs = [float(c) for c in confidences]
|
| 149 |
-
correct = [int(bool(x)) for x in is_correct]
|
| 150 |
-
_validate_inputs(confs, correct)
|
| 151 |
-
|
| 152 |
-
bin_width = 1.0 / n_bins
|
| 153 |
-
sums: list[float] = [0.0] * n_bins
|
| 154 |
-
correct_counts: list[int] = [0] * n_bins
|
| 155 |
-
counts: list[int] = [0] * n_bins
|
| 156 |
-
|
| 157 |
-
for c, ok in zip(confs, correct):
|
| 158 |
-
# Calcul du bin index par multiplication ``c * n_bins`` plutôt que
|
| 159 |
-
# division ``c / bin_width`` pour éviter les pièges de
|
| 160 |
-
# représentation flottante (ex. ``0.6 / 0.1 = 5.999…`` en IEEE 754
|
| 161 |
-
# qui placerait 0.6 dans le bin [0.5, 0.6) au lieu de [0.6, 0.7)).
|
| 162 |
-
if c >= 1.0:
|
| 163 |
-
idx = n_bins - 1
|
| 164 |
-
else:
|
| 165 |
-
idx = int(c * n_bins)
|
| 166 |
-
# Garde-fou en cas d'arrondi flottant
|
| 167 |
-
if idx >= n_bins:
|
| 168 |
-
idx = n_bins - 1
|
| 169 |
-
elif idx < 0:
|
| 170 |
-
idx = 0
|
| 171 |
-
sums[idx] += c
|
| 172 |
-
correct_counts[idx] += ok
|
| 173 |
-
counts[idx] += 1
|
| 174 |
-
|
| 175 |
-
bins: list[CalibrationBin] = []
|
| 176 |
-
for k in range(n_bins):
|
| 177 |
-
low = k * bin_width
|
| 178 |
-
high = (k + 1) * bin_width
|
| 179 |
-
n = counts[k]
|
| 180 |
-
if n == 0:
|
| 181 |
-
bins.append(CalibrationBin(low, high, None, None, 0))
|
| 182 |
-
else:
|
| 183 |
-
bins.append(CalibrationBin(
|
| 184 |
-
bin_low=low,
|
| 185 |
-
bin_high=high,
|
| 186 |
-
avg_confidence=sums[k] / n,
|
| 187 |
-
accuracy=correct_counts[k] / n,
|
| 188 |
-
count=n,
|
| 189 |
-
))
|
| 190 |
-
return bins
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 194 |
-
# ECE et MCE
|
| 195 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
def expected_calibration_error(
|
| 199 |
-
confidences: Iterable[float],
|
| 200 |
-
is_correct: Iterable[bool | int],
|
| 201 |
-
n_bins: int = 10,
|
| 202 |
-
) -> float:
|
| 203 |
-
"""Expected Calibration Error : moyenne pondérée par bin de l'écart
|
| 204 |
-
absolu confiance ↔ précision.
|
| 205 |
-
|
| 206 |
-
``ECE = sum_k (n_k / N) * |avg_conf_k - accuracy_k|``
|
| 207 |
-
|
| 208 |
-
où la somme porte sur les bins non vides.
|
| 209 |
-
|
| 210 |
-
Returns
|
| 211 |
-
-------
|
| 212 |
-
float
|
| 213 |
-
``∈ [0, 1]``. ``0`` ↔ calibration parfaite.
|
| 214 |
-
"""
|
| 215 |
-
bins = reliability_diagram(confidences, is_correct, n_bins=n_bins)
|
| 216 |
-
total = sum(b.count for b in bins)
|
| 217 |
-
if total == 0:
|
| 218 |
-
return 0.0
|
| 219 |
-
ece = 0.0
|
| 220 |
-
for b in bins:
|
| 221 |
-
if b.count == 0 or b.gap is None:
|
| 222 |
-
continue
|
| 223 |
-
ece += (b.count / total) * b.gap
|
| 224 |
-
return ece
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
def maximum_calibration_error(
|
| 228 |
-
confidences: Iterable[float],
|
| 229 |
-
is_correct: Iterable[bool | int],
|
| 230 |
-
n_bins: int = 10,
|
| 231 |
-
) -> float:
|
| 232 |
-
"""Maximum Calibration Error : pire écart confiance ↔ précision sur
|
| 233 |
-
tous les bins non vides.
|
| 234 |
-
|
| 235 |
-
Utile pour repérer un mensonge ponctuel du moteur (ex. il dit 95 %
|
| 236 |
-
de confiance et il a tort une fois sur deux dans ce bin).
|
| 237 |
-
|
| 238 |
-
Returns
|
| 239 |
-
-------
|
| 240 |
-
float
|
| 241 |
-
``∈ [0, 1]``. ``0`` ↔ calibration parfaite.
|
| 242 |
-
"""
|
| 243 |
-
bins = reliability_diagram(confidences, is_correct, n_bins=n_bins)
|
| 244 |
-
gaps = [b.gap for b in bins if b.gap is not None]
|
| 245 |
-
return max(gaps) if gaps else 0.0
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 249 |
-
# Vue agrégée
|
| 250 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
def compute_calibration_metrics(
|
| 254 |
-
confidences: Iterable[float],
|
| 255 |
-
is_correct: Iterable[bool | int],
|
| 256 |
-
n_bins: int = 10,
|
| 257 |
-
) -> dict:
|
| 258 |
-
"""Calcule l'ensemble des métriques de calibration en un appel.
|
| 259 |
-
|
| 260 |
-
Returns
|
| 261 |
-
-------
|
| 262 |
-
dict
|
| 263 |
-
``{
|
| 264 |
-
"ece": float,
|
| 265 |
-
"mce": float,
|
| 266 |
-
"n_bins": int,
|
| 267 |
-
"n_predictions": int,
|
| 268 |
-
"overall_accuracy": float,
|
| 269 |
-
"overall_confidence": float,
|
| 270 |
-
"bins": [
|
| 271 |
-
{"bin_low", "bin_high", "avg_confidence",
|
| 272 |
-
"accuracy", "count", "gap"},
|
| 273 |
-
...
|
| 274 |
-
],
|
| 275 |
-
}``
|
| 276 |
-
"""
|
| 277 |
-
confs = list(confidences)
|
| 278 |
-
correct = list(is_correct)
|
| 279 |
-
bins = reliability_diagram(confs, correct, n_bins=n_bins)
|
| 280 |
-
total = sum(b.count for b in bins)
|
| 281 |
-
overall_acc = (
|
| 282 |
-
sum(int(bool(x)) for x in correct) / total if total > 0 else 0.0
|
| 283 |
-
)
|
| 284 |
-
overall_conf = (
|
| 285 |
-
sum(float(c) for c in confs) / total if total > 0 else 0.0
|
| 286 |
-
)
|
| 287 |
-
|
| 288 |
-
ece = 0.0
|
| 289 |
-
if total > 0:
|
| 290 |
-
for b in bins:
|
| 291 |
-
if b.gap is None:
|
| 292 |
-
continue
|
| 293 |
-
ece += (b.count / total) * b.gap
|
| 294 |
-
mce = max((b.gap for b in bins if b.gap is not None), default=0.0)
|
| 295 |
-
|
| 296 |
-
return {
|
| 297 |
-
"ece": ece,
|
| 298 |
-
"mce": mce,
|
| 299 |
-
"n_bins": n_bins,
|
| 300 |
-
"n_predictions": total,
|
| 301 |
-
"overall_accuracy": overall_acc,
|
| 302 |
-
"overall_confidence": overall_conf,
|
| 303 |
-
"bins": [
|
| 304 |
-
{
|
| 305 |
-
"bin_low": b.bin_low,
|
| 306 |
-
"bin_high": b.bin_high,
|
| 307 |
-
"avg_confidence": b.avg_confidence,
|
| 308 |
-
"accuracy": b.accuracy,
|
| 309 |
-
"count": b.count,
|
| 310 |
-
"gap": b.gap,
|
| 311 |
-
}
|
| 312 |
-
for b in bins
|
| 313 |
-
],
|
| 314 |
-
}
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
__all__ = [
|
| 318 |
-
"CalibrationBin",
|
| 319 |
-
"reliability_diagram",
|
| 320 |
-
"expected_calibration_error",
|
| 321 |
-
"maximum_calibration_error",
|
| 322 |
-
"compute_calibration_metrics",
|
| 323 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.calibration``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.calibration`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.calibration import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,268 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
caractéristique de chaque moteur ou pipeline.
|
| 6 |
-
|
| 7 |
-
Méthode
|
| 8 |
-
-------
|
| 9 |
-
L'alignement caractère par caractère utilise les opérations d'édition
|
| 10 |
-
de la distance de Levenshtein (via difflib.SequenceMatcher), ce qui permet
|
| 11 |
-
d'identifier les substitutions, insertions et suppressions.
|
| 12 |
-
|
| 13 |
-
La matrice est stockée comme un dict de dict :
|
| 14 |
-
``{gt_char: {ocr_char: count}}``
|
| 15 |
-
|
| 16 |
-
La valeur spéciale ``"∅"`` (U+2205) représente un caractère vide :
|
| 17 |
-
- ``{"a": {"∅": 3}}`` → 'a' supprimé 3 fois dans l'OCR
|
| 18 |
-
- ``{"∅": {"x": 2}}`` → 'x' inséré 2 fois dans l'OCR (absent du GT)
|
| 19 |
"""
|
| 20 |
|
| 21 |
from __future__ import annotations
|
| 22 |
|
| 23 |
-
import
|
| 24 |
-
from collections import defaultdict
|
| 25 |
-
from dataclasses import dataclass, field
|
| 26 |
-
|
| 27 |
-
# Symbole représentant un caractère absent (insertion / suppression)
|
| 28 |
-
EMPTY_CHAR = "∅"
|
| 29 |
-
|
| 30 |
-
# Caractères non pertinents à ignorer dans la matrice (espaces, sauts de ligne)
|
| 31 |
-
_WHITESPACE = set(" \t\n\r")
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
@dataclass
|
| 35 |
-
class ConfusionMatrix:
|
| 36 |
-
"""Matrice de confusion unicode pour une paire (GT, OCR)."""
|
| 37 |
-
|
| 38 |
-
matrix: dict[str, dict[str, int]] = field(default_factory=dict)
|
| 39 |
-
"""Clé externe = char GT ; clé interne = char OCR ; valeur = count."""
|
| 40 |
-
|
| 41 |
-
total_substitutions: int = 0
|
| 42 |
-
total_insertions: int = 0
|
| 43 |
-
total_deletions: int = 0
|
| 44 |
-
|
| 45 |
-
@property
|
| 46 |
-
def total_errors(self) -> int:
|
| 47 |
-
return self.total_substitutions + self.total_insertions + self.total_deletions
|
| 48 |
-
|
| 49 |
-
def top_confusions(self, n: int = 20) -> list[dict]:
|
| 50 |
-
"""Retourne les n confusions les plus fréquentes (substitutions uniquement)."""
|
| 51 |
-
pairs: list[tuple[str, str, int]] = []
|
| 52 |
-
for gt_char, ocr_counts in self.matrix.items():
|
| 53 |
-
if gt_char == EMPTY_CHAR:
|
| 54 |
-
continue # insertions
|
| 55 |
-
for ocr_char, count in ocr_counts.items():
|
| 56 |
-
if ocr_char == EMPTY_CHAR:
|
| 57 |
-
continue # suppressions
|
| 58 |
-
if gt_char != ocr_char:
|
| 59 |
-
pairs.append((gt_char, ocr_char, count))
|
| 60 |
-
pairs.sort(key=lambda x: -x[2])
|
| 61 |
-
return [
|
| 62 |
-
{"gt": gt, "ocr": ocr, "count": cnt}
|
| 63 |
-
for gt, ocr, cnt in pairs[:n]
|
| 64 |
-
]
|
| 65 |
-
|
| 66 |
-
def as_compact_dict(self, min_count: int = 1) -> dict:
|
| 67 |
-
"""Sérialise la matrice en éliminant les entrées rares."""
|
| 68 |
-
compact: dict[str, dict[str, int]] = {}
|
| 69 |
-
for gt_char, ocr_counts in self.matrix.items():
|
| 70 |
-
filtered = {
|
| 71 |
-
oc: cnt for oc, cnt in ocr_counts.items()
|
| 72 |
-
if cnt >= min_count
|
| 73 |
-
}
|
| 74 |
-
if filtered:
|
| 75 |
-
compact[gt_char] = filtered
|
| 76 |
-
return {
|
| 77 |
-
"matrix": compact,
|
| 78 |
-
"total_substitutions": self.total_substitutions,
|
| 79 |
-
"total_insertions": self.total_insertions,
|
| 80 |
-
"total_deletions": self.total_deletions,
|
| 81 |
-
}
|
| 82 |
-
|
| 83 |
-
def as_dict(self) -> dict:
|
| 84 |
-
return self.as_compact_dict(min_count=1)
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
def build_confusion_matrix(
|
| 88 |
-
ground_truth: str,
|
| 89 |
-
hypothesis: str,
|
| 90 |
-
ignore_whitespace: bool = True,
|
| 91 |
-
ignore_correct: bool = True,
|
| 92 |
-
) -> ConfusionMatrix:
|
| 93 |
-
"""Construit la matrice de confusion unicode pour une paire GT/OCR.
|
| 94 |
-
|
| 95 |
-
Parameters
|
| 96 |
-
----------
|
| 97 |
-
ground_truth:
|
| 98 |
-
Texte de référence (vérité terrain).
|
| 99 |
-
hypothesis:
|
| 100 |
-
Texte produit par l'OCR.
|
| 101 |
-
ignore_whitespace:
|
| 102 |
-
Si True, ignore les espaces, tabulations et sauts de ligne.
|
| 103 |
-
ignore_correct:
|
| 104 |
-
Si True, n'enregistre pas les paires identiques (gt_char == ocr_char).
|
| 105 |
-
Par défaut True pour réduire la taille de la matrice.
|
| 106 |
-
|
| 107 |
-
Returns
|
| 108 |
-
-------
|
| 109 |
-
ConfusionMatrix
|
| 110 |
-
"""
|
| 111 |
-
matrix: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
| 112 |
-
n_subs = n_ins = n_dels = 0
|
| 113 |
-
|
| 114 |
-
if not ground_truth and not hypothesis:
|
| 115 |
-
return ConfusionMatrix(dict(matrix), 0, 0, 0)
|
| 116 |
-
|
| 117 |
-
# SequenceMatcher sur listes de chars pour un alignement précis
|
| 118 |
-
matcher = difflib.SequenceMatcher(None, ground_truth, hypothesis, autojunk=False)
|
| 119 |
-
|
| 120 |
-
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 121 |
-
if tag == "equal":
|
| 122 |
-
if not ignore_correct:
|
| 123 |
-
for ch in ground_truth[i1:i2]:
|
| 124 |
-
if ignore_whitespace and ch in _WHITESPACE:
|
| 125 |
-
continue
|
| 126 |
-
matrix[ch][ch] += 1
|
| 127 |
-
elif tag == "replace":
|
| 128 |
-
# Aligner char par char les séquences de longueurs différentes
|
| 129 |
-
gt_seg = ground_truth[i1:i2]
|
| 130 |
-
oc_seg = hypothesis[j1:j2]
|
| 131 |
-
_align_segments(gt_seg, oc_seg, matrix, ignore_whitespace)
|
| 132 |
-
# Substitutions = longueur commune, surplus = insertions ou suppressions
|
| 133 |
-
n_subs += min(len(gt_seg), len(oc_seg))
|
| 134 |
-
surplus = abs(len(gt_seg) - len(oc_seg))
|
| 135 |
-
if len(gt_seg) > len(oc_seg):
|
| 136 |
-
n_dels += surplus
|
| 137 |
-
else:
|
| 138 |
-
n_ins += surplus
|
| 139 |
-
elif tag == "delete":
|
| 140 |
-
for ch in ground_truth[i1:i2]:
|
| 141 |
-
if ignore_whitespace and ch in _WHITESPACE:
|
| 142 |
-
continue
|
| 143 |
-
matrix[ch][EMPTY_CHAR] += 1
|
| 144 |
-
n_dels += 1
|
| 145 |
-
elif tag == "insert":
|
| 146 |
-
for ch in hypothesis[j1:j2]:
|
| 147 |
-
if ignore_whitespace and ch in _WHITESPACE:
|
| 148 |
-
continue
|
| 149 |
-
matrix[EMPTY_CHAR][ch] += 1
|
| 150 |
-
n_ins += 1
|
| 151 |
-
|
| 152 |
-
# Convertir defaultdict en dict normal
|
| 153 |
-
result_matrix: dict[str, dict[str, int]] = {
|
| 154 |
-
k: dict(v) for k, v in matrix.items()
|
| 155 |
-
}
|
| 156 |
-
|
| 157 |
-
return ConfusionMatrix(
|
| 158 |
-
matrix=result_matrix,
|
| 159 |
-
total_substitutions=n_subs,
|
| 160 |
-
total_insertions=n_ins,
|
| 161 |
-
total_deletions=n_dels,
|
| 162 |
-
)
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
def _align_segments(
|
| 166 |
-
gt_seg: str,
|
| 167 |
-
oc_seg: str,
|
| 168 |
-
matrix: dict,
|
| 169 |
-
ignore_whitespace: bool,
|
| 170 |
-
) -> None:
|
| 171 |
-
"""Aligne deux segments de longueurs potentiellement différentes."""
|
| 172 |
-
if not gt_seg:
|
| 173 |
-
for ch in oc_seg:
|
| 174 |
-
if ignore_whitespace and ch in _WHITESPACE:
|
| 175 |
-
continue
|
| 176 |
-
matrix[EMPTY_CHAR][ch] += 1
|
| 177 |
-
return
|
| 178 |
-
if not oc_seg:
|
| 179 |
-
for ch in gt_seg:
|
| 180 |
-
if ignore_whitespace and ch in _WHITESPACE:
|
| 181 |
-
continue
|
| 182 |
-
matrix[ch][EMPTY_CHAR] += 1
|
| 183 |
-
return
|
| 184 |
-
|
| 185 |
-
if len(gt_seg) == len(oc_seg):
|
| 186 |
-
# Substitutions 1-pour-1
|
| 187 |
-
for g, o in zip(gt_seg, oc_seg):
|
| 188 |
-
if ignore_whitespace and (g in _WHITESPACE or o in _WHITESPACE):
|
| 189 |
-
continue
|
| 190 |
-
matrix[g][o] += 1
|
| 191 |
-
else:
|
| 192 |
-
# Longueurs différentes : utiliser SequenceMatcher récursif sur segments courts
|
| 193 |
-
sub = difflib.SequenceMatcher(None, gt_seg, oc_seg, autojunk=False)
|
| 194 |
-
for tag2, i1, i2, j1, j2 in sub.get_opcodes():
|
| 195 |
-
if tag2 == "equal":
|
| 196 |
-
pass
|
| 197 |
-
elif tag2 == "replace":
|
| 198 |
-
# Régression simple : aligner par troncature
|
| 199 |
-
for g, o in zip(gt_seg[i1:i2], oc_seg[j1:j2]):
|
| 200 |
-
if ignore_whitespace and (g in _WHITESPACE or o in _WHITESPACE):
|
| 201 |
-
continue
|
| 202 |
-
matrix[g][o] += 1
|
| 203 |
-
elif tag2 == "delete":
|
| 204 |
-
for g in gt_seg[i1:i2]:
|
| 205 |
-
if ignore_whitespace and g in _WHITESPACE:
|
| 206 |
-
continue
|
| 207 |
-
matrix[g][EMPTY_CHAR] += 1
|
| 208 |
-
elif tag2 == "insert":
|
| 209 |
-
for o in oc_seg[j1:j2]:
|
| 210 |
-
if ignore_whitespace and o in _WHITESPACE:
|
| 211 |
-
continue
|
| 212 |
-
matrix[EMPTY_CHAR][o] += 1
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
def aggregate_confusion_matrices(matrices: list[ConfusionMatrix]) -> ConfusionMatrix:
|
| 216 |
-
"""Agrège plusieurs matrices de confusion en une seule.
|
| 217 |
-
|
| 218 |
-
Utile pour obtenir la matrice agrégée sur l'ensemble du corpus.
|
| 219 |
-
"""
|
| 220 |
-
combined: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
|
| 221 |
-
total_subs = total_ins = total_dels = 0
|
| 222 |
-
|
| 223 |
-
for cm in matrices:
|
| 224 |
-
for gt_char, ocr_counts in cm.matrix.items():
|
| 225 |
-
for ocr_char, count in ocr_counts.items():
|
| 226 |
-
combined[gt_char][ocr_char] += count
|
| 227 |
-
total_subs += cm.total_substitutions
|
| 228 |
-
total_ins += cm.total_insertions
|
| 229 |
-
total_dels += cm.total_deletions
|
| 230 |
-
|
| 231 |
-
return ConfusionMatrix(
|
| 232 |
-
matrix={k: dict(v) for k, v in combined.items()},
|
| 233 |
-
total_substitutions=total_subs,
|
| 234 |
-
total_insertions=total_ins,
|
| 235 |
-
total_deletions=total_dels,
|
| 236 |
-
)
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
def top_confused_chars(
|
| 240 |
-
matrix: ConfusionMatrix,
|
| 241 |
-
n: int = 15,
|
| 242 |
-
exclude_empty: bool = True,
|
| 243 |
-
) -> list[dict]:
|
| 244 |
-
"""Retourne les caractères GT les plus souvent confondus.
|
| 245 |
-
|
| 246 |
-
Retourne une liste triée par nombre total d'erreurs décroissant :
|
| 247 |
-
``[{"char": "ſ", "total_errors": 47, "top_substitutes": [...]}, ...]``
|
| 248 |
-
"""
|
| 249 |
-
char_stats: dict[str, dict] = {}
|
| 250 |
-
for gt_char, ocr_counts in matrix.matrix.items():
|
| 251 |
-
if exclude_empty and gt_char == EMPTY_CHAR:
|
| 252 |
-
continue
|
| 253 |
-
error_count = sum(
|
| 254 |
-
cnt for oc, cnt in ocr_counts.items()
|
| 255 |
-
if (oc != gt_char) and (not exclude_empty or oc != EMPTY_CHAR)
|
| 256 |
-
)
|
| 257 |
-
if error_count > 0:
|
| 258 |
-
top_subs = sorted(
|
| 259 |
-
[{"ocr": oc, "count": cnt} for oc, cnt in ocr_counts.items() if oc != gt_char],
|
| 260 |
-
key=lambda x: -x["count"],
|
| 261 |
-
)[:5]
|
| 262 |
-
char_stats[gt_char] = {
|
| 263 |
-
"char": gt_char,
|
| 264 |
-
"total_errors": error_count,
|
| 265 |
-
"top_substitutes": top_subs,
|
| 266 |
-
}
|
| 267 |
-
|
| 268 |
-
return sorted(char_stats.values(), key=lambda x: -x["total_errors"])[:n]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.confusion``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.confusion`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.confusion import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,276 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Quand un module de post-correction LLM aplatit les différences
|
| 8 |
-
entre OCR amont, ce n'est pas qu'il « améliore » tous les
|
| 9 |
-
moteurs — c'est qu'il introduit ses propres biais qui dominent
|
| 10 |
-
ceux de l'OCR. Mesurer la dégradation par étape ne suffit
|
| 11 |
-
pas : il faut **séparer** les deux flux.
|
| 12 |
-
|
| 13 |
-
À chaque jonction où un module transforme un artefact, on
|
| 14 |
-
mesure :
|
| 15 |
-
|
| 16 |
-
- **Taux de correction** : parmi les erreurs présentes en
|
| 17 |
-
entrée du module, combien sont corrigées en sortie ?
|
| 18 |
-
- **Taux d'introduction** : parmi les erreurs présentes en
|
| 19 |
-
sortie, combien sont **nouvelles** (absentes en entrée) ?
|
| 20 |
-
|
| 21 |
-
C'est la généralisation du score de sur-normalisation
|
| 22 |
-
(chantier A.I.7) à toute jonction. La formule s'applique
|
| 23 |
-
uniformément à OCR→LLM, OCR→reconstructor, VLM→ALTO_mapper —
|
| 24 |
-
toute jonction qui transforme un artefact en un autre du même
|
| 25 |
-
type.
|
| 26 |
-
|
| 27 |
-
Méthode (token-level)
|
| 28 |
-
---------------------
|
| 29 |
-
On split en tokens whitespace ``reference``, ``before``,
|
| 30 |
-
``after``. On compare en **multiset** (un token GT consommé
|
| 31 |
-
au plus une fois) :
|
| 32 |
-
|
| 33 |
-
- ``errors_before`` = tokens GT non retrouvés dans ``before``
|
| 34 |
-
- ``errors_after`` = tokens GT non retrouvés dans ``after``
|
| 35 |
-
- ``corrected`` = ``errors_before \\ errors_after``
|
| 36 |
-
(présents avant, absents après → corrigés)
|
| 37 |
-
- ``introduced`` = ``errors_after \\ errors_before``
|
| 38 |
-
(absents avant, présents après → introduits)
|
| 39 |
-
|
| 40 |
-
Garde-fou : le module ne classe pas les erreurs (visuelles,
|
| 41 |
-
abréviations, etc.) — c'est une métrique d'**absorption de
|
| 42 |
-
volume**, pas de qualité éditoriale. L'intersection sémantique
|
| 43 |
-
avec ``taxonomy`` (Sprint 5) est documentée dans le glossaire.
|
| 44 |
-
|
| 45 |
-
Sortie
|
| 46 |
-
------
|
| 47 |
-
``compute_error_absorption(reference, before, after)`` retourne :
|
| 48 |
-
|
| 49 |
-
.. code-block:: text
|
| 50 |
-
|
| 51 |
-
{
|
| 52 |
-
"n_gt_tokens": int,
|
| 53 |
-
"n_errors_before": int,
|
| 54 |
-
"n_errors_after": int,
|
| 55 |
-
"n_corrected": int,
|
| 56 |
-
"n_introduced": int,
|
| 57 |
-
"n_kept_wrong": int,
|
| 58 |
-
"correction_rate": float | None, # n_corrected / n_errors_before
|
| 59 |
-
"introduction_rate": float | None, # n_introduced / n_errors_after
|
| 60 |
-
"net_improvement": int, # n_corrected - n_introduced
|
| 61 |
-
"corrected_tokens": list[str],
|
| 62 |
-
"introduced_tokens": list[str],
|
| 63 |
-
}
|
| 64 |
-
|
| 65 |
-
``aggregate_error_absorption(per_doc_results)`` somme les
|
| 66 |
-
compteurs corpus-wide et recalcule les taux *micro*.
|
| 67 |
"""
|
| 68 |
|
| 69 |
from __future__ import annotations
|
| 70 |
|
| 71 |
-
import
|
| 72 |
-
from collections import Counter
|
| 73 |
-
from typing import Iterable, Optional
|
| 74 |
-
|
| 75 |
-
logger = logging.getLogger(__name__)
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
def _split_words(text: Optional[str]) -> list[str]:
|
| 79 |
-
if not text:
|
| 80 |
-
return []
|
| 81 |
-
return text.split()
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
def _missing_tokens(
|
| 85 |
-
reference: list[str], hypothesis: list[str],
|
| 86 |
-
) -> Counter:
|
| 87 |
-
"""Tokens GT manquants en hypothèse au sens multiset.
|
| 88 |
-
|
| 89 |
-
Un token GT compte plusieurs fois s'il apparaît plusieurs
|
| 90 |
-
fois ; chaque occurrence en hypothèse en absorbe au plus
|
| 91 |
-
une. Retourne un Counter ``{token: nb_occurrences_manquees}``.
|
| 92 |
-
"""
|
| 93 |
-
ref_count = Counter(reference)
|
| 94 |
-
hyp_count = Counter(hypothesis)
|
| 95 |
-
missing: Counter = Counter()
|
| 96 |
-
for token, n_ref in ref_count.items():
|
| 97 |
-
n_hyp = hyp_count.get(token, 0)
|
| 98 |
-
if n_hyp < n_ref:
|
| 99 |
-
missing[token] = n_ref - n_hyp
|
| 100 |
-
return missing
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
def compute_error_absorption(
|
| 104 |
-
reference: Optional[str],
|
| 105 |
-
before: Optional[str],
|
| 106 |
-
after: Optional[str],
|
| 107 |
-
*,
|
| 108 |
-
case_sensitive: bool = False,
|
| 109 |
-
) -> Optional[dict]:
|
| 110 |
-
"""Mesure l'absorption d'erreur entre ``before`` et ``after``.
|
| 111 |
-
|
| 112 |
-
Parameters
|
| 113 |
-
----------
|
| 114 |
-
reference:
|
| 115 |
-
GT (vérité terrain).
|
| 116 |
-
before:
|
| 117 |
-
Sortie de l'étape précédente (typiquement OCR amont).
|
| 118 |
-
after:
|
| 119 |
-
Sortie de l'étape courante (typiquement post-correction LLM).
|
| 120 |
-
case_sensitive:
|
| 121 |
-
Si False (défaut), match case-insensitive — la sortie
|
| 122 |
-
``corrected_tokens``/``introduced_tokens`` reste en casse
|
| 123 |
-
GT originale.
|
| 124 |
-
|
| 125 |
-
Returns
|
| 126 |
-
-------
|
| 127 |
-
dict | None
|
| 128 |
-
``None`` si la GT est vide ou ne contient aucun token.
|
| 129 |
-
"""
|
| 130 |
-
ref_tokens = _split_words(reference)
|
| 131 |
-
if not ref_tokens:
|
| 132 |
-
return None
|
| 133 |
-
before_tokens = _split_words(before)
|
| 134 |
-
after_tokens = _split_words(after)
|
| 135 |
-
|
| 136 |
-
if case_sensitive:
|
| 137 |
-
ref_match = list(ref_tokens)
|
| 138 |
-
before_match = list(before_tokens)
|
| 139 |
-
after_match = list(after_tokens)
|
| 140 |
-
else:
|
| 141 |
-
ref_match = [t.lower() for t in ref_tokens]
|
| 142 |
-
before_match = [t.lower() for t in before_tokens]
|
| 143 |
-
after_match = [t.lower() for t in after_tokens]
|
| 144 |
-
|
| 145 |
-
# Map case-insensitive token → liste de casses GT originales
|
| 146 |
-
ref_orig_by_match: dict[str, list[str]] = {}
|
| 147 |
-
for orig, m in zip(ref_tokens, ref_match):
|
| 148 |
-
ref_orig_by_match.setdefault(m, []).append(orig)
|
| 149 |
-
|
| 150 |
-
missing_before = _missing_tokens(ref_match, before_match)
|
| 151 |
-
missing_after = _missing_tokens(ref_match, after_match)
|
| 152 |
-
|
| 153 |
-
n_errors_before = sum(missing_before.values())
|
| 154 |
-
n_errors_after = sum(missing_after.values())
|
| 155 |
-
|
| 156 |
-
# Calcul corrigé / introduit en multiset
|
| 157 |
-
corrected_counter: Counter = Counter()
|
| 158 |
-
introduced_counter: Counter = Counter()
|
| 159 |
-
kept_wrong_counter: Counter = Counter()
|
| 160 |
-
all_tokens = set(missing_before) | set(missing_after)
|
| 161 |
-
for tok in all_tokens:
|
| 162 |
-
nb = missing_before.get(tok, 0)
|
| 163 |
-
na = missing_after.get(tok, 0)
|
| 164 |
-
if nb > na:
|
| 165 |
-
corrected_counter[tok] = nb - na
|
| 166 |
-
kept_wrong_counter[tok] = na
|
| 167 |
-
elif na > nb:
|
| 168 |
-
introduced_counter[tok] = na - nb
|
| 169 |
-
kept_wrong_counter[tok] = nb
|
| 170 |
-
else:
|
| 171 |
-
kept_wrong_counter[tok] = nb
|
| 172 |
-
|
| 173 |
-
n_corrected = sum(corrected_counter.values())
|
| 174 |
-
n_introduced = sum(introduced_counter.values())
|
| 175 |
-
n_kept_wrong = sum(kept_wrong_counter.values())
|
| 176 |
-
|
| 177 |
-
correction_rate = (
|
| 178 |
-
n_corrected / n_errors_before
|
| 179 |
-
if n_errors_before > 0 else None
|
| 180 |
-
)
|
| 181 |
-
introduction_rate = (
|
| 182 |
-
n_introduced / n_errors_after
|
| 183 |
-
if n_errors_after > 0 else None
|
| 184 |
-
)
|
| 185 |
-
|
| 186 |
-
def _expand(counter: Counter) -> list[str]:
|
| 187 |
-
out: list[str] = []
|
| 188 |
-
for tok, count in counter.items():
|
| 189 |
-
origs = ref_orig_by_match.get(tok, [tok])
|
| 190 |
-
# Ne renvoie que la casse représentative GT
|
| 191 |
-
display = origs[0] if origs else tok
|
| 192 |
-
out.extend([display] * count)
|
| 193 |
-
return out
|
| 194 |
-
|
| 195 |
-
return {
|
| 196 |
-
"n_gt_tokens": len(ref_tokens),
|
| 197 |
-
"n_errors_before": n_errors_before,
|
| 198 |
-
"n_errors_after": n_errors_after,
|
| 199 |
-
"n_corrected": n_corrected,
|
| 200 |
-
"n_introduced": n_introduced,
|
| 201 |
-
"n_kept_wrong": n_kept_wrong,
|
| 202 |
-
"correction_rate": correction_rate,
|
| 203 |
-
"introduction_rate": introduction_rate,
|
| 204 |
-
"net_improvement": n_corrected - n_introduced,
|
| 205 |
-
"corrected_tokens": _expand(corrected_counter),
|
| 206 |
-
"introduced_tokens": _expand(introduced_counter),
|
| 207 |
-
}
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
def aggregate_error_absorption(
|
| 211 |
-
per_doc: Iterable[Optional[dict]],
|
| 212 |
-
*,
|
| 213 |
-
sample_tokens: int = 50,
|
| 214 |
-
) -> Optional[dict]:
|
| 215 |
-
"""Agrège les compteurs corpus-wide et recalcule les taux
|
| 216 |
-
*micro*.
|
| 217 |
-
|
| 218 |
-
Parameters
|
| 219 |
-
----------
|
| 220 |
-
per_doc:
|
| 221 |
-
Itérable de sorties de ``compute_error_absorption`` (ou
|
| 222 |
-
``None`` pour les docs sans GT).
|
| 223 |
-
sample_tokens:
|
| 224 |
-
Nombre maximal de tokens corrigés/introduits gardés dans
|
| 225 |
-
l'échantillon (cap pour ne pas exploser le JSON).
|
| 226 |
-
|
| 227 |
-
Returns
|
| 228 |
-
-------
|
| 229 |
-
dict | None
|
| 230 |
-
``None`` si aucune entry valide.
|
| 231 |
-
"""
|
| 232 |
-
docs = [d for d in per_doc if d]
|
| 233 |
-
if not docs:
|
| 234 |
-
return None
|
| 235 |
-
n_gt = sum(int(d.get("n_gt_tokens") or 0) for d in docs)
|
| 236 |
-
n_errors_before = sum(int(d.get("n_errors_before") or 0) for d in docs)
|
| 237 |
-
n_errors_after = sum(int(d.get("n_errors_after") or 0) for d in docs)
|
| 238 |
-
n_corrected = sum(int(d.get("n_corrected") or 0) for d in docs)
|
| 239 |
-
n_introduced = sum(int(d.get("n_introduced") or 0) for d in docs)
|
| 240 |
-
n_kept_wrong = sum(int(d.get("n_kept_wrong") or 0) for d in docs)
|
| 241 |
-
correction_rate = (
|
| 242 |
-
n_corrected / n_errors_before if n_errors_before > 0 else None
|
| 243 |
-
)
|
| 244 |
-
introduction_rate = (
|
| 245 |
-
n_introduced / n_errors_after if n_errors_after > 0 else None
|
| 246 |
-
)
|
| 247 |
-
corrected_sample: list[str] = []
|
| 248 |
-
introduced_sample: list[str] = []
|
| 249 |
-
for d in docs:
|
| 250 |
-
corrected_sample.extend(d.get("corrected_tokens") or [])
|
| 251 |
-
introduced_sample.extend(d.get("introduced_tokens") or [])
|
| 252 |
-
if (
|
| 253 |
-
len(corrected_sample) >= sample_tokens
|
| 254 |
-
and len(introduced_sample) >= sample_tokens
|
| 255 |
-
):
|
| 256 |
-
break
|
| 257 |
-
return {
|
| 258 |
-
"n_docs": len(docs),
|
| 259 |
-
"n_gt_tokens": n_gt,
|
| 260 |
-
"n_errors_before": n_errors_before,
|
| 261 |
-
"n_errors_after": n_errors_after,
|
| 262 |
-
"n_corrected": n_corrected,
|
| 263 |
-
"n_introduced": n_introduced,
|
| 264 |
-
"n_kept_wrong": n_kept_wrong,
|
| 265 |
-
"correction_rate": correction_rate,
|
| 266 |
-
"introduction_rate": introduction_rate,
|
| 267 |
-
"net_improvement": n_corrected - n_introduced,
|
| 268 |
-
"corrected_tokens_sample": corrected_sample[:sample_tokens],
|
| 269 |
-
"introduced_tokens_sample": introduced_sample[:sample_tokens],
|
| 270 |
-
}
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
__all__ = [
|
| 274 |
-
"compute_error_absorption",
|
| 275 |
-
"aggregate_error_absorption",
|
| 276 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.error_absorption``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.error_absorption`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.error_absorption import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,331 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
-
|
| 5 |
-
- Taux d'insertion net : mots/caractères ajoutés absents du GT, distinct du WIL existant
|
| 6 |
-
- Ratio de longueur : len(hyp) / len(gt) — ratio > 1.2 → hallucination potentielle
|
| 7 |
-
- Score d'ancrage : proportion des n-grammes (trigrammes) de la sortie présents dans le GT
|
| 8 |
-
- Blocs hallucinés : segments continus de la sortie sans correspondance GT au-delà d'un seuil
|
| 9 |
-
- Badge hallucination : True si ancrage faible ou ratio de longueur anormal
|
| 10 |
"""
|
| 11 |
|
| 12 |
from __future__ import annotations
|
| 13 |
|
| 14 |
-
import
|
| 15 |
-
from dataclasses import dataclass
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
# ---------------------------------------------------------------------------
|
| 19 |
-
# Helpers texte
|
| 20 |
-
# ---------------------------------------------------------------------------
|
| 21 |
-
|
| 22 |
-
def _tokenize(text: str) -> list[str]:
|
| 23 |
-
"""Découpe en mots (minuscules, sans ponctuation)."""
|
| 24 |
-
return re.findall(r"[^\s]+", text.lower())
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
def _ngrams(tokens: list[str], n: int) -> list[tuple[str, ...]]:
|
| 28 |
-
"""Génère les n-grammes d'une liste de tokens."""
|
| 29 |
-
if len(tokens) < n:
|
| 30 |
-
return [tuple(tokens)] if tokens else []
|
| 31 |
-
return [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
# ---------------------------------------------------------------------------
|
| 35 |
-
# Blocs hallucinés (segments continus sans ancrage)
|
| 36 |
-
# ---------------------------------------------------------------------------
|
| 37 |
-
|
| 38 |
-
@dataclass
|
| 39 |
-
class HallucinatedBlock:
|
| 40 |
-
"""Segment continu de la sortie sans correspondance dans le GT."""
|
| 41 |
-
start_token: int
|
| 42 |
-
end_token: int
|
| 43 |
-
text: str
|
| 44 |
-
length: int # nombre de tokens
|
| 45 |
-
|
| 46 |
-
def as_dict(self) -> dict:
|
| 47 |
-
return {
|
| 48 |
-
"start_token": self.start_token,
|
| 49 |
-
"end_token": self.end_token,
|
| 50 |
-
"text": self.text,
|
| 51 |
-
"length": self.length,
|
| 52 |
-
}
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
def _detect_hallucinated_blocks(
|
| 56 |
-
hyp_tokens: list[str],
|
| 57 |
-
gt_token_set: set[str],
|
| 58 |
-
tolerance: int = 3,
|
| 59 |
-
min_block_length: int = 4,
|
| 60 |
-
) -> list[HallucinatedBlock]:
|
| 61 |
-
"""Détecte les blocs de tokens hypothèse sans correspondance dans le GT.
|
| 62 |
-
|
| 63 |
-
Un bloc est un segment contigu de tokens hypothèse dont aucun n'est présent
|
| 64 |
-
dans le vocabulaire GT. Une tolérance de ``tolerance`` tokens connus interrompus
|
| 65 |
-
est acceptée avant de clore un bloc.
|
| 66 |
-
|
| 67 |
-
Parameters
|
| 68 |
-
----------
|
| 69 |
-
hyp_tokens:
|
| 70 |
-
Tokens de la sortie OCR/VLM.
|
| 71 |
-
gt_token_set:
|
| 72 |
-
Ensemble des tokens du GT (pour recherche O(1)).
|
| 73 |
-
tolerance:
|
| 74 |
-
Nombre de tokens connus consécutifs interrompant un bloc avant de le clore.
|
| 75 |
-
min_block_length:
|
| 76 |
-
Longueur minimale (tokens) pour qu'un bloc soit signalé.
|
| 77 |
-
|
| 78 |
-
Returns
|
| 79 |
-
-------
|
| 80 |
-
list[HallucinatedBlock]
|
| 81 |
-
"""
|
| 82 |
-
blocks: list[HallucinatedBlock] = []
|
| 83 |
-
if not hyp_tokens:
|
| 84 |
-
return blocks
|
| 85 |
-
|
| 86 |
-
in_block = False
|
| 87 |
-
block_start = 0
|
| 88 |
-
consecutive_known = 0
|
| 89 |
-
|
| 90 |
-
for i, tok in enumerate(hyp_tokens):
|
| 91 |
-
is_unknown = tok not in gt_token_set
|
| 92 |
-
if is_unknown:
|
| 93 |
-
if not in_block:
|
| 94 |
-
in_block = True
|
| 95 |
-
block_start = i
|
| 96 |
-
consecutive_known = 0
|
| 97 |
-
else:
|
| 98 |
-
consecutive_known = 0
|
| 99 |
-
else:
|
| 100 |
-
if in_block:
|
| 101 |
-
consecutive_known += 1
|
| 102 |
-
if consecutive_known >= tolerance:
|
| 103 |
-
# Clore le bloc
|
| 104 |
-
end = i - consecutive_known
|
| 105 |
-
length = end - block_start + 1
|
| 106 |
-
if length >= min_block_length:
|
| 107 |
-
text = " ".join(hyp_tokens[block_start:end + 1])
|
| 108 |
-
blocks.append(HallucinatedBlock(
|
| 109 |
-
start_token=block_start,
|
| 110 |
-
end_token=end,
|
| 111 |
-
text=text,
|
| 112 |
-
length=length,
|
| 113 |
-
))
|
| 114 |
-
in_block = False
|
| 115 |
-
consecutive_known = 0
|
| 116 |
-
|
| 117 |
-
# Bloc non terminé
|
| 118 |
-
if in_block:
|
| 119 |
-
end = len(hyp_tokens) - 1
|
| 120 |
-
length = end - block_start + 1
|
| 121 |
-
if length >= min_block_length:
|
| 122 |
-
text = " ".join(hyp_tokens[block_start:end + 1])
|
| 123 |
-
blocks.append(HallucinatedBlock(
|
| 124 |
-
start_token=block_start,
|
| 125 |
-
end_token=end,
|
| 126 |
-
text=text,
|
| 127 |
-
length=length,
|
| 128 |
-
))
|
| 129 |
-
|
| 130 |
-
return blocks
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
# ---------------------------------------------------------------------------
|
| 134 |
-
# Résultat structuré
|
| 135 |
-
# ---------------------------------------------------------------------------
|
| 136 |
-
|
| 137 |
-
@dataclass
|
| 138 |
-
class HallucinationMetrics:
|
| 139 |
-
"""Métriques de détection des hallucinations pour une paire (GT, hypothèse)."""
|
| 140 |
-
|
| 141 |
-
net_insertion_rate: float
|
| 142 |
-
"""Taux d'insertion nette : tokens hypothèse absents du GT / total tokens hypothèse."""
|
| 143 |
-
|
| 144 |
-
length_ratio: float
|
| 145 |
-
"""Ratio de longueur : len(hyp) / len(gt) en caractères. > 1.2 = signal d'hallucination."""
|
| 146 |
-
|
| 147 |
-
anchor_score: float
|
| 148 |
-
"""Score d'ancrage : proportion des trigrammes hypothèse présents dans les trigrammes GT.
|
| 149 |
-
Score élevé → l'hypothèse s'ancre bien dans le GT. Score faible → hallucinations probables."""
|
| 150 |
-
|
| 151 |
-
hallucinated_blocks: list[HallucinatedBlock]
|
| 152 |
-
"""Segments continus de la sortie sans correspondance GT (au-dessus du seuil de tolérance)."""
|
| 153 |
-
|
| 154 |
-
is_hallucinating: bool
|
| 155 |
-
"""True si anchor_score < anchor_threshold OU length_ratio > length_ratio_threshold."""
|
| 156 |
-
|
| 157 |
-
# Détails supplémentaires
|
| 158 |
-
gt_word_count: int = 0
|
| 159 |
-
hyp_word_count: int = 0
|
| 160 |
-
net_inserted_words: int = 0
|
| 161 |
-
anchor_threshold_used: float = 0.5
|
| 162 |
-
length_ratio_threshold_used: float = 1.2
|
| 163 |
-
ngram_size_used: int = 3
|
| 164 |
-
|
| 165 |
-
def as_dict(self) -> dict:
|
| 166 |
-
return {
|
| 167 |
-
"net_insertion_rate": round(self.net_insertion_rate, 6),
|
| 168 |
-
"length_ratio": round(self.length_ratio, 6),
|
| 169 |
-
"anchor_score": round(self.anchor_score, 6),
|
| 170 |
-
"hallucinated_blocks": [b.as_dict() for b in self.hallucinated_blocks],
|
| 171 |
-
"is_hallucinating": self.is_hallucinating,
|
| 172 |
-
"gt_word_count": self.gt_word_count,
|
| 173 |
-
"hyp_word_count": self.hyp_word_count,
|
| 174 |
-
"net_inserted_words": self.net_inserted_words,
|
| 175 |
-
"anchor_threshold_used": self.anchor_threshold_used,
|
| 176 |
-
"length_ratio_threshold_used": self.length_ratio_threshold_used,
|
| 177 |
-
"ngram_size_used": self.ngram_size_used,
|
| 178 |
-
}
|
| 179 |
-
|
| 180 |
-
@classmethod
|
| 181 |
-
def from_dict(cls, d: dict) -> "HallucinationMetrics":
|
| 182 |
-
blocks = [
|
| 183 |
-
HallucinatedBlock(**b) for b in d.get("hallucinated_blocks", [])
|
| 184 |
-
]
|
| 185 |
-
return cls(
|
| 186 |
-
net_insertion_rate=d.get("net_insertion_rate", 0.0),
|
| 187 |
-
length_ratio=d.get("length_ratio", 1.0),
|
| 188 |
-
anchor_score=d.get("anchor_score", 1.0),
|
| 189 |
-
hallucinated_blocks=blocks,
|
| 190 |
-
is_hallucinating=d.get("is_hallucinating", False),
|
| 191 |
-
gt_word_count=d.get("gt_word_count", 0),
|
| 192 |
-
hyp_word_count=d.get("hyp_word_count", 0),
|
| 193 |
-
net_inserted_words=d.get("net_inserted_words", 0),
|
| 194 |
-
anchor_threshold_used=d.get("anchor_threshold_used", 0.5),
|
| 195 |
-
length_ratio_threshold_used=d.get("length_ratio_threshold_used", 1.2),
|
| 196 |
-
ngram_size_used=d.get("ngram_size_used", 3),
|
| 197 |
-
)
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
# ---------------------------------------------------------------------------
|
| 201 |
-
# Calcul principal
|
| 202 |
-
# ---------------------------------------------------------------------------
|
| 203 |
-
|
| 204 |
-
def compute_hallucination_metrics(
|
| 205 |
-
reference: str,
|
| 206 |
-
hypothesis: str,
|
| 207 |
-
n: int = 3,
|
| 208 |
-
length_ratio_threshold: float = 1.2,
|
| 209 |
-
anchor_threshold: float = 0.5,
|
| 210 |
-
block_tolerance: int = 3,
|
| 211 |
-
min_block_length: int = 4,
|
| 212 |
-
) -> HallucinationMetrics:
|
| 213 |
-
"""Calcule les métriques de détection des hallucinations VLM/LLM.
|
| 214 |
-
|
| 215 |
-
Parameters
|
| 216 |
-
----------
|
| 217 |
-
reference:
|
| 218 |
-
Texte de vérité terrain (GT).
|
| 219 |
-
hypothesis:
|
| 220 |
-
Texte produit par le modèle.
|
| 221 |
-
n:
|
| 222 |
-
Taille des n-grammes pour le score d'ancrage (défaut : trigrammes).
|
| 223 |
-
length_ratio_threshold:
|
| 224 |
-
Seuil de ratio de longueur au-dessus duquel on signale une hallucination potentielle.
|
| 225 |
-
anchor_threshold:
|
| 226 |
-
Seuil de score d'ancrage en dessous duquel on signale une hallucination potentielle.
|
| 227 |
-
block_tolerance:
|
| 228 |
-
Nombre de tokens connus consécutifs acceptés dans un bloc halluciné.
|
| 229 |
-
min_block_length:
|
| 230 |
-
Longueur minimale (tokens) pour signaler un bloc halluciné.
|
| 231 |
-
|
| 232 |
-
Returns
|
| 233 |
-
-------
|
| 234 |
-
HallucinationMetrics
|
| 235 |
-
"""
|
| 236 |
-
gt_tokens = _tokenize(reference)
|
| 237 |
-
hyp_tokens = _tokenize(hypothesis)
|
| 238 |
-
|
| 239 |
-
gt_len_chars = len(reference.strip())
|
| 240 |
-
hyp_len_chars = len(hypothesis.strip())
|
| 241 |
-
|
| 242 |
-
# ── Ratio de longueur ────────────────────────────────────────────────
|
| 243 |
-
if gt_len_chars == 0:
|
| 244 |
-
length_ratio = 1.0 if hyp_len_chars == 0 else float("inf")
|
| 245 |
-
else:
|
| 246 |
-
length_ratio = hyp_len_chars / gt_len_chars
|
| 247 |
-
|
| 248 |
-
# ── Taux d'insertion nette ───────────────────────────────────────────
|
| 249 |
-
gt_token_set = set(gt_tokens)
|
| 250 |
-
hyp_token_count = len(hyp_tokens)
|
| 251 |
-
|
| 252 |
-
if hyp_token_count == 0:
|
| 253 |
-
net_insertion_rate = 0.0
|
| 254 |
-
net_inserted_words = 0
|
| 255 |
-
else:
|
| 256 |
-
net_inserted = [t for t in hyp_tokens if t not in gt_token_set]
|
| 257 |
-
net_inserted_words = len(net_inserted)
|
| 258 |
-
net_insertion_rate = net_inserted_words / hyp_token_count
|
| 259 |
-
|
| 260 |
-
# ── Score d'ancrage (n-grammes) ──────────────────────────────────────
|
| 261 |
-
gt_ngrams = set(_ngrams(gt_tokens, n))
|
| 262 |
-
hyp_ngrams = _ngrams(hyp_tokens, n)
|
| 263 |
-
|
| 264 |
-
if not hyp_ngrams:
|
| 265 |
-
# Pas de n-grammes dans l'hypothèse → ancrage parfait (hypothèse vide ou trop courte)
|
| 266 |
-
anchor_score = 1.0 if not gt_ngrams else 0.0
|
| 267 |
-
elif not gt_ngrams:
|
| 268 |
-
anchor_score = 0.0
|
| 269 |
-
else:
|
| 270 |
-
anchored = sum(1 for ng in hyp_ngrams if ng in gt_ngrams)
|
| 271 |
-
anchor_score = anchored / len(hyp_ngrams)
|
| 272 |
-
|
| 273 |
-
# ── Blocs hallucinés ─────────────────────────────────────────────────
|
| 274 |
-
blocks = _detect_hallucinated_blocks(
|
| 275 |
-
hyp_tokens=hyp_tokens,
|
| 276 |
-
gt_token_set=gt_token_set,
|
| 277 |
-
tolerance=block_tolerance,
|
| 278 |
-
min_block_length=min_block_length,
|
| 279 |
-
)
|
| 280 |
-
|
| 281 |
-
# ── Badge hallucination ──────────────────────────────────────────────
|
| 282 |
-
is_hallucinating = (
|
| 283 |
-
anchor_score < anchor_threshold
|
| 284 |
-
or length_ratio > length_ratio_threshold
|
| 285 |
-
)
|
| 286 |
-
|
| 287 |
-
return HallucinationMetrics(
|
| 288 |
-
net_insertion_rate=net_insertion_rate,
|
| 289 |
-
length_ratio=min(length_ratio, 9.99), # plafonner pour la sérialisation
|
| 290 |
-
anchor_score=anchor_score,
|
| 291 |
-
hallucinated_blocks=blocks,
|
| 292 |
-
is_hallucinating=is_hallucinating,
|
| 293 |
-
gt_word_count=len(gt_tokens),
|
| 294 |
-
hyp_word_count=hyp_token_count,
|
| 295 |
-
net_inserted_words=net_inserted_words,
|
| 296 |
-
anchor_threshold_used=anchor_threshold,
|
| 297 |
-
length_ratio_threshold_used=length_ratio_threshold,
|
| 298 |
-
ngram_size_used=n,
|
| 299 |
-
)
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
# ---------------------------------------------------------------------------
|
| 303 |
-
# Agrégation sur un corpus
|
| 304 |
-
# ---------------------------------------------------------------------------
|
| 305 |
-
|
| 306 |
-
def aggregate_hallucination_metrics(results: list[HallucinationMetrics]) -> dict:
|
| 307 |
-
"""Agrège les métriques d'hallucination sur un corpus.
|
| 308 |
-
|
| 309 |
-
Returns
|
| 310 |
-
-------
|
| 311 |
-
dict
|
| 312 |
-
Statistiques agrégées : anchor_score moyen, taux de documents hallucinés…
|
| 313 |
-
"""
|
| 314 |
-
if not results:
|
| 315 |
-
return {}
|
| 316 |
-
|
| 317 |
-
n = len(results)
|
| 318 |
-
anchor_values = [r.anchor_score for r in results]
|
| 319 |
-
ratio_values = [r.length_ratio for r in results]
|
| 320 |
-
insertion_values = [r.net_insertion_rate for r in results]
|
| 321 |
-
hallucinating_count = sum(1 for r in results if r.is_hallucinating)
|
| 322 |
-
|
| 323 |
-
return {
|
| 324 |
-
"anchor_score_mean": round(sum(anchor_values) / n, 6),
|
| 325 |
-
"anchor_score_min": round(min(anchor_values), 6),
|
| 326 |
-
"length_ratio_mean": round(sum(ratio_values) / n, 6),
|
| 327 |
-
"net_insertion_rate_mean": round(sum(insertion_values) / n, 6),
|
| 328 |
-
"hallucinating_doc_count": hallucinating_count,
|
| 329 |
-
"hallucinating_doc_rate": round(hallucinating_count / n, 6),
|
| 330 |
-
"document_count": n,
|
| 331 |
-
}
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.hallucination``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.hallucination`` est conservé
|
| 5 |
+
pour ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.hallucination import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,283 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
``image_quality`` (Sprint 5) mesure des features d'image
|
| 8 |
-
indépendamment ; ce module **les combine** pour produire deux
|
| 9 |
-
indicateurs corpus-level :
|
| 10 |
-
|
| 11 |
-
1. **Score de complexité paléographique** ∈ [0, 1]. Combine
|
| 12 |
-
bruit, faible netteté, faible contraste et rotation en un
|
| 13 |
-
indicateur unique de la difficulté intrinsèque pour un OCR.
|
| 14 |
-
0 = document trivial, 1 = document extrême. Permet
|
| 15 |
-
d'expliquer une partie du CER observé.
|
| 16 |
-
|
| 17 |
-
2. **Score d'homogénéité du corpus** ∈ [0, 1]. Variance des
|
| 18 |
-
features entre documents. 0 = corpus uniforme (la moyenne
|
| 19 |
-
globale du benchmark est fiable), 1 = corpus hétérogène
|
| 20 |
-
(la moyenne ment, il faut stratifier). Couplé au détecteur
|
| 21 |
-
``stratification_recommended`` (Sprint 46) qui agit sur
|
| 22 |
-
``script_type``.
|
| 23 |
-
|
| 24 |
-
Pondérations
|
| 25 |
-
------------
|
| 26 |
-
La roadmap propose une combinaison **pondérée** sans fixer les
|
| 27 |
-
poids — on adopte une convention éditoriale documentée :
|
| 28 |
-
|
| 29 |
-
- ``noise_level`` : poids 0.30 (bruit franc → CER ↑)
|
| 30 |
-
- ``1 - sharpness_score`` : poids 0.30 (flou → CER ↑)
|
| 31 |
-
- ``1 - contrast_score`` : poids 0.20 (faible contraste → CER ↑)
|
| 32 |
-
- ``|rotation_degrees|/30`` : poids 0.20 (rotation > 30° = pire)
|
| 33 |
-
|
| 34 |
-
Les poids somment à 1. L'utilisateur peut surcharger via
|
| 35 |
-
``weights={...}``.
|
| 36 |
-
|
| 37 |
-
Pas de prédiction CER absolue
|
| 38 |
-
-----------------------------
|
| 39 |
-
On ne prétend **pas** prédire une valeur CER en pourcentage —
|
| 40 |
-
ça demanderait un modèle entraîné par moteur, ce que la
|
| 41 |
-
philosophie banc d'essai exclut. On fournit un score relatif
|
| 42 |
-
qui se corrèle au CER observé pour une **lecture
|
| 43 |
-
diagnostique** : *« le document A est ~3× plus complexe que le
|
| 44 |
-
document B, ce qui est cohérent avec le CER observé. »*
|
| 45 |
"""
|
| 46 |
|
| 47 |
from __future__ import annotations
|
| 48 |
|
| 49 |
-
import
|
| 50 |
-
import math
|
| 51 |
-
import statistics
|
| 52 |
-
from typing import Iterable, Optional
|
| 53 |
-
|
| 54 |
-
logger = logging.getLogger(__name__)
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
# Poids éditoriaux par défaut.
|
| 58 |
-
DEFAULT_COMPLEXITY_WEIGHTS = {
|
| 59 |
-
"noise_level": 0.30,
|
| 60 |
-
"blur": 0.30, # 1 - sharpness_score
|
| 61 |
-
"low_contrast": 0.20, # 1 - contrast_score
|
| 62 |
-
"rotation": 0.20, # |rotation_degrees| / 30
|
| 63 |
-
}
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
# Plage de saturation pour la rotation. Au-delà de 30°, on
|
| 67 |
-
# considère que c'est aussi pire que pire.
|
| 68 |
-
_ROTATION_SATURATION_DEG = 30.0
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
def _clip01(x: float) -> float:
|
| 72 |
-
return max(0.0, min(1.0, x))
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
def _extract_feature(
|
| 76 |
-
quality: dict, key: str, default: float = 0.0,
|
| 77 |
-
) -> float:
|
| 78 |
-
val = quality.get(key, default)
|
| 79 |
-
if val is None:
|
| 80 |
-
return default
|
| 81 |
-
try:
|
| 82 |
-
return float(val)
|
| 83 |
-
except (TypeError, ValueError):
|
| 84 |
-
return default
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
def compute_paleographic_complexity(
|
| 88 |
-
quality: dict,
|
| 89 |
-
*,
|
| 90 |
-
weights: Optional[dict[str, float]] = None,
|
| 91 |
-
) -> Optional[dict]:
|
| 92 |
-
"""Score de complexité paléographique d'une image.
|
| 93 |
-
|
| 94 |
-
Parameters
|
| 95 |
-
----------
|
| 96 |
-
quality:
|
| 97 |
-
Dict ``ImageQualityResult.as_dict()`` ou compatible.
|
| 98 |
-
Champs lus : ``noise_level``, ``sharpness_score``,
|
| 99 |
-
``contrast_score``, ``rotation_degrees``.
|
| 100 |
-
weights:
|
| 101 |
-
Poids surchargeant les défauts. Doit contenir les
|
| 102 |
-
4 clés ``noise_level``, ``blur``, ``low_contrast``,
|
| 103 |
-
``rotation``. Les poids sont normalisés (somme = 1).
|
| 104 |
-
|
| 105 |
-
Returns
|
| 106 |
-
-------
|
| 107 |
-
dict | None
|
| 108 |
-
``{
|
| 109 |
-
"score": float, # ∈ [0, 1]
|
| 110 |
-
"components": {
|
| 111 |
-
"noise": float, "blur": float,
|
| 112 |
-
"low_contrast": float, "rotation": float,
|
| 113 |
-
},
|
| 114 |
-
"weights_used": dict,
|
| 115 |
-
}`` ou ``None`` si ``quality`` est falsy.
|
| 116 |
-
"""
|
| 117 |
-
if not quality:
|
| 118 |
-
return None
|
| 119 |
-
w = dict(DEFAULT_COMPLEXITY_WEIGHTS)
|
| 120 |
-
if weights:
|
| 121 |
-
for k in w:
|
| 122 |
-
if k in weights:
|
| 123 |
-
w[k] = float(weights[k])
|
| 124 |
-
total = sum(w.values())
|
| 125 |
-
if total <= 0:
|
| 126 |
-
return None
|
| 127 |
-
w = {k: v / total for k, v in w.items()}
|
| 128 |
-
noise = _clip01(_extract_feature(quality, "noise_level"))
|
| 129 |
-
sharpness = _clip01(_extract_feature(quality, "sharpness_score"))
|
| 130 |
-
contrast = _clip01(_extract_feature(quality, "contrast_score"))
|
| 131 |
-
rotation_deg = abs(_extract_feature(quality, "rotation_degrees"))
|
| 132 |
-
blur = 1.0 - sharpness
|
| 133 |
-
low_contrast = 1.0 - contrast
|
| 134 |
-
rotation = _clip01(rotation_deg / _ROTATION_SATURATION_DEG)
|
| 135 |
-
score = (
|
| 136 |
-
w["noise_level"] * noise
|
| 137 |
-
+ w["blur"] * blur
|
| 138 |
-
+ w["low_contrast"] * low_contrast
|
| 139 |
-
+ w["rotation"] * rotation
|
| 140 |
-
)
|
| 141 |
-
return {
|
| 142 |
-
"score": _clip01(score),
|
| 143 |
-
"components": {
|
| 144 |
-
"noise": noise,
|
| 145 |
-
"blur": blur,
|
| 146 |
-
"low_contrast": low_contrast,
|
| 147 |
-
"rotation": rotation,
|
| 148 |
-
},
|
| 149 |
-
"weights_used": w,
|
| 150 |
-
}
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
def compute_corpus_homogeneity(
|
| 154 |
-
image_qualities: Iterable[dict],
|
| 155 |
-
) -> Optional[dict]:
|
| 156 |
-
"""Score d'homogénéité du corpus ∈ [0, 1].
|
| 157 |
-
|
| 158 |
-
0 = corpus uniforme (faible variance entre documents),
|
| 159 |
-
1 = corpus hétérogène.
|
| 160 |
-
|
| 161 |
-
Méthode : pour chaque feature dans ``noise_level``,
|
| 162 |
-
``sharpness_score``, ``contrast_score``, ``rotation_degrees``,
|
| 163 |
-
on calcule l'écart-type *normalisé* sur les documents (par
|
| 164 |
-
une plage de référence), puis on prend la moyenne des 4.
|
| 165 |
-
|
| 166 |
-
Plages de normalisation :
|
| 167 |
-
- ``noise_level``, ``sharpness_score``, ``contrast_score``
|
| 168 |
-
∈ [0, 1] → écart-type / 0.5 (max théorique de l'écart-type
|
| 169 |
-
d'une distribution sur [0,1]) borné à 1.
|
| 170 |
-
- ``rotation_degrees`` → écart-type / 10°.
|
| 171 |
-
|
| 172 |
-
Parameters
|
| 173 |
-
----------
|
| 174 |
-
image_qualities:
|
| 175 |
-
Itérable de dicts ``ImageQualityResult.as_dict()``.
|
| 176 |
-
|
| 177 |
-
Returns
|
| 178 |
-
-------
|
| 179 |
-
dict | None
|
| 180 |
-
``{
|
| 181 |
-
"score": float, # ∈ [0, 1]
|
| 182 |
-
"n_docs": int,
|
| 183 |
-
"per_feature": {
|
| 184 |
-
feature: {"mean": float, "stdev": float,
|
| 185 |
-
"normalised": float},
|
| 186 |
-
},
|
| 187 |
-
}`` ou ``None`` si moins de 2 documents.
|
| 188 |
-
"""
|
| 189 |
-
docs = [q for q in image_qualities if q]
|
| 190 |
-
if len(docs) < 2:
|
| 191 |
-
return None
|
| 192 |
-
features = (
|
| 193 |
-
("noise_level", 0.5),
|
| 194 |
-
("sharpness_score", 0.5),
|
| 195 |
-
("contrast_score", 0.5),
|
| 196 |
-
("rotation_degrees", 10.0),
|
| 197 |
-
)
|
| 198 |
-
per_feature: dict[str, dict] = {}
|
| 199 |
-
norm_stdevs: list[float] = []
|
| 200 |
-
for key, divisor in features:
|
| 201 |
-
values = [
|
| 202 |
-
_extract_feature(q, key)
|
| 203 |
-
for q in docs
|
| 204 |
-
]
|
| 205 |
-
if not values:
|
| 206 |
-
continue
|
| 207 |
-
mean = statistics.fmean(values)
|
| 208 |
-
try:
|
| 209 |
-
stdev = statistics.stdev(values) if len(values) >= 2 else 0.0
|
| 210 |
-
except statistics.StatisticsError:
|
| 211 |
-
stdev = 0.0
|
| 212 |
-
normalised = _clip01(stdev / divisor) if divisor > 0 else 0.0
|
| 213 |
-
per_feature[key] = {
|
| 214 |
-
"mean": mean,
|
| 215 |
-
"stdev": stdev,
|
| 216 |
-
"normalised": normalised,
|
| 217 |
-
}
|
| 218 |
-
norm_stdevs.append(normalised)
|
| 219 |
-
if not norm_stdevs:
|
| 220 |
-
return None
|
| 221 |
-
score = statistics.fmean(norm_stdevs)
|
| 222 |
-
return {
|
| 223 |
-
"score": _clip01(score),
|
| 224 |
-
"n_docs": len(docs),
|
| 225 |
-
"per_feature": per_feature,
|
| 226 |
-
}
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
def aggregate_corpus_predictive(
|
| 230 |
-
image_qualities: Iterable[dict],
|
| 231 |
-
*,
|
| 232 |
-
weights: Optional[dict[str, float]] = None,
|
| 233 |
-
) -> Optional[dict]:
|
| 234 |
-
"""Synthèse corpus-wide : complexité moyenne + homogénéité.
|
| 235 |
-
|
| 236 |
-
Returns
|
| 237 |
-
-------
|
| 238 |
-
dict | None
|
| 239 |
-
``{
|
| 240 |
-
"n_docs": int,
|
| 241 |
-
"complexity_mean": float,
|
| 242 |
-
"complexity_median": float,
|
| 243 |
-
"complexity_min": float,
|
| 244 |
-
"complexity_max": float,
|
| 245 |
-
"complexity_stdev": float,
|
| 246 |
-
"homogeneity": dict, # sortie de
|
| 247 |
-
# compute_corpus_homogeneity
|
| 248 |
-
}`` ou ``None`` si moins d'un document.
|
| 249 |
-
"""
|
| 250 |
-
docs = [q for q in image_qualities if q]
|
| 251 |
-
if not docs:
|
| 252 |
-
return None
|
| 253 |
-
scores: list[float] = []
|
| 254 |
-
for q in docs:
|
| 255 |
-
result = compute_paleographic_complexity(q, weights=weights)
|
| 256 |
-
if result is not None:
|
| 257 |
-
scores.append(float(result["score"]))
|
| 258 |
-
if not scores:
|
| 259 |
-
return None
|
| 260 |
-
homogeneity = compute_corpus_homogeneity(docs)
|
| 261 |
-
return {
|
| 262 |
-
"n_docs": len(docs),
|
| 263 |
-
"complexity_mean": statistics.fmean(scores),
|
| 264 |
-
"complexity_median": statistics.median(scores),
|
| 265 |
-
"complexity_min": min(scores),
|
| 266 |
-
"complexity_max": max(scores),
|
| 267 |
-
"complexity_stdev": (
|
| 268 |
-
statistics.stdev(scores) if len(scores) >= 2 else 0.0
|
| 269 |
-
),
|
| 270 |
-
"homogeneity": homogeneity,
|
| 271 |
-
}
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
__all__ = [
|
| 275 |
-
"DEFAULT_COMPLEXITY_WEIGHTS",
|
| 276 |
-
"compute_paleographic_complexity",
|
| 277 |
-
"compute_corpus_homogeneity",
|
| 278 |
-
"aggregate_corpus_predictive",
|
| 279 |
-
]
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
# Évite warning import inutilisé
|
| 283 |
-
_ = math
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.image_predictive``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.image_predictive`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.image_predictive import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,391 +1,14 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
-
|
| 5 |
-
- **Score de netteté** : variance du laplacien (plus élevé = plus net)
|
| 6 |
-
- **Niveau de bruit** : écart-type des résidus haute-fréquence
|
| 7 |
-
- **Angle de rotation résiduel** : estimé par projection horizontale
|
| 8 |
-
- **Score de contraste** : ratio Michelson entre zones sombres (encre) et claires (fond)
|
| 9 |
-
- **Score de qualité global** : combinaison normalisée des métriques ci-dessus
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
de fallback n'en dépendent pas.
|
| 14 |
-
|
| 15 |
-
Note
|
| 16 |
-
----
|
| 17 |
-
Pour les images placeholder (fixtures), des valeurs fictives cohérentes
|
| 18 |
-
sont générées via `generate_mock_quality_scores()`.
|
| 19 |
"""
|
| 20 |
|
| 21 |
from __future__ import annotations
|
| 22 |
|
| 23 |
-
import
|
| 24 |
-
import
|
| 25 |
-
import statistics
|
| 26 |
-
from dataclasses import dataclass
|
| 27 |
-
from pathlib import Path
|
| 28 |
-
from typing import Optional
|
| 29 |
-
|
| 30 |
-
logger = logging.getLogger(__name__)
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
@dataclass
|
| 34 |
-
class ImageQualityResult:
|
| 35 |
-
"""Métriques de qualité d'une image de document."""
|
| 36 |
-
|
| 37 |
-
sharpness_score: float = 0.0
|
| 38 |
-
"""Score de netteté [0, 1]. Basé sur la variance du laplacien normalisée."""
|
| 39 |
-
|
| 40 |
-
noise_level: float = 0.0
|
| 41 |
-
"""Niveau de bruit [0, 1]. 0 = pas de bruit, 1 = très bruité."""
|
| 42 |
-
|
| 43 |
-
rotation_degrees: float = 0.0
|
| 44 |
-
"""Angle de rotation résiduel estimé en degrés (positif = sens horaire)."""
|
| 45 |
-
|
| 46 |
-
contrast_score: float = 0.0
|
| 47 |
-
"""Score de contraste [0, 1]. Ratio Michelson encre/fond."""
|
| 48 |
-
|
| 49 |
-
quality_score: float = 0.0
|
| 50 |
-
"""Score de qualité global [0, 1]. Combinaison pondérée des autres métriques."""
|
| 51 |
-
|
| 52 |
-
analysis_method: str = "none"
|
| 53 |
-
"""Méthode d'analyse utilisée : 'pillow', 'numpy', 'mock'."""
|
| 54 |
-
|
| 55 |
-
error: Optional[str] = None
|
| 56 |
-
"""Erreur si l'analyse a échoué."""
|
| 57 |
-
|
| 58 |
-
@property
|
| 59 |
-
def is_good_quality(self) -> bool:
|
| 60 |
-
"""Vrai si le score de qualité global est ≥ 0.7."""
|
| 61 |
-
return self.quality_score >= 0.7
|
| 62 |
-
|
| 63 |
-
@property
|
| 64 |
-
def quality_tier(self) -> str:
|
| 65 |
-
"""Catégorie de qualité : 'good', 'medium', 'poor'."""
|
| 66 |
-
if self.quality_score >= 0.7:
|
| 67 |
-
return "good"
|
| 68 |
-
elif self.quality_score >= 0.4:
|
| 69 |
-
return "medium"
|
| 70 |
-
return "poor"
|
| 71 |
-
|
| 72 |
-
def as_dict(self) -> dict:
|
| 73 |
-
d = {
|
| 74 |
-
"sharpness_score": round(self.sharpness_score, 4),
|
| 75 |
-
"noise_level": round(self.noise_level, 4),
|
| 76 |
-
"rotation_degrees": round(self.rotation_degrees, 2),
|
| 77 |
-
"contrast_score": round(self.contrast_score, 4),
|
| 78 |
-
"quality_score": round(self.quality_score, 4),
|
| 79 |
-
"quality_tier": self.quality_tier,
|
| 80 |
-
"analysis_method": self.analysis_method,
|
| 81 |
-
}
|
| 82 |
-
if self.error:
|
| 83 |
-
d["error"] = self.error
|
| 84 |
-
return d
|
| 85 |
-
|
| 86 |
-
@classmethod
|
| 87 |
-
def from_dict(cls, data: dict) -> "ImageQualityResult":
|
| 88 |
-
return cls(
|
| 89 |
-
sharpness_score=data.get("sharpness_score", 0.0),
|
| 90 |
-
noise_level=data.get("noise_level", 0.0),
|
| 91 |
-
rotation_degrees=data.get("rotation_degrees", 0.0),
|
| 92 |
-
contrast_score=data.get("contrast_score", 0.0),
|
| 93 |
-
quality_score=data.get("quality_score", 0.0),
|
| 94 |
-
analysis_method=data.get("analysis_method", "none"),
|
| 95 |
-
error=data.get("error"),
|
| 96 |
-
)
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
def analyze_image_quality(image_path: str | Path) -> ImageQualityResult:
|
| 100 |
-
"""Analyse la qualité d'une image de document numérisé.
|
| 101 |
-
|
| 102 |
-
Essaie successivement :
|
| 103 |
-
1. Pillow + NumPy (méthode complète)
|
| 104 |
-
2. Pillow seul (méthode simplifiée)
|
| 105 |
-
3. Fallback : retourne un résultat vide avec erreur
|
| 106 |
-
|
| 107 |
-
Parameters
|
| 108 |
-
----------
|
| 109 |
-
image_path:
|
| 110 |
-
Chemin vers l'image (JPG, PNG, TIFF…).
|
| 111 |
-
|
| 112 |
-
Returns
|
| 113 |
-
-------
|
| 114 |
-
ImageQualityResult
|
| 115 |
-
"""
|
| 116 |
-
path = Path(image_path)
|
| 117 |
-
if not path.exists():
|
| 118 |
-
return ImageQualityResult(
|
| 119 |
-
error=f"Fichier image introuvable : {image_path}",
|
| 120 |
-
analysis_method="none",
|
| 121 |
-
)
|
| 122 |
-
|
| 123 |
-
# Essai avec Pillow + NumPy
|
| 124 |
-
try:
|
| 125 |
-
import numpy as np
|
| 126 |
-
from PIL import Image
|
| 127 |
-
return _analyze_with_numpy(path, np, Image)
|
| 128 |
-
except ImportError:
|
| 129 |
-
pass
|
| 130 |
-
|
| 131 |
-
# Essai avec Pillow seul
|
| 132 |
-
try:
|
| 133 |
-
from PIL import Image
|
| 134 |
-
return _analyze_with_pillow(path, Image)
|
| 135 |
-
except ImportError:
|
| 136 |
-
pass
|
| 137 |
-
|
| 138 |
-
return ImageQualityResult(
|
| 139 |
-
error="Pillow non disponible (pip install Pillow)",
|
| 140 |
-
analysis_method="none",
|
| 141 |
-
quality_score=0.5, # valeur neutre
|
| 142 |
-
)
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
def _analyze_with_numpy(path: Path, np, Image) -> ImageQualityResult:
|
| 146 |
-
"""Analyse complète avec NumPy."""
|
| 147 |
-
img = Image.open(path).convert("L") # niveaux de gris
|
| 148 |
-
arr = np.array(img, dtype=np.float32)
|
| 149 |
-
|
| 150 |
-
# 1. Netteté : variance du laplacien
|
| 151 |
-
laplacian = _laplacian_variance_numpy(arr, np)
|
| 152 |
-
# Normalisation empirique : variance > 500 = très net, < 50 = flou
|
| 153 |
-
sharpness = min(1.0, laplacian / 500.0)
|
| 154 |
-
|
| 155 |
-
# 2. Bruit : écart-type des résidus (différence image - image lissée)
|
| 156 |
-
noise = _noise_level_numpy(arr, np)
|
| 157 |
-
|
| 158 |
-
# 3. Rotation : angle d'inclinaison estimé
|
| 159 |
-
rotation = _estimate_rotation_numpy(arr, np)
|
| 160 |
-
|
| 161 |
-
# 4. Contraste : ratio Michelson
|
| 162 |
-
contrast = _contrast_score_numpy(arr, np)
|
| 163 |
-
|
| 164 |
-
# 5. Score global pondéré
|
| 165 |
-
quality = _global_quality_score(sharpness, noise, abs(rotation), contrast)
|
| 166 |
-
|
| 167 |
-
return ImageQualityResult(
|
| 168 |
-
sharpness_score=float(sharpness),
|
| 169 |
-
noise_level=float(noise),
|
| 170 |
-
rotation_degrees=float(rotation),
|
| 171 |
-
contrast_score=float(contrast),
|
| 172 |
-
quality_score=float(quality),
|
| 173 |
-
analysis_method="numpy",
|
| 174 |
-
)
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
def _analyze_with_pillow(path: Path, Image) -> ImageQualityResult:
|
| 178 |
-
"""Analyse simplifiée avec Pillow seul (sans NumPy)."""
|
| 179 |
-
img = Image.open(path).convert("L")
|
| 180 |
-
pixels = list(img.tobytes()) # mode "L" = 1 byte/pixel
|
| 181 |
-
w, h = img.size
|
| 182 |
-
|
| 183 |
-
if not pixels:
|
| 184 |
-
return ImageQualityResult(quality_score=0.5, analysis_method="pillow")
|
| 185 |
-
|
| 186 |
-
# Contraste : étendue des valeurs
|
| 187 |
-
min_val = min(pixels)
|
| 188 |
-
max_val = max(pixels)
|
| 189 |
-
if max_val + min_val > 0:
|
| 190 |
-
contrast = (max_val - min_val) / (max_val + min_val)
|
| 191 |
-
else:
|
| 192 |
-
contrast = 0.0
|
| 193 |
-
|
| 194 |
-
# Netteté approximée : variance globale des pixels
|
| 195 |
-
try:
|
| 196 |
-
variance = statistics.variance(pixels)
|
| 197 |
-
except statistics.StatisticsError:
|
| 198 |
-
variance = 0.0
|
| 199 |
-
sharpness = min(1.0, math.sqrt(variance) / 128.0)
|
| 200 |
-
|
| 201 |
-
# Bruit : approximation grossière
|
| 202 |
-
noise = min(1.0, statistics.stdev(pixels[:min(1000, len(pixels))]) / 64.0) if len(pixels) > 1 else 0.0
|
| 203 |
-
|
| 204 |
-
quality = _global_quality_score(sharpness, noise, 0.0, contrast)
|
| 205 |
-
|
| 206 |
-
return ImageQualityResult(
|
| 207 |
-
sharpness_score=sharpness,
|
| 208 |
-
noise_level=noise,
|
| 209 |
-
rotation_degrees=0.0, # non calculé sans NumPy
|
| 210 |
-
contrast_score=contrast,
|
| 211 |
-
quality_score=quality,
|
| 212 |
-
analysis_method="pillow",
|
| 213 |
-
)
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
def _laplacian_variance_numpy(arr, np) -> float:
|
| 217 |
-
"""Calcule la variance du laplacien (mesure de netteté)."""
|
| 218 |
-
# Convolution laplacien 3x3 via slicing (bordures ignorées)
|
| 219 |
-
h, w = arr.shape
|
| 220 |
-
if h < 3 or w < 3:
|
| 221 |
-
return float(np.var(arr))
|
| 222 |
-
|
| 223 |
-
# Utiliser une convolution rapide avec slicing
|
| 224 |
-
center = arr[1:-1, 1:-1]
|
| 225 |
-
top = arr[:-2, 1:-1]
|
| 226 |
-
bottom = arr[2:, 1:-1]
|
| 227 |
-
left = arr[1:-1, :-2]
|
| 228 |
-
right = arr[1:-1, 2:]
|
| 229 |
-
lap = top + bottom + left + right - 4 * center
|
| 230 |
-
|
| 231 |
-
return float(np.var(lap))
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
def _noise_level_numpy(arr, np) -> float:
|
| 235 |
-
"""Estime le niveau de bruit par la MAD (Median Absolute Deviation) des gradients."""
|
| 236 |
-
h, w = arr.shape
|
| 237 |
-
if h < 2 or w < 2:
|
| 238 |
-
return 0.0
|
| 239 |
-
# Différences horizontales et verticales
|
| 240 |
-
diff_h = np.abs(arr[:, 1:] - arr[:, :-1])
|
| 241 |
-
diff_v = np.abs(arr[1:, :] - arr[:-1, :])
|
| 242 |
-
noise_std = float(np.median(np.concatenate([diff_h.ravel(), diff_v.ravel()])))
|
| 243 |
-
# Normaliser : 0 = pas de bruit, 1 = très bruité (seuil à ~30)
|
| 244 |
-
return min(1.0, noise_std / 30.0)
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
def _estimate_rotation_numpy(arr, np) -> float:
|
| 248 |
-
"""Estime l'angle de rotation par projection horizontale simplifiée.
|
| 249 |
-
|
| 250 |
-
Retourne l'angle estimé en degrés [-45, 45].
|
| 251 |
-
"""
|
| 252 |
-
# Méthode simplifiée : analyse de la variance des projections à différents angles
|
| 253 |
-
# Limiter à quelques angles pour la performance
|
| 254 |
-
h, w = arr.shape
|
| 255 |
-
if h < 20 or w < 20:
|
| 256 |
-
return 0.0
|
| 257 |
-
|
| 258 |
-
# Sous-échantillonnage pour la performance
|
| 259 |
-
step = max(1, h // 100)
|
| 260 |
-
sample = arr[::step, :]
|
| 261 |
-
|
| 262 |
-
best_angle = 0.0
|
| 263 |
-
best_var = -1.0
|
| 264 |
-
|
| 265 |
-
for angle_deg in range(-5, 6): # ±5 degrés, pas de 1°
|
| 266 |
-
angle_rad = math.radians(angle_deg)
|
| 267 |
-
# Projection horizontale après rotation approximative
|
| 268 |
-
# (approximation linéaire rapide)
|
| 269 |
-
offsets = np.round(
|
| 270 |
-
np.arange(sample.shape[0]) * math.tan(angle_rad)
|
| 271 |
-
).astype(int)
|
| 272 |
-
offsets = np.clip(offsets, 0, w - 1)
|
| 273 |
-
|
| 274 |
-
# Variance des sommes de lignes décalées
|
| 275 |
-
try:
|
| 276 |
-
row_sums = np.array([
|
| 277 |
-
float(np.sum(sample[i, max(0, offsets[i]):min(w, offsets[i]+w)]))
|
| 278 |
-
for i in range(sample.shape[0])
|
| 279 |
-
])
|
| 280 |
-
var = float(np.var(row_sums))
|
| 281 |
-
if var > best_var:
|
| 282 |
-
best_var = var
|
| 283 |
-
best_angle = float(angle_deg)
|
| 284 |
-
except Exception as e:
|
| 285 |
-
logger.warning(
|
| 286 |
-
"[image_quality] projection à %d° indisponible : %s",
|
| 287 |
-
angle_deg, e,
|
| 288 |
-
)
|
| 289 |
-
|
| 290 |
-
return best_angle
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
def _contrast_score_numpy(arr, np) -> float:
|
| 294 |
-
"""Score de contraste Michelson [0, 1]."""
|
| 295 |
-
p5 = float(np.percentile(arr, 5)) # fond clair
|
| 296 |
-
p95 = float(np.percentile(arr, 95)) # encre sombre
|
| 297 |
-
if p5 + p95 == 0:
|
| 298 |
-
return 0.0
|
| 299 |
-
# Michelson : (Imax - Imin) / (Imax + Imin)
|
| 300 |
-
return float((p95 - p5) / (p95 + p5))
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
def _global_quality_score(
|
| 304 |
-
sharpness: float,
|
| 305 |
-
noise: float,
|
| 306 |
-
rotation_abs: float,
|
| 307 |
-
contrast: float,
|
| 308 |
-
) -> float:
|
| 309 |
-
"""Calcule le score de qualité global pondéré."""
|
| 310 |
-
# Poids : netteté (40%), contraste (30%), bruit (20%), rotation (10%)
|
| 311 |
-
score = (
|
| 312 |
-
0.40 * sharpness
|
| 313 |
-
+ 0.30 * contrast
|
| 314 |
-
+ 0.20 * (1.0 - noise) # moins de bruit = mieux
|
| 315 |
-
+ 0.10 * max(0.0, 1.0 - rotation_abs / 10.0) # ±10° max
|
| 316 |
-
)
|
| 317 |
-
return round(min(1.0, max(0.0, score)), 4)
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
# ---------------------------------------------------------------------------
|
| 321 |
-
# Données fictives pour les fixtures de démo
|
| 322 |
-
# ---------------------------------------------------------------------------
|
| 323 |
-
|
| 324 |
-
def generate_mock_quality_scores(
|
| 325 |
-
doc_id: str,
|
| 326 |
-
seed: Optional[int] = None,
|
| 327 |
-
) -> ImageQualityResult:
|
| 328 |
-
"""Génère des métriques de qualité fictives mais cohérentes pour un document.
|
| 329 |
-
|
| 330 |
-
Utilisé par les fixtures de démo pour simuler une diversité réaliste
|
| 331 |
-
de qualités d'image (bonne, moyenne, dégradée).
|
| 332 |
-
|
| 333 |
-
Parameters
|
| 334 |
-
----------
|
| 335 |
-
doc_id:
|
| 336 |
-
Identifiant du document (utilisé pour la reproductibilité).
|
| 337 |
-
seed:
|
| 338 |
-
Graine aléatoire optionnelle.
|
| 339 |
-
"""
|
| 340 |
-
import random
|
| 341 |
-
rng = random.Random(seed or hash(doc_id) % 2**32)
|
| 342 |
-
|
| 343 |
-
# Générer une qualité cohérente : certains docs sont plus difficiles
|
| 344 |
-
base_quality = 0.3 + rng.random() * 0.6 # 0.3 à 0.9
|
| 345 |
-
|
| 346 |
-
sharpness = max(0.1, min(1.0, base_quality + rng.gauss(0, 0.1)))
|
| 347 |
-
noise = max(0.0, min(1.0, (1.0 - base_quality) * 0.8 + rng.gauss(0, 0.05)))
|
| 348 |
-
rotation = rng.gauss(0, 1.5) # ±1.5° typique
|
| 349 |
-
contrast = max(0.2, min(1.0, base_quality + rng.gauss(0, 0.15)))
|
| 350 |
-
|
| 351 |
-
quality = _global_quality_score(sharpness, noise, abs(rotation), contrast)
|
| 352 |
-
|
| 353 |
-
return ImageQualityResult(
|
| 354 |
-
sharpness_score=round(sharpness, 4),
|
| 355 |
-
noise_level=round(noise, 4),
|
| 356 |
-
rotation_degrees=round(rotation, 2),
|
| 357 |
-
contrast_score=round(contrast, 4),
|
| 358 |
-
quality_score=round(quality, 4),
|
| 359 |
-
analysis_method="mock",
|
| 360 |
-
)
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
def aggregate_image_quality(results: list[ImageQualityResult]) -> dict:
|
| 364 |
-
"""Agrège les métriques de qualité image sur un corpus."""
|
| 365 |
-
if not results:
|
| 366 |
-
return {}
|
| 367 |
-
|
| 368 |
-
valid = [r for r in results if r.error is None]
|
| 369 |
-
if not valid:
|
| 370 |
-
return {"error": "Aucune analyse réussie"}
|
| 371 |
-
|
| 372 |
-
def _mean(vals: list[float]) -> float:
|
| 373 |
-
return round(statistics.mean(vals), 4) if vals else 0.0
|
| 374 |
-
|
| 375 |
-
quality_scores = [r.quality_score for r in valid]
|
| 376 |
-
sharpness_scores = [r.sharpness_score for r in valid]
|
| 377 |
-
noise_levels = [r.noise_level for r in valid]
|
| 378 |
-
|
| 379 |
-
# Distribution par tier
|
| 380 |
-
tiers = {"good": 0, "medium": 0, "poor": 0}
|
| 381 |
-
for r in valid:
|
| 382 |
-
tiers[r.quality_tier] += 1
|
| 383 |
-
|
| 384 |
-
return {
|
| 385 |
-
"mean_quality_score": _mean(quality_scores),
|
| 386 |
-
"mean_sharpness": _mean(sharpness_scores),
|
| 387 |
-
"mean_noise_level": _mean(noise_levels),
|
| 388 |
-
"quality_distribution": tiers,
|
| 389 |
-
"document_count": len(valid),
|
| 390 |
-
"scores": [r.quality_score for r in valid], # pour scatter plot
|
| 391 |
-
}
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.image_quality``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.image_quality`` est conservé
|
| 5 |
+
pour ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
+
Ré-expose explicitement ``_global_quality_score`` (symbole privé
|
| 8 |
+
utilisé downstream).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
"""
|
| 10 |
|
| 11 |
from __future__ import annotations
|
| 12 |
|
| 13 |
+
from picarones.evaluation.metrics.image_quality import * # noqa: F401,F403
|
| 14 |
+
from picarones.evaluation.metrics.image_quality import _global_quality_score # noqa: F401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,253 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Avec 5 OCR × 3 reconstructeurs × 4 post-correcteurs × 3
|
| 8 |
-
mappeurs = 180 pipelines à comparer, le rapport noie
|
| 9 |
-
l'information. Il faut un mécanisme de **comparaison
|
| 10 |
-
contrôlée** type design d'expérience.
|
| 11 |
-
|
| 12 |
-
Méthode
|
| 13 |
-
-------
|
| 14 |
-
Pour mesurer l'effet isolé d'un slot ``varying`` :
|
| 15 |
-
|
| 16 |
-
1. Fixer les valeurs des autres slots (``fixed``).
|
| 17 |
-
2. Pour chaque combinaison des fixed, comparer les pipelines
|
| 18 |
-
qui ne diffèrent que sur le slot varying.
|
| 19 |
-
3. Agréger : pour chaque valeur du slot varying, calculer
|
| 20 |
-
sa moyenne, son écart-type, son rang moyen sur les groupes.
|
| 21 |
-
|
| 22 |
-
C'est presque un Latin square automatisé. Sans ça, le
|
| 23 |
-
rapport sur 180 pipelines est inutilisable.
|
| 24 |
-
|
| 25 |
-
Pas de tests statistiques scipy
|
| 26 |
-
-------------------------------
|
| 27 |
-
On ne reconstruit pas Friedman/Nemenyi (déjà dans Sprint 18) ;
|
| 28 |
-
on agrège ici les données nécessaires pour qu'un
|
| 29 |
-
tests statistique externe puisse les consommer. Le rapport
|
| 30 |
-
existant reste libre de brancher
|
| 31 |
-
``picarones.measurements.statistics.friedman_test`` sur la sortie de
|
| 32 |
-
ce module.
|
| 33 |
-
|
| 34 |
-
Sortie
|
| 35 |
-
------
|
| 36 |
-
``compare_isolated_effect(runs, varying_slot)`` retourne :
|
| 37 |
-
|
| 38 |
-
.. code-block:: text
|
| 39 |
-
|
| 40 |
-
{
|
| 41 |
-
"varying_slot": str,
|
| 42 |
-
"n_runs": int,
|
| 43 |
-
"n_groups": int, # combinaisons fixed distinctes
|
| 44 |
-
"values": list[str], # valeurs distinctes du slot
|
| 45 |
-
"per_value": {value: {
|
| 46 |
-
"n_observations": int,
|
| 47 |
-
"mean": float | None,
|
| 48 |
-
"stdev": float | None,
|
| 49 |
-
"min": float, "max": float,
|
| 50 |
-
"mean_rank": float | None,
|
| 51 |
-
}},
|
| 52 |
-
"best_value": str | None,
|
| 53 |
-
"worst_value": str | None,
|
| 54 |
-
"groups": list[dict], # détail par groupe
|
| 55 |
-
}
|
| 56 |
"""
|
| 57 |
|
| 58 |
from __future__ import annotations
|
| 59 |
|
| 60 |
-
import
|
| 61 |
-
import statistics
|
| 62 |
-
from dataclasses import dataclass
|
| 63 |
-
from typing import Optional
|
| 64 |
-
|
| 65 |
-
logger = logging.getLogger(__name__)
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
@dataclass(frozen=True)
|
| 69 |
-
class PipelineRun:
|
| 70 |
-
"""Un run de pipeline composée pour la comparaison contrôlée.
|
| 71 |
-
|
| 72 |
-
Attributes
|
| 73 |
-
----------
|
| 74 |
-
name:
|
| 75 |
-
Nom du run (libre — informatif uniquement).
|
| 76 |
-
slots:
|
| 77 |
-
Map ``{slot_name: module_name}`` décrivant la pipeline
|
| 78 |
-
(ex. ``{"ocr": "tess", "llm": "gpt-4o"}``).
|
| 79 |
-
score:
|
| 80 |
-
Métrique numérique à comparer (CER moyen typiquement).
|
| 81 |
-
Plus bas = meilleur par convention sauf si
|
| 82 |
-
``higher_is_better=True`` est passé à
|
| 83 |
-
``compare_isolated_effect``.
|
| 84 |
-
"""
|
| 85 |
-
|
| 86 |
-
name: str
|
| 87 |
-
slots: dict[str, str]
|
| 88 |
-
score: float
|
| 89 |
-
|
| 90 |
-
def as_dict(self) -> dict:
|
| 91 |
-
return {
|
| 92 |
-
"name": self.name,
|
| 93 |
-
"slots": dict(self.slots),
|
| 94 |
-
"score": self.score,
|
| 95 |
-
}
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
def _normalise_runs(runs) -> list[PipelineRun]:
|
| 99 |
-
"""Accepte une liste de ``PipelineRun`` ou de dicts compatibles."""
|
| 100 |
-
out: list[PipelineRun] = []
|
| 101 |
-
for r in runs:
|
| 102 |
-
if isinstance(r, PipelineRun):
|
| 103 |
-
out.append(r)
|
| 104 |
-
continue
|
| 105 |
-
if not isinstance(r, dict):
|
| 106 |
-
continue
|
| 107 |
-
slots = r.get("slots") or {}
|
| 108 |
-
if not isinstance(slots, dict):
|
| 109 |
-
continue
|
| 110 |
-
try:
|
| 111 |
-
score = float(r.get("score"))
|
| 112 |
-
except (TypeError, ValueError):
|
| 113 |
-
continue
|
| 114 |
-
out.append(PipelineRun(
|
| 115 |
-
name=str(r.get("name") or ""),
|
| 116 |
-
slots={str(k): str(v) for k, v in slots.items()},
|
| 117 |
-
score=score,
|
| 118 |
-
))
|
| 119 |
-
return out
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
def compare_isolated_effect(
|
| 123 |
-
runs,
|
| 124 |
-
varying_slot: str,
|
| 125 |
-
*,
|
| 126 |
-
higher_is_better: bool = False,
|
| 127 |
-
) -> Optional[dict]:
|
| 128 |
-
"""Mesure l'effet isolé du slot ``varying_slot``.
|
| 129 |
-
|
| 130 |
-
Parameters
|
| 131 |
-
----------
|
| 132 |
-
runs:
|
| 133 |
-
Liste de ``PipelineRun`` (ou dicts compatibles).
|
| 134 |
-
varying_slot:
|
| 135 |
-
Nom du slot dont on veut isoler l'effet. Les autres
|
| 136 |
-
slots constituent les groupes de contrôle.
|
| 137 |
-
higher_is_better:
|
| 138 |
-
Si ``True``, on inverse la convention de classement
|
| 139 |
-
(rang 1 = score le plus haut). Défaut ``False`` =
|
| 140 |
-
rang 1 = score le plus bas (CER).
|
| 141 |
-
|
| 142 |
-
Returns
|
| 143 |
-
-------
|
| 144 |
-
dict | None
|
| 145 |
-
``None`` si moins de 2 runs ou si ``varying_slot``
|
| 146 |
-
n'est présent dans aucun run.
|
| 147 |
-
"""
|
| 148 |
-
runs_list = _normalise_runs(runs)
|
| 149 |
-
if len(runs_list) < 2:
|
| 150 |
-
return None
|
| 151 |
-
runs_list = [r for r in runs_list if varying_slot in r.slots]
|
| 152 |
-
if not runs_list:
|
| 153 |
-
return None
|
| 154 |
-
|
| 155 |
-
# Constitue les groupes par valeurs des slots fixed
|
| 156 |
-
groups: dict[tuple, list[PipelineRun]] = {}
|
| 157 |
-
fixed_slot_names: list[str] = []
|
| 158 |
-
for r in runs_list:
|
| 159 |
-
other_slots = sorted(k for k in r.slots if k != varying_slot)
|
| 160 |
-
if not fixed_slot_names:
|
| 161 |
-
fixed_slot_names = other_slots
|
| 162 |
-
# Skip runs avec un schéma de slots incompatible
|
| 163 |
-
if other_slots != fixed_slot_names:
|
| 164 |
-
continue
|
| 165 |
-
key = tuple((k, r.slots[k]) for k in other_slots)
|
| 166 |
-
groups.setdefault(key, []).append(r)
|
| 167 |
-
|
| 168 |
-
if not groups:
|
| 169 |
-
return None
|
| 170 |
-
|
| 171 |
-
# Pour chaque groupe : ranking des runs par score
|
| 172 |
-
per_value: dict[str, dict] = {}
|
| 173 |
-
group_details: list[dict] = []
|
| 174 |
-
for key, members in groups.items():
|
| 175 |
-
members_sorted = sorted(
|
| 176 |
-
members, key=lambda x: x.score, reverse=higher_is_better,
|
| 177 |
-
)
|
| 178 |
-
# Rangs : runs ex aequo partagent la moyenne des rangs
|
| 179 |
-
ranks: dict[str, float] = {}
|
| 180 |
-
i = 0
|
| 181 |
-
while i < len(members_sorted):
|
| 182 |
-
j = i
|
| 183 |
-
while (
|
| 184 |
-
j + 1 < len(members_sorted)
|
| 185 |
-
and members_sorted[j + 1].score == members_sorted[i].score
|
| 186 |
-
):
|
| 187 |
-
j += 1
|
| 188 |
-
avg_rank = (i + 1 + j + 1) / 2
|
| 189 |
-
for k in range(i, j + 1):
|
| 190 |
-
value = members_sorted[k].slots[varying_slot]
|
| 191 |
-
ranks[value] = avg_rank
|
| 192 |
-
i = j + 1
|
| 193 |
-
|
| 194 |
-
for r in members:
|
| 195 |
-
value = r.slots[varying_slot]
|
| 196 |
-
slot = per_value.setdefault(value, {
|
| 197 |
-
"scores": [],
|
| 198 |
-
"ranks": [],
|
| 199 |
-
})
|
| 200 |
-
slot["scores"].append(r.score)
|
| 201 |
-
slot["ranks"].append(ranks[value])
|
| 202 |
-
group_details.append({
|
| 203 |
-
"fixed_slots": dict(key),
|
| 204 |
-
"n_members": len(members),
|
| 205 |
-
"values": [r.slots[varying_slot] for r in members_sorted],
|
| 206 |
-
"scores": [r.score for r in members_sorted],
|
| 207 |
-
})
|
| 208 |
-
|
| 209 |
-
# Calcul mean/stdev/min/max + rang moyen par valeur
|
| 210 |
-
summary: dict[str, dict] = {}
|
| 211 |
-
for value, slot in per_value.items():
|
| 212 |
-
scores = slot["scores"]
|
| 213 |
-
ranks = slot["ranks"]
|
| 214 |
-
summary[value] = {
|
| 215 |
-
"n_observations": len(scores),
|
| 216 |
-
"mean": statistics.fmean(scores) if scores else None,
|
| 217 |
-
"stdev": (
|
| 218 |
-
statistics.stdev(scores) if len(scores) >= 2 else None
|
| 219 |
-
),
|
| 220 |
-
"min": min(scores),
|
| 221 |
-
"max": max(scores),
|
| 222 |
-
"mean_rank": (
|
| 223 |
-
statistics.fmean(ranks) if ranks else None
|
| 224 |
-
),
|
| 225 |
-
}
|
| 226 |
-
|
| 227 |
-
# Best/worst : sur la mean (convention CER : plus bas = meilleur)
|
| 228 |
-
by_mean = sorted(
|
| 229 |
-
((v, d["mean"]) for v, d in summary.items()
|
| 230 |
-
if d["mean"] is not None),
|
| 231 |
-
key=lambda kv: kv[1],
|
| 232 |
-
reverse=higher_is_better,
|
| 233 |
-
)
|
| 234 |
-
best_value = by_mean[0][0] if by_mean else None
|
| 235 |
-
worst_value = by_mean[-1][0] if by_mean else None
|
| 236 |
-
|
| 237 |
-
return {
|
| 238 |
-
"varying_slot": varying_slot,
|
| 239 |
-
"n_runs": len(runs_list),
|
| 240 |
-
"n_groups": len(groups),
|
| 241 |
-
"values": sorted(per_value.keys()),
|
| 242 |
-
"per_value": summary,
|
| 243 |
-
"best_value": best_value,
|
| 244 |
-
"worst_value": worst_value,
|
| 245 |
-
"groups": group_details,
|
| 246 |
-
"higher_is_better": higher_is_better,
|
| 247 |
-
}
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
__all__ = [
|
| 251 |
-
"PipelineRun",
|
| 252 |
-
"compare_isolated_effect",
|
| 253 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.incremental_comparison``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.incremental_comparison`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.incremental_comparison import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,484 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
1. **Divergence taxonomique** (`kl_divergence`, `jensen_shannon_divergence`,
|
| 7 |
-
`taxonomy_divergence_matrix`) — *à quel point les moteurs font-ils des
|
| 8 |
-
erreurs de natures différentes ?* Une divergence élevée signale des
|
| 9 |
-
moteurs spécialisés sur des classes d'erreurs distinctes (visual vs
|
| 10 |
-
abréviation vs casse) et donc des candidats pour un voting ensemble.
|
| 11 |
-
|
| 12 |
-
2. **Complémentarité** (`oracle_token_recall`, `complementarity_gap`,
|
| 13 |
-
`pairwise_disagreement_rate`) — *quel CER serait atteignable si on
|
| 14 |
-
combinait les moteurs ?* La borne inférieure du CER atteignable par
|
| 15 |
-
un voting majoritaire token-level est ``1 - oracle_token_recall``.
|
| 16 |
-
Si elle est très inférieure au CER du meilleur moteur seul, l'effort
|
| 17 |
-
d'un pipeline d'ensemble se justifie. Sinon non.
|
| 18 |
-
|
| 19 |
-
Convention de typage
|
| 20 |
-
--------------------
|
| 21 |
-
Toutes les fonctions sont enregistrables dans le registre Sprint 34 si
|
| 22 |
-
on les wrappe par un adaptateur ``(input_types=(TEXT, TEXT))``. Pour
|
| 23 |
-
limiter le bruit, on ne les enregistre **pas** automatiquement : ce sont
|
| 24 |
-
des métriques d'agrégation (multi-moteurs ou multi-documents) qui ne
|
| 25 |
-
correspondent pas au modèle « une jonction = une métrique » du runner.
|
| 26 |
-
Elles sont consommées par les détecteurs narratifs et le rapport HTML.
|
| 27 |
-
|
| 28 |
-
Note sur l'oracle
|
| 29 |
-
-----------------
|
| 30 |
-
La métrique ``oracle_token_recall`` retournée ici utilise un alignement
|
| 31 |
-
bag-of-words pondéré par multiplicité. Ce n'est **pas** une vraie
|
| 32 |
-
borne atteignable par voting majoritaire séquentiel — c'est une borne
|
| 33 |
-
supérieure (proxy optimiste). La vraie borne demanderait un
|
| 34 |
-
alignement séquentiel des hypothèses, ce qui est plus coûteux. Pour
|
| 35 |
-
le diagnostic « ensemble vaut-il le coup ? », le proxy suffit
|
| 36 |
-
largement ; on documente clairement la limite dans le glossaire et le
|
| 37 |
-
rapport.
|
| 38 |
"""
|
| 39 |
|
| 40 |
from __future__ import annotations
|
| 41 |
|
| 42 |
-
import
|
| 43 |
-
import math
|
| 44 |
-
from collections import Counter
|
| 45 |
-
|
| 46 |
-
logger = logging.getLogger(__name__)
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 50 |
-
# Divergence taxonomique (KL / Jensen-Shannon)
|
| 51 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
def _smoothed_distribution(
|
| 55 |
-
distribution: dict[str, float],
|
| 56 |
-
keys: list[str],
|
| 57 |
-
epsilon: float = 1e-12,
|
| 58 |
-
) -> list[float]:
|
| 59 |
-
"""Aligne une distribution sur l'ordre de ``keys`` et lisse les zéros.
|
| 60 |
-
|
| 61 |
-
Le lissage évite ``log(0)`` dans la KL. ``epsilon`` est volontairement
|
| 62 |
-
minuscule pour ne pas modifier le résultat de manière sensible.
|
| 63 |
-
"""
|
| 64 |
-
smoothed = [max(distribution.get(k, 0.0), epsilon) for k in keys]
|
| 65 |
-
total = sum(smoothed)
|
| 66 |
-
return [v / total for v in smoothed]
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def kl_divergence(p: dict[str, float], q: dict[str, float]) -> float:
|
| 70 |
-
"""KL-divergence ``D(P||Q)`` en bits, sur l'union des clés.
|
| 71 |
-
|
| 72 |
-
Les distributions n'ont pas besoin de partager exactement les mêmes
|
| 73 |
-
clés ; les clés manquantes sont lissées à ``epsilon`` puis
|
| 74 |
-
renormalisées.
|
| 75 |
-
|
| 76 |
-
Returns
|
| 77 |
-
-------
|
| 78 |
-
float
|
| 79 |
-
``D(P||Q) ≥ 0``. Vaut 0 si et seulement si P == Q. N'est pas
|
| 80 |
-
symétrique : ``kl(p, q) != kl(q, p)`` en général.
|
| 81 |
-
"""
|
| 82 |
-
keys = sorted(set(p.keys()) | set(q.keys()))
|
| 83 |
-
if not keys:
|
| 84 |
-
return 0.0
|
| 85 |
-
p_vec = _smoothed_distribution(p, keys)
|
| 86 |
-
q_vec = _smoothed_distribution(q, keys)
|
| 87 |
-
return sum(pi * math.log2(pi / qi) for pi, qi in zip(p_vec, q_vec))
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
def jensen_shannon_divergence(
|
| 91 |
-
p: dict[str, float],
|
| 92 |
-
q: dict[str, float],
|
| 93 |
-
) -> float:
|
| 94 |
-
"""JS-divergence symétrique en bits, bornée dans ``[0, 1]``.
|
| 95 |
-
|
| 96 |
-
``JS(P, Q) = ½ D(P||M) + ½ D(Q||M)`` avec ``M = (P + Q) / 2``.
|
| 97 |
-
Symétrique et bornée — préférable à la KL pour construire une
|
| 98 |
-
matrice triangulaire de divergences entre moteurs.
|
| 99 |
-
"""
|
| 100 |
-
keys = sorted(set(p.keys()) | set(q.keys()))
|
| 101 |
-
if not keys:
|
| 102 |
-
return 0.0
|
| 103 |
-
p_vec = _smoothed_distribution(p, keys)
|
| 104 |
-
q_vec = _smoothed_distribution(q, keys)
|
| 105 |
-
m_vec = [(pi + qi) / 2.0 for pi, qi in zip(p_vec, q_vec)]
|
| 106 |
-
|
| 107 |
-
def _kl(a: list[float], b: list[float]) -> float:
|
| 108 |
-
return sum(ai * math.log2(ai / bi) for ai, bi in zip(a, b) if ai > 0)
|
| 109 |
-
|
| 110 |
-
js = 0.5 * _kl(p_vec, m_vec) + 0.5 * _kl(q_vec, m_vec)
|
| 111 |
-
# Borne théorique : JS ∈ [0, 1] en bits. Clamp pour absorber les
|
| 112 |
-
# erreurs d'arrondi flottant.
|
| 113 |
-
return max(0.0, min(1.0, js))
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
def taxonomy_divergence_matrix(
|
| 117 |
-
distributions: dict[str, dict[str, float]],
|
| 118 |
-
metric: str = "js",
|
| 119 |
-
) -> dict[str, dict[str, float]]:
|
| 120 |
-
"""Construit la matrice de divergence triangulaire entre moteurs.
|
| 121 |
-
|
| 122 |
-
Parameters
|
| 123 |
-
----------
|
| 124 |
-
distributions:
|
| 125 |
-
``{engine_name: {error_class: probability}}``. Chaque
|
| 126 |
-
distribution doit sommer à environ 1 (pas de validation stricte
|
| 127 |
-
— les distributions taxonomiques de Picarones sont déjà
|
| 128 |
-
normalisées par ``aggregate_taxonomy``).
|
| 129 |
-
metric:
|
| 130 |
-
``"js"`` (défaut, symétrique) ou ``"kl"`` (asymétrique).
|
| 131 |
-
|
| 132 |
-
Returns
|
| 133 |
-
-------
|
| 134 |
-
dict[str, dict[str, float]]
|
| 135 |
-
Matrice ``{engine_a: {engine_b: divergence}}`` symétrique pour
|
| 136 |
-
``js``, asymétrique pour ``kl``. La diagonale vaut 0.
|
| 137 |
-
"""
|
| 138 |
-
if metric not in ("js", "kl"):
|
| 139 |
-
raise ValueError(f"metric doit être 'js' ou 'kl' — reçu {metric!r}")
|
| 140 |
-
fn = jensen_shannon_divergence if metric == "js" else kl_divergence
|
| 141 |
-
|
| 142 |
-
engines = sorted(distributions.keys())
|
| 143 |
-
matrix: dict[str, dict[str, float]] = {a: {} for a in engines}
|
| 144 |
-
for a in engines:
|
| 145 |
-
for b in engines:
|
| 146 |
-
if a == b:
|
| 147 |
-
matrix[a][b] = 0.0
|
| 148 |
-
elif metric == "js" and b in matrix and a in matrix[b]:
|
| 149 |
-
# Symétrique : recopie pour éviter de recalculer
|
| 150 |
-
matrix[a][b] = matrix[b][a]
|
| 151 |
-
else:
|
| 152 |
-
matrix[a][b] = fn(distributions[a], distributions[b])
|
| 153 |
-
return matrix
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 157 |
-
# Complémentarité (oracle token recall)
|
| 158 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
def _word_multiset(text: str) -> Counter[str]:
|
| 162 |
-
"""Décomposition en multiset de tokens (séparateur whitespace)."""
|
| 163 |
-
return Counter(tok for tok in text.split() if tok)
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
def oracle_token_recall(
|
| 167 |
-
reference: str,
|
| 168 |
-
hypotheses: dict[str, str],
|
| 169 |
-
) -> float:
|
| 170 |
-
"""Borne supérieure (proxy bag-of-words) du token-recall atteignable
|
| 171 |
-
par un voting majoritaire entre tous les moteurs fournis.
|
| 172 |
-
|
| 173 |
-
Pour chaque token de la référence (avec sa multiplicité), on
|
| 174 |
-
considère qu'il est "préservé" par l'ensemble si au moins un moteur
|
| 175 |
-
en produit une occurrence non encore comptée. Le score est le ratio
|
| 176 |
-
d'occurrences GT préservées sur le total.
|
| 177 |
-
|
| 178 |
-
Parameters
|
| 179 |
-
----------
|
| 180 |
-
reference:
|
| 181 |
-
Texte GT.
|
| 182 |
-
hypotheses:
|
| 183 |
-
``{engine_name: hypothesis_text}``.
|
| 184 |
-
|
| 185 |
-
Returns
|
| 186 |
-
-------
|
| 187 |
-
float
|
| 188 |
-
Ratio dans ``[0, 1]``. ``1.0`` = chaque token GT est présent
|
| 189 |
-
dans au moins une hypothèse à hauteur de sa multiplicité.
|
| 190 |
-
|
| 191 |
-
Note
|
| 192 |
-
----
|
| 193 |
-
Cette borne est **optimiste** (supérieure à la vraie borne par
|
| 194 |
-
voting séquentiel) car elle ignore l'ordre d'apparition. Pour le
|
| 195 |
-
diagnostic « un voting vaut-il l'effort ? » le proxy suffit ; pour
|
| 196 |
-
une vraie borne il faudrait un alignement séquentiel.
|
| 197 |
-
"""
|
| 198 |
-
ref_counter = _word_multiset(reference)
|
| 199 |
-
if not ref_counter or not hypotheses:
|
| 200 |
-
return 1.0 if not ref_counter else 0.0
|
| 201 |
-
|
| 202 |
-
hyp_counters = [_word_multiset(h) for h in hypotheses.values()]
|
| 203 |
-
total_ref = sum(ref_counter.values())
|
| 204 |
-
preserved = 0
|
| 205 |
-
for token, gt_count in ref_counter.items():
|
| 206 |
-
# Pour chaque moteur, le nombre d'occurrences disponibles, plafonné
|
| 207 |
-
# à la multiplicité GT. L'oracle prend le max sur les moteurs.
|
| 208 |
-
best = max((min(gt_count, hc.get(token, 0)) for hc in hyp_counters), default=0)
|
| 209 |
-
preserved += best
|
| 210 |
-
return preserved / total_ref
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
def complementarity_gap(
|
| 214 |
-
reference: str,
|
| 215 |
-
hypotheses: dict[str, str],
|
| 216 |
-
) -> dict[str, float]:
|
| 217 |
-
"""Compare l'oracle au meilleur moteur seul.
|
| 218 |
-
|
| 219 |
-
Returns
|
| 220 |
-
-------
|
| 221 |
-
dict
|
| 222 |
-
``{
|
| 223 |
-
"oracle_recall": float, # bag-of-words recall de l'oracle
|
| 224 |
-
"best_single_recall": float, # meilleur recall token d'un moteur seul
|
| 225 |
-
"best_engine": str, # nom du moteur correspondant
|
| 226 |
-
"absolute_gap": float, # oracle - best_single (toujours ≥ 0)
|
| 227 |
-
"relative_gap": float, # absolute_gap / (1 - best_single + ε)
|
| 228 |
-
# = fraction des erreurs encore évitables
|
| 229 |
-
# par un ensemble
|
| 230 |
-
}``
|
| 231 |
-
"""
|
| 232 |
-
ref_counter = _word_multiset(reference)
|
| 233 |
-
total = sum(ref_counter.values())
|
| 234 |
-
if not total:
|
| 235 |
-
return {
|
| 236 |
-
"oracle_recall": 1.0,
|
| 237 |
-
"best_single_recall": 1.0,
|
| 238 |
-
"best_engine": "",
|
| 239 |
-
"absolute_gap": 0.0,
|
| 240 |
-
"relative_gap": 0.0,
|
| 241 |
-
}
|
| 242 |
-
|
| 243 |
-
def _single_recall(hyp_text: str) -> float:
|
| 244 |
-
hc = _word_multiset(hyp_text)
|
| 245 |
-
preserved = sum(min(gt, hc.get(tok, 0)) for tok, gt in ref_counter.items())
|
| 246 |
-
return preserved / total
|
| 247 |
-
|
| 248 |
-
if not hypotheses:
|
| 249 |
-
return {
|
| 250 |
-
"oracle_recall": 0.0,
|
| 251 |
-
"best_single_recall": 0.0,
|
| 252 |
-
"best_engine": "",
|
| 253 |
-
"absolute_gap": 0.0,
|
| 254 |
-
"relative_gap": 0.0,
|
| 255 |
-
}
|
| 256 |
-
|
| 257 |
-
per_engine = {name: _single_recall(h) for name, h in hypotheses.items()}
|
| 258 |
-
best_engine, best_recall = max(per_engine.items(), key=lambda kv: kv[1])
|
| 259 |
-
oracle = oracle_token_recall(reference, hypotheses)
|
| 260 |
-
|
| 261 |
-
absolute_gap = max(0.0, oracle - best_recall)
|
| 262 |
-
# relative_gap : fraction des erreurs du meilleur moteur que l'ensemble
|
| 263 |
-
# serait théoriquement capable de récupérer (∈ [0, 1])
|
| 264 |
-
headroom = max(1.0 - best_recall, 1e-12)
|
| 265 |
-
relative_gap = min(1.0, absolute_gap / headroom)
|
| 266 |
-
|
| 267 |
-
return {
|
| 268 |
-
"oracle_recall": oracle,
|
| 269 |
-
"best_single_recall": best_recall,
|
| 270 |
-
"best_engine": best_engine,
|
| 271 |
-
"absolute_gap": absolute_gap,
|
| 272 |
-
"relative_gap": relative_gap,
|
| 273 |
-
}
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
def pairwise_disagreement_rate(
|
| 277 |
-
reference: str,
|
| 278 |
-
hyp_a: str,
|
| 279 |
-
hyp_b: str,
|
| 280 |
-
) -> float:
|
| 281 |
-
"""Fraction de tokens GT pour lesquels A et B sont en désaccord.
|
| 282 |
-
|
| 283 |
-
Un désaccord = (l'un préserve le token, l'autre non) OU
|
| 284 |
-
(les deux le ratent mais avec des substitutions différentes — non
|
| 285 |
-
capturé ici, on reste sur la version simple présence/absence).
|
| 286 |
-
|
| 287 |
-
Returns
|
| 288 |
-
-------
|
| 289 |
-
float
|
| 290 |
-
Ratio dans ``[0, 1]``. ``0`` = A et B font les mêmes choix
|
| 291 |
-
(pas de gain d'ensemble). ``1`` = A et B sont toujours en
|
| 292 |
-
désaccord (gain d'ensemble maximal).
|
| 293 |
-
"""
|
| 294 |
-
ref_counter = _word_multiset(reference)
|
| 295 |
-
if not ref_counter:
|
| 296 |
-
return 0.0
|
| 297 |
-
a = _word_multiset(hyp_a)
|
| 298 |
-
b = _word_multiset(hyp_b)
|
| 299 |
-
total = sum(ref_counter.values())
|
| 300 |
-
disagree = 0
|
| 301 |
-
for tok, gt_count in ref_counter.items():
|
| 302 |
-
a_pres = min(gt_count, a.get(tok, 0))
|
| 303 |
-
b_pres = min(gt_count, b.get(tok, 0))
|
| 304 |
-
# Compte les positions où A et B donnent une réponse différente
|
| 305 |
-
disagree += abs(a_pres - b_pres)
|
| 306 |
-
return disagree / total
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 310 |
-
# Agrégation au niveau benchmark (Sprint 36)
|
| 311 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
def compute_inter_engine_analysis(
|
| 315 |
-
*,
|
| 316 |
-
per_engine_outputs: dict[str, dict[str, str]],
|
| 317 |
-
ground_truths: dict[str, str],
|
| 318 |
-
taxonomy_distributions: dict[str, dict[str, float]] | None = None,
|
| 319 |
-
divergence_metric: str = "js",
|
| 320 |
-
) -> dict:
|
| 321 |
-
"""Agrège les métriques inter-moteurs sur l'ensemble du corpus.
|
| 322 |
-
|
| 323 |
-
Parameters
|
| 324 |
-
----------
|
| 325 |
-
per_engine_outputs:
|
| 326 |
-
``{engine_name: {doc_id: hypothesis_text}}``. Une entrée par
|
| 327 |
-
moteur, avec une hypothèse par document. Les documents absents
|
| 328 |
-
d'un moteur (échecs, timeouts) sont simplement ignorés pour ce
|
| 329 |
-
moteur — l'oracle est calculé sur les moteurs qui ont produit
|
| 330 |
-
une sortie pour le doc.
|
| 331 |
-
ground_truths:
|
| 332 |
-
``{doc_id: ground_truth_text}``. La GT est la même pour tous
|
| 333 |
-
les moteurs ; on la passe une seule fois.
|
| 334 |
-
taxonomy_distributions:
|
| 335 |
-
``{engine_name: {error_class: probability}}`` — typiquement
|
| 336 |
-
``EngineReport.aggregated_taxonomy["class_distribution"]``. Si
|
| 337 |
-
``None`` ou vide, la divergence taxonomique n'est pas calculée.
|
| 338 |
-
divergence_metric:
|
| 339 |
-
``"js"`` (défaut, symétrique) ou ``"kl"``.
|
| 340 |
-
|
| 341 |
-
Returns
|
| 342 |
-
-------
|
| 343 |
-
dict
|
| 344 |
-
Structure stable consommable par les détecteurs narratifs et le
|
| 345 |
-
rapport HTML :
|
| 346 |
-
``{
|
| 347 |
-
"complementarity": {
|
| 348 |
-
"oracle_recall": float,
|
| 349 |
-
"best_single_recall": float,
|
| 350 |
-
"best_engine": str,
|
| 351 |
-
"absolute_gap": float,
|
| 352 |
-
"relative_gap": float,
|
| 353 |
-
"doc_count": int,
|
| 354 |
-
"per_doc": [{doc_id, oracle, best, gap}, ...] # max 50 docs
|
| 355 |
-
},
|
| 356 |
-
"taxonomy_divergence": {
|
| 357 |
-
"metric": "js"|"kl",
|
| 358 |
-
"matrix": {engine_a: {engine_b: divergence}},
|
| 359 |
-
"max_pair": [engine_a, engine_b, value] # paire la plus divergente
|
| 360 |
-
} | None,
|
| 361 |
-
"engines": [...], # liste des moteurs analysés (ordre stable)
|
| 362 |
-
}``
|
| 363 |
-
"""
|
| 364 |
-
engines = sorted(per_engine_outputs.keys())
|
| 365 |
-
result: dict = {"engines": engines}
|
| 366 |
-
|
| 367 |
-
# ── Complémentarité agrégée doc par doc ──────────────────────────────
|
| 368 |
-
if not engines:
|
| 369 |
-
result["complementarity"] = None
|
| 370 |
-
else:
|
| 371 |
-
total_oracle_preserved = 0
|
| 372 |
-
total_ref_tokens = 0
|
| 373 |
-
per_engine_preserved: dict[str, int] = {name: 0 for name in engines}
|
| 374 |
-
per_doc_records: list[dict] = []
|
| 375 |
-
|
| 376 |
-
for doc_id, gt in ground_truths.items():
|
| 377 |
-
ref_counter = _word_multiset(gt)
|
| 378 |
-
ref_total = sum(ref_counter.values())
|
| 379 |
-
if not ref_total:
|
| 380 |
-
continue
|
| 381 |
-
total_ref_tokens += ref_total
|
| 382 |
-
|
| 383 |
-
doc_hyps: dict[str, str] = {}
|
| 384 |
-
for name in engines:
|
| 385 |
-
hyp = per_engine_outputs.get(name, {}).get(doc_id)
|
| 386 |
-
if hyp is not None:
|
| 387 |
-
doc_hyps[name] = hyp
|
| 388 |
-
|
| 389 |
-
if not doc_hyps:
|
| 390 |
-
continue
|
| 391 |
-
|
| 392 |
-
hyp_counters = {n: _word_multiset(h) for n, h in doc_hyps.items()}
|
| 393 |
-
|
| 394 |
-
doc_oracle = 0
|
| 395 |
-
doc_best_per_engine: dict[str, int] = {n: 0 for n in doc_hyps}
|
| 396 |
-
for tok, gt_count in ref_counter.items():
|
| 397 |
-
# Oracle : meilleur des moteurs sur ce token
|
| 398 |
-
best_for_token = 0
|
| 399 |
-
for name, hc in hyp_counters.items():
|
| 400 |
-
preserved = min(gt_count, hc.get(tok, 0))
|
| 401 |
-
doc_best_per_engine[name] += preserved
|
| 402 |
-
if preserved > best_for_token:
|
| 403 |
-
best_for_token = preserved
|
| 404 |
-
doc_oracle += best_for_token
|
| 405 |
-
|
| 406 |
-
total_oracle_preserved += doc_oracle
|
| 407 |
-
for name, count in doc_best_per_engine.items():
|
| 408 |
-
per_engine_preserved[name] += count
|
| 409 |
-
|
| 410 |
-
doc_best = max(doc_best_per_engine.values()) if doc_best_per_engine else 0
|
| 411 |
-
per_doc_records.append({
|
| 412 |
-
"doc_id": doc_id,
|
| 413 |
-
"oracle_recall": doc_oracle / ref_total,
|
| 414 |
-
"best_single_recall": doc_best / ref_total,
|
| 415 |
-
"absolute_gap": (doc_oracle - doc_best) / ref_total,
|
| 416 |
-
})
|
| 417 |
-
|
| 418 |
-
if total_ref_tokens == 0:
|
| 419 |
-
result["complementarity"] = None
|
| 420 |
-
else:
|
| 421 |
-
oracle_recall = total_oracle_preserved / total_ref_tokens
|
| 422 |
-
recalls = {
|
| 423 |
-
name: per_engine_preserved[name] / total_ref_tokens
|
| 424 |
-
for name in engines
|
| 425 |
-
}
|
| 426 |
-
best_engine, best_recall = max(recalls.items(), key=lambda kv: kv[1])
|
| 427 |
-
absolute_gap = max(0.0, oracle_recall - best_recall)
|
| 428 |
-
headroom = max(1.0 - best_recall, 1e-12)
|
| 429 |
-
relative_gap = min(1.0, absolute_gap / headroom)
|
| 430 |
-
|
| 431 |
-
# Garder les ``per_doc_records`` les plus instructifs : tri par
|
| 432 |
-
# gap absolu décroissant, top 50. Les détecteurs narratifs
|
| 433 |
-
# n'en consomment que quelques-uns.
|
| 434 |
-
per_doc_records.sort(key=lambda r: r["absolute_gap"], reverse=True)
|
| 435 |
-
per_doc_top = per_doc_records[:50]
|
| 436 |
-
|
| 437 |
-
result["complementarity"] = {
|
| 438 |
-
"oracle_recall": oracle_recall,
|
| 439 |
-
"best_single_recall": best_recall,
|
| 440 |
-
"best_engine": best_engine,
|
| 441 |
-
"absolute_gap": absolute_gap,
|
| 442 |
-
"relative_gap": relative_gap,
|
| 443 |
-
"doc_count": len(per_doc_records),
|
| 444 |
-
"per_engine_recall": recalls,
|
| 445 |
-
"per_doc": per_doc_top,
|
| 446 |
-
}
|
| 447 |
-
|
| 448 |
-
# ── Divergence taxonomique ─────────────────────────────────────────
|
| 449 |
-
if not taxonomy_distributions:
|
| 450 |
-
result["taxonomy_divergence"] = None
|
| 451 |
-
else:
|
| 452 |
-
matrix = taxonomy_divergence_matrix(
|
| 453 |
-
taxonomy_distributions,
|
| 454 |
-
metric=divergence_metric,
|
| 455 |
-
)
|
| 456 |
-
# Cherche la paire la plus divergente (utile pour la synthèse
|
| 457 |
-
# narrative qui veut nommer les deux moteurs candidats à
|
| 458 |
-
# l'ensemble).
|
| 459 |
-
max_pair: tuple[str, str, float] = ("", "", 0.0)
|
| 460 |
-
names = sorted(matrix.keys())
|
| 461 |
-
for i, a in enumerate(names):
|
| 462 |
-
for b in names[i + 1:]:
|
| 463 |
-
v = matrix[a][b]
|
| 464 |
-
if v > max_pair[2]:
|
| 465 |
-
max_pair = (a, b, v)
|
| 466 |
-
|
| 467 |
-
result["taxonomy_divergence"] = {
|
| 468 |
-
"metric": divergence_metric,
|
| 469 |
-
"matrix": matrix,
|
| 470 |
-
"max_pair": list(max_pair) if max_pair[2] > 0 else None,
|
| 471 |
-
}
|
| 472 |
-
|
| 473 |
-
return result
|
| 474 |
-
|
| 475 |
-
|
| 476 |
-
__all__ = [
|
| 477 |
-
"kl_divergence",
|
| 478 |
-
"jensen_shannon_divergence",
|
| 479 |
-
"taxonomy_divergence_matrix",
|
| 480 |
-
"oracle_token_recall",
|
| 481 |
-
"complementarity_gap",
|
| 482 |
-
"pairwise_disagreement_rate",
|
| 483 |
-
"compute_inter_engine_analysis",
|
| 484 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.inter_engine``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.inter_engine`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.inter_engine import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,280 +1,14 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
Un médiéviste qui édite un manuscrit glosé veut savoir : *« le moteur
|
| 8 |
-
sépare-t-il bien le texte principal de la glose ? »*. Le score de
|
| 9 |
-
structure global de Picarones (Sprint 5) agrège fusion/fragmentation
|
| 10 |
-
de lignes en un seul nombre — utile mais non typé. Ce module
|
| 11 |
-
discrimine par **type de région** ALTO/PAGE (``TextRegion``,
|
| 12 |
-
``MarginNote``, ``Header``, ``Footer``, ``Drop-Cap``...) en
|
| 13 |
-
appliquant le pattern ICDAR layout standard :
|
| 14 |
-
|
| 15 |
-
- **TP** : région GT et région hypothèse de **même type** avec
|
| 16 |
-
chevauchement IoU ≥ seuil (alignement greedy par IoU décroissant),
|
| 17 |
-
- **FN** : région GT non matchée,
|
| 18 |
-
- **FP** : région hypothèse non matchée,
|
| 19 |
-
- F1 calculé global et par type.
|
| 20 |
-
|
| 21 |
-
Le pattern d'alignement est le même que pour le NER (Sprint 38) — on
|
| 22 |
-
réutilise une approche éprouvée plutôt que d'en inventer une nouvelle.
|
| 23 |
-
|
| 24 |
-
Stratégie de découpage
|
| 25 |
-
----------------------
|
| 26 |
-
Cohérente avec NER (Sprint 38), Flesch (Sprint 52), Reading order F1
|
| 27 |
-
(Sprint 53) : couche de calcul pure d'abord. L'utilisateur fournit
|
| 28 |
-
deux listes de ``Region`` (typiquement extraites de ALTO/PAGE par un
|
| 29 |
-
parser amont — le parser ALTO/PAGE standard de Picarones suivra
|
| 30 |
-
dans un sprint dédié). Pas de câblage runner ni de vue HTML ici.
|
| 31 |
-
|
| 32 |
-
Convention de coordonnées
|
| 33 |
-
-------------------------
|
| 34 |
-
Une bbox est un tuple ``(x, y, width, height)`` en pixels (origine
|
| 35 |
-
en haut à gauche, axe y vers le bas — convention ALTO et PAGE
|
| 36 |
-
standard). L'IoU est calculée sur l'aire d'intersection / union des
|
| 37 |
-
rectangles.
|
| 38 |
"""
|
| 39 |
|
| 40 |
from __future__ import annotations
|
| 41 |
|
| 42 |
-
import
|
| 43 |
-
from
|
| 44 |
-
from typing import Iterable
|
| 45 |
-
|
| 46 |
-
logger = logging.getLogger(__name__)
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 50 |
-
# Modèle de données
|
| 51 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
@dataclass(frozen=True)
|
| 55 |
-
class Region:
|
| 56 |
-
"""Une région ALTO/PAGE alignable sur sa GT.
|
| 57 |
-
|
| 58 |
-
Attributs
|
| 59 |
-
---------
|
| 60 |
-
id:
|
| 61 |
-
Identifiant unique au sein de la séquence (ex. ``"r_1"``,
|
| 62 |
-
``"region_main"``). Informatif — l'alignement se fait par IoU,
|
| 63 |
-
pas par ID.
|
| 64 |
-
type:
|
| 65 |
-
Catégorie de la région (``"TextRegion"``, ``"MarginNote"``,
|
| 66 |
-
``"Header"``, etc.). Comparaison **case-insensitive**.
|
| 67 |
-
bbox:
|
| 68 |
-
Rectangle ``(x, y, width, height)`` en pixels, origine en haut
|
| 69 |
-
à gauche. Doit avoir width > 0 et height > 0.
|
| 70 |
-
"""
|
| 71 |
-
|
| 72 |
-
id: str
|
| 73 |
-
type: str
|
| 74 |
-
bbox: tuple[int, int, int, int]
|
| 75 |
-
|
| 76 |
-
def __post_init__(self) -> None:
|
| 77 |
-
x, y, w, h = self.bbox
|
| 78 |
-
if w <= 0 or h <= 0:
|
| 79 |
-
raise ValueError(
|
| 80 |
-
f"Region {self.id!r} : bbox invalide (w={w}, h={h}). "
|
| 81 |
-
"width et height doivent être strictement positifs."
|
| 82 |
-
)
|
| 83 |
-
|
| 84 |
-
@property
|
| 85 |
-
def area(self) -> int:
|
| 86 |
-
_, _, w, h = self.bbox
|
| 87 |
-
return w * h
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
def _to_region(obj: Region | dict) -> Region:
|
| 91 |
-
"""Coerce un dict en ``Region`` (clés ``id``, ``type``, ``bbox``)."""
|
| 92 |
-
if isinstance(obj, Region):
|
| 93 |
-
return obj
|
| 94 |
-
return Region(
|
| 95 |
-
id=str(obj["id"]),
|
| 96 |
-
type=str(obj["type"]),
|
| 97 |
-
bbox=tuple(obj["bbox"]), # type: ignore[arg-type]
|
| 98 |
-
)
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 102 |
-
# IoU + alignement greedy
|
| 103 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
def _iou_bbox(a: Region, b: Region) -> float:
|
| 107 |
-
"""Intersection-over-Union de deux bboxes ``(x, y, w, h)``."""
|
| 108 |
-
ax, ay, aw, ah = a.bbox
|
| 109 |
-
bx, by, bw, bh = b.bbox
|
| 110 |
-
inter_x = max(ax, bx)
|
| 111 |
-
inter_y = max(ay, by)
|
| 112 |
-
inter_x_end = min(ax + aw, bx + bw)
|
| 113 |
-
inter_y_end = min(ay + ah, by + bh)
|
| 114 |
-
inter_w = max(0, inter_x_end - inter_x)
|
| 115 |
-
inter_h = max(0, inter_y_end - inter_y)
|
| 116 |
-
inter = inter_w * inter_h
|
| 117 |
-
if inter == 0:
|
| 118 |
-
return 0.0
|
| 119 |
-
union = a.area + b.area - inter
|
| 120 |
-
if union <= 0:
|
| 121 |
-
return 0.0
|
| 122 |
-
return inter / union
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
def _align_regions(
|
| 126 |
-
references: list[Region],
|
| 127 |
-
hypotheses: list[Region],
|
| 128 |
-
iou_threshold: float,
|
| 129 |
-
) -> tuple[list[tuple[int, int, float]], set[int], set[int]]:
|
| 130 |
-
"""Appareillage greedy par IoU décroissant ; same type requis.
|
| 131 |
-
|
| 132 |
-
Renvoie ``(matches, unmatched_refs, unmatched_hyps)`` —
|
| 133 |
-
``matches`` est une liste de ``(idx_ref, idx_hyp, iou)``.
|
| 134 |
-
"""
|
| 135 |
-
candidates: list[tuple[float, int, int]] = []
|
| 136 |
-
for i, r in enumerate(references):
|
| 137 |
-
for j, h in enumerate(hypotheses):
|
| 138 |
-
if r.type.casefold() != h.type.casefold():
|
| 139 |
-
continue
|
| 140 |
-
iou = _iou_bbox(r, h)
|
| 141 |
-
if iou >= iou_threshold:
|
| 142 |
-
candidates.append((iou, i, j))
|
| 143 |
-
|
| 144 |
-
# Tri stable : IoU décroissant, puis indices croissants pour
|
| 145 |
-
# déterminisme sur égalités.
|
| 146 |
-
candidates.sort(key=lambda t: (-t[0], t[1], t[2]))
|
| 147 |
-
|
| 148 |
-
matched_refs: set[int] = set()
|
| 149 |
-
matched_hyps: set[int] = set()
|
| 150 |
-
matches: list[tuple[int, int, float]] = []
|
| 151 |
-
for iou, i, j in candidates:
|
| 152 |
-
if i in matched_refs or j in matched_hyps:
|
| 153 |
-
continue
|
| 154 |
-
matched_refs.add(i)
|
| 155 |
-
matched_hyps.add(j)
|
| 156 |
-
matches.append((i, j, iou))
|
| 157 |
-
|
| 158 |
-
unmatched_refs = set(range(len(references))) - matched_refs
|
| 159 |
-
unmatched_hyps = set(range(len(hypotheses))) - matched_hyps
|
| 160 |
-
return matches, unmatched_refs, unmatched_hyps
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 164 |
-
# Métrique principale
|
| 165 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
def _prf(tp: int, fp: int, fn: int) -> dict[str, float]:
|
| 169 |
-
p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
|
| 170 |
-
r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
|
| 171 |
-
f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
|
| 172 |
-
return {"precision": p, "recall": r, "f1": f1, "support": tp + fn}
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
def compute_layout_metrics(
|
| 176 |
-
reference_regions: Iterable[Region | dict] | None,
|
| 177 |
-
hypothesis_regions: Iterable[Region | dict] | None,
|
| 178 |
-
iou_threshold: float = 0.5,
|
| 179 |
-
) -> dict:
|
| 180 |
-
"""Calcule precision/recall/F1 sur le layout par type de région.
|
| 181 |
-
|
| 182 |
-
Parameters
|
| 183 |
-
----------
|
| 184 |
-
reference_regions:
|
| 185 |
-
Liste de régions GT (``Region`` ou dict ``{id, type, bbox}``).
|
| 186 |
-
hypothesis_regions:
|
| 187 |
-
Liste de régions produites par le moteur OCR/HTR ou un
|
| 188 |
-
layout-detector.
|
| 189 |
-
iou_threshold:
|
| 190 |
-
Seuil de chevauchement minimal pour déclarer un appariement
|
| 191 |
-
(défaut : 0,5 — convention ICDAR).
|
| 192 |
-
|
| 193 |
-
Returns
|
| 194 |
-
-------
|
| 195 |
-
dict
|
| 196 |
-
``{
|
| 197 |
-
"global": {"precision", "recall", "f1", "support"},
|
| 198 |
-
"per_type": {type_name: {"precision", ...}},
|
| 199 |
-
"true_positives": int,
|
| 200 |
-
"false_positives": int,
|
| 201 |
-
"false_negatives": int,
|
| 202 |
-
"missed_regions": list[dict], # GT non matchées
|
| 203 |
-
"hallucinated_regions": list[dict], # hyp non matchées
|
| 204 |
-
"iou_threshold": float,
|
| 205 |
-
}``
|
| 206 |
-
|
| 207 |
-
Cas dégénérés
|
| 208 |
-
-------------
|
| 209 |
-
- Deux listes vides → F1 = 0 et tous compteurs à 0.
|
| 210 |
-
- GT vide + hyp non-vide → F1 = 0 (toutes hyp = FP).
|
| 211 |
-
- hyp vide + GT non-vide → F1 = 0 (toutes GT = FN).
|
| 212 |
-
"""
|
| 213 |
-
refs = [_to_region(r) for r in (reference_regions or [])]
|
| 214 |
-
hyps = [_to_region(h) for h in (hypothesis_regions or [])]
|
| 215 |
-
|
| 216 |
-
matches, unmatched_refs, unmatched_hyps = _align_regions(
|
| 217 |
-
refs, hyps, iou_threshold,
|
| 218 |
-
)
|
| 219 |
-
|
| 220 |
-
tp = len(matches)
|
| 221 |
-
fn = len(unmatched_refs)
|
| 222 |
-
fp = len(unmatched_hyps)
|
| 223 |
-
|
| 224 |
-
cat_tp: dict[str, int] = {}
|
| 225 |
-
cat_fn: dict[str, int] = {}
|
| 226 |
-
cat_fp: dict[str, int] = {}
|
| 227 |
-
for i, _j, _iou in matches:
|
| 228 |
-
cat = refs[i].type
|
| 229 |
-
cat_tp[cat] = cat_tp.get(cat, 0) + 1
|
| 230 |
-
for i in unmatched_refs:
|
| 231 |
-
cat = refs[i].type
|
| 232 |
-
cat_fn[cat] = cat_fn.get(cat, 0) + 1
|
| 233 |
-
for j in unmatched_hyps:
|
| 234 |
-
cat = hyps[j].type
|
| 235 |
-
cat_fp[cat] = cat_fp.get(cat, 0) + 1
|
| 236 |
-
|
| 237 |
-
all_categories = sorted(set(cat_tp) | set(cat_fn) | set(cat_fp))
|
| 238 |
-
per_type = {
|
| 239 |
-
cat: _prf(
|
| 240 |
-
cat_tp.get(cat, 0),
|
| 241 |
-
cat_fp.get(cat, 0),
|
| 242 |
-
cat_fn.get(cat, 0),
|
| 243 |
-
)
|
| 244 |
-
for cat in all_categories
|
| 245 |
-
}
|
| 246 |
-
|
| 247 |
-
return {
|
| 248 |
-
"global": _prf(tp, fp, fn),
|
| 249 |
-
"per_type": per_type,
|
| 250 |
-
"true_positives": tp,
|
| 251 |
-
"false_positives": fp,
|
| 252 |
-
"false_negatives": fn,
|
| 253 |
-
"missed_regions": [
|
| 254 |
-
{"id": refs[i].id, "type": refs[i].type, "bbox": list(refs[i].bbox)}
|
| 255 |
-
for i in sorted(unmatched_refs)
|
| 256 |
-
],
|
| 257 |
-
"hallucinated_regions": [
|
| 258 |
-
{"id": hyps[j].id, "type": hyps[j].type, "bbox": list(hyps[j].bbox)}
|
| 259 |
-
for j in sorted(unmatched_hyps)
|
| 260 |
-
],
|
| 261 |
-
"iou_threshold": iou_threshold,
|
| 262 |
-
}
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
def layout_f1(
|
| 266 |
-
reference_regions: Iterable[Region | dict] | None,
|
| 267 |
-
hypothesis_regions: Iterable[Region | dict] | None,
|
| 268 |
-
iou_threshold: float = 0.5,
|
| 269 |
-
) -> float:
|
| 270 |
-
"""Raccourci : F1 global du layout."""
|
| 271 |
-
return compute_layout_metrics(
|
| 272 |
-
reference_regions, hypothesis_regions, iou_threshold,
|
| 273 |
-
)["global"]["f1"]
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
__all__ = [
|
| 277 |
-
"Region",
|
| 278 |
-
"compute_layout_metrics",
|
| 279 |
-
"layout_f1",
|
| 280 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.layout``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.layout`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
| 6 |
|
| 7 |
+
Ré-expose explicitement le symbole privé ``_iou_bbox`` qu'au moins
|
| 8 |
+
un test importe directement.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
"""
|
| 10 |
|
| 11 |
from __future__ import annotations
|
| 12 |
|
| 13 |
+
from picarones.evaluation.metrics.layout import * # noqa: F401,F403
|
| 14 |
+
from picarones.evaluation.metrics.layout import _iou_bbox # noqa: F401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,561 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Le moteur narratif (Sprint 19) émet des `Fact` qui décrivent **ce
|
| 8 |
-
qui s'est passé** dans le benchmark : qui gagne, qui s'effondre,
|
| 9 |
-
qui est fragile. Ce sprint répond à une question
|
| 10 |
-
complémentaire : **sur quelle dimension le bénéfice attendu d'une
|
| 11 |
-
amélioration serait-il le plus visible ?**
|
| 12 |
-
|
| 13 |
-
Pas de prescription
|
| 14 |
-
-------------------
|
| 15 |
-
Picarones est un **outil de recherche**, pas un atelier de
|
| 16 |
-
production. Le module ne dit jamais *« faites X »* ni
|
| 17 |
-
*« utilisez le moteur Y »* ; il agrège des **observations
|
| 18 |
-
factuelles** déjà calculées dans d'autres modules (Sprints 75-81)
|
| 19 |
-
et les présente comme un récapitulatif compact en bas du rapport.
|
| 20 |
-
Le chercheur lit, juge et arbitre.
|
| 21 |
-
|
| 22 |
-
Exemples de leviers émis
|
| 23 |
-
------------------------
|
| 24 |
-
- *« 65 % des erreurs de Tesseract sont de classe récupérable
|
| 25 |
-
(case_error, ligature_error, abbreviation_error) — un
|
| 26 |
-
post-processing trivial absorberait une partie. »*
|
| 27 |
-
- *« 12 % de vos documents concentrent 78 % du CER total
|
| 28 |
-
(Pareto-CER). »*
|
| 29 |
-
- *« Le déficit projeté du moteur le plus fragile sur le corpus
|
| 30 |
-
réel est de 4,2 points de CER (Sprint 81). »*
|
| 31 |
-
- *« Le top-3 des tokens GT systématiquement modernisés est
|
| 32 |
-
maistre, nostre, veoir (Sprint 80). »*
|
| 33 |
-
|
| 34 |
-
Structure
|
| 35 |
-
---------
|
| 36 |
-
Module parallèle au registre narratif Sprint 19 : `Lever` est la
|
| 37 |
-
dataclass équivalente à `Fact`, `LeverImportance` reprend la
|
| 38 |
-
sémantique de `FactImportance`, `@register_lever` indexe les
|
| 39 |
-
détecteurs. Garde-fou anti-hallucination identique : chaque
|
| 40 |
-
nombre rendu doit être présent dans le `payload` du `Lever`.
|
| 41 |
-
|
| 42 |
-
Les détecteurs lisent **uniquement** des structures déjà
|
| 43 |
-
construites par le pipeline du benchmark — ils ne calculent rien
|
| 44 |
-
de nouveau, ils synthétisent. C'est pourquoi le module est
|
| 45 |
-
résolument optionnel : si un benchmark n'expose pas
|
| 46 |
-
`taxonomy_aggregated`, `inter_engine_analysis`, `corpus_difficulty`,
|
| 47 |
-
`lexical_modernization` ou `robustness_projection`, le détecteur
|
| 48 |
-
correspondant retourne tout simplement `[]`.
|
| 49 |
"""
|
| 50 |
|
| 51 |
from __future__ import annotations
|
| 52 |
|
| 53 |
-
import
|
| 54 |
-
import threading
|
| 55 |
-
from dataclasses import dataclass
|
| 56 |
-
from enum import Enum
|
| 57 |
-
from typing import Callable
|
| 58 |
-
|
| 59 |
-
logger = logging.getLogger(__name__)
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 63 |
-
# Modèle
|
| 64 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
class LeverType(str, Enum):
|
| 68 |
-
"""Types de leviers détectés."""
|
| 69 |
-
|
| 70 |
-
DOMINANT_RECOVERABLE_CLASS = "dominant_recoverable_class"
|
| 71 |
-
"""Une part importante des erreurs d'un moteur est dans des classes
|
| 72 |
-
catégorisées « récupérables » (Sprint 77)."""
|
| 73 |
-
|
| 74 |
-
PARETO_CONCENTRATION = "pareto_concentration"
|
| 75 |
-
"""Une fraction minoritaire de documents concentre une fraction
|
| 76 |
-
majoritaire du CER total — l'inspection ciblée est rentable."""
|
| 77 |
-
|
| 78 |
-
COMPLEMENTARITY_OBSERVATION = "complementarity_observation"
|
| 79 |
-
"""Le `complementarity_gap` (Sprint 35) entre l'oracle et le
|
| 80 |
-
meilleur moteur seul est non négligeable — observation factuelle,
|
| 81 |
-
aucune recommandation d'ensemble."""
|
| 82 |
-
|
| 83 |
-
LEXICAL_MODERNIZATION_OBSERVATION = "lexical_modernization_observation"
|
| 84 |
-
"""Top-N des tokens GT systématiquement modernisés (Sprint 80)."""
|
| 85 |
-
|
| 86 |
-
ROBUSTNESS_PROJECTION_OBSERVATION = "robustness_projection_observation"
|
| 87 |
-
"""Déficit projeté global le plus important pour un moteur sur
|
| 88 |
-
le corpus réel (Sprint 81)."""
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
class LeverImportance(int, Enum):
|
| 92 |
-
"""Importance éditoriale d'un levier."""
|
| 93 |
-
|
| 94 |
-
HIGH = 70
|
| 95 |
-
MEDIUM = 40
|
| 96 |
-
LOW = 10
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
@dataclass
|
| 100 |
-
class Lever:
|
| 101 |
-
"""Observation factuelle synthétisable en encart « Leviers ».
|
| 102 |
-
|
| 103 |
-
Attributes
|
| 104 |
-
----------
|
| 105 |
-
type:
|
| 106 |
-
Le type de levier (voir `LeverType`).
|
| 107 |
-
importance:
|
| 108 |
-
Score qui décide l'ordre d'affichage.
|
| 109 |
-
payload:
|
| 110 |
-
Données brutes — **tout chiffre rendu dans le HTML doit
|
| 111 |
-
provenir d'ici**, jamais d'un calcul du renderer.
|
| 112 |
-
engines_involved:
|
| 113 |
-
Noms des moteurs concernés (peut être vide pour un levier
|
| 114 |
-
corpus-wide).
|
| 115 |
-
"""
|
| 116 |
-
|
| 117 |
-
type: LeverType
|
| 118 |
-
importance: LeverImportance
|
| 119 |
-
payload: dict
|
| 120 |
-
engines_involved: tuple[str, ...] = ()
|
| 121 |
-
|
| 122 |
-
def as_dict(self) -> dict:
|
| 123 |
-
return {
|
| 124 |
-
"type": self.type.value,
|
| 125 |
-
"importance": int(self.importance),
|
| 126 |
-
"payload": self.payload,
|
| 127 |
-
"engines_involved": list(self.engines_involved),
|
| 128 |
-
}
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 132 |
-
# Registre
|
| 133 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
LeverDetectorFn = Callable[[dict], list[Lever]]
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
@dataclass(frozen=True)
|
| 140 |
-
class LeverDetectorEntry:
|
| 141 |
-
lever_type: LeverType
|
| 142 |
-
fn: LeverDetectorFn
|
| 143 |
-
priority: int
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
_LEVER_REGISTRY: dict[LeverType, LeverDetectorEntry] = {}
|
| 147 |
-
_LEVER_REGISTRY_LOCK = threading.Lock()
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
def register_lever(
|
| 151 |
-
lever_type: LeverType,
|
| 152 |
-
*,
|
| 153 |
-
priority: int,
|
| 154 |
-
) -> Callable[[LeverDetectorFn], LeverDetectorFn]:
|
| 155 |
-
"""Décorateur : enregistre un détecteur de levier.
|
| 156 |
-
|
| 157 |
-
Une seule fonction par type — réenregistrer lève `ValueError`.
|
| 158 |
-
"""
|
| 159 |
-
def _decorator(fn: LeverDetectorFn) -> LeverDetectorFn:
|
| 160 |
-
with _LEVER_REGISTRY_LOCK:
|
| 161 |
-
if lever_type in _LEVER_REGISTRY:
|
| 162 |
-
raise ValueError(
|
| 163 |
-
f"Détecteur déjà enregistré pour {lever_type.value!r} : "
|
| 164 |
-
f"{_LEVER_REGISTRY[lever_type].fn.__name__}."
|
| 165 |
-
)
|
| 166 |
-
_LEVER_REGISTRY[lever_type] = LeverDetectorEntry(
|
| 167 |
-
lever_type=lever_type, fn=fn, priority=int(priority),
|
| 168 |
-
)
|
| 169 |
-
return fn
|
| 170 |
-
return _decorator
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
def unregister_lever(lever_type: LeverType) -> None:
|
| 174 |
-
with _LEVER_REGISTRY_LOCK:
|
| 175 |
-
_LEVER_REGISTRY.pop(lever_type, None)
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
def iter_lever_detectors() -> list[LeverDetectorEntry]:
|
| 179 |
-
with _LEVER_REGISTRY_LOCK:
|
| 180 |
-
entries = list(_LEVER_REGISTRY.values())
|
| 181 |
-
entries.sort(key=lambda e: e.priority)
|
| 182 |
-
return entries
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
def detect_levers(benchmark_data: dict) -> list[Lever]:
|
| 186 |
-
"""Applique tous les détecteurs enregistrés et trie par importance
|
| 187 |
-
décroissante puis priorité d'enregistrement croissante."""
|
| 188 |
-
levers: list[Lever] = []
|
| 189 |
-
for entry in iter_lever_detectors():
|
| 190 |
-
try:
|
| 191 |
-
result = entry.fn(benchmark_data)
|
| 192 |
-
except Exception as e:
|
| 193 |
-
logger.warning(
|
| 194 |
-
"[levers.detector.%s] fonctionnalité dégradée : %s",
|
| 195 |
-
entry.lever_type.value, e,
|
| 196 |
-
)
|
| 197 |
-
continue
|
| 198 |
-
if result:
|
| 199 |
-
levers.extend(result)
|
| 200 |
-
# Tri stable : importance décroissante d'abord
|
| 201 |
-
levers.sort(key=lambda lv: -int(lv.importance))
|
| 202 |
-
return levers
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 206 |
-
# Détecteurs
|
| 207 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
# Catégorisation reprise du Sprint 77 (taxonomy_comparison.py).
|
| 211 |
-
# Volontairement dupliquée ici pour ne pas introduire d'import
|
| 212 |
-
# circulaire — la sémantique est gelée.
|
| 213 |
-
_RECOVERABILITY: dict[str, str] = {
|
| 214 |
-
"case_error": "recoverable",
|
| 215 |
-
"ligature_error": "recoverable",
|
| 216 |
-
"abbreviation_error": "recoverable",
|
| 217 |
-
"diacritic_error": "difficult",
|
| 218 |
-
"visual_confusion": "difficult",
|
| 219 |
-
"hapax": "difficult",
|
| 220 |
-
"lacuna": "irrecoverable",
|
| 221 |
-
"oov_character": "irrecoverable",
|
| 222 |
-
"segmentation_error": "irrecoverable",
|
| 223 |
-
}
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
@register_lever(LeverType.DOMINANT_RECOVERABLE_CLASS, priority=10)
|
| 227 |
-
def detect_dominant_recoverable_class(
|
| 228 |
-
benchmark_data: dict,
|
| 229 |
-
*,
|
| 230 |
-
threshold: float = 0.30,
|
| 231 |
-
) -> list[Lever]:
|
| 232 |
-
"""Émet un levier si ≥ `threshold` des erreurs d'un moteur sont
|
| 233 |
-
classifiées récupérables (catégorisation Sprint 77).
|
| 234 |
-
|
| 235 |
-
Lit `benchmark_data["engines"][i]["aggregated_taxonomy"]` —
|
| 236 |
-
structure produite par le runner historique. Si absent, retourne
|
| 237 |
-
[].
|
| 238 |
-
"""
|
| 239 |
-
engines = benchmark_data.get("engines") or []
|
| 240 |
-
out: list[Lever] = []
|
| 241 |
-
for engine in engines:
|
| 242 |
-
taxonomy = engine.get("aggregated_taxonomy")
|
| 243 |
-
if not taxonomy:
|
| 244 |
-
continue
|
| 245 |
-
# `taxonomy` peut être {class_name: int} ou un dict avec une
|
| 246 |
-
# sous-clé "counts" — on accepte les deux conventions.
|
| 247 |
-
counts = taxonomy.get("counts") if isinstance(taxonomy, dict) and "counts" in taxonomy else taxonomy
|
| 248 |
-
if not isinstance(counts, dict) or not counts:
|
| 249 |
-
continue
|
| 250 |
-
try:
|
| 251 |
-
int_counts = {k: int(v) for k, v in counts.items() if isinstance(v, (int, float))}
|
| 252 |
-
except (TypeError, ValueError):
|
| 253 |
-
continue
|
| 254 |
-
total = sum(int_counts.values())
|
| 255 |
-
if total <= 0:
|
| 256 |
-
continue
|
| 257 |
-
recoverable_total = sum(
|
| 258 |
-
v for k, v in int_counts.items()
|
| 259 |
-
if _RECOVERABILITY.get(k) == "recoverable"
|
| 260 |
-
)
|
| 261 |
-
share = recoverable_total / total
|
| 262 |
-
if share < threshold:
|
| 263 |
-
continue
|
| 264 |
-
# Classes récupérables non vides triées par count décroissant
|
| 265 |
-
breakdown = sorted(
|
| 266 |
-
(
|
| 267 |
-
(k, v) for k, v in int_counts.items()
|
| 268 |
-
if _RECOVERABILITY.get(k) == "recoverable" and v > 0
|
| 269 |
-
),
|
| 270 |
-
key=lambda kv: -kv[1],
|
| 271 |
-
)
|
| 272 |
-
importance = (
|
| 273 |
-
LeverImportance.HIGH if share >= 0.50 else LeverImportance.MEDIUM
|
| 274 |
-
)
|
| 275 |
-
out.append(Lever(
|
| 276 |
-
type=LeverType.DOMINANT_RECOVERABLE_CLASS,
|
| 277 |
-
importance=importance,
|
| 278 |
-
payload={
|
| 279 |
-
"engine": engine.get("name") or "?",
|
| 280 |
-
"share_recoverable": share,
|
| 281 |
-
"share_recoverable_pct": round(share * 100, 1),
|
| 282 |
-
"n_recoverable": recoverable_total,
|
| 283 |
-
"n_total_errors": total,
|
| 284 |
-
"top_classes": [
|
| 285 |
-
{"class": k, "count": v} for k, v in breakdown[:3]
|
| 286 |
-
],
|
| 287 |
-
},
|
| 288 |
-
engines_involved=(engine.get("name") or "?",),
|
| 289 |
-
))
|
| 290 |
-
return out
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
@register_lever(LeverType.PARETO_CONCENTRATION, priority=20)
|
| 294 |
-
def detect_pareto_concentration(
|
| 295 |
-
benchmark_data: dict,
|
| 296 |
-
*,
|
| 297 |
-
top_share: float = 0.20,
|
| 298 |
-
cer_share_threshold: float = 0.50,
|
| 299 |
-
) -> list[Lever]:
|
| 300 |
-
"""Émet un levier si une fraction minoritaire de documents
|
| 301 |
-
(`top_share`) concentre plus de `cer_share_threshold` du CER
|
| 302 |
-
total cumulé sur le moteur leader.
|
| 303 |
-
|
| 304 |
-
Lit `benchmark_data["per_doc_cer"][engine_name]` ou tente de
|
| 305 |
-
reconstruire depuis `benchmark_data["engines"][...]["per_doc"]`.
|
| 306 |
-
Si rien d'exploitable, retourne [].
|
| 307 |
-
"""
|
| 308 |
-
ranking = benchmark_data.get("ranking") or []
|
| 309 |
-
if not ranking:
|
| 310 |
-
return []
|
| 311 |
-
leader = ranking[0]
|
| 312 |
-
leader_name = leader.get("engine")
|
| 313 |
-
if not leader_name:
|
| 314 |
-
return []
|
| 315 |
-
|
| 316 |
-
per_doc_cer: list[float] = []
|
| 317 |
-
# Voie 1 : structure plate "per_doc_cer"
|
| 318 |
-
flat = benchmark_data.get("per_doc_cer") or {}
|
| 319 |
-
if isinstance(flat, dict) and leader_name in flat and isinstance(flat[leader_name], list):
|
| 320 |
-
per_doc_cer = [float(x) for x in flat[leader_name] if isinstance(x, (int, float))]
|
| 321 |
-
else:
|
| 322 |
-
# Voie 2 : engine.per_doc liste de dicts {cer: float}
|
| 323 |
-
for engine in benchmark_data.get("engines") or []:
|
| 324 |
-
if engine.get("name") != leader_name:
|
| 325 |
-
continue
|
| 326 |
-
per_doc = engine.get("per_doc") or []
|
| 327 |
-
for entry in per_doc:
|
| 328 |
-
if isinstance(entry, dict) and isinstance(entry.get("cer"), (int, float)):
|
| 329 |
-
per_doc_cer.append(float(entry["cer"]))
|
| 330 |
-
break
|
| 331 |
-
|
| 332 |
-
if not per_doc_cer:
|
| 333 |
-
return []
|
| 334 |
-
total_cer = sum(per_doc_cer)
|
| 335 |
-
if total_cer <= 0:
|
| 336 |
-
return []
|
| 337 |
-
|
| 338 |
-
sorted_cer = sorted(per_doc_cer, reverse=True)
|
| 339 |
-
n = len(sorted_cer)
|
| 340 |
-
n_top = max(1, int(round(top_share * n)))
|
| 341 |
-
top_cer_sum = sum(sorted_cer[:n_top])
|
| 342 |
-
share_of_total = top_cer_sum / total_cer
|
| 343 |
-
if share_of_total < cer_share_threshold:
|
| 344 |
-
return []
|
| 345 |
-
importance = (
|
| 346 |
-
LeverImportance.HIGH if share_of_total >= 0.75
|
| 347 |
-
else LeverImportance.MEDIUM
|
| 348 |
-
)
|
| 349 |
-
return [Lever(
|
| 350 |
-
type=LeverType.PARETO_CONCENTRATION,
|
| 351 |
-
importance=importance,
|
| 352 |
-
payload={
|
| 353 |
-
"engine": leader_name,
|
| 354 |
-
"n_docs": n,
|
| 355 |
-
"n_docs_top": n_top,
|
| 356 |
-
"top_share_pct": round((n_top / n) * 100, 1),
|
| 357 |
-
"cer_share_of_total": share_of_total,
|
| 358 |
-
"cer_share_pct": round(share_of_total * 100, 1),
|
| 359 |
-
},
|
| 360 |
-
engines_involved=(leader_name,),
|
| 361 |
-
)]
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
@register_lever(LeverType.COMPLEMENTARITY_OBSERVATION, priority=30)
|
| 365 |
-
def detect_complementarity_observation(
|
| 366 |
-
benchmark_data: dict,
|
| 367 |
-
*,
|
| 368 |
-
min_relative_gap: float = 0.20,
|
| 369 |
-
) -> list[Lever]:
|
| 370 |
-
"""Reformule factuellement le `complementarity_gap` (Sprint 35).
|
| 371 |
-
|
| 372 |
-
Lit `benchmark_data["inter_engine_analysis"]`. Garde-fou : ne
|
| 373 |
-
déclenche que si `relative_gap` ≥ `min_relative_gap`. **Aucune
|
| 374 |
-
recommandation d'ensemble** — le levier dit factuellement
|
| 375 |
-
« X points séparent l'oracle du meilleur moteur », c'est tout.
|
| 376 |
-
"""
|
| 377 |
-
inter = benchmark_data.get("inter_engine_analysis") or {}
|
| 378 |
-
cgap = inter.get("complementarity_gap") or {}
|
| 379 |
-
relative_gap = cgap.get("relative_gap")
|
| 380 |
-
absolute_gap = cgap.get("absolute_gap")
|
| 381 |
-
if relative_gap is None or absolute_gap is None:
|
| 382 |
-
return []
|
| 383 |
-
try:
|
| 384 |
-
rg = float(relative_gap)
|
| 385 |
-
ag = float(absolute_gap)
|
| 386 |
-
except (TypeError, ValueError):
|
| 387 |
-
return []
|
| 388 |
-
if rg < min_relative_gap:
|
| 389 |
-
return []
|
| 390 |
-
importance = (
|
| 391 |
-
LeverImportance.HIGH if rg >= 0.50 else LeverImportance.MEDIUM
|
| 392 |
-
)
|
| 393 |
-
payload: dict = {
|
| 394 |
-
"absolute_gap": ag,
|
| 395 |
-
"absolute_gap_pct": round(ag * 100, 1),
|
| 396 |
-
"relative_gap": rg,
|
| 397 |
-
"relative_gap_pct": round(rg * 100, 1),
|
| 398 |
-
}
|
| 399 |
-
best_engine = cgap.get("best_engine") or inter.get("best_engine")
|
| 400 |
-
best_recall = cgap.get("best_recall") or inter.get("best_engine_recall")
|
| 401 |
-
oracle_recall = cgap.get("oracle_recall") or inter.get("oracle_recall")
|
| 402 |
-
engines_involved: tuple[str, ...] = ()
|
| 403 |
-
if best_engine:
|
| 404 |
-
payload["best_engine"] = str(best_engine)
|
| 405 |
-
engines_involved = (str(best_engine),)
|
| 406 |
-
if isinstance(best_recall, (int, float)):
|
| 407 |
-
payload["best_recall"] = float(best_recall)
|
| 408 |
-
if isinstance(oracle_recall, (int, float)):
|
| 409 |
-
payload["oracle_recall"] = float(oracle_recall)
|
| 410 |
-
return [Lever(
|
| 411 |
-
type=LeverType.COMPLEMENTARITY_OBSERVATION,
|
| 412 |
-
importance=importance,
|
| 413 |
-
payload=payload,
|
| 414 |
-
engines_involved=engines_involved,
|
| 415 |
-
)]
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
@register_lever(LeverType.LEXICAL_MODERNIZATION_OBSERVATION, priority=40)
|
| 419 |
-
def detect_lexical_modernization_observation(
|
| 420 |
-
benchmark_data: dict,
|
| 421 |
-
*,
|
| 422 |
-
top_n: int = 3,
|
| 423 |
-
min_total: int = 3,
|
| 424 |
-
min_rate: float = 0.50,
|
| 425 |
-
) -> list[Lever]:
|
| 426 |
-
"""Pour chaque moteur disposant de `lexical_modernization`,
|
| 427 |
-
émet un levier listant les `top_n` tokens GT les plus modernisés.
|
| 428 |
-
|
| 429 |
-
Lit `benchmark_data["engines"][i]["lexical_modernization"]` qui
|
| 430 |
-
suit la forme produite par `compute_lexical_modernization` du
|
| 431 |
-
Sprint 80 (`{"n_gt_tokens": int, "tokens": dict}`).
|
| 432 |
-
"""
|
| 433 |
-
out: list[Lever] = []
|
| 434 |
-
for engine in benchmark_data.get("engines") or []:
|
| 435 |
-
data = engine.get("lexical_modernization")
|
| 436 |
-
if not isinstance(data, dict):
|
| 437 |
-
continue
|
| 438 |
-
tokens = data.get("tokens") or {}
|
| 439 |
-
if not isinstance(tokens, dict) or not tokens:
|
| 440 |
-
continue
|
| 441 |
-
candidates: list[tuple[str, dict]] = []
|
| 442 |
-
for gt_token, slot in tokens.items():
|
| 443 |
-
if not isinstance(slot, dict):
|
| 444 |
-
continue
|
| 445 |
-
n_total = slot.get("n_total")
|
| 446 |
-
rate = slot.get("rate_modernized")
|
| 447 |
-
if not isinstance(n_total, (int, float)) or not isinstance(rate, (int, float)):
|
| 448 |
-
continue
|
| 449 |
-
if int(n_total) < min_total:
|
| 450 |
-
continue
|
| 451 |
-
if float(rate) < min_rate:
|
| 452 |
-
continue
|
| 453 |
-
candidates.append((gt_token, dict(slot)))
|
| 454 |
-
if not candidates:
|
| 455 |
-
continue
|
| 456 |
-
candidates.sort(
|
| 457 |
-
key=lambda kv: (-float(kv[1].get("rate_modernized", 0.0)),
|
| 458 |
-
-int(kv[1].get("n_total", 0)),
|
| 459 |
-
kv[0]),
|
| 460 |
-
)
|
| 461 |
-
top = candidates[:top_n]
|
| 462 |
-
engine_name = engine.get("name") or "?"
|
| 463 |
-
max_rate = max(float(slot.get("rate_modernized", 0.0)) for _, slot in top)
|
| 464 |
-
importance = (
|
| 465 |
-
LeverImportance.HIGH if max_rate >= 0.90 else LeverImportance.MEDIUM
|
| 466 |
-
)
|
| 467 |
-
out.append(Lever(
|
| 468 |
-
type=LeverType.LEXICAL_MODERNIZATION_OBSERVATION,
|
| 469 |
-
importance=importance,
|
| 470 |
-
payload={
|
| 471 |
-
"engine": engine_name,
|
| 472 |
-
"top_tokens": [
|
| 473 |
-
{
|
| 474 |
-
"gt_token": gt,
|
| 475 |
-
"n_total": int(slot.get("n_total", 0)),
|
| 476 |
-
"rate_modernized": float(slot.get("rate_modernized", 0.0)),
|
| 477 |
-
"rate_modernized_pct": round(
|
| 478 |
-
float(slot.get("rate_modernized", 0.0)) * 100, 1,
|
| 479 |
-
),
|
| 480 |
-
}
|
| 481 |
-
for gt, slot in top
|
| 482 |
-
],
|
| 483 |
-
},
|
| 484 |
-
engines_involved=(engine_name,),
|
| 485 |
-
))
|
| 486 |
-
return out
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
@register_lever(LeverType.ROBUSTNESS_PROJECTION_OBSERVATION, priority=50)
|
| 490 |
-
def detect_robustness_projection_observation(
|
| 491 |
-
benchmark_data: dict,
|
| 492 |
-
*,
|
| 493 |
-
min_total_deficit: float = 0.02,
|
| 494 |
-
) -> list[Lever]:
|
| 495 |
-
"""Lit l'agrégation par moteur de la projection de robustesse
|
| 496 |
-
(Sprint 81). Émet le levier pour le moteur dont
|
| 497 |
-
`total_expected_deficit` est ≥ `min_total_deficit` (par défaut
|
| 498 |
-
2 points de CER).
|
| 499 |
-
|
| 500 |
-
Lit `benchmark_data["robustness_projection_aggregated"]` —
|
| 501 |
-
structure produite par `aggregate_projection_per_engine`.
|
| 502 |
-
"""
|
| 503 |
-
agg = benchmark_data.get("robustness_projection_aggregated") or {}
|
| 504 |
-
if not isinstance(agg, dict) or not agg:
|
| 505 |
-
return []
|
| 506 |
-
out: list[Lever] = []
|
| 507 |
-
for engine_name, info in agg.items():
|
| 508 |
-
if not isinstance(info, dict):
|
| 509 |
-
continue
|
| 510 |
-
total_deficit = info.get("total_expected_deficit")
|
| 511 |
-
worst_type = info.get("worst_degradation_type")
|
| 512 |
-
worst_deficit = info.get("worst_degradation_deficit")
|
| 513 |
-
if not isinstance(total_deficit, (int, float)):
|
| 514 |
-
continue
|
| 515 |
-
if float(total_deficit) < min_total_deficit:
|
| 516 |
-
continue
|
| 517 |
-
importance = (
|
| 518 |
-
LeverImportance.HIGH if float(total_deficit) >= 0.05
|
| 519 |
-
else LeverImportance.MEDIUM
|
| 520 |
-
)
|
| 521 |
-
payload: dict = {
|
| 522 |
-
"engine": engine_name,
|
| 523 |
-
"total_expected_deficit": float(total_deficit),
|
| 524 |
-
"total_expected_deficit_pct": round(float(total_deficit) * 100, 1),
|
| 525 |
-
"n_degradation_types": int(info.get("n_degradation_types") or 0),
|
| 526 |
-
}
|
| 527 |
-
if isinstance(worst_type, str):
|
| 528 |
-
payload["worst_degradation_type"] = worst_type
|
| 529 |
-
if isinstance(worst_deficit, (int, float)):
|
| 530 |
-
payload["worst_degradation_deficit"] = float(worst_deficit)
|
| 531 |
-
payload["worst_degradation_deficit_pct"] = round(
|
| 532 |
-
float(worst_deficit) * 100, 1,
|
| 533 |
-
)
|
| 534 |
-
out.append(Lever(
|
| 535 |
-
type=LeverType.ROBUSTNESS_PROJECTION_OBSERVATION,
|
| 536 |
-
importance=importance,
|
| 537 |
-
payload=payload,
|
| 538 |
-
engines_involved=(engine_name,),
|
| 539 |
-
))
|
| 540 |
-
# Tri par déficit décroissant pour stabilité d'affichage.
|
| 541 |
-
out.sort(
|
| 542 |
-
key=lambda lv: -float(lv.payload.get("total_expected_deficit") or 0.0),
|
| 543 |
-
)
|
| 544 |
-
return out
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
__all__ = [
|
| 548 |
-
"Lever",
|
| 549 |
-
"LeverImportance",
|
| 550 |
-
"LeverType",
|
| 551 |
-
"LeverDetectorEntry",
|
| 552 |
-
"register_lever",
|
| 553 |
-
"unregister_lever",
|
| 554 |
-
"iter_lever_detectors",
|
| 555 |
-
"detect_levers",
|
| 556 |
-
"detect_dominant_recoverable_class",
|
| 557 |
-
"detect_pareto_concentration",
|
| 558 |
-
"detect_complementarity_observation",
|
| 559 |
-
"detect_lexical_modernization_observation",
|
| 560 |
-
"detect_robustness_projection_observation",
|
| 561 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.levers``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.levers`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.levers import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,263 +1,10 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
Pourquoi ce module
|
| 7 |
-
------------------
|
| 8 |
-
Le détecteur ``llm_hallucination_flag`` (Sprint 19) signale qu'un
|
| 9 |
-
moteur sur-normalise (« 0,05 % »). Mais ce score agrégé ne dit
|
| 10 |
-
rien sur **quoi** corriger dans le prompt. Ce module produit
|
| 11 |
-
une **table de fréquences détaillée** :
|
| 12 |
-
|
| 13 |
-
+----------------------+--------------------+------+----------+
|
| 14 |
-
| Forme historique GT | Forme modernisée | n GT | % modern |
|
| 15 |
-
+======================+====================+======+==========+
|
| 16 |
-
| maistre | maître | 47 | 85 % |
|
| 17 |
-
| nostre | nostre | 92 | 8 % |
|
| 18 |
-
| veoir | voir | 23 | 100 % |
|
| 19 |
-
+----------------------+--------------------+------+----------+
|
| 20 |
-
|
| 21 |
-
Lecture immédiate : *« le LLM modernise systématiquement
|
| 22 |
-
maistre → maître ; pour préserver l'orthographe historique, ajouter
|
| 23 |
-
au prompt "ne pas moderniser maistre, nostre, veoir" »*.
|
| 24 |
-
|
| 25 |
-
Méthode
|
| 26 |
-
-------
|
| 27 |
-
Alignement mot-à-mot via ``difflib.SequenceMatcher``. Chaque
|
| 28 |
-
``replace`` ou ``equal`` produit une paire ``(gt_token,
|
| 29 |
-
hyp_token)``. On accumule pour chaque ``gt_token`` :
|
| 30 |
-
|
| 31 |
-
- ``n_total`` : nombre d'occurrences du token dans la GT
|
| 32 |
-
- ``n_modernized`` : nombre d'occurrences où ``hyp_token != gt_token``
|
| 33 |
-
- ``variants`` : dict des hyp_tokens observés avec leur count
|
| 34 |
-
|
| 35 |
-
Stop-list
|
| 36 |
-
---------
|
| 37 |
-
L'utilisateur peut passer ``stop_list`` (ensemble de tokens GT à
|
| 38 |
-
ignorer). Par défaut, vide — le module ne tente pas de deviner ce
|
| 39 |
-
qui est « moderne » ou « historique », c'est au chercheur de
|
| 40 |
-
fournir le filtre adapté à son corpus.
|
| 41 |
-
|
| 42 |
-
Sortie
|
| 43 |
-
------
|
| 44 |
-
``compute_lexical_modernization`` retourne une structure adaptée
|
| 45 |
-
au rendu HTML. ``aggregate_lexical_modernization`` agrège
|
| 46 |
-
plusieurs documents.
|
| 47 |
-
|
| 48 |
-
Limites documentées
|
| 49 |
-
-------------------
|
| 50 |
-
- Tokenisation au niveau mot (split sur espace) — cohérent avec
|
| 51 |
-
``taxonomy.py`` et autres modules. Pas de stemming ni de
|
| 52 |
-
lemmatisation.
|
| 53 |
-
- La métrique mesure la **réécriture lexicale** ; elle n'attrape
|
| 54 |
-
pas les modernisations infra-mot (perte du s long ſ qui se
|
| 55 |
-
fond dans la même forme). Pour ça, voir ``early_modern_typography``
|
| 56 |
-
(Sprint 58) et ``equivalence_profile`` (Sprint 78).
|
| 57 |
"""
|
| 58 |
|
| 59 |
from __future__ import annotations
|
| 60 |
|
| 61 |
-
import
|
| 62 |
-
import logging
|
| 63 |
-
from typing import Iterable, Optional
|
| 64 |
-
|
| 65 |
-
logger = logging.getLogger(__name__)
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
def _split_words(text: Optional[str]) -> list[str]:
|
| 69 |
-
"""Tokenisation simple par split sur whitespace."""
|
| 70 |
-
if not text:
|
| 71 |
-
return []
|
| 72 |
-
return text.split()
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
def compute_lexical_modernization(
|
| 76 |
-
reference: Optional[str],
|
| 77 |
-
hypothesis: Optional[str],
|
| 78 |
-
*,
|
| 79 |
-
stop_list: Optional[Iterable[str]] = None,
|
| 80 |
-
case_sensitive: bool = False,
|
| 81 |
-
) -> dict:
|
| 82 |
-
"""Calcule le tableau de modernisation lexicale pour un document.
|
| 83 |
-
|
| 84 |
-
Returns
|
| 85 |
-
-------
|
| 86 |
-
dict
|
| 87 |
-
``{
|
| 88 |
-
"n_gt_tokens": int,
|
| 89 |
-
"tokens": {
|
| 90 |
-
gt_token: {
|
| 91 |
-
"n_total": int,
|
| 92 |
-
"n_modernized": int,
|
| 93 |
-
"rate_modernized": float, # ∈ [0, 1]
|
| 94 |
-
"variants": {hyp_token: count, ...},
|
| 95 |
-
},
|
| 96 |
-
...
|
| 97 |
-
},
|
| 98 |
-
}``
|
| 99 |
-
Si ``reference`` est vide → ``tokens == {}``.
|
| 100 |
-
"""
|
| 101 |
-
ref_tokens = _split_words(reference)
|
| 102 |
-
hyp_tokens = _split_words(hypothesis)
|
| 103 |
-
if not ref_tokens:
|
| 104 |
-
return {"n_gt_tokens": 0, "tokens": {}}
|
| 105 |
-
|
| 106 |
-
if not case_sensitive:
|
| 107 |
-
ref_for_match = [t.lower() for t in ref_tokens]
|
| 108 |
-
hyp_for_match = [t.lower() for t in hyp_tokens]
|
| 109 |
-
else:
|
| 110 |
-
ref_for_match = ref_tokens
|
| 111 |
-
hyp_for_match = hyp_tokens
|
| 112 |
-
|
| 113 |
-
stop = frozenset(
|
| 114 |
-
(t.lower() if not case_sensitive else t)
|
| 115 |
-
for t in (stop_list or [])
|
| 116 |
-
)
|
| 117 |
-
|
| 118 |
-
# On accumule par gt_token (forme display = forme originale,
|
| 119 |
-
# match key = forme casée selon ``case_sensitive``).
|
| 120 |
-
tokens_data: dict[str, dict] = {}
|
| 121 |
-
|
| 122 |
-
matcher = difflib.SequenceMatcher(
|
| 123 |
-
None, ref_for_match, hyp_for_match, autojunk=False,
|
| 124 |
-
)
|
| 125 |
-
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 126 |
-
if tag == "equal":
|
| 127 |
-
for k in range(i2 - i1):
|
| 128 |
-
gt_orig = ref_tokens[i1 + k]
|
| 129 |
-
gt_match = ref_for_match[i1 + k]
|
| 130 |
-
if gt_match in stop:
|
| 131 |
-
continue
|
| 132 |
-
slot = tokens_data.setdefault(
|
| 133 |
-
gt_orig,
|
| 134 |
-
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 135 |
-
)
|
| 136 |
-
slot["n_total"] += 1
|
| 137 |
-
elif tag == "replace":
|
| 138 |
-
# Apparier 1-à-1 quand possible
|
| 139 |
-
paired = min(i2 - i1, j2 - j1)
|
| 140 |
-
for k in range(paired):
|
| 141 |
-
gt_orig = ref_tokens[i1 + k]
|
| 142 |
-
gt_match = ref_for_match[i1 + k]
|
| 143 |
-
if gt_match in stop:
|
| 144 |
-
continue
|
| 145 |
-
hyp_orig = hyp_tokens[j1 + k]
|
| 146 |
-
slot = tokens_data.setdefault(
|
| 147 |
-
gt_orig,
|
| 148 |
-
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 149 |
-
)
|
| 150 |
-
slot["n_total"] += 1
|
| 151 |
-
slot["n_modernized"] += 1
|
| 152 |
-
slot["variants"][hyp_orig] = slot["variants"].get(hyp_orig, 0) + 1
|
| 153 |
-
# Si plus de gt que de hyp, le reste des gt_tokens est
|
| 154 |
-
# « perdu » — on les compte comme totaux mais pas comme
|
| 155 |
-
# modernisés (on ne sait pas en quoi).
|
| 156 |
-
for k in range(paired, i2 - i1):
|
| 157 |
-
gt_orig = ref_tokens[i1 + k]
|
| 158 |
-
gt_match = ref_for_match[i1 + k]
|
| 159 |
-
if gt_match in stop:
|
| 160 |
-
continue
|
| 161 |
-
slot = tokens_data.setdefault(
|
| 162 |
-
gt_orig,
|
| 163 |
-
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 164 |
-
)
|
| 165 |
-
slot["n_total"] += 1
|
| 166 |
-
slot["n_modernized"] += 1
|
| 167 |
-
slot["variants"]["∅"] = slot["variants"].get("∅", 0) + 1
|
| 168 |
-
elif tag == "delete":
|
| 169 |
-
# gt présent, pas en hyp → modernisation par
|
| 170 |
-
# suppression (ou perte pure)
|
| 171 |
-
for k in range(i2 - i1):
|
| 172 |
-
gt_orig = ref_tokens[i1 + k]
|
| 173 |
-
gt_match = ref_for_match[i1 + k]
|
| 174 |
-
if gt_match in stop:
|
| 175 |
-
continue
|
| 176 |
-
slot = tokens_data.setdefault(
|
| 177 |
-
gt_orig,
|
| 178 |
-
{"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 179 |
-
)
|
| 180 |
-
slot["n_total"] += 1
|
| 181 |
-
slot["n_modernized"] += 1
|
| 182 |
-
slot["variants"]["∅"] = slot["variants"].get("∅", 0) + 1
|
| 183 |
-
|
| 184 |
-
# Calcul du taux par token
|
| 185 |
-
for slot in tokens_data.values():
|
| 186 |
-
total = slot["n_total"]
|
| 187 |
-
slot["rate_modernized"] = (
|
| 188 |
-
slot["n_modernized"] / total if total > 0 else 0.0
|
| 189 |
-
)
|
| 190 |
-
|
| 191 |
-
return {
|
| 192 |
-
"n_gt_tokens": len(ref_tokens),
|
| 193 |
-
"tokens": tokens_data,
|
| 194 |
-
}
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
def aggregate_lexical_modernization(
|
| 198 |
-
per_doc_results: Iterable[dict],
|
| 199 |
-
) -> dict:
|
| 200 |
-
"""Agrège des ``compute_lexical_modernization`` per-doc.
|
| 201 |
-
|
| 202 |
-
Renvoie la structure agrégée corpus-wide avec la même forme
|
| 203 |
-
que ``compute_lexical_modernization``.
|
| 204 |
-
"""
|
| 205 |
-
agg_tokens: dict[str, dict] = {}
|
| 206 |
-
n_gt_total = 0
|
| 207 |
-
for doc_result in per_doc_results:
|
| 208 |
-
if not doc_result:
|
| 209 |
-
continue
|
| 210 |
-
n_gt_total += doc_result.get("n_gt_tokens", 0)
|
| 211 |
-
for gt, data in (doc_result.get("tokens") or {}).items():
|
| 212 |
-
slot = agg_tokens.setdefault(
|
| 213 |
-
gt, {"n_total": 0, "n_modernized": 0, "variants": {}},
|
| 214 |
-
)
|
| 215 |
-
slot["n_total"] += data.get("n_total", 0)
|
| 216 |
-
slot["n_modernized"] += data.get("n_modernized", 0)
|
| 217 |
-
for hyp_t, count in (data.get("variants") or {}).items():
|
| 218 |
-
slot["variants"][hyp_t] = slot["variants"].get(hyp_t, 0) + count
|
| 219 |
-
|
| 220 |
-
for slot in agg_tokens.values():
|
| 221 |
-
total = slot["n_total"]
|
| 222 |
-
slot["rate_modernized"] = (
|
| 223 |
-
slot["n_modernized"] / total if total > 0 else 0.0
|
| 224 |
-
)
|
| 225 |
-
return {
|
| 226 |
-
"n_gt_tokens": n_gt_total,
|
| 227 |
-
"tokens": agg_tokens,
|
| 228 |
-
}
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
def top_modernized_tokens(
|
| 232 |
-
data: dict,
|
| 233 |
-
*,
|
| 234 |
-
n: int = 20,
|
| 235 |
-
min_total: int = 1,
|
| 236 |
-
) -> list[tuple[str, dict]]:
|
| 237 |
-
"""Top-N tokens GT par taux de modernisation.
|
| 238 |
-
|
| 239 |
-
Filtre les tokens dont ``n_total < min_total`` (anecdotiques).
|
| 240 |
-
Tri par ``rate_modernized`` décroissant, tie-break par
|
| 241 |
-
``n_total`` décroissant.
|
| 242 |
-
"""
|
| 243 |
-
tokens = data.get("tokens") or {}
|
| 244 |
-
candidates = [
|
| 245 |
-
(gt, slot) for gt, slot in tokens.items()
|
| 246 |
-
if slot.get("n_total", 0) >= min_total
|
| 247 |
-
and slot.get("n_modernized", 0) > 0
|
| 248 |
-
]
|
| 249 |
-
candidates.sort(
|
| 250 |
-
key=lambda pair: (
|
| 251 |
-
-pair[1].get("rate_modernized", 0.0),
|
| 252 |
-
-pair[1].get("n_total", 0),
|
| 253 |
-
pair[0],
|
| 254 |
-
),
|
| 255 |
-
)
|
| 256 |
-
return candidates[:n]
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
__all__ = [
|
| 260 |
-
"compute_lexical_modernization",
|
| 261 |
-
"aggregate_lexical_modernization",
|
| 262 |
-
"top_modernized_tokens",
|
| 263 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.lexical_modernization``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.lexical_modernization`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.lexical_modernization import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,286 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
-
|
| 5 |
-
- CER par ligne : distance d'édition caractère/longueur GT sur chaque paire de lignes
|
| 6 |
-
- Percentiles : p50, p75, p90, p95, p99 sur la distribution des CER ligne
|
| 7 |
-
- Taux catastrophiques : % de lignes dépassant des seuils configurables (30 %, 50 %, 100 %)
|
| 8 |
-
- Coefficient de Gini : concentration des erreurs (0 = uniformes, 1 = toutes concentrées)
|
| 9 |
-
- Carte thermique : CER moyen par tranche de position dans le document
|
| 10 |
"""
|
| 11 |
|
| 12 |
from __future__ import annotations
|
| 13 |
|
| 14 |
-
import
|
| 15 |
-
from dataclasses import dataclass
|
| 16 |
-
from typing import Optional
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
# ---------------------------------------------------------------------------
|
| 20 |
-
# CER d'une paire de lignes (distance d'édition Levenshtein normalisée)
|
| 21 |
-
# ---------------------------------------------------------------------------
|
| 22 |
-
|
| 23 |
-
def _edit_distance(a: str, b: str) -> int:
|
| 24 |
-
"""Distance de Levenshtein entre deux chaînes."""
|
| 25 |
-
if not a:
|
| 26 |
-
return len(b)
|
| 27 |
-
if not b:
|
| 28 |
-
return len(a)
|
| 29 |
-
prev = list(range(len(b) + 1))
|
| 30 |
-
for i, ca in enumerate(a, 1):
|
| 31 |
-
curr = [i]
|
| 32 |
-
for j, cb in enumerate(b, 1):
|
| 33 |
-
cost = 0 if ca == cb else 1
|
| 34 |
-
curr.append(min(curr[j - 1] + 1, prev[j] + 1, prev[j - 1] + cost))
|
| 35 |
-
prev = curr
|
| 36 |
-
return prev[-1]
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
def _line_cer(ref_line: str, hyp_line: str) -> float:
|
| 40 |
-
"""CER pour une paire de lignes. Retourne 1.0 si le GT est vide et que l'hyp ne l'est pas."""
|
| 41 |
-
ref = unicodedata.normalize("NFC", ref_line.strip())
|
| 42 |
-
hyp = unicodedata.normalize("NFC", hyp_line.strip())
|
| 43 |
-
if not ref:
|
| 44 |
-
return 0.0 if not hyp else 1.0
|
| 45 |
-
dist = _edit_distance(ref, hyp)
|
| 46 |
-
return dist / len(ref)
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
# ---------------------------------------------------------------------------
|
| 50 |
-
# Percentiles (implémentation pur-Python, sans numpy)
|
| 51 |
-
# ---------------------------------------------------------------------------
|
| 52 |
-
|
| 53 |
-
def _percentile(sorted_values: list[float], p: float) -> float:
|
| 54 |
-
"""Retourne le p-ième percentile (0 ≤ p ≤ 100) d'une liste triée."""
|
| 55 |
-
if not sorted_values:
|
| 56 |
-
return 0.0
|
| 57 |
-
n = len(sorted_values)
|
| 58 |
-
index = p / 100 * (n - 1)
|
| 59 |
-
lo = int(index)
|
| 60 |
-
hi = min(lo + 1, n - 1)
|
| 61 |
-
frac = index - lo
|
| 62 |
-
return sorted_values[lo] + frac * (sorted_values[hi] - sorted_values[lo])
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
# ---------------------------------------------------------------------------
|
| 66 |
-
# Coefficient de Gini
|
| 67 |
-
# ---------------------------------------------------------------------------
|
| 68 |
-
|
| 69 |
-
def _gini(values: list[float]) -> float:
|
| 70 |
-
"""Coefficient de Gini des erreurs (0 = uniformes, 1 = toutes concentrées).
|
| 71 |
-
|
| 72 |
-
Formule : G = (2 * Σ i*x_i) / (n * Σ x_i) - (n+1)/n
|
| 73 |
-
sur les valeurs triées par ordre croissant.
|
| 74 |
-
"""
|
| 75 |
-
if not values:
|
| 76 |
-
return 0.0
|
| 77 |
-
xs = sorted(max(v, 0.0) for v in values)
|
| 78 |
-
n = len(xs)
|
| 79 |
-
total = sum(xs)
|
| 80 |
-
if total == 0.0:
|
| 81 |
-
return 0.0
|
| 82 |
-
weighted_sum = sum((i + 1) * x for i, x in enumerate(xs))
|
| 83 |
-
return (2.0 * weighted_sum) / (n * total) - (n + 1) / n
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
# ---------------------------------------------------------------------------
|
| 87 |
-
# Résultat structuré
|
| 88 |
-
# ---------------------------------------------------------------------------
|
| 89 |
-
|
| 90 |
-
@dataclass
|
| 91 |
-
class LineMetrics:
|
| 92 |
-
"""Distribution des erreurs CER par ligne pour une paire (GT, hypothèse)."""
|
| 93 |
-
|
| 94 |
-
cer_per_line: list[float]
|
| 95 |
-
"""CER de chaque ligne (longueur = nombre de lignes GT)."""
|
| 96 |
-
|
| 97 |
-
percentiles: dict[str, float]
|
| 98 |
-
"""Percentiles : p50, p75, p90, p95, p99."""
|
| 99 |
-
|
| 100 |
-
catastrophic_rate: dict[str, float]
|
| 101 |
-
"""Taux de lignes catastrophiques pour chaque seuil (ex. {0.3: 0.12, 0.5: 0.07, 1.0: 0.02})."""
|
| 102 |
-
|
| 103 |
-
gini: float
|
| 104 |
-
"""Coefficient de Gini des erreurs (0 → uniforme, 1 → concentrées)."""
|
| 105 |
-
|
| 106 |
-
heatmap: list[float]
|
| 107 |
-
"""CER moyen par tranche de position dans le document (longueur = heatmap_bins)."""
|
| 108 |
-
|
| 109 |
-
line_count: int
|
| 110 |
-
"""Nombre de lignes GT traitées."""
|
| 111 |
-
|
| 112 |
-
mean_cer: float
|
| 113 |
-
"""CER moyen sur l'ensemble des lignes."""
|
| 114 |
-
|
| 115 |
-
def as_dict(self) -> dict:
|
| 116 |
-
return {
|
| 117 |
-
"cer_per_line": [round(v, 6) for v in self.cer_per_line],
|
| 118 |
-
"percentiles": {k: round(v, 6) for k, v in self.percentiles.items()},
|
| 119 |
-
"catastrophic_rate": {str(k): round(v, 6) for k, v in self.catastrophic_rate.items()},
|
| 120 |
-
"gini": round(self.gini, 6),
|
| 121 |
-
"heatmap": [round(v, 6) for v in self.heatmap],
|
| 122 |
-
"line_count": self.line_count,
|
| 123 |
-
"mean_cer": round(self.mean_cer, 6),
|
| 124 |
-
}
|
| 125 |
-
|
| 126 |
-
@classmethod
|
| 127 |
-
def from_dict(cls, d: dict) -> "LineMetrics":
|
| 128 |
-
return cls(
|
| 129 |
-
cer_per_line=d.get("cer_per_line", []),
|
| 130 |
-
percentiles=d.get("percentiles", {}),
|
| 131 |
-
catastrophic_rate={float(k): v for k, v in d.get("catastrophic_rate", {}).items()},
|
| 132 |
-
gini=d.get("gini", 0.0),
|
| 133 |
-
heatmap=d.get("heatmap", []),
|
| 134 |
-
line_count=d.get("line_count", 0),
|
| 135 |
-
mean_cer=d.get("mean_cer", 0.0),
|
| 136 |
-
)
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
# ---------------------------------------------------------------------------
|
| 140 |
-
# Calcul principal
|
| 141 |
-
# ---------------------------------------------------------------------------
|
| 142 |
-
|
| 143 |
-
def compute_line_metrics(
|
| 144 |
-
reference: str,
|
| 145 |
-
hypothesis: str,
|
| 146 |
-
thresholds: Optional[list[float]] = None,
|
| 147 |
-
heatmap_bins: int = 10,
|
| 148 |
-
) -> LineMetrics:
|
| 149 |
-
"""Calcule la distribution des erreurs CER ligne par ligne.
|
| 150 |
-
|
| 151 |
-
Parameters
|
| 152 |
-
----------
|
| 153 |
-
reference:
|
| 154 |
-
Texte de vérité terrain (GT) avec sauts de ligne.
|
| 155 |
-
hypothesis:
|
| 156 |
-
Texte produit par le moteur OCR.
|
| 157 |
-
thresholds:
|
| 158 |
-
Seuils CER pour le taux catastrophique. Défaut : [0.30, 0.50, 1.00].
|
| 159 |
-
heatmap_bins:
|
| 160 |
-
Nombre de tranches de position pour la carte thermique.
|
| 161 |
-
|
| 162 |
-
Returns
|
| 163 |
-
-------
|
| 164 |
-
LineMetrics
|
| 165 |
-
"""
|
| 166 |
-
if thresholds is None:
|
| 167 |
-
thresholds = [0.30, 0.50, 1.00]
|
| 168 |
-
|
| 169 |
-
ref_lines = reference.splitlines()
|
| 170 |
-
hyp_lines = hypothesis.splitlines()
|
| 171 |
-
|
| 172 |
-
# Aligner les lignes GT / hypothèse — on prend au moins autant de lignes que le GT
|
| 173 |
-
n = len(ref_lines)
|
| 174 |
-
if n == 0:
|
| 175 |
-
# Pas de lignes : retourner des métriques neutres
|
| 176 |
-
return LineMetrics(
|
| 177 |
-
cer_per_line=[],
|
| 178 |
-
percentiles={f"p{p}": 0.0 for p in (50, 75, 90, 95, 99)},
|
| 179 |
-
catastrophic_rate={t: 0.0 for t in thresholds},
|
| 180 |
-
gini=0.0,
|
| 181 |
-
heatmap=[0.0] * heatmap_bins,
|
| 182 |
-
line_count=0,
|
| 183 |
-
mean_cer=0.0,
|
| 184 |
-
)
|
| 185 |
-
|
| 186 |
-
# Aligner en ignorant les lignes d'hypothèse supplémentaires
|
| 187 |
-
# Si l'hypothèse a moins de lignes, les lignes manquantes comptent comme supprimées (CER = 1.0)
|
| 188 |
-
cer_per_line: list[float] = []
|
| 189 |
-
for i, ref_line in enumerate(ref_lines):
|
| 190 |
-
hyp_line = hyp_lines[i] if i < len(hyp_lines) else ""
|
| 191 |
-
cer_per_line.append(min(_line_cer(ref_line, hyp_line), 1.0))
|
| 192 |
-
|
| 193 |
-
sorted_cer = sorted(cer_per_line)
|
| 194 |
-
|
| 195 |
-
# Percentiles
|
| 196 |
-
percentiles = {
|
| 197 |
-
f"p{p}": _percentile(sorted_cer, p)
|
| 198 |
-
for p in (50, 75, 90, 95, 99)
|
| 199 |
-
}
|
| 200 |
-
|
| 201 |
-
# Taux catastrophiques
|
| 202 |
-
catastrophic_rate: dict[float, float] = {}
|
| 203 |
-
for t in thresholds:
|
| 204 |
-
count = sum(1 for v in cer_per_line if v > t)
|
| 205 |
-
catastrophic_rate[t] = count / n
|
| 206 |
-
|
| 207 |
-
# Gini
|
| 208 |
-
gini = _gini(cer_per_line)
|
| 209 |
-
|
| 210 |
-
# Carte thermique par tranche de position
|
| 211 |
-
bins = heatmap_bins
|
| 212 |
-
heatmap: list[float] = []
|
| 213 |
-
for b in range(bins):
|
| 214 |
-
start = int(b * n / bins)
|
| 215 |
-
end = int((b + 1) * n / bins)
|
| 216 |
-
slice_ = cer_per_line[start:end]
|
| 217 |
-
heatmap.append(sum(slice_) / len(slice_) if slice_ else 0.0)
|
| 218 |
-
|
| 219 |
-
mean_cer = sum(cer_per_line) / n
|
| 220 |
-
|
| 221 |
-
return LineMetrics(
|
| 222 |
-
cer_per_line=cer_per_line,
|
| 223 |
-
percentiles=percentiles,
|
| 224 |
-
catastrophic_rate=catastrophic_rate,
|
| 225 |
-
gini=gini,
|
| 226 |
-
heatmap=heatmap,
|
| 227 |
-
line_count=n,
|
| 228 |
-
mean_cer=mean_cer,
|
| 229 |
-
)
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
# ---------------------------------------------------------------------------
|
| 233 |
-
# Agrégation sur un corpus
|
| 234 |
-
# ---------------------------------------------------------------------------
|
| 235 |
-
|
| 236 |
-
def aggregate_line_metrics(results: list[LineMetrics]) -> dict:
|
| 237 |
-
"""Agrège les métriques de distribution par ligne sur un corpus.
|
| 238 |
-
|
| 239 |
-
Returns
|
| 240 |
-
-------
|
| 241 |
-
dict
|
| 242 |
-
Statistiques agrégées : Gini moyen, percentiles moyens, taux catastrophiques moyens.
|
| 243 |
-
"""
|
| 244 |
-
if not results:
|
| 245 |
-
return {}
|
| 246 |
-
|
| 247 |
-
import statistics as _stats
|
| 248 |
-
|
| 249 |
-
gini_values = [r.gini for r in results]
|
| 250 |
-
mean_cer_values = [r.mean_cer for r in results]
|
| 251 |
-
|
| 252 |
-
# Percentiles moyens
|
| 253 |
-
pct_keys = ["p50", "p75", "p90", "p95", "p99"]
|
| 254 |
-
avg_percentiles = {}
|
| 255 |
-
for k in pct_keys:
|
| 256 |
-
vals = [r.percentiles.get(k, 0.0) for r in results]
|
| 257 |
-
avg_percentiles[k] = round(sum(vals) / len(vals), 6) if vals else 0.0
|
| 258 |
-
|
| 259 |
-
# Taux catastrophiques moyens (union des seuils)
|
| 260 |
-
all_thresholds: set[float] = set()
|
| 261 |
-
for r in results:
|
| 262 |
-
all_thresholds.update(r.catastrophic_rate.keys())
|
| 263 |
-
avg_catastrophic: dict[str, float] = {}
|
| 264 |
-
for t in sorted(all_thresholds):
|
| 265 |
-
vals = [r.catastrophic_rate.get(t, 0.0) for r in results]
|
| 266 |
-
avg_catastrophic[str(t)] = round(sum(vals) / len(vals), 6) if vals else 0.0
|
| 267 |
-
|
| 268 |
-
# Heatmap moyenne (longueur = max des longueurs)
|
| 269 |
-
if results and results[0].heatmap:
|
| 270 |
-
n_bins = len(results[0].heatmap)
|
| 271 |
-
heatmap_avg = []
|
| 272 |
-
for b in range(n_bins):
|
| 273 |
-
vals = [r.heatmap[b] for r in results if b < len(r.heatmap)]
|
| 274 |
-
heatmap_avg.append(round(sum(vals) / len(vals), 6) if vals else 0.0)
|
| 275 |
-
else:
|
| 276 |
-
heatmap_avg = []
|
| 277 |
-
|
| 278 |
-
return {
|
| 279 |
-
"gini_mean": round(sum(gini_values) / len(gini_values), 6),
|
| 280 |
-
"gini_stdev": round(_stats.stdev(gini_values), 6) if len(gini_values) > 1 else 0.0,
|
| 281 |
-
"mean_cer_mean": round(sum(mean_cer_values) / len(mean_cer_values), 6),
|
| 282 |
-
"percentiles": avg_percentiles,
|
| 283 |
-
"catastrophic_rate": avg_catastrophic,
|
| 284 |
-
"heatmap": heatmap_avg,
|
| 285 |
-
"document_count": len(results),
|
| 286 |
-
}
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.line_metrics``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.line_metrics`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.line_metrics import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,373 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
L'historique SQLite (`core/history.py`, Sprint 8) collecte les
|
| 8 |
-
résultats de chaque run de benchmark, mais aucune métrique
|
| 9 |
-
n'en sortait dans le rapport. Ce module exploite la série
|
| 10 |
-
temporelle des CER d'un moteur pour répondre à deux
|
| 11 |
-
questions :
|
| 12 |
-
|
| 13 |
-
1. **Y a-t-il une tendance ?** Régression linéaire simple
|
| 14 |
-
(méthode des moindres carrés) sur ``(t, CER)`` — pente,
|
| 15 |
-
ordonnée à l'origine, R², n_runs. Une pente > 0 signale
|
| 16 |
-
une régression progressive ; une pente < 0 une amélioration.
|
| 17 |
-
|
| 18 |
-
2. **Y a-t-il un point de rupture ?** Algorithme de
|
| 19 |
-
change-point pur Python (différence de moyennes maximale,
|
| 20 |
-
variante de Pettitt simplifiée). Identifie l'index où la
|
| 21 |
-
série se sépare en deux segments avec moyennes les plus
|
| 22 |
-
différentes — typiquement le run où un modèle a changé de
|
| 23 |
-
comportement.
|
| 24 |
-
|
| 25 |
-
Pas de scipy
|
| 26 |
-
------------
|
| 27 |
-
Pour rester sans dépendance lourde, on implémente :
|
| 28 |
-
- la régression linéaire en pur Python (closed-form OLS) ;
|
| 29 |
-
- le change-point par balayage exhaustif (O(N) pour de petits
|
| 30 |
-
N — l'historique d'une institution dépasse rarement quelques
|
| 31 |
-
centaines de runs).
|
| 32 |
"""
|
| 33 |
|
| 34 |
from __future__ import annotations
|
| 35 |
|
| 36 |
-
import
|
| 37 |
-
import math
|
| 38 |
-
import statistics
|
| 39 |
-
from dataclasses import dataclass
|
| 40 |
-
from datetime import datetime
|
| 41 |
-
from typing import Iterable, Optional
|
| 42 |
-
|
| 43 |
-
logger = logging.getLogger(__name__)
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
@dataclass
|
| 47 |
-
class LinearTrend:
|
| 48 |
-
"""Résultat d'une régression linéaire sur une série CER."""
|
| 49 |
-
slope: float
|
| 50 |
-
"""Pente (CER par jour). Positif = régression."""
|
| 51 |
-
intercept: float
|
| 52 |
-
"""Ordonnée à l'origine."""
|
| 53 |
-
r_squared: float
|
| 54 |
-
"""Qualité de l'ajustement, ∈ [0, 1]."""
|
| 55 |
-
n_runs: int
|
| 56 |
-
"""Nombre de points utilisés."""
|
| 57 |
-
|
| 58 |
-
def as_dict(self) -> dict:
|
| 59 |
-
return {
|
| 60 |
-
"slope": self.slope,
|
| 61 |
-
"intercept": self.intercept,
|
| 62 |
-
"r_squared": self.r_squared,
|
| 63 |
-
"n_runs": self.n_runs,
|
| 64 |
-
}
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
@dataclass
|
| 68 |
-
class ChangePointResult:
|
| 69 |
-
"""Résultat d'une détection de point de rupture."""
|
| 70 |
-
index: int
|
| 71 |
-
"""Index de la rupture (0-based, le segment 1 est [0:index],
|
| 72 |
-
le segment 2 est [index:N])."""
|
| 73 |
-
timestamp: str
|
| 74 |
-
"""Timestamp du run à la rupture."""
|
| 75 |
-
mean_before: float
|
| 76 |
-
mean_after: float
|
| 77 |
-
delta: float
|
| 78 |
-
"""``mean_after - mean_before``. Positif = régression."""
|
| 79 |
-
n_before: int
|
| 80 |
-
n_after: int
|
| 81 |
-
|
| 82 |
-
def as_dict(self) -> dict:
|
| 83 |
-
return {
|
| 84 |
-
"index": self.index,
|
| 85 |
-
"timestamp": self.timestamp,
|
| 86 |
-
"mean_before": self.mean_before,
|
| 87 |
-
"mean_after": self.mean_after,
|
| 88 |
-
"delta": self.delta,
|
| 89 |
-
"n_before": self.n_before,
|
| 90 |
-
"n_after": self.n_after,
|
| 91 |
-
}
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
def _parse_timestamp(ts: str) -> Optional[float]:
|
| 95 |
-
"""Parse un ISO timestamp en jour ordinal float.
|
| 96 |
-
|
| 97 |
-
Tolère ``YYYY-MM-DD`` et ``YYYY-MM-DDTHH:MM:SS``. Retourne
|
| 98 |
-
``None`` si non parsable.
|
| 99 |
-
"""
|
| 100 |
-
if not ts:
|
| 101 |
-
return None
|
| 102 |
-
formats = (
|
| 103 |
-
"%Y-%m-%dT%H:%M:%S.%f",
|
| 104 |
-
"%Y-%m-%dT%H:%M:%S",
|
| 105 |
-
"%Y-%m-%d %H:%M:%S",
|
| 106 |
-
"%Y-%m-%d",
|
| 107 |
-
)
|
| 108 |
-
for fmt in formats:
|
| 109 |
-
try:
|
| 110 |
-
dt = datetime.strptime(ts.split("+")[0].split("Z")[0], fmt)
|
| 111 |
-
return dt.toordinal() + (
|
| 112 |
-
dt.hour * 3600 + dt.minute * 60 + dt.second
|
| 113 |
-
) / 86400.0
|
| 114 |
-
except ValueError:
|
| 115 |
-
continue
|
| 116 |
-
return None
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
def compute_linear_trend(
|
| 120 |
-
cer_series: Iterable[tuple[str, float]],
|
| 121 |
-
) -> Optional[LinearTrend]:
|
| 122 |
-
"""Régression linéaire OLS sur une série temporelle de CER.
|
| 123 |
-
|
| 124 |
-
Parameters
|
| 125 |
-
----------
|
| 126 |
-
cer_series:
|
| 127 |
-
Itérable de ``(timestamp_iso, cer)``. Au moins 2 points
|
| 128 |
-
valides requis.
|
| 129 |
-
|
| 130 |
-
Returns
|
| 131 |
-
-------
|
| 132 |
-
LinearTrend | None
|
| 133 |
-
``None`` si moins de 2 points ou si tous les timestamps
|
| 134 |
-
sont identiques (variance nulle sur t).
|
| 135 |
-
"""
|
| 136 |
-
points: list[tuple[float, float]] = []
|
| 137 |
-
for ts, cer in cer_series:
|
| 138 |
-
t = _parse_timestamp(ts)
|
| 139 |
-
if t is None or cer is None:
|
| 140 |
-
continue
|
| 141 |
-
try:
|
| 142 |
-
cer_f = float(cer)
|
| 143 |
-
except (TypeError, ValueError):
|
| 144 |
-
continue
|
| 145 |
-
points.append((t, cer_f))
|
| 146 |
-
n = len(points)
|
| 147 |
-
if n < 2:
|
| 148 |
-
return None
|
| 149 |
-
xs = [p[0] for p in points]
|
| 150 |
-
ys = [p[1] for p in points]
|
| 151 |
-
x_mean = statistics.fmean(xs)
|
| 152 |
-
y_mean = statistics.fmean(ys)
|
| 153 |
-
sxx = sum((x - x_mean) ** 2 for x in xs)
|
| 154 |
-
sxy = sum((x - x_mean) * (y - y_mean) for x, y in zip(xs, ys))
|
| 155 |
-
if sxx == 0:
|
| 156 |
-
return None
|
| 157 |
-
slope = sxy / sxx
|
| 158 |
-
intercept = y_mean - slope * x_mean
|
| 159 |
-
syy = sum((y - y_mean) ** 2 for y in ys)
|
| 160 |
-
if syy == 0:
|
| 161 |
-
# Tous les CER sont égaux → R² mathématiquement indéfini ;
|
| 162 |
-
# on retourne 1.0 (parfaite "non-tendance").
|
| 163 |
-
r_squared = 1.0
|
| 164 |
-
else:
|
| 165 |
-
ss_res = sum(
|
| 166 |
-
(y - (slope * x + intercept)) ** 2
|
| 167 |
-
for x, y in zip(xs, ys)
|
| 168 |
-
)
|
| 169 |
-
r_squared = max(0.0, 1.0 - ss_res / syy)
|
| 170 |
-
return LinearTrend(
|
| 171 |
-
slope=slope,
|
| 172 |
-
intercept=intercept,
|
| 173 |
-
r_squared=r_squared,
|
| 174 |
-
n_runs=n,
|
| 175 |
-
)
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
def detect_change_point(
|
| 179 |
-
cer_series: Iterable[tuple[str, float]],
|
| 180 |
-
min_segment_size: int = 3,
|
| 181 |
-
) -> Optional[ChangePointResult]:
|
| 182 |
-
"""Détecte le point de rupture maximisant l'écart de moyennes.
|
| 183 |
-
|
| 184 |
-
Algorithme : balayage des indices ``i`` où la série se
|
| 185 |
-
sépare en deux segments d'au moins ``min_segment_size``
|
| 186 |
-
points chacun ; on retient l'index où ``|mean_after -
|
| 187 |
-
mean_before|`` est maximal. Variante simplifiée de Pettitt.
|
| 188 |
-
|
| 189 |
-
Parameters
|
| 190 |
-
----------
|
| 191 |
-
cer_series:
|
| 192 |
-
Itérable de ``(timestamp_iso, cer)``.
|
| 193 |
-
min_segment_size:
|
| 194 |
-
Taille minimale des deux segments. Défaut 3.
|
| 195 |
-
|
| 196 |
-
Returns
|
| 197 |
-
-------
|
| 198 |
-
ChangePointResult | None
|
| 199 |
-
``None`` si la série a moins de ``2 × min_segment_size``
|
| 200 |
-
points valides.
|
| 201 |
-
"""
|
| 202 |
-
points: list[tuple[str, float, float]] = []
|
| 203 |
-
for ts, cer in cer_series:
|
| 204 |
-
t = _parse_timestamp(ts)
|
| 205 |
-
if t is None or cer is None:
|
| 206 |
-
continue
|
| 207 |
-
try:
|
| 208 |
-
cer_f = float(cer)
|
| 209 |
-
except (TypeError, ValueError):
|
| 210 |
-
continue
|
| 211 |
-
points.append((ts, t, cer_f))
|
| 212 |
-
if len(points) < 2 * min_segment_size:
|
| 213 |
-
return None
|
| 214 |
-
points.sort(key=lambda p: p[1])
|
| 215 |
-
n = len(points)
|
| 216 |
-
best_index = -1
|
| 217 |
-
best_abs_delta = -1.0
|
| 218 |
-
best_delta = 0.0
|
| 219 |
-
best_mean_before = 0.0
|
| 220 |
-
best_mean_after = 0.0
|
| 221 |
-
for i in range(min_segment_size, n - min_segment_size + 1):
|
| 222 |
-
before = [p[2] for p in points[:i]]
|
| 223 |
-
after = [p[2] for p in points[i:]]
|
| 224 |
-
mean_b = statistics.fmean(before)
|
| 225 |
-
mean_a = statistics.fmean(after)
|
| 226 |
-
delta = mean_a - mean_b
|
| 227 |
-
abs_delta = abs(delta)
|
| 228 |
-
if abs_delta > best_abs_delta:
|
| 229 |
-
best_abs_delta = abs_delta
|
| 230 |
-
best_index = i
|
| 231 |
-
best_delta = delta
|
| 232 |
-
best_mean_before = mean_b
|
| 233 |
-
best_mean_after = mean_a
|
| 234 |
-
if best_index < 0:
|
| 235 |
-
return None
|
| 236 |
-
return ChangePointResult(
|
| 237 |
-
index=best_index,
|
| 238 |
-
timestamp=points[best_index][0],
|
| 239 |
-
mean_before=best_mean_before,
|
| 240 |
-
mean_after=best_mean_after,
|
| 241 |
-
delta=best_delta,
|
| 242 |
-
n_before=best_index,
|
| 243 |
-
n_after=n - best_index,
|
| 244 |
-
)
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
def compute_engine_longitudinal(
|
| 248 |
-
history_entries: Iterable,
|
| 249 |
-
engine_name: str,
|
| 250 |
-
corpus_name: Optional[str] = None,
|
| 251 |
-
*,
|
| 252 |
-
min_runs_for_trend: int = 3,
|
| 253 |
-
min_segment_size: int = 3,
|
| 254 |
-
change_point_threshold: float = 0.01,
|
| 255 |
-
) -> Optional[dict]:
|
| 256 |
-
"""Calcule trend + change_point pour un moteur.
|
| 257 |
-
|
| 258 |
-
Parameters
|
| 259 |
-
----------
|
| 260 |
-
history_entries:
|
| 261 |
-
Liste de ``HistoryEntry`` (ou dicts compatibles).
|
| 262 |
-
engine_name:
|
| 263 |
-
Filtre sur le nom du moteur.
|
| 264 |
-
corpus_name:
|
| 265 |
-
Filtre optionnel sur le corpus. ``None`` (défaut) : tous
|
| 266 |
-
les corpus.
|
| 267 |
-
min_runs_for_trend:
|
| 268 |
-
Minimum de runs pour calculer une tendance.
|
| 269 |
-
min_segment_size:
|
| 270 |
-
Taille minimale des segments pour le change-point.
|
| 271 |
-
change_point_threshold:
|
| 272 |
-
Magnitude absolue minimale du delta (en CER) pour
|
| 273 |
-
retenir le change-point. Défaut 0.01 (1 point de CER).
|
| 274 |
-
|
| 275 |
-
Returns
|
| 276 |
-
-------
|
| 277 |
-
dict | None
|
| 278 |
-
``{
|
| 279 |
-
"engine_name", "corpus_name", "n_runs", "trend",
|
| 280 |
-
"change_point", # ou None
|
| 281 |
-
"first_timestamp", "last_timestamp",
|
| 282 |
-
"first_cer", "last_cer", "absolute_delta_pct",
|
| 283 |
-
}`` ou ``None`` si moins de ``min_runs_for_trend`` runs.
|
| 284 |
-
"""
|
| 285 |
-
series: list[tuple[str, float]] = []
|
| 286 |
-
for entry in history_entries:
|
| 287 |
-
if hasattr(entry, "as_dict"):
|
| 288 |
-
data = entry.as_dict()
|
| 289 |
-
else:
|
| 290 |
-
data = entry
|
| 291 |
-
if data.get("engine_name") != engine_name:
|
| 292 |
-
continue
|
| 293 |
-
if corpus_name is not None and data.get("corpus_name") != corpus_name:
|
| 294 |
-
continue
|
| 295 |
-
cer = data.get("cer_mean")
|
| 296 |
-
ts = data.get("timestamp")
|
| 297 |
-
if cer is None or ts is None:
|
| 298 |
-
continue
|
| 299 |
-
series.append((ts, float(cer)))
|
| 300 |
-
if len(series) < min_runs_for_trend:
|
| 301 |
-
return None
|
| 302 |
-
series.sort(key=lambda p: _parse_timestamp(p[0]) or 0.0)
|
| 303 |
-
trend = compute_linear_trend(series)
|
| 304 |
-
cp = detect_change_point(series, min_segment_size=min_segment_size)
|
| 305 |
-
if cp is not None and abs(cp.delta) < change_point_threshold:
|
| 306 |
-
cp = None
|
| 307 |
-
first_ts, first_cer = series[0]
|
| 308 |
-
last_ts, last_cer = series[-1]
|
| 309 |
-
return {
|
| 310 |
-
"engine_name": engine_name,
|
| 311 |
-
"corpus_name": corpus_name,
|
| 312 |
-
"n_runs": len(series),
|
| 313 |
-
"trend": trend.as_dict() if trend else None,
|
| 314 |
-
"change_point": cp.as_dict() if cp else None,
|
| 315 |
-
"first_timestamp": first_ts,
|
| 316 |
-
"last_timestamp": last_ts,
|
| 317 |
-
"first_cer": first_cer,
|
| 318 |
-
"last_cer": last_cer,
|
| 319 |
-
"absolute_delta": last_cer - first_cer,
|
| 320 |
-
"absolute_delta_pct": round((last_cer - first_cer) * 100, 2),
|
| 321 |
-
}
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
def compute_corpus_longitudinal(
|
| 325 |
-
history_entries: Iterable,
|
| 326 |
-
corpus_name: Optional[str] = None,
|
| 327 |
-
*,
|
| 328 |
-
min_runs_for_trend: int = 3,
|
| 329 |
-
min_segment_size: int = 3,
|
| 330 |
-
change_point_threshold: float = 0.01,
|
| 331 |
-
) -> list[dict]:
|
| 332 |
-
"""Pour chaque moteur présent dans l'historique sur ``corpus_name``,
|
| 333 |
-
calcule trend + change_point.
|
| 334 |
-
|
| 335 |
-
Returns
|
| 336 |
-
-------
|
| 337 |
-
list[dict]
|
| 338 |
-
Une entrée par moteur (filtrée), liste vide si rien.
|
| 339 |
-
"""
|
| 340 |
-
entries = list(history_entries)
|
| 341 |
-
engines: set[str] = set()
|
| 342 |
-
for entry in entries:
|
| 343 |
-
data = entry.as_dict() if hasattr(entry, "as_dict") else entry
|
| 344 |
-
if corpus_name is not None and data.get("corpus_name") != corpus_name:
|
| 345 |
-
continue
|
| 346 |
-
name = data.get("engine_name")
|
| 347 |
-
if name:
|
| 348 |
-
engines.add(name)
|
| 349 |
-
out: list[dict] = []
|
| 350 |
-
for engine in sorted(engines):
|
| 351 |
-
result = compute_engine_longitudinal(
|
| 352 |
-
entries, engine, corpus_name=corpus_name,
|
| 353 |
-
min_runs_for_trend=min_runs_for_trend,
|
| 354 |
-
min_segment_size=min_segment_size,
|
| 355 |
-
change_point_threshold=change_point_threshold,
|
| 356 |
-
)
|
| 357 |
-
if result is not None:
|
| 358 |
-
out.append(result)
|
| 359 |
-
return out
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
__all__ = [
|
| 363 |
-
"LinearTrend",
|
| 364 |
-
"ChangePointResult",
|
| 365 |
-
"compute_linear_trend",
|
| 366 |
-
"detect_change_point",
|
| 367 |
-
"compute_engine_longitudinal",
|
| 368 |
-
"compute_corpus_longitudinal",
|
| 369 |
-
]
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
# Marqueur d'évitement d'import inutilisé (math)
|
| 373 |
-
_ = math
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.longitudinal``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.longitudinal`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.longitudinal import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,142 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
La vue Pareto (Sprint 20) trace CER vs coût mais n'arbitre pas
|
| 8 |
-
quel surcoût est *raisonnable* pour quelle réduction d'erreur.
|
| 9 |
-
Une institution avec un budget contraint a besoin d'une
|
| 10 |
-
réponse opérationnelle :
|
| 11 |
-
|
| 12 |
-
*« Passer de Tesseract à Mistral OCR coûte 0,83 € par
|
| 13 |
-
erreur évitée — décider selon votre budget par millier
|
| 14 |
-
d'erreurs corrigées. »*
|
| 15 |
-
|
| 16 |
-
Formule
|
| 17 |
-
-------
|
| 18 |
-
Pour deux moteurs A et B où B fait **moins** d'erreurs que A
|
| 19 |
-
(donc B est plus précis) :
|
| 20 |
-
|
| 21 |
-
.. code::
|
| 22 |
-
|
| 23 |
-
coût_marginal = (coût_B − coût_A) / (errors_A − errors_B)
|
| 24 |
-
|
| 25 |
-
- Si ``cost_B > cost_A`` et ``errors_B < errors_A`` :
|
| 26 |
-
``cost_per_avoided_error > 0`` (cas standard, B coûte plus
|
| 27 |
-
pour moins d'erreurs).
|
| 28 |
-
- Si ``cost_B ≤ cost_A`` et ``errors_B < errors_A`` :
|
| 29 |
-
``cost_per_avoided_error ≤ 0`` (cas idéal, B est strictement
|
| 30 |
-
meilleur).
|
| 31 |
-
- Si ``errors_B ≥ errors_A`` : non comparable dans ce sens
|
| 32 |
-
(B n'évite pas d'erreur), retourne ``None``.
|
| 33 |
-
|
| 34 |
-
Sortie
|
| 35 |
-
------
|
| 36 |
-
``compute_marginal_cost(cost_a, errors_a, cost_b, errors_b)``
|
| 37 |
-
retourne ``{cost_per_avoided_error, n_errors_avoided,
|
| 38 |
-
cost_delta, dominated}`` ou ``None`` si non comparable.
|
| 39 |
-
|
| 40 |
-
``compute_marginal_cost_matrix(per_engine)`` retourne, pour
|
| 41 |
-
chaque paire ordonnée ``(A → B)`` où B est plus précis, le
|
| 42 |
-
coût marginal correspondant. Trié par coût marginal croissant
|
| 43 |
-
(meilleur ratio en tête).
|
| 44 |
"""
|
| 45 |
|
| 46 |
from __future__ import annotations
|
| 47 |
|
| 48 |
-
import
|
| 49 |
-
from typing import Optional
|
| 50 |
-
|
| 51 |
-
logger = logging.getLogger(__name__)
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
def compute_marginal_cost(
|
| 55 |
-
cost_a: float,
|
| 56 |
-
errors_a: float,
|
| 57 |
-
cost_b: float,
|
| 58 |
-
errors_b: float,
|
| 59 |
-
) -> Optional[dict]:
|
| 60 |
-
"""Coût marginal du passage A → B (B plus précis).
|
| 61 |
-
|
| 62 |
-
Retourne ``None`` si :
|
| 63 |
-
- ``errors_b >= errors_a`` (B n'évite pas d'erreur) ;
|
| 64 |
-
- les valeurs ne sont pas finies.
|
| 65 |
-
"""
|
| 66 |
-
try:
|
| 67 |
-
ca = float(cost_a)
|
| 68 |
-
cb = float(cost_b)
|
| 69 |
-
ea = float(errors_a)
|
| 70 |
-
eb = float(errors_b)
|
| 71 |
-
except (TypeError, ValueError):
|
| 72 |
-
return None
|
| 73 |
-
if ea <= eb:
|
| 74 |
-
# B ne fait pas mieux que A → pas de gain à mesurer.
|
| 75 |
-
return None
|
| 76 |
-
n_avoided = ea - eb
|
| 77 |
-
cost_delta = cb - ca
|
| 78 |
-
cost_per_avoided = cost_delta / n_avoided
|
| 79 |
-
dominated = cost_delta <= 0 # B aussi cher ou moins → cas idéal
|
| 80 |
-
return {
|
| 81 |
-
"cost_per_avoided_error": cost_per_avoided,
|
| 82 |
-
"n_errors_avoided": n_avoided,
|
| 83 |
-
"cost_delta": cost_delta,
|
| 84 |
-
"dominated": dominated,
|
| 85 |
-
}
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
def compute_marginal_cost_matrix(
|
| 89 |
-
per_engine: dict[str, dict],
|
| 90 |
-
) -> Optional[dict]:
|
| 91 |
-
"""Pour chaque paire A → B où B fait moins d'erreurs, calcule
|
| 92 |
-
le coût marginal.
|
| 93 |
-
|
| 94 |
-
Parameters
|
| 95 |
-
----------
|
| 96 |
-
per_engine:
|
| 97 |
-
Map ``{engine_name: {"cost": float, "errors": float}}``.
|
| 98 |
-
|
| 99 |
-
Returns
|
| 100 |
-
-------
|
| 101 |
-
dict | None
|
| 102 |
-
``{
|
| 103 |
-
"pairs": list[
|
| 104 |
-
{"engine_a", "engine_b", "cost_per_avoided_error",
|
| 105 |
-
"n_errors_avoided", "cost_delta", "dominated"}
|
| 106 |
-
], # triée par cost_per_avoided_error croissant
|
| 107 |
-
}``
|
| 108 |
-
ou ``None`` si moins de 2 moteurs.
|
| 109 |
-
"""
|
| 110 |
-
if not per_engine or len(per_engine) < 2:
|
| 111 |
-
return None
|
| 112 |
-
engines = sorted(per_engine.keys())
|
| 113 |
-
pairs: list[dict] = []
|
| 114 |
-
for a in engines:
|
| 115 |
-
for b in engines:
|
| 116 |
-
if a == b:
|
| 117 |
-
continue
|
| 118 |
-
data_a = per_engine[a]
|
| 119 |
-
data_b = per_engine[b]
|
| 120 |
-
try:
|
| 121 |
-
ca = float(data_a.get("cost"))
|
| 122 |
-
ea = float(data_a.get("errors"))
|
| 123 |
-
cb = float(data_b.get("cost"))
|
| 124 |
-
eb = float(data_b.get("errors"))
|
| 125 |
-
except (TypeError, ValueError):
|
| 126 |
-
continue
|
| 127 |
-
result = compute_marginal_cost(ca, ea, cb, eb)
|
| 128 |
-
if result is None:
|
| 129 |
-
continue
|
| 130 |
-
entry = {"engine_a": a, "engine_b": b}
|
| 131 |
-
entry.update(result)
|
| 132 |
-
pairs.append(entry)
|
| 133 |
-
if not pairs:
|
| 134 |
-
return None
|
| 135 |
-
pairs.sort(key=lambda p: p["cost_per_avoided_error"])
|
| 136 |
-
return {"pairs": pairs}
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
__all__ = [
|
| 140 |
-
"compute_marginal_cost",
|
| 141 |
-
"compute_marginal_cost_matrix",
|
| 142 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.marginal_cost``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.marginal_cost`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.marginal_cost import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,333 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Avant d'ouvrir Picarones aux contributions externes (axe B —
|
| 8 |
-
modules tiers que l'utilisateur amène), il faut un cadre de
|
| 9 |
-
qualité explicite : *« un module qui ne passe pas l'audit
|
| 10 |
-
n'est pas exécutable. »*
|
| 11 |
-
|
| 12 |
-
Ce module fournit l'**enveloppe d'audit** :
|
| 13 |
-
|
| 14 |
-
- ``ModuleManifest`` — métadonnées obligatoires (auteur,
|
| 15 |
-
licence, version, citation, contrat d'entrée/sortie typé).
|
| 16 |
-
- ``validate_manifest(manifest)`` — vérifie que tous les champs
|
| 17 |
-
obligatoires sont présents et bien formés.
|
| 18 |
-
- ``audit_module(module_class_or_instance, manifest)`` —
|
| 19 |
-
vérifie en plus que la classe respecte le contrat ``BaseModule``
|
| 20 |
-
et que ``input_types``/``output_types`` correspondent au
|
| 21 |
-
manifeste.
|
| 22 |
-
- ``AuditResult`` — verdict structuré ``passed/failed`` + liste
|
| 23 |
-
des checks détaillés.
|
| 24 |
-
|
| 25 |
-
Stratégie d'ouverture
|
| 26 |
-
---------------------
|
| 27 |
-
Phase fermée actuelle : modules officiels uniquement,
|
| 28 |
-
contributions via PR sur le repo principal. Phase ouverte
|
| 29 |
-
future : une fois 5–6 modules officiels stables, ouverture via
|
| 30 |
-
``entry_points`` sur PyPI (``picarones-module-X``). Ce module
|
| 31 |
-
prépare la phase ouverte sans la déclencher : tout module
|
| 32 |
-
externe devra fournir un ``ModuleManifest`` valide pour être
|
| 33 |
-
exécuté.
|
| 34 |
-
|
| 35 |
-
Pas de SPDX validator
|
| 36 |
-
---------------------
|
| 37 |
-
On vérifie la présence et la non-vacuité des champs licence ;
|
| 38 |
-
on ne valide pas la conformité SPDX du nom (``MIT`` vs
|
| 39 |
-
``mit-license`` vs ``MIT License``). Le chercheur reste
|
| 40 |
-
responsable du choix de licence ; l'outil documente, il ne
|
| 41 |
-
juge pas.
|
| 42 |
"""
|
| 43 |
|
| 44 |
from __future__ import annotations
|
| 45 |
|
| 46 |
-
import
|
| 47 |
-
from dataclasses import dataclass, field
|
| 48 |
-
from typing import Any, Optional
|
| 49 |
-
|
| 50 |
-
logger = logging.getLogger(__name__)
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
# Champs obligatoires d'un ManifestModule (texte non-vide).
|
| 54 |
-
_REQUIRED_TEXT_FIELDS = (
|
| 55 |
-
"name", "version", "author", "license",
|
| 56 |
-
"description",
|
| 57 |
-
)
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
@dataclass
|
| 61 |
-
class ModuleManifest:
|
| 62 |
-
"""Métadonnées d'un module contribué.
|
| 63 |
-
|
| 64 |
-
Attributes
|
| 65 |
-
----------
|
| 66 |
-
name:
|
| 67 |
-
Identifiant unique du module (ex. ``"my-llm-correcteur"``).
|
| 68 |
-
version:
|
| 69 |
-
Version sémantique (ex. ``"1.2.0"``).
|
| 70 |
-
author:
|
| 71 |
-
Auteur ou institution responsable.
|
| 72 |
-
license:
|
| 73 |
-
Identifiant de licence (SPDX recommandé, non validé).
|
| 74 |
-
description:
|
| 75 |
-
Description courte (≤ 1 phrase).
|
| 76 |
-
input_types:
|
| 77 |
-
Liste des types d'entrée (chaînes). Doit correspondre
|
| 78 |
-
à ``module.input_types`` (Sprint 33).
|
| 79 |
-
output_types:
|
| 80 |
-
Liste des types de sortie. Doit correspondre à
|
| 81 |
-
``module.output_types``.
|
| 82 |
-
citation:
|
| 83 |
-
Citation académique (BibTeX, DOI, ou texte libre).
|
| 84 |
-
Optionnel.
|
| 85 |
-
homepage:
|
| 86 |
-
URL du dépôt ou de la page projet. Optionnel.
|
| 87 |
-
picarones_min_version:
|
| 88 |
-
Version minimale de Picarones requise. Optionnel.
|
| 89 |
-
extra:
|
| 90 |
-
Métadonnées libres (clé → valeur).
|
| 91 |
-
"""
|
| 92 |
-
|
| 93 |
-
name: str
|
| 94 |
-
version: str
|
| 95 |
-
author: str
|
| 96 |
-
license: str
|
| 97 |
-
description: str
|
| 98 |
-
input_types: list[str] = field(default_factory=list)
|
| 99 |
-
output_types: list[str] = field(default_factory=list)
|
| 100 |
-
citation: Optional[str] = None
|
| 101 |
-
homepage: Optional[str] = None
|
| 102 |
-
picarones_min_version: Optional[str] = None
|
| 103 |
-
extra: dict = field(default_factory=dict)
|
| 104 |
-
|
| 105 |
-
def as_dict(self) -> dict:
|
| 106 |
-
return {
|
| 107 |
-
"name": self.name,
|
| 108 |
-
"version": self.version,
|
| 109 |
-
"author": self.author,
|
| 110 |
-
"license": self.license,
|
| 111 |
-
"description": self.description,
|
| 112 |
-
"input_types": list(self.input_types),
|
| 113 |
-
"output_types": list(self.output_types),
|
| 114 |
-
"citation": self.citation,
|
| 115 |
-
"homepage": self.homepage,
|
| 116 |
-
"picarones_min_version": self.picarones_min_version,
|
| 117 |
-
"extra": dict(self.extra),
|
| 118 |
-
}
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
@dataclass
|
| 122 |
-
class AuditCheck:
|
| 123 |
-
"""Un check individuel de l'audit."""
|
| 124 |
-
|
| 125 |
-
name: str
|
| 126 |
-
passed: bool
|
| 127 |
-
detail: Optional[str] = None
|
| 128 |
-
|
| 129 |
-
def as_dict(self) -> dict:
|
| 130 |
-
return {
|
| 131 |
-
"name": self.name,
|
| 132 |
-
"passed": self.passed,
|
| 133 |
-
"detail": self.detail,
|
| 134 |
-
}
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
@dataclass
|
| 138 |
-
class AuditResult:
|
| 139 |
-
"""Résultat global d'un audit de module."""
|
| 140 |
-
|
| 141 |
-
module_name: str
|
| 142 |
-
passed: bool
|
| 143 |
-
checks: list[AuditCheck] = field(default_factory=list)
|
| 144 |
-
|
| 145 |
-
@property
|
| 146 |
-
def n_passed(self) -> int:
|
| 147 |
-
return sum(1 for c in self.checks if c.passed)
|
| 148 |
-
|
| 149 |
-
@property
|
| 150 |
-
def n_failed(self) -> int:
|
| 151 |
-
return sum(1 for c in self.checks if not c.passed)
|
| 152 |
-
|
| 153 |
-
def as_dict(self) -> dict:
|
| 154 |
-
return {
|
| 155 |
-
"module_name": self.module_name,
|
| 156 |
-
"passed": self.passed,
|
| 157 |
-
"n_passed": self.n_passed,
|
| 158 |
-
"n_failed": self.n_failed,
|
| 159 |
-
"checks": [c.as_dict() for c in self.checks],
|
| 160 |
-
}
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
def validate_manifest(manifest: ModuleManifest) -> list[AuditCheck]:
|
| 164 |
-
"""Vérifie qu'un manifest est complet et bien formé.
|
| 165 |
-
|
| 166 |
-
Returns
|
| 167 |
-
-------
|
| 168 |
-
list[AuditCheck]
|
| 169 |
-
Un check par champ obligatoire + un check pour
|
| 170 |
-
``input_types``/``output_types`` non vides.
|
| 171 |
-
"""
|
| 172 |
-
checks: list[AuditCheck] = []
|
| 173 |
-
for field_name in _REQUIRED_TEXT_FIELDS:
|
| 174 |
-
value = getattr(manifest, field_name, None)
|
| 175 |
-
ok = isinstance(value, str) and bool(value.strip())
|
| 176 |
-
checks.append(AuditCheck(
|
| 177 |
-
name=f"manifest.{field_name}",
|
| 178 |
-
passed=ok,
|
| 179 |
-
detail=None if ok else f"champ '{field_name}' vide ou absent",
|
| 180 |
-
))
|
| 181 |
-
# input_types / output_types : au moins une entrée chacun
|
| 182 |
-
in_ok = (
|
| 183 |
-
isinstance(manifest.input_types, list)
|
| 184 |
-
and len(manifest.input_types) > 0
|
| 185 |
-
and all(
|
| 186 |
-
isinstance(t, str) and t for t in manifest.input_types
|
| 187 |
-
)
|
| 188 |
-
)
|
| 189 |
-
checks.append(AuditCheck(
|
| 190 |
-
name="manifest.input_types",
|
| 191 |
-
passed=in_ok,
|
| 192 |
-
detail=None if in_ok else "input_types vide ou non-string",
|
| 193 |
-
))
|
| 194 |
-
out_ok = (
|
| 195 |
-
isinstance(manifest.output_types, list)
|
| 196 |
-
and len(manifest.output_types) > 0
|
| 197 |
-
and all(
|
| 198 |
-
isinstance(t, str) and t for t in manifest.output_types
|
| 199 |
-
)
|
| 200 |
-
)
|
| 201 |
-
checks.append(AuditCheck(
|
| 202 |
-
name="manifest.output_types",
|
| 203 |
-
passed=out_ok,
|
| 204 |
-
detail=None if out_ok else "output_types vide ou non-string",
|
| 205 |
-
))
|
| 206 |
-
return checks
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
def _is_base_module(cls: Any) -> bool:
|
| 210 |
-
"""Best-effort : vérifie que cls hérite de BaseModule.
|
| 211 |
-
|
| 212 |
-
On ne **pas** importer ``BaseModule`` au top-level pour
|
| 213 |
-
éviter les cycles : on inspecte la chaîne de classes par
|
| 214 |
-
leur nom.
|
| 215 |
-
"""
|
| 216 |
-
try:
|
| 217 |
-
for base in cls.__mro__:
|
| 218 |
-
if base.__name__ == "BaseModule":
|
| 219 |
-
return True
|
| 220 |
-
except AttributeError:
|
| 221 |
-
return False
|
| 222 |
-
return False
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
def audit_module(
|
| 226 |
-
module_class_or_instance: Any,
|
| 227 |
-
manifest: ModuleManifest,
|
| 228 |
-
) -> AuditResult:
|
| 229 |
-
"""Audite un module contribué : interface + manifest.
|
| 230 |
-
|
| 231 |
-
Parameters
|
| 232 |
-
----------
|
| 233 |
-
module_class_or_instance:
|
| 234 |
-
Soit la classe ``BaseModule`` (Sprint 33), soit une
|
| 235 |
-
instance.
|
| 236 |
-
manifest:
|
| 237 |
-
``ModuleManifest`` correspondant au module.
|
| 238 |
-
|
| 239 |
-
Returns
|
| 240 |
-
-------
|
| 241 |
-
AuditResult
|
| 242 |
-
``passed=True`` ssi tous les checks passent.
|
| 243 |
-
"""
|
| 244 |
-
checks = validate_manifest(manifest)
|
| 245 |
-
|
| 246 |
-
# Check : héritage de BaseModule
|
| 247 |
-
cls = (
|
| 248 |
-
type(module_class_or_instance)
|
| 249 |
-
if not isinstance(module_class_or_instance, type)
|
| 250 |
-
else module_class_or_instance
|
| 251 |
-
)
|
| 252 |
-
inherits_base = _is_base_module(cls)
|
| 253 |
-
checks.append(AuditCheck(
|
| 254 |
-
name="module.inherits_base_module",
|
| 255 |
-
passed=inherits_base,
|
| 256 |
-
detail=(
|
| 257 |
-
None if inherits_base
|
| 258 |
-
else "la classe n'hérite pas de picarones.core.modules.BaseModule"
|
| 259 |
-
),
|
| 260 |
-
))
|
| 261 |
-
|
| 262 |
-
# Check : input_types / output_types correspondent
|
| 263 |
-
declared_in: list[str] = []
|
| 264 |
-
declared_out: list[str] = []
|
| 265 |
-
try:
|
| 266 |
-
instance = (
|
| 267 |
-
module_class_or_instance
|
| 268 |
-
if not isinstance(module_class_or_instance, type)
|
| 269 |
-
else None
|
| 270 |
-
)
|
| 271 |
-
attr_in = getattr(cls, "input_types", None)
|
| 272 |
-
attr_out = getattr(cls, "output_types", None)
|
| 273 |
-
if instance is not None:
|
| 274 |
-
attr_in = getattr(instance, "input_types", attr_in)
|
| 275 |
-
attr_out = getattr(instance, "output_types", attr_out)
|
| 276 |
-
if attr_in is not None:
|
| 277 |
-
declared_in = [
|
| 278 |
-
getattr(t, "value", str(t)) for t in attr_in
|
| 279 |
-
]
|
| 280 |
-
if attr_out is not None:
|
| 281 |
-
declared_out = [
|
| 282 |
-
getattr(t, "value", str(t)) for t in attr_out
|
| 283 |
-
]
|
| 284 |
-
except Exception: # noqa: BLE001
|
| 285 |
-
pass
|
| 286 |
-
# Comparaison case-insensitive : on accepte "TEXT" ou "text"
|
| 287 |
-
# côté manifest, le contrat sémantique est le même.
|
| 288 |
-
declared_in_lower = sorted(t.lower() for t in declared_in)
|
| 289 |
-
declared_out_lower = sorted(t.lower() for t in declared_out)
|
| 290 |
-
manifest_in_lower = sorted(t.lower() for t in manifest.input_types)
|
| 291 |
-
manifest_out_lower = sorted(t.lower() for t in manifest.output_types)
|
| 292 |
-
in_match = declared_in_lower == manifest_in_lower
|
| 293 |
-
checks.append(AuditCheck(
|
| 294 |
-
name="module.input_types_match_manifest",
|
| 295 |
-
passed=in_match,
|
| 296 |
-
detail=(
|
| 297 |
-
None if in_match
|
| 298 |
-
else f"déclaré {declared_in} vs manifest {manifest.input_types}"
|
| 299 |
-
),
|
| 300 |
-
))
|
| 301 |
-
out_match = declared_out_lower == manifest_out_lower
|
| 302 |
-
checks.append(AuditCheck(
|
| 303 |
-
name="module.output_types_match_manifest",
|
| 304 |
-
passed=out_match,
|
| 305 |
-
detail=(
|
| 306 |
-
None if out_match
|
| 307 |
-
else f"déclaré {declared_out} vs manifest {manifest.output_types}"
|
| 308 |
-
),
|
| 309 |
-
))
|
| 310 |
-
|
| 311 |
-
# Check : process callable
|
| 312 |
-
has_process = callable(getattr(cls, "process", None))
|
| 313 |
-
checks.append(AuditCheck(
|
| 314 |
-
name="module.has_process",
|
| 315 |
-
passed=has_process,
|
| 316 |
-
detail=None if has_process else "méthode process() absente",
|
| 317 |
-
))
|
| 318 |
-
|
| 319 |
-
passed = all(c.passed for c in checks)
|
| 320 |
-
return AuditResult(
|
| 321 |
-
module_name=manifest.name,
|
| 322 |
-
passed=passed,
|
| 323 |
-
checks=checks,
|
| 324 |
-
)
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
__all__ = [
|
| 328 |
-
"ModuleManifest",
|
| 329 |
-
"AuditCheck",
|
| 330 |
-
"AuditResult",
|
| 331 |
-
"validate_manifest",
|
| 332 |
-
"audit_module",
|
| 333 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.module_policy``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.module_policy`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.module_policy import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,309 +1,15 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
pour les hypothèses, dates et URLs de référence.
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
- Coût exprimé par **1 000 pages** traitées.
|
| 11 |
-
- Coût local = temps moyen d'inférence × taux horaire (paramétrable).
|
| 12 |
-
- Empreinte carbone optionnelle : kWh × intensité g CO₂/kWh du réseau
|
| 13 |
-
d'exécution (mix France bas carbone par défaut pour le local,
|
| 14 |
-
moyenne cloud hyperscaler pour les APIs).
|
| 15 |
"""
|
| 16 |
|
| 17 |
from __future__ import annotations
|
| 18 |
|
| 19 |
-
import
|
| 20 |
-
from
|
| 21 |
-
from pathlib import Path
|
| 22 |
-
from typing import Optional
|
| 23 |
-
|
| 24 |
-
import yaml
|
| 25 |
-
|
| 26 |
-
logger = logging.getLogger(__name__)
|
| 27 |
-
|
| 28 |
-
_DEFAULT_PRICING_PATH = Path(__file__).parent.parent / "data" / "pricing.yaml"
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
@dataclass(frozen=True)
|
| 32 |
-
class PricingDefaults:
|
| 33 |
-
"""Valeurs par défaut du fichier de prix (section ``meta``)."""
|
| 34 |
-
|
| 35 |
-
last_updated: Optional[str] = None
|
| 36 |
-
currency: str = "EUR"
|
| 37 |
-
hourly_rate_local_cpu_eur: float = 0.08
|
| 38 |
-
hourly_rate_local_gpu_eur: float = 1.20
|
| 39 |
-
grid_intensity_local: float = 58.0
|
| 40 |
-
grid_intensity_cloud: float = 380.0
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
@dataclass
|
| 44 |
-
class EngineCost:
|
| 45 |
-
"""Coût estimé d'un moteur sur 1 000 pages, avec traçabilité des hypothèses.
|
| 46 |
-
|
| 47 |
-
La représentation est immuable après construction : une fois que l'utilisateur
|
| 48 |
-
a choisi un taux horaire local, toutes les instances partagent cette
|
| 49 |
-
hypothèse par injection explicite dans ``build_costs_for_benchmark``.
|
| 50 |
-
"""
|
| 51 |
-
|
| 52 |
-
engine_key: str
|
| 53 |
-
"""Nom ou modèle servant de clé dans la table (ex. ``"gpt-4o"``, ``"tesseract"``)."""
|
| 54 |
-
|
| 55 |
-
type: str # "local" | "cloud_api" | "unknown"
|
| 56 |
-
|
| 57 |
-
cost_per_1k_pages_eur: Optional[float] = None
|
| 58 |
-
"""Coût par 1 000 pages en euros. ``None`` si les données sont insuffisantes."""
|
| 59 |
-
|
| 60 |
-
currency: str = "EUR"
|
| 61 |
-
|
| 62 |
-
# Source / date
|
| 63 |
-
pricing_source_url: Optional[str] = None
|
| 64 |
-
pricing_date: Optional[str] = None
|
| 65 |
-
|
| 66 |
-
# Pour les APIs cloud : prix brut
|
| 67 |
-
api_price_per_1k_pages: Optional[float] = None
|
| 68 |
-
|
| 69 |
-
# Pour le local : temps d'inférence et taux horaire utilisés
|
| 70 |
-
local_mean_seconds_per_page: Optional[float] = None
|
| 71 |
-
hourly_rate_eur: Optional[float] = None
|
| 72 |
-
|
| 73 |
-
# Empreinte carbone (estimation — étiquetée "expérimentale" dans le rapport)
|
| 74 |
-
kwh_per_1k_pages: Optional[float] = None
|
| 75 |
-
grid_intensity_g_co2_per_kwh: Optional[float] = None
|
| 76 |
-
co2_per_1k_pages_g: Optional[float] = None
|
| 77 |
-
|
| 78 |
-
notes: Optional[str] = None
|
| 79 |
-
|
| 80 |
-
assumptions: list[str] = field(default_factory=list)
|
| 81 |
-
"""Liste d'hypothèses textuelles à afficher sous le graphique."""
|
| 82 |
-
|
| 83 |
-
def as_dict(self) -> dict:
|
| 84 |
-
return {
|
| 85 |
-
"engine_key": self.engine_key,
|
| 86 |
-
"type": self.type,
|
| 87 |
-
"cost_per_1k_pages_eur": self.cost_per_1k_pages_eur,
|
| 88 |
-
"currency": self.currency,
|
| 89 |
-
"pricing_source_url": self.pricing_source_url,
|
| 90 |
-
"pricing_date": self.pricing_date,
|
| 91 |
-
"api_price_per_1k_pages": self.api_price_per_1k_pages,
|
| 92 |
-
"local_mean_seconds_per_page": self.local_mean_seconds_per_page,
|
| 93 |
-
"hourly_rate_eur": self.hourly_rate_eur,
|
| 94 |
-
"kwh_per_1k_pages": self.kwh_per_1k_pages,
|
| 95 |
-
"grid_intensity_g_co2_per_kwh": self.grid_intensity_g_co2_per_kwh,
|
| 96 |
-
"co2_per_1k_pages_g": self.co2_per_1k_pages_g,
|
| 97 |
-
"notes": self.notes,
|
| 98 |
-
"assumptions": list(self.assumptions),
|
| 99 |
-
}
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
def load_pricing_database(path: Optional[Path] = None) -> tuple[PricingDefaults, dict]:
|
| 103 |
-
"""Charge la table de prix YAML.
|
| 104 |
-
|
| 105 |
-
Retourne ``(defaults, engines_table)`` où ``engines_table`` est un dict
|
| 106 |
-
``{engine_key: raw_entry}``.
|
| 107 |
-
"""
|
| 108 |
-
path = Path(path) if path else _DEFAULT_PRICING_PATH
|
| 109 |
-
if not path.exists():
|
| 110 |
-
logger.warning("[pricing] fichier %s introuvable", path)
|
| 111 |
-
return PricingDefaults(), {}
|
| 112 |
-
try:
|
| 113 |
-
with path.open(encoding="utf-8") as fh:
|
| 114 |
-
data = yaml.safe_load(fh) or {}
|
| 115 |
-
except yaml.YAMLError as e:
|
| 116 |
-
logger.warning("[pricing] échec parsing %s : %s", path, e)
|
| 117 |
-
return PricingDefaults(), {}
|
| 118 |
-
|
| 119 |
-
meta = data.get("meta", {}) or {}
|
| 120 |
-
defaults = PricingDefaults(
|
| 121 |
-
last_updated=meta.get("last_updated"),
|
| 122 |
-
currency=meta.get("currency", "EUR"),
|
| 123 |
-
hourly_rate_local_cpu_eur=float(meta.get("default_hourly_rate_local_cpu_eur", 0.08)),
|
| 124 |
-
hourly_rate_local_gpu_eur=float(meta.get("default_hourly_rate_local_gpu_eur", 1.20)),
|
| 125 |
-
grid_intensity_local=float(meta.get("default_grid_intensity_g_co2_per_kwh", 58.0)),
|
| 126 |
-
grid_intensity_cloud=float(meta.get("cloud_grid_intensity_g_co2_per_kwh", 380.0)),
|
| 127 |
-
)
|
| 128 |
-
engines_table = data.get("engines", {}) or {}
|
| 129 |
-
return defaults, engines_table
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
def _match_key(engine_name: str, llm_model: Optional[str], table: dict) -> Optional[str]:
|
| 133 |
-
"""Cherche la meilleure clé pour ce moteur dans la table.
|
| 134 |
-
|
| 135 |
-
Stratégie : d'abord le nom du modèle LLM (pour les pipelines), puis le
|
| 136 |
-
nom OCR, puis un match partiel (substring) comme filet de sécurité.
|
| 137 |
-
"""
|
| 138 |
-
candidates = [llm_model, engine_name]
|
| 139 |
-
for c in candidates:
|
| 140 |
-
if c and c in table:
|
| 141 |
-
return c
|
| 142 |
-
# Matching partiel — utile pour "tesseract → gpt-4o" ou "gpt-4o-vision"
|
| 143 |
-
for c in candidates:
|
| 144 |
-
if not c:
|
| 145 |
-
continue
|
| 146 |
-
for key in table:
|
| 147 |
-
if key in c:
|
| 148 |
-
return key
|
| 149 |
-
return None
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
def estimate_cost(
|
| 153 |
-
engine_name: str,
|
| 154 |
-
*,
|
| 155 |
-
llm_model: Optional[str] = None,
|
| 156 |
-
is_pipeline: bool = False,
|
| 157 |
-
measured_seconds_per_page: Optional[float] = None,
|
| 158 |
-
table: Optional[dict] = None,
|
| 159 |
-
defaults: Optional[PricingDefaults] = None,
|
| 160 |
-
hourly_rate_override_eur: Optional[float] = None,
|
| 161 |
-
) -> EngineCost:
|
| 162 |
-
"""Calcule le ``EngineCost`` pour un moteur donné.
|
| 163 |
-
|
| 164 |
-
Parameters
|
| 165 |
-
----------
|
| 166 |
-
engine_name:
|
| 167 |
-
Nom public du moteur (ex. ``"tesseract"``, ``"tesseract → gpt-4o"``).
|
| 168 |
-
llm_model:
|
| 169 |
-
Si pipeline OCR+LLM, le modèle LLM utilisé — prioritaire pour la
|
| 170 |
-
lookup car c'est lui qui domine le coût.
|
| 171 |
-
is_pipeline:
|
| 172 |
-
Indique un pipeline OCR+LLM (change la sémantique de lookup).
|
| 173 |
-
measured_seconds_per_page:
|
| 174 |
-
Temps moyen observé sur le benchmark courant. Remplace la valeur
|
| 175 |
-
indicative de la table si fournie (plus fiable).
|
| 176 |
-
table, defaults:
|
| 177 |
-
Overrides pour tests ou usage institutionnel.
|
| 178 |
-
hourly_rate_override_eur:
|
| 179 |
-
Taux horaire à utiliser pour le calcul local (sinon valeur table
|
| 180 |
-
ou défaut).
|
| 181 |
-
"""
|
| 182 |
-
if table is None or defaults is None:
|
| 183 |
-
_defaults, _table = load_pricing_database()
|
| 184 |
-
defaults = defaults or _defaults
|
| 185 |
-
table = table or _table
|
| 186 |
-
|
| 187 |
-
key = _match_key(engine_name, llm_model if is_pipeline else None, table)
|
| 188 |
-
if key is None:
|
| 189 |
-
return EngineCost(
|
| 190 |
-
engine_key=engine_name,
|
| 191 |
-
type="unknown",
|
| 192 |
-
assumptions=["Aucune entrée dans la table de prix pour ce moteur."],
|
| 193 |
-
)
|
| 194 |
-
|
| 195 |
-
entry = table[key]
|
| 196 |
-
etype = str(entry.get("type", "unknown"))
|
| 197 |
-
notes = entry.get("notes")
|
| 198 |
-
assumptions: list[str] = []
|
| 199 |
-
currency = defaults.currency
|
| 200 |
-
|
| 201 |
-
cost_eur: Optional[float] = None
|
| 202 |
-
api_price: Optional[float] = None
|
| 203 |
-
local_seconds = measured_seconds_per_page
|
| 204 |
-
hourly_rate = None
|
| 205 |
-
|
| 206 |
-
if etype == "cloud_api":
|
| 207 |
-
api_price = entry.get("api_price_per_1k_pages")
|
| 208 |
-
if api_price is not None:
|
| 209 |
-
cost_eur = float(api_price)
|
| 210 |
-
assumptions.append(
|
| 211 |
-
f"Prix API indicatif : {cost_eur:.2f} €/1000 pages "
|
| 212 |
-
f"(source : {entry.get('pricing_source_url', '—')}, {entry.get('pricing_date', 'date inconnue')})."
|
| 213 |
-
)
|
| 214 |
-
elif etype == "local":
|
| 215 |
-
indicative_seconds = entry.get("local_mean_seconds_per_page")
|
| 216 |
-
if local_seconds is None and indicative_seconds is not None:
|
| 217 |
-
local_seconds = float(indicative_seconds)
|
| 218 |
-
assumptions.append(
|
| 219 |
-
f"Temps d'inférence indicatif : {local_seconds:.1f} s/page (non mesuré sur ce benchmark)."
|
| 220 |
-
)
|
| 221 |
-
elif local_seconds is not None:
|
| 222 |
-
assumptions.append(
|
| 223 |
-
f"Temps d'inférence mesuré : {local_seconds:.1f} s/page (moyenne sur le corpus)."
|
| 224 |
-
)
|
| 225 |
-
|
| 226 |
-
hourly_rate = (
|
| 227 |
-
hourly_rate_override_eur
|
| 228 |
-
if hourly_rate_override_eur is not None
|
| 229 |
-
else entry.get("hourly_rate_override_eur")
|
| 230 |
-
)
|
| 231 |
-
if hourly_rate is None:
|
| 232 |
-
# Heuristique : si l'entrée précise un override GPU, sinon CPU
|
| 233 |
-
hourly_rate = (
|
| 234 |
-
defaults.hourly_rate_local_gpu_eur
|
| 235 |
-
if "gpu" in str(notes or "").lower()
|
| 236 |
-
else defaults.hourly_rate_local_cpu_eur
|
| 237 |
-
)
|
| 238 |
-
hourly_rate = float(hourly_rate)
|
| 239 |
-
|
| 240 |
-
if local_seconds is not None and hourly_rate is not None:
|
| 241 |
-
cost_eur = (local_seconds / 3600.0) * hourly_rate * 1000.0
|
| 242 |
-
assumptions.append(
|
| 243 |
-
f"Taux horaire appliqué : {hourly_rate:.2f} €/h "
|
| 244 |
-
f"(défaut {'GPU' if hourly_rate >= 0.5 else 'CPU'})."
|
| 245 |
-
)
|
| 246 |
-
|
| 247 |
-
# Empreinte carbone optionnelle
|
| 248 |
-
kwh_1k = entry.get("kwh_per_1k_pages")
|
| 249 |
-
grid = (
|
| 250 |
-
entry.get("grid_intensity_g_co2_per_kwh")
|
| 251 |
-
or (defaults.grid_intensity_cloud if etype == "cloud_api" else defaults.grid_intensity_local)
|
| 252 |
-
)
|
| 253 |
-
co2_g = None
|
| 254 |
-
if kwh_1k is not None and grid is not None:
|
| 255 |
-
co2_g = float(kwh_1k) * float(grid)
|
| 256 |
-
|
| 257 |
-
return EngineCost(
|
| 258 |
-
engine_key=key,
|
| 259 |
-
type=etype,
|
| 260 |
-
cost_per_1k_pages_eur=cost_eur,
|
| 261 |
-
currency=currency,
|
| 262 |
-
pricing_source_url=entry.get("pricing_source_url"),
|
| 263 |
-
pricing_date=entry.get("pricing_date"),
|
| 264 |
-
api_price_per_1k_pages=api_price,
|
| 265 |
-
local_mean_seconds_per_page=local_seconds,
|
| 266 |
-
hourly_rate_eur=hourly_rate,
|
| 267 |
-
kwh_per_1k_pages=float(kwh_1k) if kwh_1k is not None else None,
|
| 268 |
-
grid_intensity_g_co2_per_kwh=float(grid) if grid is not None else None,
|
| 269 |
-
co2_per_1k_pages_g=co2_g,
|
| 270 |
-
notes=notes,
|
| 271 |
-
assumptions=assumptions,
|
| 272 |
-
)
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
def build_costs_for_benchmark(
|
| 276 |
-
engines_summary: list[dict],
|
| 277 |
-
durations_by_engine: dict[str, float],
|
| 278 |
-
*,
|
| 279 |
-
hourly_rate_local_eur: Optional[float] = None,
|
| 280 |
-
pricing_path: Optional[Path] = None,
|
| 281 |
-
) -> dict[str, dict]:
|
| 282 |
-
"""Calcule le coût de chaque moteur d'un benchmark.
|
| 283 |
-
|
| 284 |
-
Returns
|
| 285 |
-
-------
|
| 286 |
-
dict ``{engine_name: EngineCost.as_dict()}``.
|
| 287 |
-
"""
|
| 288 |
-
defaults, table = load_pricing_database(pricing_path)
|
| 289 |
-
out: dict[str, dict] = {}
|
| 290 |
-
for e in engines_summary:
|
| 291 |
-
name = e.get("name")
|
| 292 |
-
if not name:
|
| 293 |
-
continue
|
| 294 |
-
measured = durations_by_engine.get(name)
|
| 295 |
-
llm_model = None
|
| 296 |
-
pipeline_info = e.get("pipeline_info") or {}
|
| 297 |
-
if pipeline_info:
|
| 298 |
-
llm_model = pipeline_info.get("llm_model")
|
| 299 |
-
cost = estimate_cost(
|
| 300 |
-
engine_name=name,
|
| 301 |
-
llm_model=llm_model,
|
| 302 |
-
is_pipeline=bool(e.get("is_pipeline")),
|
| 303 |
-
measured_seconds_per_page=measured,
|
| 304 |
-
table=table,
|
| 305 |
-
defaults=defaults,
|
| 306 |
-
hourly_rate_override_eur=hourly_rate_local_eur,
|
| 307 |
-
)
|
| 308 |
-
out[name] = cost.as_dict()
|
| 309 |
-
return out
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.pricing``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.pricing`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
| 6 |
|
| 7 |
+
Ce module ré-expose **explicitement** le symbole privé
|
| 8 |
+
``_DEFAULT_PRICING_PATH`` qu'au moins un consommateur importe
|
| 9 |
+
directement (cf. tests).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
from __future__ import annotations
|
| 13 |
|
| 14 |
+
from picarones.evaluation.metrics.pricing import * # noqa: F401,F403
|
| 15 |
+
from picarones.evaluation.metrics.pricing import _DEFAULT_PRICING_PATH # noqa: F401
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,254 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
-
|
| 5 |
-
Le CER global d'un moteur peut sembler bon (ex. 5 %) tout en
|
| 6 |
-
masquant des **erreurs systématiques sur les tokens rares** : noms
|
| 7 |
-
propres, toponymes peu fréquents, mots techniques, formules latines
|
| 8 |
-
récurrentes mais pas dominantes. Pour un usage prosopographique
|
| 9 |
-
(indexation de noms, recherche généalogique), ce sont précisément
|
| 10 |
-
ces tokens-là qui comptent.
|
| 11 |
-
|
| 12 |
-
Ce module mesure le **rappel sur les tokens rares** d'un corpus —
|
| 13 |
-
défaut : tokens dont la fréquence corpus-wide est ≤ 2 (hapax +
|
| 14 |
-
dis legomena, terminologie de lexicométrie classique).
|
| 15 |
-
|
| 16 |
-
Hypothèse à valider expérimentalement
|
| 17 |
-
-------------------------------------
|
| 18 |
-
La conjecture du plan A.I.1 : *« cette métrique discrimine plus
|
| 19 |
-
les moteurs que le CER global »*. Si confirmée sur un corpus
|
| 20 |
-
patrimonial réel, elle gagne sa place dans le tableau de
|
| 21 |
-
classement principal — décision laissée au chercheur après
|
| 22 |
-
observation.
|
| 23 |
-
|
| 24 |
-
Stratégie de découpage
|
| 25 |
-
----------------------
|
| 26 |
-
Cohérente avec NER (38), Flesch (52), philologie (55-60) : couche
|
| 27 |
-
de calcul pure d'abord, sans intégration runner. La vue HTML
|
| 28 |
-
« worst lines / rare tokens manqués » suit dans un sprint dédié.
|
| 29 |
-
|
| 30 |
-
Pas d'enregistrement dans le registre typé Sprint 34
|
| 31 |
-
----------------------------------------------------
|
| 32 |
-
La métrique exige **trois entrées** (reference, hypothesis, set
|
| 33 |
-
des tokens rares) et le set des rares est calculé corpus-wide
|
| 34 |
-
(donc connu seulement après itération sur tout le corpus). La
|
| 35 |
-
signature ne rentre pas dans ``(TEXT, TEXT)``. L'utilisateur
|
| 36 |
-
appelle explicitement ``compute_rare_token_recall`` avec le set
|
| 37 |
-
qu'il a calculé.
|
| 38 |
"""
|
| 39 |
|
| 40 |
from __future__ import annotations
|
| 41 |
|
| 42 |
-
import
|
| 43 |
-
import re
|
| 44 |
-
from collections import Counter
|
| 45 |
-
from typing import Iterable, Optional
|
| 46 |
-
|
| 47 |
-
logger = logging.getLogger(__name__)
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 51 |
-
# Tokenisation Unicode-aware
|
| 52 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 53 |
-
|
| 54 |
-
# Token = séquence maximale de caractères de mot Unicode (\w en
|
| 55 |
-
# Python 3 utilise déjà la table Unicode), incluant l'apostrophe
|
| 56 |
-
# typographique '’' à l'intérieur (« l'an », « d’une ») et les
|
| 57 |
-
# tirets internes (« peut-être »). La ponctuation isolée et les
|
| 58 |
-
# espaces sont des séparateurs.
|
| 59 |
-
|
| 60 |
-
_TOKEN_RE = re.compile(
|
| 61 |
-
r"\w+(?:[’'\-]\w+)*",
|
| 62 |
-
flags=re.UNICODE,
|
| 63 |
-
)
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
def tokenize(text: Optional[str]) -> list[str]:
|
| 67 |
-
"""Tokenisation Unicode-aware.
|
| 68 |
-
|
| 69 |
-
Conserve les contractions (``l'an``, ``d’une``) et les mots
|
| 70 |
-
composés (``peut-être``, ``c'est-à-dire``) comme un seul token.
|
| 71 |
-
Casse préservée — l'utilisateur normalise lui-même via
|
| 72 |
-
``case_sensitive=False`` dans les fonctions aval s'il le veut.
|
| 73 |
-
"""
|
| 74 |
-
if not text:
|
| 75 |
-
return []
|
| 76 |
-
return _TOKEN_RE.findall(text)
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 80 |
-
# Distribution de fréquence corpus-wide
|
| 81 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
def frequency_distribution(
|
| 85 |
-
documents: Iterable[str],
|
| 86 |
-
*,
|
| 87 |
-
case_sensitive: bool = False,
|
| 88 |
-
) -> Counter[str]:
|
| 89 |
-
"""Calcule ``{token: count}`` sur l'ensemble du corpus.
|
| 90 |
-
|
| 91 |
-
Parameters
|
| 92 |
-
----------
|
| 93 |
-
documents:
|
| 94 |
-
Itérable de textes (typiquement les ``ground_truth`` des
|
| 95 |
-
documents du corpus).
|
| 96 |
-
case_sensitive:
|
| 97 |
-
Si ``False`` (défaut), tous les tokens sont mis en
|
| 98 |
-
minuscule avant comptage.
|
| 99 |
-
"""
|
| 100 |
-
counter: Counter[str] = Counter()
|
| 101 |
-
for doc in documents:
|
| 102 |
-
tokens = tokenize(doc)
|
| 103 |
-
if not case_sensitive:
|
| 104 |
-
tokens = [t.lower() for t in tokens]
|
| 105 |
-
counter.update(tokens)
|
| 106 |
-
return counter
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
def extract_rare_tokens(
|
| 110 |
-
documents: Iterable[str],
|
| 111 |
-
*,
|
| 112 |
-
max_freq: int = 2,
|
| 113 |
-
case_sensitive: bool = False,
|
| 114 |
-
) -> frozenset[str]:
|
| 115 |
-
"""Retourne l'ensemble des tokens dont la fréquence
|
| 116 |
-
corpus-wide est ``≤ max_freq``.
|
| 117 |
-
|
| 118 |
-
Convention de lexicométrie : ``max_freq=1`` retourne uniquement
|
| 119 |
-
les hapax legomena (1 occurrence) ; ``max_freq=2`` retourne
|
| 120 |
-
hapax + dis legomena (≤ 2 occurrences) — défaut.
|
| 121 |
-
|
| 122 |
-
Les tokens qui n'apparaissent **jamais** dans le corpus ne sont
|
| 123 |
-
évidemment pas inclus (le ``Counter`` ne les liste pas).
|
| 124 |
-
"""
|
| 125 |
-
if max_freq < 1:
|
| 126 |
-
raise ValueError("max_freq doit être ≥ 1")
|
| 127 |
-
counter = frequency_distribution(
|
| 128 |
-
documents, case_sensitive=case_sensitive,
|
| 129 |
-
)
|
| 130 |
-
return frozenset(t for t, c in counter.items() if c <= max_freq)
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 134 |
-
# Calcul du rappel par document
|
| 135 |
-
# ──────────────────────────────────────────────────────────────────────────
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
def compute_rare_token_recall(
|
| 139 |
-
reference: Optional[str],
|
| 140 |
-
hypothesis: Optional[str],
|
| 141 |
-
rare_tokens: Iterable[str],
|
| 142 |
-
*,
|
| 143 |
-
case_sensitive: bool = False,
|
| 144 |
-
) -> dict:
|
| 145 |
-
"""Calcule le rappel sur les tokens rares présents dans la GT.
|
| 146 |
-
|
| 147 |
-
Parameters
|
| 148 |
-
----------
|
| 149 |
-
reference:
|
| 150 |
-
Texte GT du document.
|
| 151 |
-
hypothesis:
|
| 152 |
-
Texte produit par l'OCR.
|
| 153 |
-
rare_tokens:
|
| 154 |
-
Itérable des tokens rares — typiquement le résultat de
|
| 155 |
-
``extract_rare_tokens`` sur le corpus complet.
|
| 156 |
-
case_sensitive:
|
| 157 |
-
Si ``False`` (défaut), la comparaison se fait sur les
|
| 158 |
-
formes minuscules.
|
| 159 |
-
|
| 160 |
-
Returns
|
| 161 |
-
-------
|
| 162 |
-
dict
|
| 163 |
-
``{
|
| 164 |
-
"n_rare_tokens_in_reference": int,
|
| 165 |
-
# nombre d'**occurrences** de tokens rares dans la GT
|
| 166 |
-
# (multiplicité préservée — un token rare présent 2
|
| 167 |
-
# fois compte 2)
|
| 168 |
-
"n_rare_tokens_recalled": int,
|
| 169 |
-
# nombre d'occurrences correctement présentes dans hyp
|
| 170 |
-
# (alignement bag-of-tokens : min(count_ref, count_hyp))
|
| 171 |
-
"recall": float,
|
| 172 |
-
# ratio dans [0, 1], ou 0.0 si aucun rare en GT
|
| 173 |
-
"missed_tokens": list[str],
|
| 174 |
-
# liste des tokens rares **manqués** (avec multiplicité,
|
| 175 |
-
# ex. "Dupont" présent 2 fois en GT et 1 fois en hyp →
|
| 176 |
-
# missed_tokens contient ["Dupont"] une fois)
|
| 177 |
-
}``
|
| 178 |
-
|
| 179 |
-
Cas dégénérés
|
| 180 |
-
-------------
|
| 181 |
-
- GT vide ou aucun token rare présent → recall = 0.0, listes
|
| 182 |
-
vides (convention : on ne récompense pas l'absence de
|
| 183 |
-
tokens rares).
|
| 184 |
-
- Hyp vide avec rares en GT → tous manqués, recall = 0.0.
|
| 185 |
-
"""
|
| 186 |
-
ref = reference or ""
|
| 187 |
-
hyp = hypothesis or ""
|
| 188 |
-
|
| 189 |
-
if case_sensitive:
|
| 190 |
-
rare_set = frozenset(rare_tokens)
|
| 191 |
-
ref_tokens = tokenize(ref)
|
| 192 |
-
hyp_tokens = tokenize(hyp)
|
| 193 |
-
else:
|
| 194 |
-
rare_set = frozenset(t.lower() for t in rare_tokens)
|
| 195 |
-
ref_tokens = [t.lower() for t in tokenize(ref)]
|
| 196 |
-
hyp_tokens = [t.lower() for t in tokenize(hyp)]
|
| 197 |
-
|
| 198 |
-
# Multiplicité : on compte uniquement les rares présents dans la GT
|
| 199 |
-
ref_rare_counts: Counter[str] = Counter(
|
| 200 |
-
t for t in ref_tokens if t in rare_set
|
| 201 |
-
)
|
| 202 |
-
n_rare_in_ref = sum(ref_rare_counts.values())
|
| 203 |
-
if n_rare_in_ref == 0:
|
| 204 |
-
return {
|
| 205 |
-
"n_rare_tokens_in_reference": 0,
|
| 206 |
-
"n_rare_tokens_recalled": 0,
|
| 207 |
-
"recall": 0.0,
|
| 208 |
-
"missed_tokens": [],
|
| 209 |
-
}
|
| 210 |
-
|
| 211 |
-
# Bag-of-tokens dans hyp pour les tokens rares uniquement
|
| 212 |
-
hyp_rare_counts: Counter[str] = Counter(
|
| 213 |
-
t for t in hyp_tokens if t in rare_set
|
| 214 |
-
)
|
| 215 |
-
# Recall multiplicitaire : pour chaque token, min(ref_count, hyp_count)
|
| 216 |
-
n_recalled = 0
|
| 217 |
-
missed: list[str] = []
|
| 218 |
-
for token, ref_count in ref_rare_counts.items():
|
| 219 |
-
hyp_count = hyp_rare_counts.get(token, 0)
|
| 220 |
-
recalled = min(ref_count, hyp_count)
|
| 221 |
-
n_recalled += recalled
|
| 222 |
-
missed_count = ref_count - recalled
|
| 223 |
-
if missed_count > 0:
|
| 224 |
-
missed.extend([token] * missed_count)
|
| 225 |
-
|
| 226 |
-
return {
|
| 227 |
-
"n_rare_tokens_in_reference": n_rare_in_ref,
|
| 228 |
-
"n_rare_tokens_recalled": n_recalled,
|
| 229 |
-
"recall": n_recalled / n_rare_in_ref,
|
| 230 |
-
"missed_tokens": missed,
|
| 231 |
-
}
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
def rare_token_recall(
|
| 235 |
-
reference: Optional[str],
|
| 236 |
-
hypothesis: Optional[str],
|
| 237 |
-
rare_tokens: Iterable[str],
|
| 238 |
-
*,
|
| 239 |
-
case_sensitive: bool = False,
|
| 240 |
-
) -> float:
|
| 241 |
-
"""Raccourci : retourne uniquement le rappel ∈ [0, 1]."""
|
| 242 |
-
return compute_rare_token_recall(
|
| 243 |
-
reference, hypothesis, rare_tokens,
|
| 244 |
-
case_sensitive=case_sensitive,
|
| 245 |
-
)["recall"]
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
__all__ = [
|
| 249 |
-
"tokenize",
|
| 250 |
-
"frequency_distribution",
|
| 251 |
-
"extract_rare_tokens",
|
| 252 |
-
"compute_rare_token_recall",
|
| 253 |
-
"rare_token_recall",
|
| 254 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.rare_tokens``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.rare_tokens`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.rare_tokens import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,287 +1,18 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
|
| 4 |
-
|
|
|
|
|
|
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
Le module ``picarones/core/robustness.py`` (Sprint 8) génère des
|
| 9 |
-
courbes CER vs niveau de dégradation **synthétique** (bruit, flou,
|
| 10 |
-
rotation, résolution). ``picarones/core/image_quality.py`` mesure
|
| 11 |
-
le bruit/flou/contraste **réels** des images du corpus. Ce
|
| 12 |
-
sprint **projette** les caractéristiques réelles sur les courbes
|
| 13 |
-
synthétiques pour estimer le **déficit attendu de CER** sur le
|
| 14 |
-
corpus dans son état actuel.
|
| 15 |
-
|
| 16 |
-
Lecture concrète
|
| 17 |
-
----------------
|
| 18 |
-
*« 30 % de vos documents ont un bruit équivalent à σ=15 où
|
| 19 |
-
Tesseract perd 8 points de CER — soit un déficit attendu global
|
| 20 |
-
de 2,4 points (30 % × 8 points). »*
|
| 21 |
-
|
| 22 |
-
Méthode
|
| 23 |
-
-------
|
| 24 |
-
1. Pour chaque document, on extrait la valeur de qualité réelle
|
| 25 |
-
(``noise_level``, ``blur_score``, ``contrast_score``…) depuis
|
| 26 |
-
``ImageQualityResult``.
|
| 27 |
-
2. Pour chaque type de dégradation, on interpole linéairement la
|
| 28 |
-
``DegradationCurve`` synthétique : CER attendu à ce niveau.
|
| 29 |
-
3. On agrège : CER moyen attendu, % docs au-dessus du seuil
|
| 30 |
-
critique de la courbe, déficit projeté = CER_attendu -
|
| 31 |
-
CER_baseline (niveau nul).
|
| 32 |
-
|
| 33 |
-
Sortie
|
| 34 |
-
------
|
| 35 |
-
``project_robustness_on_corpus(curves, image_qualities)`` retourne
|
| 36 |
-
``{engine_name: {degradation_type: {expected_cer_mean,
|
| 37 |
-
deficit_vs_baseline, n_docs_above_critical, n_docs}}}``.
|
| 38 |
-
|
| 39 |
-
Limites
|
| 40 |
-
-------
|
| 41 |
-
- Mapping ``image_quality → degradation level`` : on suppose que
|
| 42 |
-
``noise_level`` (ImageQualityResult) correspond à σ
|
| 43 |
-
(DegradationCurve), et idem pour ``blur_score`` ↔ rayon de
|
| 44 |
-
flou. Si un corpus expose ces valeurs avec une échelle
|
| 45 |
-
différente, le mapping est documenté et l'utilisateur peut
|
| 46 |
-
passer ``quality_to_level`` custom.
|
| 47 |
-
- Interpolation **linéaire** entre les points de la courbe. Au-
|
| 48 |
-
delà des bornes, on **clip** au point extrême (pas
|
| 49 |
-
d'extrapolation hasardeuse).
|
| 50 |
"""
|
| 51 |
|
| 52 |
from __future__ import annotations
|
| 53 |
|
| 54 |
-
import
|
| 55 |
-
import
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
# Mapping par défaut entre attributs ImageQualityResult et types
|
| 62 |
-
# de dégradation synthétique. L'utilisateur peut passer un dict
|
| 63 |
-
# custom pour modifier ce mapping.
|
| 64 |
-
_DEFAULT_QUALITY_FIELD: dict[str, str] = {
|
| 65 |
-
"noise": "noise_level", # σ
|
| 66 |
-
"blur": "blur_score", # Variance laplacienne (inverse)
|
| 67 |
-
"contrast": "contrast_score",
|
| 68 |
-
"rotation": "rotation_angle",
|
| 69 |
-
"resolution": "resolution_score", # peut être absent
|
| 70 |
-
}
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
def _interpolate_cer(
|
| 74 |
-
levels: list[float],
|
| 75 |
-
cer_values: list[Optional[float]],
|
| 76 |
-
target_level: float,
|
| 77 |
-
) -> Optional[float]:
|
| 78 |
-
"""Interpolation linéaire : retourne CER attendu à
|
| 79 |
-
``target_level``.
|
| 80 |
-
|
| 81 |
-
- Si ``target_level`` est en-dessous du minimum de levels,
|
| 82 |
-
retourne le CER au minimum (clip).
|
| 83 |
-
- Si au-dessus du maximum, retourne le CER au maximum.
|
| 84 |
-
- Sinon, interpolation linéaire entre les deux points
|
| 85 |
-
encadrants.
|
| 86 |
-
- Retourne ``None`` si aucun ``cer_value`` valide.
|
| 87 |
-
"""
|
| 88 |
-
if not levels:
|
| 89 |
-
return None
|
| 90 |
-
# Filtrer les paires (level, cer) où cer est None
|
| 91 |
-
pairs = [
|
| 92 |
-
(lvl, cer) for lvl, cer in zip(levels, cer_values)
|
| 93 |
-
if cer is not None
|
| 94 |
-
]
|
| 95 |
-
if not pairs:
|
| 96 |
-
return None
|
| 97 |
-
pairs.sort(key=lambda p: p[0])
|
| 98 |
-
# Clip
|
| 99 |
-
if target_level <= pairs[0][0]:
|
| 100 |
-
return pairs[0][1]
|
| 101 |
-
if target_level >= pairs[-1][0]:
|
| 102 |
-
return pairs[-1][1]
|
| 103 |
-
# Interpolation
|
| 104 |
-
for i in range(len(pairs) - 1):
|
| 105 |
-
lo_lvl, lo_cer = pairs[i]
|
| 106 |
-
hi_lvl, hi_cer = pairs[i + 1]
|
| 107 |
-
if lo_lvl <= target_level <= hi_lvl:
|
| 108 |
-
if hi_lvl == lo_lvl:
|
| 109 |
-
return lo_cer
|
| 110 |
-
ratio = (target_level - lo_lvl) / (hi_lvl - lo_lvl)
|
| 111 |
-
return lo_cer + (hi_cer - lo_cer) * ratio
|
| 112 |
-
return None # ne devrait pas arriver
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
def _extract_quality_value(
|
| 116 |
-
quality: dict, degradation_type: str,
|
| 117 |
-
custom_mapping: Optional[dict[str, str]] = None,
|
| 118 |
-
) -> Optional[float]:
|
| 119 |
-
"""Extrait la valeur de qualité pertinente pour un type de
|
| 120 |
-
dégradation depuis un ``ImageQualityResult.as_dict()``."""
|
| 121 |
-
mapping = custom_mapping or _DEFAULT_QUALITY_FIELD
|
| 122 |
-
field = mapping.get(degradation_type)
|
| 123 |
-
if field is None:
|
| 124 |
-
return None
|
| 125 |
-
value = quality.get(field)
|
| 126 |
-
if value is None:
|
| 127 |
-
return None
|
| 128 |
-
try:
|
| 129 |
-
return float(value)
|
| 130 |
-
except (TypeError, ValueError):
|
| 131 |
-
return None
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
def project_robustness_on_corpus(
|
| 135 |
-
curves: Iterable,
|
| 136 |
-
image_qualities: list[dict],
|
| 137 |
-
*,
|
| 138 |
-
quality_to_level: Optional[Callable[[dict, str], Optional[float]]] = None,
|
| 139 |
-
critical_threshold: Optional[float] = None,
|
| 140 |
-
) -> dict:
|
| 141 |
-
"""Projette les courbes de robustesse sur les qualités réelles.
|
| 142 |
-
|
| 143 |
-
Parameters
|
| 144 |
-
----------
|
| 145 |
-
curves:
|
| 146 |
-
Itérable de ``DegradationCurve`` (ou dicts compatibles
|
| 147 |
-
avec ``engine_name``, ``degradation_type``, ``levels``,
|
| 148 |
-
``cer_values``, ``critical_threshold_level``).
|
| 149 |
-
image_qualities:
|
| 150 |
-
Liste de dicts ``ImageQualityResult.as_dict()`` (un par
|
| 151 |
-
document). Si vide, retourne une projection vide.
|
| 152 |
-
quality_to_level:
|
| 153 |
-
Fonction custom ``(quality_dict, degradation_type) →
|
| 154 |
-
Optional[float]`` pour adapter le mapping qualité→niveau.
|
| 155 |
-
Par défaut, utilise ``_DEFAULT_QUALITY_FIELD``.
|
| 156 |
-
critical_threshold:
|
| 157 |
-
Override pour le seuil critique de CER (défaut : utilise
|
| 158 |
-
``DegradationCurve.cer_threshold``).
|
| 159 |
-
|
| 160 |
-
Returns
|
| 161 |
-
-------
|
| 162 |
-
dict
|
| 163 |
-
``{
|
| 164 |
-
engine_name: {
|
| 165 |
-
degradation_type: {
|
| 166 |
-
"n_docs": int,
|
| 167 |
-
"n_docs_with_data": int, # qualité disponible
|
| 168 |
-
"expected_cer_mean": float, # moyenne CER attendu
|
| 169 |
-
"expected_cer_median": float,
|
| 170 |
-
"baseline_cer": float, # CER à niveau min
|
| 171 |
-
"deficit_vs_baseline": float,
|
| 172 |
-
"n_docs_above_critical": int,
|
| 173 |
-
"critical_threshold_level": float | None,
|
| 174 |
-
"critical_threshold_cer": float,
|
| 175 |
-
},
|
| 176 |
-
},
|
| 177 |
-
}``
|
| 178 |
-
"""
|
| 179 |
-
extractor = quality_to_level or (
|
| 180 |
-
lambda q, dt: _extract_quality_value(q, dt)
|
| 181 |
-
)
|
| 182 |
-
out: dict[str, dict] = {}
|
| 183 |
-
|
| 184 |
-
for curve in curves:
|
| 185 |
-
# Accepter dict ou DegradationCurve
|
| 186 |
-
if hasattr(curve, "as_dict"):
|
| 187 |
-
data = curve.as_dict()
|
| 188 |
-
else:
|
| 189 |
-
data = curve
|
| 190 |
-
engine = data.get("engine_name")
|
| 191 |
-
deg_type = data.get("degradation_type")
|
| 192 |
-
levels = data.get("levels") or []
|
| 193 |
-
cer_values = data.get("cer_values") or []
|
| 194 |
-
crit_lvl = data.get("critical_threshold_level")
|
| 195 |
-
crit_cer = (
|
| 196 |
-
critical_threshold
|
| 197 |
-
if critical_threshold is not None
|
| 198 |
-
else data.get("cer_threshold", 0.20)
|
| 199 |
-
)
|
| 200 |
-
if not engine or not deg_type:
|
| 201 |
-
continue
|
| 202 |
-
|
| 203 |
-
per_doc_cer: list[float] = []
|
| 204 |
-
n_docs_with_data = 0
|
| 205 |
-
n_above_critical = 0
|
| 206 |
-
for quality in image_qualities:
|
| 207 |
-
level = extractor(quality, deg_type)
|
| 208 |
-
if level is None:
|
| 209 |
-
continue
|
| 210 |
-
n_docs_with_data += 1
|
| 211 |
-
cer = _interpolate_cer(levels, cer_values, level)
|
| 212 |
-
if cer is None:
|
| 213 |
-
continue
|
| 214 |
-
per_doc_cer.append(cer)
|
| 215 |
-
if cer > crit_cer:
|
| 216 |
-
n_above_critical += 1
|
| 217 |
-
|
| 218 |
-
if not per_doc_cer:
|
| 219 |
-
continue
|
| 220 |
-
|
| 221 |
-
# Baseline = CER au niveau minimum (sans dégradation)
|
| 222 |
-
baseline = _interpolate_cer(
|
| 223 |
-
levels, cer_values,
|
| 224 |
-
min(levels) if levels else 0.0,
|
| 225 |
-
)
|
| 226 |
-
expected_mean = statistics.fmean(per_doc_cer)
|
| 227 |
-
expected_median = statistics.median(per_doc_cer)
|
| 228 |
-
deficit = (
|
| 229 |
-
expected_mean - baseline
|
| 230 |
-
if baseline is not None else None
|
| 231 |
-
)
|
| 232 |
-
|
| 233 |
-
out.setdefault(engine, {})[deg_type] = {
|
| 234 |
-
"n_docs": len(image_qualities),
|
| 235 |
-
"n_docs_with_data": n_docs_with_data,
|
| 236 |
-
"expected_cer_mean": expected_mean,
|
| 237 |
-
"expected_cer_median": expected_median,
|
| 238 |
-
"baseline_cer": baseline,
|
| 239 |
-
"deficit_vs_baseline": deficit,
|
| 240 |
-
"n_docs_above_critical": n_above_critical,
|
| 241 |
-
"critical_threshold_level": crit_lvl,
|
| 242 |
-
"critical_threshold_cer": crit_cer,
|
| 243 |
-
}
|
| 244 |
-
return out
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
def aggregate_projection_per_engine(projection: dict) -> dict:
|
| 248 |
-
"""Pour chaque moteur, agrège le déficit projeté en sommant
|
| 249 |
-
sur tous les types de dégradation.
|
| 250 |
-
|
| 251 |
-
Lecture : *« déficit total attendu pour Tesseract = 5,2 points
|
| 252 |
-
de CER si on considère les 4 dégradations indépendamment »*.
|
| 253 |
-
|
| 254 |
-
Note : la sommation **suppose l'indépendance** des
|
| 255 |
-
dégradations, ce qui n'est pas strictement vrai mais reste
|
| 256 |
-
une approximation utile pour le diagnostic.
|
| 257 |
-
"""
|
| 258 |
-
out: dict[str, dict] = {}
|
| 259 |
-
for engine, per_type in projection.items():
|
| 260 |
-
total_deficit = 0.0
|
| 261 |
-
n_types_with_data = 0
|
| 262 |
-
max_deficit_type: Optional[tuple[str, float]] = None
|
| 263 |
-
for deg_type, stats in per_type.items():
|
| 264 |
-
deficit = stats.get("deficit_vs_baseline")
|
| 265 |
-
if deficit is None:
|
| 266 |
-
continue
|
| 267 |
-
total_deficit += deficit
|
| 268 |
-
n_types_with_data += 1
|
| 269 |
-
if max_deficit_type is None or deficit > max_deficit_type[1]:
|
| 270 |
-
max_deficit_type = (deg_type, deficit)
|
| 271 |
-
out[engine] = {
|
| 272 |
-
"total_expected_deficit": total_deficit,
|
| 273 |
-
"n_degradation_types": n_types_with_data,
|
| 274 |
-
"worst_degradation_type": (
|
| 275 |
-
max_deficit_type[0] if max_deficit_type else None
|
| 276 |
-
),
|
| 277 |
-
"worst_degradation_deficit": (
|
| 278 |
-
max_deficit_type[1] if max_deficit_type else None
|
| 279 |
-
),
|
| 280 |
-
}
|
| 281 |
-
return out
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
__all__ = [
|
| 285 |
-
"project_robustness_on_corpus",
|
| 286 |
-
"aggregate_projection_per_engine",
|
| 287 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.robustness_projection``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.robustness_projection`` est
|
| 5 |
+
conservé pour ne casser aucun consommateur. Au S22, ce re-export
|
| 6 |
+
disparaîtra.
|
| 7 |
|
| 8 |
+
Ré-expose explicitement ``_extract_quality_value`` et
|
| 9 |
+
``_interpolate_cer`` (symboles privés utilisés downstream).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
from __future__ import annotations
|
| 13 |
|
| 14 |
+
from picarones.evaluation.metrics.robustness_projection import * # noqa: F401,F403
|
| 15 |
+
from picarones.evaluation.metrics.robustness_projection import ( # noqa: F401
|
| 16 |
+
_extract_quality_value,
|
| 17 |
+
_interpolate_cer,
|
| 18 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,161 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Le détecteur narratif ``error_profile_outlier`` (Sprint 19) signale
|
| 8 |
-
qu'un moteur a un profil taxonomique éloigné de ses concurrents,
|
| 9 |
-
mais le rapport n'expose pas cette différence visuellement. Ce
|
| 10 |
-
sprint répond à *« deux moteurs ont le même CER global, mais lequel
|
| 11 |
-
fait des erreurs plus récupérables ? »*.
|
| 12 |
-
|
| 13 |
-
Lecture concrète
|
| 14 |
-
----------------
|
| 15 |
-
- Moteur A : 80 % d'erreurs ``case_error`` → toutes corrigeables
|
| 16 |
-
par un post-processing trivial (récupérables).
|
| 17 |
-
- Moteur B : 80 % d'erreurs ``lacuna`` (mots manquants) →
|
| 18 |
-
irrécupérables sans relire l'image.
|
| 19 |
-
|
| 20 |
-
À CER égal, A est massivement préférable pour un workflow
|
| 21 |
-
d'édition critique. Cette vue rend la différence visible.
|
| 22 |
-
|
| 23 |
-
Catégorisation des classes
|
| 24 |
-
--------------------------
|
| 25 |
-
On annote chaque classe d'erreur d'un degré de **récupérabilité**
|
| 26 |
-
(critère éditorial pragmatique, pas verdict imposé) :
|
| 27 |
-
|
| 28 |
-
- ``recoverable`` : récupérable par post-processing trivial
|
| 29 |
-
(case_error, ligature_error, abbreviation_error)
|
| 30 |
-
- ``difficult`` : récupérable au prix d'un effort
|
| 31 |
-
(diacritic_error, visual_confusion, hapax)
|
| 32 |
-
- ``irrecoverable`` : impossible à corriger sans l'image
|
| 33 |
-
(lacuna, oov_character, segmentation_error)
|
| 34 |
-
|
| 35 |
-
L'utilisateur consulte ces catégories comme un guide, pas un
|
| 36 |
-
verdict — c'est lui qui juge selon ses besoins éditoriaux.
|
| 37 |
"""
|
| 38 |
|
| 39 |
from __future__ import annotations
|
| 40 |
|
| 41 |
-
import
|
| 42 |
-
from typing import Optional
|
| 43 |
-
|
| 44 |
-
logger = logging.getLogger(__name__)
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
# Classification éditoriale. Documentée dans la docstring.
|
| 48 |
-
RECOVERABILITY: dict[str, str] = {
|
| 49 |
-
"case_error": "recoverable",
|
| 50 |
-
"ligature_error": "recoverable",
|
| 51 |
-
"abbreviation_error": "recoverable",
|
| 52 |
-
"diacritic_error": "difficult",
|
| 53 |
-
"visual_confusion": "difficult",
|
| 54 |
-
"hapax": "difficult",
|
| 55 |
-
"lacuna": "irrecoverable",
|
| 56 |
-
"oov_character": "irrecoverable",
|
| 57 |
-
"segmentation_error": "irrecoverable",
|
| 58 |
-
}
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
def _normalize_counts(counts: dict[str, int]) -> dict[str, float]:
|
| 62 |
-
"""Convertit un dict de comptes en proportions [0, 1]."""
|
| 63 |
-
total = sum(counts.values())
|
| 64 |
-
if total <= 0:
|
| 65 |
-
return {k: 0.0 for k in counts}
|
| 66 |
-
return {k: v / total for k, v in counts.items()}
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def compare_taxonomies(
|
| 70 |
-
engine_a_name: str,
|
| 71 |
-
engine_a_counts: dict[str, int],
|
| 72 |
-
engine_b_name: str,
|
| 73 |
-
engine_b_counts: dict[str, int],
|
| 74 |
-
) -> Optional[dict]:
|
| 75 |
-
"""Compare deux profils taxonomiques.
|
| 76 |
-
|
| 77 |
-
Parameters
|
| 78 |
-
----------
|
| 79 |
-
engine_a_name, engine_b_name:
|
| 80 |
-
Noms d'identification des moteurs (utilisés dans le rendu).
|
| 81 |
-
engine_a_counts, engine_b_counts:
|
| 82 |
-
Maps ``{class_name: count}`` produites par
|
| 83 |
-
``aggregate_taxonomy``.
|
| 84 |
-
|
| 85 |
-
Returns
|
| 86 |
-
-------
|
| 87 |
-
Optional[dict]
|
| 88 |
-
``{
|
| 89 |
-
"engine_a": str, "engine_b": str,
|
| 90 |
-
"total_a": int, "total_b": int,
|
| 91 |
-
"classes": list[str], # classes apparaissant chez A ou B
|
| 92 |
-
"proportions_a": dict[str, float],
|
| 93 |
-
"proportions_b": dict[str, float],
|
| 94 |
-
"deltas": dict[str, float], # prop_b - prop_a (signé)
|
| 95 |
-
"recoverability": dict[str, str], # mapping class → niveau
|
| 96 |
-
"totals_by_recoverability": {
|
| 97 |
-
"recoverable": {"a": float, "b": float},
|
| 98 |
-
"difficult": {"a": float, "b": float},
|
| 99 |
-
"irrecoverable": {"a": float, "b": float},
|
| 100 |
-
},
|
| 101 |
-
}``
|
| 102 |
-
Ou ``None`` si les deux moteurs ont 0 erreur chacun.
|
| 103 |
-
"""
|
| 104 |
-
if engine_a_name == engine_b_name:
|
| 105 |
-
# On accepte des comparaisons même si les noms sont
|
| 106 |
-
# identiques (cas tests), mais on émet un warning.
|
| 107 |
-
logger.warning(
|
| 108 |
-
"[taxonomy_comparison] engine_a et engine_b ont le même nom : %s",
|
| 109 |
-
engine_a_name,
|
| 110 |
-
)
|
| 111 |
-
|
| 112 |
-
total_a = sum(engine_a_counts.values()) if engine_a_counts else 0
|
| 113 |
-
total_b = sum(engine_b_counts.values()) if engine_b_counts else 0
|
| 114 |
-
if total_a == 0 and total_b == 0:
|
| 115 |
-
return None
|
| 116 |
-
|
| 117 |
-
classes = sorted(set(engine_a_counts) | set(engine_b_counts))
|
| 118 |
-
if not classes:
|
| 119 |
-
return None
|
| 120 |
-
|
| 121 |
-
prop_a = _normalize_counts(
|
| 122 |
-
{c: engine_a_counts.get(c, 0) for c in classes},
|
| 123 |
-
)
|
| 124 |
-
prop_b = _normalize_counts(
|
| 125 |
-
{c: engine_b_counts.get(c, 0) for c in classes},
|
| 126 |
-
)
|
| 127 |
-
deltas = {c: prop_b[c] - prop_a[c] for c in classes}
|
| 128 |
-
|
| 129 |
-
# Agrégat par récupérabilité (utile pour la lecture rapide)
|
| 130 |
-
totals_recov: dict[str, dict[str, float]] = {
|
| 131 |
-
"recoverable": {"a": 0.0, "b": 0.0},
|
| 132 |
-
"difficult": {"a": 0.0, "b": 0.0},
|
| 133 |
-
"irrecoverable": {"a": 0.0, "b": 0.0},
|
| 134 |
-
}
|
| 135 |
-
for cls in classes:
|
| 136 |
-
level = RECOVERABILITY.get(cls, "difficult")
|
| 137 |
-
if level not in totals_recov:
|
| 138 |
-
level = "difficult"
|
| 139 |
-
totals_recov[level]["a"] += prop_a[cls]
|
| 140 |
-
totals_recov[level]["b"] += prop_b[cls]
|
| 141 |
-
|
| 142 |
-
return {
|
| 143 |
-
"engine_a": engine_a_name,
|
| 144 |
-
"engine_b": engine_b_name,
|
| 145 |
-
"total_a": total_a,
|
| 146 |
-
"total_b": total_b,
|
| 147 |
-
"classes": classes,
|
| 148 |
-
"proportions_a": prop_a,
|
| 149 |
-
"proportions_b": prop_b,
|
| 150 |
-
"deltas": deltas,
|
| 151 |
-
"recoverability": {
|
| 152 |
-
cls: RECOVERABILITY.get(cls, "difficult") for cls in classes
|
| 153 |
-
},
|
| 154 |
-
"totals_by_recoverability": totals_recov,
|
| 155 |
-
}
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
__all__ = [
|
| 159 |
-
"RECOVERABILITY",
|
| 160 |
-
"compare_taxonomies",
|
| 161 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.taxonomy_comparison``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.taxonomy_comparison`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.taxonomy_comparison import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,150 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
La taxonomie d'erreurs (10 classes, ``picarones/core/taxonomy.py``)
|
| 8 |
-
est calculée par document mais le rapport actuel ne montre qu'un
|
| 9 |
-
seul histogramme global. La roadmap A.I.4 demande trois lectures
|
| 10 |
-
plus fines de cette taxonomie ; ce sprint livre la première :
|
| 11 |
-
**co-occurrence**.
|
| 12 |
-
|
| 13 |
-
Si ``ligature_error`` et ``abbreviation_error`` co-occurrent
|
| 14 |
-
toujours dans les mêmes documents, c'est un signal de scribe
|
| 15 |
-
particulier — utile pour stratifier le corpus *a posteriori*
|
| 16 |
-
(qu'est-ce qui caractérise les documents difficiles ?).
|
| 17 |
-
|
| 18 |
-
Mesure
|
| 19 |
-
------
|
| 20 |
-
Indice de **Jaccard** entre paires de classes au niveau
|
| 21 |
-
**document** :
|
| 22 |
-
|
| 23 |
-
.. math::
|
| 24 |
-
|
| 25 |
-
J(A, B) = \\frac{|D_A \\cap D_B|}{|D_A \\cup D_B|}
|
| 26 |
-
|
| 27 |
-
où ``D_X`` est l'ensemble des documents qui contiennent au moins
|
| 28 |
-
une erreur de classe ``X``.
|
| 29 |
-
|
| 30 |
-
- ``J(A, B) = 1`` : A et B apparaissent toujours ensemble (et
|
| 31 |
-
jamais l'un sans l'autre).
|
| 32 |
-
- ``J(A, B) = 0`` : A et B ne co-occurrent jamais.
|
| 33 |
-
- ``J(A, B) = 0,5`` : A et B partagent la moitié de leur union.
|
| 34 |
-
|
| 35 |
-
Stratégie de découpage
|
| 36 |
-
----------------------
|
| 37 |
-
Couche de calcul pure d'abord (pattern Sprint 35, 38, 52-58).
|
| 38 |
-
Le rendu HTML (heatmap SVG) est livré dans le même sprint pour
|
| 39 |
-
boucler la dimension ; les chantiers 2 et 3 d'A.I.4 (évolution
|
| 40 |
-
intra-document, taxonomie comparative) suivent.
|
| 41 |
"""
|
| 42 |
|
| 43 |
from __future__ import annotations
|
| 44 |
|
| 45 |
-
import
|
| 46 |
-
from typing import Iterable, Optional
|
| 47 |
-
|
| 48 |
-
logger = logging.getLogger(__name__)
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
def compute_taxonomy_cooccurrence(
|
| 52 |
-
per_doc_classes: Iterable[Iterable[str]],
|
| 53 |
-
*,
|
| 54 |
-
min_doc_count: int = 1,
|
| 55 |
-
top_n_pairs: int = 10,
|
| 56 |
-
) -> Optional[dict]:
|
| 57 |
-
"""Calcule la matrice de Jaccard inter-classes au niveau document.
|
| 58 |
-
|
| 59 |
-
Parameters
|
| 60 |
-
----------
|
| 61 |
-
per_doc_classes:
|
| 62 |
-
Itérable de docs, chaque doc étant un itérable de noms de
|
| 63 |
-
classes taxonomiques détectées (set, list, tuple…).
|
| 64 |
-
Les doublons à l'intérieur d'un doc sont ignorés (présence
|
| 65 |
-
binaire au niveau doc).
|
| 66 |
-
min_doc_count:
|
| 67 |
-
Nombre minimum de documents dans lesquels une classe doit
|
| 68 |
-
apparaître pour figurer dans la matrice (défaut 1).
|
| 69 |
-
Permet d'écarter les classes anecdotiques.
|
| 70 |
-
top_n_pairs:
|
| 71 |
-
Nombre de paires retournées dans ``top_pairs`` (triées par
|
| 72 |
-
Jaccard décroissant). Défaut 10.
|
| 73 |
-
|
| 74 |
-
Returns
|
| 75 |
-
-------
|
| 76 |
-
Optional[dict]
|
| 77 |
-
``{
|
| 78 |
-
"classes": list[str], # triées alpha
|
| 79 |
-
"n_documents": int,
|
| 80 |
-
"doc_count": dict[str, int], # nb docs par classe
|
| 81 |
-
"cooccurrence_matrix": dict[str, dict[str, float]],
|
| 82 |
-
# symétrique, diagonale = 1.0 (sauf classe vide)
|
| 83 |
-
"top_pairs": list[tuple[str, str, float]],
|
| 84 |
-
# paires les plus co-occurrentes (Jaccard désc.)
|
| 85 |
-
}``
|
| 86 |
-
ou ``None`` si aucune classe ne dépasse ``min_doc_count``
|
| 87 |
-
ou si l'itérable est vide.
|
| 88 |
-
"""
|
| 89 |
-
docs: list[frozenset[str]] = []
|
| 90 |
-
for doc_classes in per_doc_classes:
|
| 91 |
-
if doc_classes is None:
|
| 92 |
-
continue
|
| 93 |
-
cleaned = frozenset(c for c in doc_classes if c)
|
| 94 |
-
docs.append(cleaned)
|
| 95 |
-
if not docs:
|
| 96 |
-
return None
|
| 97 |
-
|
| 98 |
-
# Comptage par classe
|
| 99 |
-
doc_count: dict[str, int] = {}
|
| 100 |
-
for doc in docs:
|
| 101 |
-
for cls in doc:
|
| 102 |
-
doc_count[cls] = doc_count.get(cls, 0) + 1
|
| 103 |
-
|
| 104 |
-
# Filtrage min_doc_count
|
| 105 |
-
classes = sorted(
|
| 106 |
-
c for c, n in doc_count.items() if n >= min_doc_count
|
| 107 |
-
)
|
| 108 |
-
if not classes:
|
| 109 |
-
return None
|
| 110 |
-
|
| 111 |
-
# Matrice de Jaccard
|
| 112 |
-
matrix: dict[str, dict[str, float]] = {
|
| 113 |
-
c: {} for c in classes
|
| 114 |
-
}
|
| 115 |
-
for i, ca in enumerate(classes):
|
| 116 |
-
docs_a = {idx for idx, d in enumerate(docs) if ca in d}
|
| 117 |
-
for cb in classes[i:]:
|
| 118 |
-
if ca == cb:
|
| 119 |
-
# Diagonale : Jaccard(X, X) = 1 si X est présent
|
| 120 |
-
matrix[ca][cb] = 1.0 if docs_a else 0.0
|
| 121 |
-
continue
|
| 122 |
-
docs_b = {idx for idx, d in enumerate(docs) if cb in d}
|
| 123 |
-
inter = len(docs_a & docs_b)
|
| 124 |
-
union = len(docs_a | docs_b)
|
| 125 |
-
jaccard = inter / union if union > 0 else 0.0
|
| 126 |
-
matrix[ca][cb] = jaccard
|
| 127 |
-
matrix[cb][ca] = jaccard # symétrique
|
| 128 |
-
|
| 129 |
-
# Top paires (hors diagonale)
|
| 130 |
-
pairs: list[tuple[str, str, float]] = []
|
| 131 |
-
for i, ca in enumerate(classes):
|
| 132 |
-
for cb in classes[i + 1:]:
|
| 133 |
-
j = matrix[ca][cb]
|
| 134 |
-
if j > 0:
|
| 135 |
-
pairs.append((ca, cb, j))
|
| 136 |
-
pairs.sort(key=lambda p: (-p[2], p[0], p[1]))
|
| 137 |
-
top_pairs = pairs[:top_n_pairs]
|
| 138 |
-
|
| 139 |
-
return {
|
| 140 |
-
"classes": classes,
|
| 141 |
-
"n_documents": len(docs),
|
| 142 |
-
"doc_count": doc_count,
|
| 143 |
-
"cooccurrence_matrix": matrix,
|
| 144 |
-
"top_pairs": top_pairs,
|
| 145 |
-
}
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
__all__ = [
|
| 149 |
-
"compute_taxonomy_cooccurrence",
|
| 150 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.taxonomy_cooccurrence``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.taxonomy_cooccurrence`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.taxonomy_cooccurrence import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,165 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Le throughput brut (pages/heure d'OCR pur) ment quand un moteur
|
| 8 |
-
est rapide mais imprécis : la correction humaine *post hoc*
|
| 9 |
-
absorbe le gain. La **vraie** vitesse opérationnelle inclut
|
| 10 |
-
le temps de correction. Cette métrique discrimine fortement
|
| 11 |
-
entre un cloud rapide à 30 % de timeouts/erreurs et un local
|
| 12 |
-
lent à 100 % de fiabilité.
|
| 13 |
-
|
| 14 |
-
Formule
|
| 15 |
-
-------
|
| 16 |
-
.. code::
|
| 17 |
-
|
| 18 |
-
pages_par_heure_utilisable =
|
| 19 |
-
pages_traitées / (durée_totale + temps_correction_humaine)
|
| 20 |
-
|
| 21 |
-
Le temps de correction est estimé linéairement :
|
| 22 |
-
``temps_par_erreur × nombre_d_erreurs``. Le défaut
|
| 23 |
-
``time_per_error_seconds=5.0`` correspond aux études HTR-United
|
| 24 |
-
(saisie manuelle d'une correction de mot par un opérateur
|
| 25 |
-
formé : ≈ 5 s par erreur). L'utilisateur peut le surcharger
|
| 26 |
-
pour son institution.
|
| 27 |
-
|
| 28 |
-
Sortie
|
| 29 |
-
------
|
| 30 |
-
``compute_effective_throughput(n_pages, duration_seconds,
|
| 31 |
-
n_errors, time_per_error_seconds=5.0)`` retourne ``{n_pages,
|
| 32 |
-
duration_seconds, n_errors, time_per_error_seconds,
|
| 33 |
-
correction_time_seconds, total_seconds, pages_per_hour_raw,
|
| 34 |
-
pages_per_hour_effective, drag_ratio}``.
|
| 35 |
-
|
| 36 |
-
``aggregate_effective_throughput(per_engine_data)`` agrège par
|
| 37 |
-
moteur sur l'ensemble du corpus.
|
| 38 |
"""
|
| 39 |
|
| 40 |
from __future__ import annotations
|
| 41 |
|
| 42 |
-
import
|
| 43 |
-
from typing import Iterable, Optional
|
| 44 |
-
|
| 45 |
-
logger = logging.getLogger(__name__)
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
_DEFAULT_TIME_PER_ERROR_SECONDS = 5.0
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
def compute_effective_throughput(
|
| 52 |
-
n_pages: int,
|
| 53 |
-
duration_seconds: float,
|
| 54 |
-
n_errors: int,
|
| 55 |
-
*,
|
| 56 |
-
time_per_error_seconds: float = _DEFAULT_TIME_PER_ERROR_SECONDS,
|
| 57 |
-
) -> Optional[dict]:
|
| 58 |
-
"""Throughput effectif (pages/heure utilisables).
|
| 59 |
-
|
| 60 |
-
Parameters
|
| 61 |
-
----------
|
| 62 |
-
n_pages:
|
| 63 |
-
Nombre de pages traitées.
|
| 64 |
-
duration_seconds:
|
| 65 |
-
Durée totale de l'OCR (somme des durées par doc).
|
| 66 |
-
n_errors:
|
| 67 |
-
Nombre d'erreurs (au niveau mot, typiquement
|
| 68 |
-
``WER × n_words_total``).
|
| 69 |
-
time_per_error_seconds:
|
| 70 |
-
Temps moyen de correction humaine par erreur. Défaut
|
| 71 |
-
5 s (HTR-United). Doit être ≥ 0.
|
| 72 |
-
|
| 73 |
-
Returns
|
| 74 |
-
-------
|
| 75 |
-
dict | None
|
| 76 |
-
``None`` si ``n_pages == 0`` ou ``total_seconds == 0``
|
| 77 |
-
(pas de division par zéro).
|
| 78 |
-
"""
|
| 79 |
-
if n_pages <= 0:
|
| 80 |
-
return None
|
| 81 |
-
if duration_seconds < 0 or n_errors < 0 or time_per_error_seconds < 0:
|
| 82 |
-
raise ValueError(
|
| 83 |
-
"duration_seconds, n_errors et time_per_error_seconds "
|
| 84 |
-
"doivent être ≥ 0",
|
| 85 |
-
)
|
| 86 |
-
correction_seconds = float(n_errors) * float(time_per_error_seconds)
|
| 87 |
-
total_seconds = float(duration_seconds) + correction_seconds
|
| 88 |
-
if total_seconds <= 0:
|
| 89 |
-
# Aucun temps écoulé : impossible de définir un throughput
|
| 90 |
-
return None
|
| 91 |
-
pages_per_hour_raw = (
|
| 92 |
-
n_pages / duration_seconds * 3600.0
|
| 93 |
-
if duration_seconds > 0 else None
|
| 94 |
-
)
|
| 95 |
-
pages_per_hour_effective = n_pages / total_seconds * 3600.0
|
| 96 |
-
drag_ratio = (
|
| 97 |
-
correction_seconds / total_seconds if total_seconds > 0 else 0.0
|
| 98 |
-
)
|
| 99 |
-
return {
|
| 100 |
-
"n_pages": int(n_pages),
|
| 101 |
-
"duration_seconds": float(duration_seconds),
|
| 102 |
-
"n_errors": int(n_errors),
|
| 103 |
-
"time_per_error_seconds": float(time_per_error_seconds),
|
| 104 |
-
"correction_time_seconds": correction_seconds,
|
| 105 |
-
"total_seconds": total_seconds,
|
| 106 |
-
"pages_per_hour_raw": pages_per_hour_raw,
|
| 107 |
-
"pages_per_hour_effective": pages_per_hour_effective,
|
| 108 |
-
"drag_ratio": drag_ratio,
|
| 109 |
-
}
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
def aggregate_effective_throughput(
|
| 113 |
-
per_engine: Iterable[dict],
|
| 114 |
-
*,
|
| 115 |
-
time_per_error_seconds: float = _DEFAULT_TIME_PER_ERROR_SECONDS,
|
| 116 |
-
) -> Optional[dict]:
|
| 117 |
-
"""Agrège le throughput effectif par moteur.
|
| 118 |
-
|
| 119 |
-
Parameters
|
| 120 |
-
----------
|
| 121 |
-
per_engine:
|
| 122 |
-
Itérable de dicts ``{engine_name, n_pages,
|
| 123 |
-
duration_seconds, n_errors}``.
|
| 124 |
-
|
| 125 |
-
Returns
|
| 126 |
-
-------
|
| 127 |
-
dict | None
|
| 128 |
-
``{
|
| 129 |
-
"engines": [
|
| 130 |
-
{"engine_name", ..., compute_effective_throughput
|
| 131 |
-
fields},
|
| 132 |
-
...
|
| 133 |
-
],
|
| 134 |
-
"time_per_error_seconds": float,
|
| 135 |
-
}`` ou ``None`` si aucun moteur exploitable.
|
| 136 |
-
"""
|
| 137 |
-
rows: list[dict] = []
|
| 138 |
-
for entry in per_engine:
|
| 139 |
-
if not isinstance(entry, dict):
|
| 140 |
-
continue
|
| 141 |
-
name = entry.get("engine_name") or entry.get("engine")
|
| 142 |
-
if not name:
|
| 143 |
-
continue
|
| 144 |
-
result = compute_effective_throughput(
|
| 145 |
-
int(entry.get("n_pages") or 0),
|
| 146 |
-
float(entry.get("duration_seconds") or 0.0),
|
| 147 |
-
int(entry.get("n_errors") or 0),
|
| 148 |
-
time_per_error_seconds=time_per_error_seconds,
|
| 149 |
-
)
|
| 150 |
-
if result is None:
|
| 151 |
-
continue
|
| 152 |
-
result["engine_name"] = str(name)
|
| 153 |
-
rows.append(result)
|
| 154 |
-
if not rows:
|
| 155 |
-
return None
|
| 156 |
-
return {
|
| 157 |
-
"engines": rows,
|
| 158 |
-
"time_per_error_seconds": float(time_per_error_seconds),
|
| 159 |
-
}
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
__all__ = [
|
| 163 |
-
"compute_effective_throughput",
|
| 164 |
-
"aggregate_effective_throughput",
|
| 165 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.throughput``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.throughput`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.throughput import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1,199 +1,10 @@
|
|
| 1 |
-
"""
|
|
|
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
Pourquoi ce module
|
| 6 |
-
------------------
|
| 7 |
-
Le percentile p95 du CER ligne (calculé par ``line_metrics.py``,
|
| 8 |
-
Sprint 10) est un nombre abstrait : *« 5 % de mes lignes ont un
|
| 9 |
-
CER > 0,42 »*. Le chercheur veut **voir** ces lignes : leur
|
| 10 |
-
texte, leur diff, leur document parent, pour comprendre ce qui
|
| 11 |
-
casse.
|
| 12 |
-
|
| 13 |
-
Ce module fournit la requête transversale qui collecte, depuis un
|
| 14 |
-
``BenchmarkResult``, les **N lignes les plus mal transcrites de
|
| 15 |
-
tout le corpus**, classées par CER ligne. Filtrable par moteur
|
| 16 |
-
et par strate.
|
| 17 |
-
|
| 18 |
-
Limite documentée
|
| 19 |
-
-----------------
|
| 20 |
-
``DocumentResult.line_metrics`` ne stocke que les CER par ligne,
|
| 21 |
-
**pas le texte des lignes**. Pour récupérer les textes GT/hyp
|
| 22 |
-
on resplitte ``ground_truth`` et ``hypothesis`` du
|
| 23 |
-
``DocumentResult`` à l'index de la ligne. Cette logique
|
| 24 |
-
**suppose un BenchmarkResult non-compacté** — après ``compact()``
|
| 25 |
-
les textes sont tronqués à 200 caractères et les lignes au-delà
|
| 26 |
-
de cette troncature ne sont plus accessibles. En pratique on
|
| 27 |
-
extrait les worst lines **avant** la sérialisation/compactage.
|
| 28 |
"""
|
| 29 |
|
| 30 |
from __future__ import annotations
|
| 31 |
|
| 32 |
-
import
|
| 33 |
-
from dataclasses import dataclass
|
| 34 |
-
from typing import Optional
|
| 35 |
-
|
| 36 |
-
logger = logging.getLogger(__name__)
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
@dataclass
|
| 40 |
-
class WorstLineEntry:
|
| 41 |
-
"""Une ligne du corpus identifiée comme mal transcrite.
|
| 42 |
-
|
| 43 |
-
Champs
|
| 44 |
-
------
|
| 45 |
-
rank:
|
| 46 |
-
Position dans le classement (1-based, 1 = pire CER).
|
| 47 |
-
cer:
|
| 48 |
-
CER de la ligne ∈ [0, 1].
|
| 49 |
-
engine_name:
|
| 50 |
-
Nom du moteur ayant produit cette hypothèse.
|
| 51 |
-
doc_id:
|
| 52 |
-
Identifiant du document parent.
|
| 53 |
-
line_index:
|
| 54 |
-
Index 0-based de la ligne dans le document GT.
|
| 55 |
-
gt_line:
|
| 56 |
-
Texte de la ligne dans la GT.
|
| 57 |
-
hyp_line:
|
| 58 |
-
Texte correspondant dans l'hypothèse (peut être ``""``
|
| 59 |
-
si l'OCR a sauté la ligne).
|
| 60 |
-
script_type:
|
| 61 |
-
Strate du document si disponible (``script_type``
|
| 62 |
-
capturé par le runner pour la stratification A.III).
|
| 63 |
-
"""
|
| 64 |
-
|
| 65 |
-
rank: int
|
| 66 |
-
cer: float
|
| 67 |
-
engine_name: str
|
| 68 |
-
doc_id: str
|
| 69 |
-
line_index: int
|
| 70 |
-
gt_line: str
|
| 71 |
-
hyp_line: str
|
| 72 |
-
script_type: Optional[str] = None
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
def _split_lines(text: Optional[str]) -> list[str]:
|
| 76 |
-
"""Splitte un texte en lignes (cohérent avec ``line_metrics``).
|
| 77 |
-
|
| 78 |
-
Supporte les fins de ligne ``\\n``, ``\\r\\n``, ``\\r``. Les
|
| 79 |
-
lignes vides sont préservées. Retourne une liste vide si le
|
| 80 |
-
texte est None ou vide.
|
| 81 |
-
"""
|
| 82 |
-
if not text:
|
| 83 |
-
return []
|
| 84 |
-
# ``splitlines`` gère \r\n et \r correctement
|
| 85 |
-
return text.splitlines()
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
def _line_at(text: Optional[str], index: int) -> str:
|
| 89 |
-
"""Retourne la ligne à l'index demandé, ou ``""`` si l'index
|
| 90 |
-
est hors borne (cas où l'OCR a moins de lignes que la GT)."""
|
| 91 |
-
lines = _split_lines(text)
|
| 92 |
-
if 0 <= index < len(lines):
|
| 93 |
-
return lines[index]
|
| 94 |
-
return ""
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
def extract_worst_lines(
|
| 98 |
-
benchmark,
|
| 99 |
-
*,
|
| 100 |
-
top_n: int = 20,
|
| 101 |
-
engine_filter: Optional[str] = None,
|
| 102 |
-
script_type_filter: Optional[str] = None,
|
| 103 |
-
) -> list[WorstLineEntry]:
|
| 104 |
-
"""Extrait les ``top_n`` lignes les plus mal transcrites du
|
| 105 |
-
corpus, transversalement à tous les moteurs et documents.
|
| 106 |
-
|
| 107 |
-
Parameters
|
| 108 |
-
----------
|
| 109 |
-
benchmark:
|
| 110 |
-
``BenchmarkResult`` non-compacté (cf. limite ci-dessus).
|
| 111 |
-
L'objet doit exposer ``engine_reports`` (liste de
|
| 112 |
-
``EngineReport``) et optionnellement ``doc_strata``
|
| 113 |
-
(map ``{doc_id: script_type}``, Sprint 45).
|
| 114 |
-
top_n:
|
| 115 |
-
Nombre de lignes à retourner. Défaut : 20.
|
| 116 |
-
engine_filter:
|
| 117 |
-
Si fourni, n'inclut que les lignes produites par ce moteur
|
| 118 |
-
(match exact sur ``engine_name``).
|
| 119 |
-
script_type_filter:
|
| 120 |
-
Si fourni, n'inclut que les lignes des documents de cette
|
| 121 |
-
strate (nécessite ``benchmark.doc_strata``).
|
| 122 |
-
|
| 123 |
-
Returns
|
| 124 |
-
-------
|
| 125 |
-
list[WorstLineEntry]
|
| 126 |
-
Liste triée par CER décroissant (pire en premier),
|
| 127 |
-
rang 1-based attribué après tri. Vide si aucune ligne
|
| 128 |
-
exploitable.
|
| 129 |
-
"""
|
| 130 |
-
if top_n <= 0:
|
| 131 |
-
return []
|
| 132 |
-
|
| 133 |
-
doc_strata = getattr(benchmark, "doc_strata", None) or {}
|
| 134 |
-
candidates: list[tuple[float, str, str, int, str, str, Optional[str]]] = []
|
| 135 |
-
|
| 136 |
-
for engine_report in getattr(benchmark, "engine_reports", []):
|
| 137 |
-
engine_name = engine_report.engine_name
|
| 138 |
-
if engine_filter is not None and engine_name != engine_filter:
|
| 139 |
-
continue
|
| 140 |
-
for dr in engine_report.document_results:
|
| 141 |
-
line_metrics = getattr(dr, "line_metrics", None)
|
| 142 |
-
if not line_metrics:
|
| 143 |
-
continue
|
| 144 |
-
cer_per_line = line_metrics.get("cer_per_line") if isinstance(
|
| 145 |
-
line_metrics, dict,
|
| 146 |
-
) else getattr(line_metrics, "cer_per_line", None)
|
| 147 |
-
if not cer_per_line:
|
| 148 |
-
continue
|
| 149 |
-
doc_id = dr.doc_id
|
| 150 |
-
doc_strata_value = doc_strata.get(doc_id)
|
| 151 |
-
if (
|
| 152 |
-
script_type_filter is not None
|
| 153 |
-
and doc_strata_value != script_type_filter
|
| 154 |
-
):
|
| 155 |
-
continue
|
| 156 |
-
for idx, cer in enumerate(cer_per_line):
|
| 157 |
-
if cer <= 0.0:
|
| 158 |
-
continue
|
| 159 |
-
gt_line = _line_at(dr.ground_truth, idx)
|
| 160 |
-
hyp_line = _line_at(dr.hypothesis, idx)
|
| 161 |
-
if not gt_line and not hyp_line:
|
| 162 |
-
continue
|
| 163 |
-
candidates.append((
|
| 164 |
-
float(cer), engine_name, doc_id, idx,
|
| 165 |
-
gt_line, hyp_line, doc_strata_value,
|
| 166 |
-
))
|
| 167 |
-
|
| 168 |
-
if not candidates:
|
| 169 |
-
return []
|
| 170 |
-
|
| 171 |
-
# Tri par CER décroissant ; en cas d'égalité, ordre stable
|
| 172 |
-
# (engine, doc_id, line_index) pour reproductibilité.
|
| 173 |
-
candidates.sort(
|
| 174 |
-
key=lambda c: (-c[0], c[1], c[2], c[3]),
|
| 175 |
-
)
|
| 176 |
-
selected = candidates[:top_n]
|
| 177 |
-
|
| 178 |
-
return [
|
| 179 |
-
WorstLineEntry(
|
| 180 |
-
rank=i + 1,
|
| 181 |
-
cer=cer,
|
| 182 |
-
engine_name=engine,
|
| 183 |
-
doc_id=doc_id,
|
| 184 |
-
line_index=line_index,
|
| 185 |
-
gt_line=gt_line,
|
| 186 |
-
hyp_line=hyp_line,
|
| 187 |
-
script_type=script_type,
|
| 188 |
-
)
|
| 189 |
-
for i, (
|
| 190 |
-
cer, engine, doc_id, line_index,
|
| 191 |
-
gt_line, hyp_line, script_type,
|
| 192 |
-
) in enumerate(selected)
|
| 193 |
-
]
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
__all__ = [
|
| 197 |
-
"WorstLineEntry",
|
| 198 |
-
"extract_worst_lines",
|
| 199 |
-
]
|
|
|
|
| 1 |
+
"""Re-export — Sprint A14-S10. Le contenu canonique vit dans
|
| 2 |
+
``picarones.evaluation.metrics.worst_lines``.
|
| 3 |
|
| 4 |
+
L'ancien chemin ``picarones.measurements.worst_lines`` est conservé pour
|
| 5 |
+
ne casser aucun consommateur. Au S22, ce re-export disparaîtra.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
from picarones.evaluation.metrics.worst_lines import * # noqa: F401,F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -61,7 +61,12 @@ FILE_BUDGETS: dict[str, int] = {
|
|
| 61 |
"picarones/core/pipeline.py": 675, # actuel 571
|
| 62 |
"picarones/extras/importers/iiif.py": 675, # actuel 567
|
| 63 |
"picarones/extras/importers/gallica.py": 675, # actuel 563
|
| 64 |
-
"picarones/measurements/levers.py": 675, # actuel 561
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
"picarones/extras/importers/escriptorium.py": 650, # actuel 553
|
| 66 |
# Sprint A14-S1 — A.I.0 P0 : ajout de validated_path,
|
| 67 |
# validated_prompt_filename, safe_report_name et compute_workspace_roots.
|
|
|
|
| 61 |
"picarones/core/pipeline.py": 675, # actuel 571
|
| 62 |
"picarones/extras/importers/iiif.py": 675, # actuel 567
|
| 63 |
"picarones/extras/importers/gallica.py": 675, # actuel 563
|
| 64 |
+
"picarones/measurements/levers.py": 675, # actuel 561 (re-export S10)
|
| 65 |
+
# Sprint A14-S10 — déplacés depuis measurements/, l'ancien
|
| 66 |
+
# emplacement est désormais un re-export. Le contenu canonique
|
| 67 |
+
# vit dans evaluation/metrics/.
|
| 68 |
+
"picarones/evaluation/metrics/levers.py": 675, # actuel 561
|
| 69 |
+
"picarones/evaluation/metrics/inter_engine.py": 575, # actuel 484
|
| 70 |
"picarones/extras/importers/escriptorium.py": 650, # actuel 553
|
| 71 |
# Sprint A14-S1 — A.I.0 P0 : ajout de validated_path,
|
| 72 |
# validated_prompt_filename, safe_report_name et compute_workspace_roots.
|
|
@@ -86,6 +86,9 @@ EXTERNAL_ALLOWED: dict[str, frozenset[str]] = {
|
|
| 86 |
"evaluation": frozenset({
|
| 87 |
"pydantic", "typing_extensions", "annotated_types",
|
| 88 |
"numpy", "scipy", "jiwer", "rapidfuzz",
|
|
|
|
|
|
|
|
|
|
| 89 |
}),
|
| 90 |
"pipeline": frozenset({
|
| 91 |
"pydantic", "typing_extensions", "annotated_types",
|
|
|
|
| 86 |
"evaluation": frozenset({
|
| 87 |
"pydantic", "typing_extensions", "annotated_types",
|
| 88 |
"numpy", "scipy", "jiwer", "rapidfuzz",
|
| 89 |
+
# S10 — fichiers de calcul migrés depuis measurements/ :
|
| 90 |
+
"PIL", # image_quality utilise Pillow pour analyser les images
|
| 91 |
+
"yaml", # pricing charge sa table de coûts depuis YAML
|
| 92 |
}),
|
| 93 |
"pipeline": frozenset({
|
| 94 |
"pydantic", "typing_extensions", "annotated_types",
|