Spaces:
Sleeping
feat(6-volet1): pipelines/over_normalization.py → evaluation/metrics/
Browse filesPhase 6 volet 1 — la détection de sur-normalisation LLM (classe 10
de la taxonomie d'erreurs) est relocalisée depuis ``pipelines/``
vers la couche canonique ``evaluation/metrics/``.
Le module est pur Python (juste ``dataclass`` + ``Optional``) —
aucune dépendance externe, donc 100 % compatible avec la whitelist
d'imports d'``evaluation/``.
Modifications
-------------
- Création de ``picarones/evaluation/metrics/over_normalization.py``
(121 LOC, copie identique du legacy + en-tête Phase 6).
- ``picarones/pipelines/over_normalization.py`` réduit à un shim
de 30 lignes avec ``DeprecationWarning`` à l'import et
ré-export explicite de ``OverNormalizationResult``,
``detect_over_normalization``, ``aggregate_over_normalization``.
- ``picarones/fixtures.py`` (1 import) : caller migré vers le
canonique.
- ``picarones/measurements/runner/document.py`` (1 import lazy) :
caller migré.
- ``picarones/evaluation/metrics/taxonomy.py`` : référence dans
la docstring mise à jour (``pipelines/`` → ``evaluation/metrics/``).
- ``tests/engines/test_sprint3_llm_pipelines.py`` : 5 imports
migrés vers le canonique (les tests ``OCRLLMPipeline`` du même
fichier restent inchangés — ils relèvent du volet 2).
Architecture
------------
- ``BOOTSTRAP_BASELINE`` de
``tests/architecture/test_legacy_canonical_parity.py`` abaissé
de 104 à 101 (3 symboles publics sortent de la dette : la
``OverNormalizationResult`` et les 2 fonctions module-level).
Volet 2 reporté
---------------
La migration de ``pipelines/base.OCRLLMPipeline`` vers des
``PipelineSpec`` YAML composés reste à faire (3 modes,
``inputs_from`` cross-step, refactor de ``web/benchmark_utils.py``
+ ``measurements/runner/orchestration.py``). Le plan maître estime
3-5 jours d'effort — sortira d'un commit séparé sous ``6-volet2``.
Bilan
-----
- ``pytest tests/`` : 4715 passed, 0 failed.
- ``ruff check`` : clean.
- 1 fichier canonique créé, 5 callers migrés, 1 shim conservé.
https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP
- picarones/evaluation/metrics/over_normalization.py +128 -0
- picarones/evaluation/metrics/taxonomy.py +2 -2
- picarones/fixtures.py +1 -1
- picarones/measurements/runner/document.py +1 -1
- picarones/pipelines/over_normalization.py +25 -116
- tests/architecture/test_legacy_canonical_parity.py +1 -1
- tests/engines/test_sprint3_llm_pipelines.py +7 -7
|
@@ -0,0 +1,128 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Détection de la sur-normalisation LLM — Classe 10 de la taxonomie des erreurs.
|
| 2 |
+
|
| 3 |
+
Phase 6 (mai 2026) — module relocalisé depuis
|
| 4 |
+
``picarones/pipelines/over_normalization.py`` vers
|
| 5 |
+
``picarones/evaluation/metrics/over_normalization.py``.
|
| 6 |
+
|
| 7 |
+
Le shim ``pipelines/over_normalization.py`` reste exécutable le temps
|
| 8 |
+
que les callers externes migrent ; il sera supprimé en 2.0.
|
| 9 |
+
|
| 10 |
+
La sur-normalisation désigne le cas où le LLM « corrige » à tort des passages
|
| 11 |
+
déjà bien transcrits par l'OCR, en particulier :
|
| 12 |
+
- modernisation de graphies médiévales légitimes (nostre → notre, faict → fait)
|
| 13 |
+
- normalisation de variantes orthographiques historiques authentiques
|
| 14 |
+
- modification de noms propres ou de termes rares sans erreur OCR initiale
|
| 15 |
+
|
| 16 |
+
Mesure :
|
| 17 |
+
score = nombre de mots (OCR correct → LLM modifié) / nombre de mots OCR corrects
|
| 18 |
+
|
| 19 |
+
Un score élevé indique que le prompt doit être affiné pour mieux préserver
|
| 20 |
+
la graphie originale.
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
from __future__ import annotations
|
| 24 |
+
|
| 25 |
+
from dataclasses import dataclass, field
|
| 26 |
+
from typing import Optional
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
@dataclass
|
| 30 |
+
class OverNormalizationResult:
|
| 31 |
+
"""Résultat de la détection de sur-normalisation pour un document."""
|
| 32 |
+
|
| 33 |
+
total_correct_ocr_words: int
|
| 34 |
+
over_normalized_count: int
|
| 35 |
+
over_normalized_passages: list[dict] = field(default_factory=list)
|
| 36 |
+
# Chaque entrée : {"gt": str, "ocr": str, "llm": str}
|
| 37 |
+
|
| 38 |
+
@property
|
| 39 |
+
def score(self) -> float:
|
| 40 |
+
"""Score de sur-normalisation entre 0 (aucune dégradation) et 1 (tout dégradé)."""
|
| 41 |
+
if self.total_correct_ocr_words == 0:
|
| 42 |
+
return 0.0
|
| 43 |
+
return round(self.over_normalized_count / self.total_correct_ocr_words, 4)
|
| 44 |
+
|
| 45 |
+
def as_dict(self) -> dict:
|
| 46 |
+
return {
|
| 47 |
+
"score": self.score,
|
| 48 |
+
"total_correct_ocr_words": self.total_correct_ocr_words,
|
| 49 |
+
"over_normalized_count": self.over_normalized_count,
|
| 50 |
+
"over_normalized_passages": self.over_normalized_passages[:20],
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def detect_over_normalization(
|
| 55 |
+
ground_truth: str,
|
| 56 |
+
ocr_text: str,
|
| 57 |
+
llm_text: str,
|
| 58 |
+
*,
|
| 59 |
+
max_examples: int = 20,
|
| 60 |
+
) -> OverNormalizationResult:
|
| 61 |
+
"""Détecte la sur-normalisation LLM au niveau des mots.
|
| 62 |
+
|
| 63 |
+
Algorithme (alignement positionnel simple, adapté aux textes courts) :
|
| 64 |
+
Pour chaque position i dans min(len(GT), len(OCR), len(LLM)) :
|
| 65 |
+
- Si ocr[i] == gt[i] → le mot était correct dans l'OCR
|
| 66 |
+
- Si llm[i] != gt[i] → le LLM a dégradé ce mot correct → sur-normalisation
|
| 67 |
+
|
| 68 |
+
Parameters
|
| 69 |
+
----------
|
| 70 |
+
ground_truth:
|
| 71 |
+
Transcription de référence.
|
| 72 |
+
ocr_text:
|
| 73 |
+
Sortie brute du moteur OCR (avant correction LLM).
|
| 74 |
+
llm_text:
|
| 75 |
+
Sortie après correction par le LLM.
|
| 76 |
+
max_examples:
|
| 77 |
+
Nombre maximal d'exemples de sur-normalisation conservés.
|
| 78 |
+
|
| 79 |
+
Returns
|
| 80 |
+
-------
|
| 81 |
+
OverNormalizationResult
|
| 82 |
+
"""
|
| 83 |
+
gt_words = ground_truth.split()
|
| 84 |
+
ocr_words = ocr_text.split()
|
| 85 |
+
llm_words = llm_text.split()
|
| 86 |
+
|
| 87 |
+
n = min(len(gt_words), len(ocr_words), len(llm_words))
|
| 88 |
+
|
| 89 |
+
correct_ocr = 0
|
| 90 |
+
over_norm = 0
|
| 91 |
+
passages: list[dict] = []
|
| 92 |
+
|
| 93 |
+
for i in range(n):
|
| 94 |
+
gt_w = gt_words[i]
|
| 95 |
+
ocr_w = ocr_words[i]
|
| 96 |
+
llm_w = llm_words[i]
|
| 97 |
+
|
| 98 |
+
if ocr_w == gt_w:
|
| 99 |
+
correct_ocr += 1
|
| 100 |
+
if llm_w != gt_w and len(passages) < max_examples:
|
| 101 |
+
over_norm += 1
|
| 102 |
+
passages.append({"gt": gt_w, "ocr": ocr_w, "llm": llm_w})
|
| 103 |
+
elif llm_w != gt_w:
|
| 104 |
+
over_norm += 1
|
| 105 |
+
|
| 106 |
+
return OverNormalizationResult(
|
| 107 |
+
total_correct_ocr_words=correct_ocr,
|
| 108 |
+
over_normalized_count=over_norm,
|
| 109 |
+
over_normalized_passages=passages,
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def aggregate_over_normalization(results: list[Optional[OverNormalizationResult]]) -> dict:
|
| 114 |
+
"""Agrège les résultats de sur-normalisation sur un ensemble de documents."""
|
| 115 |
+
valid = [r for r in results if r is not None]
|
| 116 |
+
if not valid:
|
| 117 |
+
return {"score": None, "total_correct_ocr_words": 0, "over_normalized_count": 0}
|
| 118 |
+
|
| 119 |
+
total_correct = sum(r.total_correct_ocr_words for r in valid)
|
| 120 |
+
total_over = sum(r.over_normalized_count for r in valid)
|
| 121 |
+
score = round(total_over / total_correct, 4) if total_correct > 0 else 0.0
|
| 122 |
+
|
| 123 |
+
return {
|
| 124 |
+
"score": score,
|
| 125 |
+
"total_correct_ocr_words": total_correct,
|
| 126 |
+
"over_normalized_count": total_over,
|
| 127 |
+
"document_count": len(valid),
|
| 128 |
+
}
|
|
@@ -14,9 +14,9 @@ la taxonomie Picarones :
|
|
| 14 |
| 7 | segmentation_error| Fusion ou fragmentation de tokens (mots/lignes) |
|
| 15 |
| 8 | oov_character | Caractère hors-vocabulaire du moteur |
|
| 16 |
| 9 | lacuna | Texte présent dans le GT absent de l'OCR |
|
| 17 |
-
| 10 | over_normalization| Sur-normalisation LLM (voir
|
| 18 |
|
| 19 |
-
Note : la classe 10 est calculée par picarones/
|
| 20 |
"""
|
| 21 |
|
| 22 |
from __future__ import annotations
|
|
|
|
| 14 |
| 7 | segmentation_error| Fusion ou fragmentation de tokens (mots/lignes) |
|
| 15 |
| 8 | oov_character | Caractère hors-vocabulaire du moteur |
|
| 16 |
| 9 | lacuna | Texte présent dans le GT absent de l'OCR |
|
| 17 |
+
| 10 | over_normalization| Sur-normalisation LLM (voir evaluation/metrics/) |
|
| 18 |
|
| 19 |
+
Note : la classe 10 est calculée par picarones/evaluation/metrics/over_normalization.py.
|
| 20 |
"""
|
| 21 |
|
| 22 |
from __future__ import annotations
|
|
@@ -15,7 +15,7 @@ import zlib
|
|
| 15 |
|
| 16 |
from picarones.evaluation.metric_result import MetricsResult
|
| 17 |
from picarones.evaluation.benchmark_result import BenchmarkResult, DocumentResult, EngineReport
|
| 18 |
-
from picarones.
|
| 19 |
# Sprint 5 — métriques avancées
|
| 20 |
from picarones.evaluation.metrics.confusion import build_confusion_matrix
|
| 21 |
from picarones.evaluation.metrics.char_scores import compute_ligature_score, compute_diacritic_score
|
|
|
|
| 15 |
|
| 16 |
from picarones.evaluation.metric_result import MetricsResult
|
| 17 |
from picarones.evaluation.benchmark_result import BenchmarkResult, DocumentResult, EngineReport
|
| 18 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 19 |
# Sprint 5 — métriques avancées
|
| 20 |
from picarones.evaluation.metrics.confusion import build_confusion_matrix
|
| 21 |
from picarones.evaluation.metrics.char_scores import compute_ligature_score, compute_diacritic_score
|
|
@@ -101,7 +101,7 @@ def _compute_document_result(
|
|
| 101 |
}
|
| 102 |
if ocr_intermediate is not None and ocr_result.success:
|
| 103 |
try:
|
| 104 |
-
from picarones.
|
| 105 |
over_norm = detect_over_normalization(
|
| 106 |
ground_truth=ground_truth,
|
| 107 |
ocr_text=ocr_intermediate,
|
|
|
|
| 101 |
}
|
| 102 |
if ocr_intermediate is not None and ocr_result.success:
|
| 103 |
try:
|
| 104 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 105 |
over_norm = detect_over_normalization(
|
| 106 |
ground_truth=ground_truth,
|
| 107 |
ocr_text=ocr_intermediate,
|
|
@@ -1,121 +1,30 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
- normalisation de variantes orthographiques historiques authentiques
|
| 7 |
-
- modification de noms propres ou de termes rares sans erreur OCR initiale
|
| 8 |
-
|
| 9 |
-
Mesure :
|
| 10 |
-
score = nombre de mots (OCR correct → LLM modifié) / nombre de mots OCR corrects
|
| 11 |
-
|
| 12 |
-
Un score élevé indique que le prompt doit être affiné pour mieux préserver
|
| 13 |
-
la graphie originale.
|
| 14 |
"""
|
| 15 |
|
| 16 |
from __future__ import annotations
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
return {
|
| 40 |
-
"score": self.score,
|
| 41 |
-
"total_correct_ocr_words": self.total_correct_ocr_words,
|
| 42 |
-
"over_normalized_count": self.over_normalized_count,
|
| 43 |
-
"over_normalized_passages": self.over_normalized_passages[:20],
|
| 44 |
-
}
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
def detect_over_normalization(
|
| 48 |
-
ground_truth: str,
|
| 49 |
-
ocr_text: str,
|
| 50 |
-
llm_text: str,
|
| 51 |
-
*,
|
| 52 |
-
max_examples: int = 20,
|
| 53 |
-
) -> OverNormalizationResult:
|
| 54 |
-
"""Détecte la sur-normalisation LLM au niveau des mots.
|
| 55 |
-
|
| 56 |
-
Algorithme (alignement positionnel simple, adapté aux textes courts) :
|
| 57 |
-
Pour chaque position i dans min(len(GT), len(OCR), len(LLM)) :
|
| 58 |
-
- Si ocr[i] == gt[i] → le mot était correct dans l'OCR
|
| 59 |
-
- Si llm[i] != gt[i] → le LLM a dégradé ce mot correct → sur-normalisation
|
| 60 |
-
|
| 61 |
-
Parameters
|
| 62 |
-
----------
|
| 63 |
-
ground_truth:
|
| 64 |
-
Transcription de référence.
|
| 65 |
-
ocr_text:
|
| 66 |
-
Sortie brute du moteur OCR (avant correction LLM).
|
| 67 |
-
llm_text:
|
| 68 |
-
Sortie après correction par le LLM.
|
| 69 |
-
max_examples:
|
| 70 |
-
Nombre maximal d'exemples de sur-normalisation conservés.
|
| 71 |
-
|
| 72 |
-
Returns
|
| 73 |
-
-------
|
| 74 |
-
OverNormalizationResult
|
| 75 |
-
"""
|
| 76 |
-
gt_words = ground_truth.split()
|
| 77 |
-
ocr_words = ocr_text.split()
|
| 78 |
-
llm_words = llm_text.split()
|
| 79 |
-
|
| 80 |
-
n = min(len(gt_words), len(ocr_words), len(llm_words))
|
| 81 |
-
|
| 82 |
-
correct_ocr = 0
|
| 83 |
-
over_norm = 0
|
| 84 |
-
passages: list[dict] = []
|
| 85 |
-
|
| 86 |
-
for i in range(n):
|
| 87 |
-
gt_w = gt_words[i]
|
| 88 |
-
ocr_w = ocr_words[i]
|
| 89 |
-
llm_w = llm_words[i]
|
| 90 |
-
|
| 91 |
-
if ocr_w == gt_w:
|
| 92 |
-
correct_ocr += 1
|
| 93 |
-
if llm_w != gt_w and len(passages) < max_examples:
|
| 94 |
-
over_norm += 1
|
| 95 |
-
passages.append({"gt": gt_w, "ocr": ocr_w, "llm": llm_w})
|
| 96 |
-
elif llm_w != gt_w:
|
| 97 |
-
over_norm += 1
|
| 98 |
-
|
| 99 |
-
return OverNormalizationResult(
|
| 100 |
-
total_correct_ocr_words=correct_ocr,
|
| 101 |
-
over_normalized_count=over_norm,
|
| 102 |
-
over_normalized_passages=passages,
|
| 103 |
-
)
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
def aggregate_over_normalization(results: list[Optional[OverNormalizationResult]]) -> dict:
|
| 107 |
-
"""Agrège les résultats de sur-normalisation sur un ensemble de documents."""
|
| 108 |
-
valid = [r for r in results if r is not None]
|
| 109 |
-
if not valid:
|
| 110 |
-
return {"score": None, "total_correct_ocr_words": 0, "over_normalized_count": 0}
|
| 111 |
-
|
| 112 |
-
total_correct = sum(r.total_correct_ocr_words for r in valid)
|
| 113 |
-
total_over = sum(r.over_normalized_count for r in valid)
|
| 114 |
-
score = round(total_over / total_correct, 4) if total_correct > 0 else 0.0
|
| 115 |
-
|
| 116 |
-
return {
|
| 117 |
-
"score": score,
|
| 118 |
-
"total_correct_ocr_words": total_correct,
|
| 119 |
-
"over_normalized_count": total_over,
|
| 120 |
-
"document_count": len(valid),
|
| 121 |
-
}
|
|
|
|
| 1 |
+
"""Shim de compatibilité — détection de sur-normalisation LLM.
|
| 2 |
|
| 3 |
+
Phase 6 (mai 2026) — l'implémentation canonique vit désormais dans
|
| 4 |
+
``picarones.evaluation.metrics.over_normalization``. Ce shim ré-exporte
|
| 5 |
+
l'API publique avec un ``DeprecationWarning`` et sera supprimé en 2.0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
"""
|
| 7 |
|
| 8 |
from __future__ import annotations
|
| 9 |
|
| 10 |
+
import warnings
|
| 11 |
+
|
| 12 |
+
warnings.warn(
|
| 13 |
+
"picarones.pipelines.over_normalization est obsolète et sera supprimé en 2.0. "
|
| 14 |
+
"Utiliser picarones.evaluation.metrics.over_normalization à la place.",
|
| 15 |
+
DeprecationWarning,
|
| 16 |
+
stacklevel=2,
|
| 17 |
+
)
|
| 18 |
+
|
| 19 |
+
from picarones.evaluation.metrics.over_normalization import * # noqa: F401, F403, E402
|
| 20 |
+
from picarones.evaluation.metrics.over_normalization import ( # noqa: E402
|
| 21 |
+
OverNormalizationResult,
|
| 22 |
+
aggregate_over_normalization,
|
| 23 |
+
detect_over_normalization,
|
| 24 |
+
)
|
| 25 |
+
|
| 26 |
+
__all__ = [
|
| 27 |
+
"OverNormalizationResult",
|
| 28 |
+
"aggregate_over_normalization",
|
| 29 |
+
"detect_over_normalization",
|
| 30 |
+
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@@ -73,7 +73,7 @@ LEGACY_PACKAGES: tuple[str, ...] = (
|
|
| 73 |
#: :data:`LEGACY_PARITY` sans faire échouer le test. À diminuer
|
| 74 |
#: à chaque session de migration : on cible 0 quand le retrait
|
| 75 |
#: est complet.
|
| 76 |
-
BOOTSTRAP_BASELINE =
|
| 77 |
|
| 78 |
|
| 79 |
# ──────────────────────────────────────────────────────────────────
|
|
|
|
| 73 |
#: :data:`LEGACY_PARITY` sans faire échouer le test. À diminuer
|
| 74 |
#: à chaque session de migration : on cible 0 quand le retrait
|
| 75 |
#: est complet.
|
| 76 |
+
BOOTSTRAP_BASELINE = 101
|
| 77 |
|
| 78 |
|
| 79 |
# ──────────────────────────────────────────────────────────────────
|
|
@@ -22,7 +22,7 @@ import pytest
|
|
| 22 |
class TestOverNormalization:
|
| 23 |
|
| 24 |
def test_no_over_normalization(self):
|
| 25 |
-
from picarones.
|
| 26 |
gt = "nostre seigneur le roy"
|
| 27 |
ocr = "noltre seigneur le roy" # erreur OCR sur 'nostre'
|
| 28 |
llm = "nostre seigneur le roy" # LLM corrige → correct
|
|
@@ -31,7 +31,7 @@ class TestOverNormalization:
|
|
| 31 |
assert result.over_normalized_count == 0
|
| 32 |
|
| 33 |
def test_perfect_llm_no_over_norm(self):
|
| 34 |
-
from picarones.
|
| 35 |
gt = "nostre seigneur le roy"
|
| 36 |
ocr = "nostre seigneur le roy" # OCR correct
|
| 37 |
llm = "nostre seigneur le roy" # LLM conserve
|
|
@@ -40,7 +40,7 @@ class TestOverNormalization:
|
|
| 40 |
assert result.total_correct_ocr_words == 4
|
| 41 |
|
| 42 |
def test_over_normalization_detected(self):
|
| 43 |
-
from picarones.
|
| 44 |
gt = "nostre seigneur le roy"
|
| 45 |
ocr = "nostre seigneur le roy" # OCR correct
|
| 46 |
llm = "notre seigneur le roy" # LLM modifie 'nostre' → 'notre' : sur-normalisation
|
|
@@ -54,7 +54,7 @@ class TestOverNormalization:
|
|
| 54 |
assert passage["llm"] == "notre"
|
| 55 |
|
| 56 |
def test_over_normalization_score_formula(self):
|
| 57 |
-
from picarones.
|
| 58 |
# 4 mots, OCR correct sur tous, LLM modifie 2 → score = 2/4 = 0.5
|
| 59 |
gt = "maistre jehan nostre dame"
|
| 60 |
ocr = "maistre jehan nostre dame"
|
|
@@ -65,7 +65,7 @@ class TestOverNormalization:
|
|
| 65 |
assert result.score == pytest.approx(0.5)
|
| 66 |
|
| 67 |
def test_as_dict_keys(self):
|
| 68 |
-
from picarones.
|
| 69 |
result = detect_over_normalization("foo bar", "foo baz", "foo baz")
|
| 70 |
d = result.as_dict()
|
| 71 |
assert "score" in d
|
|
@@ -74,12 +74,12 @@ class TestOverNormalization:
|
|
| 74 |
assert "over_normalized_passages" in d
|
| 75 |
|
| 76 |
def test_empty_texts(self):
|
| 77 |
-
from picarones.
|
| 78 |
result = detect_over_normalization("", "", "")
|
| 79 |
assert result.score == 0.0
|
| 80 |
|
| 81 |
def test_aggregate_over_normalization(self):
|
| 82 |
-
from picarones.
|
| 83 |
OverNormalizationResult,
|
| 84 |
aggregate_over_normalization,
|
| 85 |
)
|
|
|
|
| 22 |
class TestOverNormalization:
|
| 23 |
|
| 24 |
def test_no_over_normalization(self):
|
| 25 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 26 |
gt = "nostre seigneur le roy"
|
| 27 |
ocr = "noltre seigneur le roy" # erreur OCR sur 'nostre'
|
| 28 |
llm = "nostre seigneur le roy" # LLM corrige → correct
|
|
|
|
| 31 |
assert result.over_normalized_count == 0
|
| 32 |
|
| 33 |
def test_perfect_llm_no_over_norm(self):
|
| 34 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 35 |
gt = "nostre seigneur le roy"
|
| 36 |
ocr = "nostre seigneur le roy" # OCR correct
|
| 37 |
llm = "nostre seigneur le roy" # LLM conserve
|
|
|
|
| 40 |
assert result.total_correct_ocr_words == 4
|
| 41 |
|
| 42 |
def test_over_normalization_detected(self):
|
| 43 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 44 |
gt = "nostre seigneur le roy"
|
| 45 |
ocr = "nostre seigneur le roy" # OCR correct
|
| 46 |
llm = "notre seigneur le roy" # LLM modifie 'nostre' → 'notre' : sur-normalisation
|
|
|
|
| 54 |
assert passage["llm"] == "notre"
|
| 55 |
|
| 56 |
def test_over_normalization_score_formula(self):
|
| 57 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 58 |
# 4 mots, OCR correct sur tous, LLM modifie 2 → score = 2/4 = 0.5
|
| 59 |
gt = "maistre jehan nostre dame"
|
| 60 |
ocr = "maistre jehan nostre dame"
|
|
|
|
| 65 |
assert result.score == pytest.approx(0.5)
|
| 66 |
|
| 67 |
def test_as_dict_keys(self):
|
| 68 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 69 |
result = detect_over_normalization("foo bar", "foo baz", "foo baz")
|
| 70 |
d = result.as_dict()
|
| 71 |
assert "score" in d
|
|
|
|
| 74 |
assert "over_normalized_passages" in d
|
| 75 |
|
| 76 |
def test_empty_texts(self):
|
| 77 |
+
from picarones.evaluation.metrics.over_normalization import detect_over_normalization
|
| 78 |
result = detect_over_normalization("", "", "")
|
| 79 |
assert result.score == 0.0
|
| 80 |
|
| 81 |
def test_aggregate_over_normalization(self):
|
| 82 |
+
from picarones.evaluation.metrics.over_normalization import (
|
| 83 |
OverNormalizationResult,
|
| 84 |
aggregate_over_normalization,
|
| 85 |
)
|