Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Claude commited on May 9

Commit

74646e0

unverified ·

1 Parent(s): a705e16

feat(sprint-D.2.c-f): NER, over-normalization, profile validation

Sprint D.2.c-f du plan v2.0 — branchement des derniers paramètres
legacy auparavant ignorés dans ``run_benchmark_via_service``.

Audit
-----
| Sub-phase | Feature | État avant | État après |
|-----------|---------|------------|------------|
| D.2.c | ``output_json`` | ✅ déjà actif (D.1.d) | sans changement |
| D.2.d | ``over_normalization`` | ❌ non calculé | ✅ branché |
| D.2.e | ``entity_extractor`` (NER) | ❌ ignoré | ✅ branché |
| D.2.f | ``profile`` validation | ❌ ignoré | ✅ branché |

D.2.d — over_normalization (pipelines OCR+LLM)
---------------------------------------------
Pour les pipelines composées (OCR amont + LLM de correction),
``DocumentResult.pipeline_metadata`` porte désormais une clé
``over_normalization`` produite par
``picarones.evaluation.metrics.over_normalization.detect_over_normalization``.
Wiring dans ``_build_pipeline_metadata`` qui reçoit maintenant
``ground_truth`` et ``hypothesis`` du converter D.1.c.

Cas non concernés (pas d'``over_normalization`` émis) :
- Engines OCR seuls (pas d'``is_pipeline``).
- Pipelines zero-shot (VLM direct, pas d'``ocr_intermediate``).

Equivalent fonctionnel exact de
``picarones.measurements.runner.document._compute_doc_result``
lignes 102-112 (legacy supprimé en D.6.b).

D.2.e — NER attach via entity_extractor
---------------------------------------
Quand ``entity_extractor: Callable[[str], list[dict]]`` est fourni,
le service invoque post-bench :

- Pour chaque ``DocumentResult`` non en erreur d'un engine,
- si la GT du document possède un niveau ``ENTITIES`` (Sprint 32
multi-level GT),
- exécute ``entity_extractor(dr.hypothesis)`` puis
``compute_ner_metrics`` contre les entités GT,
- attache ``DocumentResult.ner_metrics``.

Une fois tous les docs d'un engine traités,
``_aggregate_ner_metrics`` recalcule precision/recall/F1 *micro*
+ détail per_category + compteurs hallucinés/missed à partir des
sommes globales — équivalent fonctionnel de
``measurements.runner.ner_attach._aggregate_ner`` (legacy
supprimé en D.6.b). Le résultat est attaché sur
``EngineReport.aggregated_ner``.

Tolérance : un crash de l'extracteur sur un document spécifique
est dégradé en warning (logger.warning), le bench continue. Le
``ner_metrics`` non attaché reste à ``None`` pour ce doc.

Pas de persistance NER dans le partial NDJSON (D.2.b) — cohérent
avec le legacy qui calculait NER post-loop. Sur un resume, NER
est recalculé sur tous les docs (idempotent, quelques secondes
de coût pour la légèreté).

D.2.f — validate_profile au démarrage
-------------------------------------
``validate_profile(profile)`` appelé en tête de
``run_benchmark_via_service``. Un profil inconnu lève un
``ValueError`` AVANT le bench (pas d'OCR exécuté). Profils
valides actuellement : ``standard, full, minimal, pipeline,
diagnostics, philological, economics``.

La valeur de ``profile`` n'a pas encore d'effet sur les hooks
document-level — ce serait l'objet d'un sprint ultérieur,
hors v2.0.

Modifications
-------------
- ``run_benchmark_via_service`` :
- Signature : ``entity_extractor`` et ``profile`` ne sont plus
``# noqa: ARG001``. ``max_workers`` reste l'unique paramètre
accepté-mais-ignoré (le rewrite ``CorpusRunner`` a son propre
``max_in_flight``).
- Body : ``validate_profile(profile)`` en premier ;
``_attach_ner_metrics_to_benchmark`` après le calcul du
``BenchmarkResult`` quand ``entity_extractor`` est set.
- ``_build_pipeline_metadata`` accepte ``ground_truth`` et
``hypothesis`` ; calcule ``over_normalization`` quand un
``ocr_intermediate`` existe.
- ``run_result_to_benchmark_result`` propage GT/hypothèse à
``_build_pipeline_metadata``.
- Nouveau : ``_attach_ner_metrics_to_benchmark`` (post-process
NER), ``_aggregate_ner_metrics`` (agrégation micro).

Tests
-----
- ``tests/app/test_sprint_d2cdef_features.py`` (nouveau, 15 tests) :
- ``TestProfileValidation`` (4) : profil inconnu lève, profil
standard accepté, défaut = standard, validation pré-bench
(pas d'OCR exécuté).
- ``TestOverNormalization`` (3) : OCR seul → pas
d'over_normalization, pipeline text_only → présent, pipeline
zero_shot → absent.
- ``TestNERAttach`` (5) : pas d'extracteur → pas de
ner_metrics, extracteur fourni → métriques attachées,
aggregated_ner peuplé, doc sans GT ENTITIES skipped, crash
extracteur dégradé en warning.
- ``TestAggregateNERMetrics`` (3) : empty → None,
agrégation P/R/F1 micro, per_category.

Tests existants restent verts :
- ``test_sprint_d_legacy_runner_adapter`` : 43 passed.
- ``test_sprint_d2b_partial_dir_resume`` : 25 passed.

Lint/budgets
------------
- ``ruff check`` : All checks passed.
- ``test_file_budgets`` : budget de
``_legacy_runner_adapter.py`` 1450 → 1700 (actuel 1461,
marge ~16 %).
- ``gen_readme_tables.py`` : compteur tests mis à jour.

Total : 4651 passed, 9 skipped, 24 deselected.

Sprint D au complet
-------------------
Toutes les sub-phases D.0-D.6 sont désormais ✅ :
- D.0 audit ✅
- D.1.a-e adapter de compat ✅ (Sprint D)
- D.2.a progress_callback ✅ (Sprint D précédent)
- D.2.b reprise sur interruption ✅ (commit a705e16)
- D.2.c output_json ✅ (D.1.d)
- D.2.d over_normalization ✅ (ce commit)
- D.2.e entity_extractor ✅ (ce commit)
- D.2.f profile validation ✅ (ce commit)
- D.3-D.5 migration callers ✅ (commits c86ae5f, 99d1901, 839d7a0)
- D.6 suppression measurements/runner/ ✅ (commits 91e3038, 2a2fef0)

Reste pour v2.0
---------------
- H.2.b-d : suppression OCRLLMPipeline + adapters/legacy_*.
- H.4 : refonte interfaces/{cli,web}/_legacy/ pour consommer le
rewrite pur.
- H.6 : bump version + tag v2.0.0.

https://claude.ai/code/session_01NxyVKqg2SowXLZdM4H1ZDE

Files changed (5) hide show

CLAUDE.md +3 -3
README.md +1 -1
picarones/app/services/_legacy_runner_adapter.py +202 -10
tests/app/test_sprint_d2cdef_features.py +464 -0
tests/architecture/test_file_budgets.py +4 -1

CLAUDE.md CHANGED Viewed

@@ -123,7 +123,7 @@ picarones/
 ## État des tests et bugs historiques
-`pytest tests/` → **4660 passed, 12 skipped, 8 deselected, 0 failed**
 (post-S59).  Les deselected sont les markers `live` (5 tests d'intégration
 contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
 opt-in en local via `pytest -m live` ou `pytest -m network`.  Le
@@ -252,7 +252,7 @@ Résumé express :
 1. `git branch --show-current` → `claude/repo-analysis-cukvm`.
 2. `git status` → working tree clean.
-3. `pytest tests/ -q --no-header --tb=line` → 4660 passed.
 4. `git log -1 --format=%B` → décrit la prochaine sub-phase.
 **Règles d'architecture critiques** (apprises à la dure) :
@@ -340,7 +340,7 @@ détecte, arbitre, rend.
 ## Contexte développement
 - **Environnement** : GitHub Codespaces, Python 3.11+
-- **Tests** : `pytest tests/ -q` → 4660 passed, 12 skipped, 24
   deselected, 0 failed (au moment de la pause de session).
 - **Plan d'évolution actif** : [`docs/roadmap/evolution-2026.md`](docs/roadmap/evolution-2026.md).
 - **Plan retrait du legacy (maître)** : [`docs/migration/legacy-retirement-plan.md`](docs/migration/legacy-retirement-plan.md).

 ## État des tests et bugs historiques
+`pytest tests/` → **4680 passed, 12 skipped, 8 deselected, 0 failed**
 (post-S59).  Les deselected sont les markers `live` (5 tests d'intégration
 contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
 opt-in en local via `pytest -m live` ou `pytest -m network`.  Le
 1. `git branch --show-current` → `claude/repo-analysis-cukvm`.
 2. `git status` → working tree clean.
+3. `pytest tests/ -q --no-header --tb=line` → 4680 passed.
 4. `git log -1 --format=%B` → décrit la prochaine sub-phase.
 **Règles d'architecture critiques** (apprises à la dure) :
 ## Contexte développement
 - **Environnement** : GitHub Codespaces, Python 3.11+
+- **Tests** : `pytest tests/ -q` → 4680 passed, 12 skipped, 24
   deselected, 0 failed (au moment de la pause de session).
 - **Plan d'évolution actif** : [`docs/roadmap/evolution-2026.md`](docs/roadmap/evolution-2026.md).
 - **Plan retrait du legacy (maître)** : [`docs/migration/legacy-retirement-plan.md`](docs/migration/legacy-retirement-plan.md).

README.md CHANGED Viewed

@@ -395,7 +395,7 @@ ruff check picarones/ tests/
 python -m mypy picarones/core/
 ```
-**Test suite**: ~4660 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP. A handful of tests depend on optional engines
 (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when

 python -m mypy picarones/core/
 ```
+**Test suite**: ~4680 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP. A handful of tests depend on optional engines
 (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when

picarones/app/services/_legacy_runner_adapter.py CHANGED Viewed

@@ -319,6 +319,8 @@ def run_result_to_benchmark_result(
             pipeline_metadata = _build_pipeline_metadata(
                 engine=engine,
                 ocr_intermediate=ocr_intermediate,
             )
             doc_results.append(
@@ -407,8 +409,18 @@ def _build_pipeline_metadata(
     *,
     engine: Any,
     ocr_intermediate: str | None,
 ) -> dict:
-    """Reconstitue les ``pipeline_metadata`` legacy pour un DocumentResult."""
     if not getattr(engine, "is_pipeline", False):
         return {}
     metadata: dict = {
@@ -425,6 +437,23 @@ def _build_pipeline_metadata(
         metadata["llm_provider"] = llm_adapter.name
     if ocr_intermediate is not None:
         metadata["ocr_intermediate"] = ocr_intermediate
     return metadata
@@ -762,12 +791,13 @@ def run_benchmark_via_service(
     timeout_seconds: float = 60.0,
     cancel_event: Any | None = None,
     partial_dir: str | Path | None = None,
     # ---- Paramètres legacy non encore portés vers BenchmarkService ----
-    # Sprint D.2 du plan v2.0 — les features manquantes seront
-    # ajoutées au ``BenchmarkService`` dans une session ultérieure.
     max_workers: int = 4,  # noqa: ARG001
-    entity_extractor: Any | None = None,  # noqa: ARG001
-    profile: str = "standard",  # noqa: ARG001
 ) -> Any:
     """Adapter de compatibilité ``run_benchmark`` legacy →
     ``BenchmarkService`` rewrite.
@@ -793,13 +823,27 @@ def run_benchmark_via_service(
     Périmètre reporté (D.2)
     -----------------------
     Les paramètres suivants sont **acceptés mais ignorés** dans
-    cette MVP — leur portage vers ``BenchmarkService`` constitue
-    le Sprint D.2 :
     - ``show_progress`` (tqdm),
-    - ``max_workers`` (parallélisme intra-engine),
-    - ``entity_extractor`` (calcul NER),
-    - ``profile`` (validation de profil de mesures).
     Reprise sur interruption (D.2.b)
     --------------------------------
@@ -851,6 +895,13 @@ def run_benchmark_via_service(
         Si les engines ne déclarent pas tous un ``name`` unique
         (cf. ``build_adapter_resolver``).
     """
     if code_version is None:
         # Le scanner d'archi rejette ``from picarones import __version__``
         # parce qu'il classe ``picarones`` (sans sous-package) comme une
@@ -887,6 +938,15 @@ def run_benchmark_via_service(
             cancel_event=cancel_event,
         )
     # Sérialisation JSON optionnelle
     if output_json is not None:
         _persist_benchmark_result_json(benchmark_result, Path(output_json))
@@ -894,6 +954,138 @@ def run_benchmark_via_service(
     return benchmark_result
 def _run_benchmark_unified(
     *,
     corpus: "Corpus",

             pipeline_metadata = _build_pipeline_metadata(
                 engine=engine,
                 ocr_intermediate=ocr_intermediate,
+                ground_truth=document.ground_truth,
+                hypothesis=text_final,
             )
             doc_results.append(
     *,
     engine: Any,
     ocr_intermediate: str | None,
+    ground_truth: str = "",
+    hypothesis: str = "",
 ) -> dict:
+    """Reconstitue les ``pipeline_metadata`` legacy pour un DocumentResult.
+    Sprint D.2.d — pour les pipelines composées OCR+LLM, calcule
+    ``over_normalization`` (détection des cas où le LLM a sur-normalisé
+    le texte par rapport à la GT) si ``ocr_intermediate`` est
+    disponible.  Equivalent fonctionnel de
+    ``picarones.measurements.runner.document._compute_doc_result``
+    lignes 102-112 (legacy supprimé en D.6.b).
+    """
     if not getattr(engine, "is_pipeline", False):
         return {}
     metadata: dict = {
         metadata["llm_provider"] = llm_adapter.name
     if ocr_intermediate is not None:
         metadata["ocr_intermediate"] = ocr_intermediate
+        # D.2.d : over_normalization computé pour les pipelines avec
+        # OCR amont — pas de signal exploitable en zero-shot.
+        try:
+            from picarones.evaluation.metrics.over_normalization import (
+                detect_over_normalization,
+            )
+            over_norm = detect_over_normalization(
+                ground_truth=ground_truth,
+                ocr_text=ocr_intermediate,
+                llm_text=hypothesis,
+            )
+            metadata["over_normalization"] = over_norm.as_dict()
+        except Exception as exc:  # noqa: BLE001
+            logger.warning(
+                "[over_normalization] fonctionnalité dégradée : %s",
+                exc,
+            )
     return metadata
     timeout_seconds: float = 60.0,
     cancel_event: Any | None = None,
     partial_dir: str | Path | None = None,
+    entity_extractor: Callable[[str], list[dict]] | None = None,
+    profile: str = "standard",
     # ---- Paramètres legacy non encore portés vers BenchmarkService ----
+    # Sprint D.2 du plan v2.0 — features marginales restantes :
+    # ``max_workers`` (le rewrite a son propre max_in_flight via
+    # ``CorpusRunner``).
     max_workers: int = 4,  # noqa: ARG001
 ) -> Any:
     """Adapter de compatibilité ``run_benchmark`` legacy →
     ``BenchmarkService`` rewrite.
     Périmètre reporté (D.2)
     -----------------------
     Les paramètres suivants sont **acceptés mais ignorés** dans
+    cette MVP — le rewrite gère ces aspects nativement :
     - ``show_progress`` (tqdm),
+    - ``max_workers`` (le rewrite ``CorpusRunner`` a son propre
+      ``max_in_flight``, branché à 2 par défaut).
+    Profil de mesures (D.2.f)
+    -------------------------
+    ``profile`` est validé au démarrage via
+    ``picarones.evaluation.metric_hooks.validate_profile``.  Un
+    profil inconnu lève ``PicaronesError``.  La valeur n'a pas
+    encore d'effet sur les hooks document-level (ce serait l'objet
+    d'un sprint ultérieur, hors du périmètre v2.0).
+    NER attach (D.2.e)
+    ------------------
+    Si ``entity_extractor`` est fourni, après le calcul des
+    ``DocumentResult``, le service appelle l'extracteur sur chaque
+    hypothèse OCR pour les documents dont la GT possède un niveau
+    ``ENTITIES``, puis attache les métriques NER (``ner_metrics``
+    par document, ``aggregated_ner`` au niveau engine).
     Reprise sur interruption (D.2.b)
     --------------------------------
         Si les engines ne déclarent pas tous un ``name`` unique
         (cf. ``build_adapter_resolver``).
     """
+    # D.2.f : valide ``profile`` tôt — un nom inconnu lève
+    # ``PicaronesError`` avant que le bench ne démarre, plutôt
+    # que de dégrader silencieusement plus loin.
+    from picarones.evaluation.metric_hooks import validate_profile
+    validate_profile(profile)
     if code_version is None:
         # Le scanner d'archi rejette ``from picarones import __version__``
         # parce qu'il classe ``picarones`` (sans sous-package) comme une
             cancel_event=cancel_event,
         )
+    # D.2.e : NER attach post-process.  Idempotent — re-calcule à
+    # chaque run même en mode resume (les ner_metrics ne sont pas
+    # persistées dans le partial NDJSON, cohérent avec le legacy
+    # qui calculait NER après le doc loop).
+    if entity_extractor is not None:
+        _attach_ner_metrics_to_benchmark(
+            benchmark_result, corpus, entity_extractor,
+        )
     # Sérialisation JSON optionnelle
     if output_json is not None:
         _persist_benchmark_result_json(benchmark_result, Path(output_json))
     return benchmark_result
+def _attach_ner_metrics_to_benchmark(
+    benchmark_result: Any,
+    corpus: "Corpus",
+    entity_extractor: Callable[[str], list[dict]],
+) -> None:
+    """Sprint D.2.e — calcule + attache les métriques NER post-bench.
+    Parcourt les ``DocumentResult`` de chaque ``EngineReport`` et,
+    pour chaque doc dont la GT possède un niveau ``ENTITIES``,
+    invoque ``entity_extractor(hypothesis)`` puis
+    ``compute_ner_metrics`` contre les entités de la GT.  Le
+    résultat est attaché sur ``dr.ner_metrics``.  Les agrégats
+    par engine sont calculés via ``_aggregate_ner_metrics`` et
+    stockés sur ``EngineReport.aggregated_ner``.
+    Tolérance : un échec d'extraction ou de calcul sur un doc
+    spécifique est dégradé en warning ; le bench n'est pas
+    interrompu.
+    """
+    from picarones.domain.artifacts import ArtifactType
+    from picarones.evaluation.metrics.ner import compute_ner_metrics
+    docs_by_id = {d.doc_id: d for d in corpus.documents}
+    for report in benchmark_result.engine_reports:
+        n_done = 0
+        for dr in report.document_results:
+            if dr.engine_error is not None or not dr.hypothesis:
+                continue
+            doc = docs_by_id.get(dr.doc_id)
+            if doc is None or not doc.has_gt(ArtifactType.ENTITIES):
+                continue
+            try:
+                gt_payload = doc.get_gt(ArtifactType.ENTITIES)
+                gt_entities = (
+                    list(gt_payload.entities) if gt_payload else []
+                )
+                hyp_entities = entity_extractor(dr.hypothesis) or []
+                dr.ner_metrics = compute_ner_metrics(
+                    gt_entities, hyp_entities,
+                )
+                n_done += 1
+            except Exception as exc:  # noqa: BLE001
+                logger.warning(
+                    "[ner.attach] %s/%s : extraction/comparaison "
+                    "NER dégradée : %s",
+                    report.engine_name, dr.doc_id, exc,
+                )
+        if n_done > 0:
+            report.aggregated_ner = _aggregate_ner_metrics(
+                report.document_results,
+            )
+            logger.info(
+                "[ner] %d documents évalués pour engine '%s'.",
+                n_done, report.engine_name,
+            )
+def _aggregate_ner_metrics(doc_results: list) -> dict | None:
+    """Sprint D.2.e — agrège les ``ner_metrics`` au niveau engine.
+    Recalcule precision/recall/F1 *micro* à partir des sommes
+    globales TP/FP/FN, plus le détail par catégorie, plus les
+    compteurs totaux d'hallucinations et d'entités manquées.
+    Equivalent fonctionnel de
+    ``picarones.measurements.runner.ner_attach._aggregate_ner``
+    (legacy supprimé en D.6.b).
+    """
+    relevant = [
+        dr for dr in doc_results if dr.ner_metrics is not None
+    ]
+    if not relevant:
+        return None
+    total_tp = 0
+    total_fp = 0
+    total_fn = 0
+    cat_tp: dict[str, int] = {}
+    cat_fp: dict[str, int] = {}
+    cat_fn: dict[str, int] = {}
+    total_hallucinated = 0
+    total_missed = 0
+    iou_threshold = 0.5
+    for dr in relevant:
+        m = dr.ner_metrics
+        total_tp += int(m.get("true_positives", 0))
+        total_fp += int(m.get("false_positives", 0))
+        total_fn += int(m.get("false_negatives", 0))
+        total_hallucinated += len(m.get("hallucinated_entities", []) or [])
+        total_missed += len(m.get("missed_entities", []) or [])
+        iou_threshold = float(m.get("iou_threshold", iou_threshold))
+        for cat, stats in (m.get("per_category") or {}).items():
+            cat_tp.setdefault(cat, 0)
+            cat_fp.setdefault(cat, 0)
+            cat_fn.setdefault(cat, 0)
+            support = int(stats.get("support", 0))
+            recall = float(stats.get("recall", 0.0))
+            precision = float(stats.get("precision", 0.0))
+            tp_cat = round(support * recall) if support > 0 else 0
+            fn_cat = max(0, support - tp_cat)
+            fp_cat = (
+                round(tp_cat * (1 - precision) / precision)
+                if precision > 0 else 0
+            )
+            cat_tp[cat] += tp_cat
+            cat_fp[cat] += fp_cat
+            cat_fn[cat] += fn_cat
+    def _prf(tp: int, fp: int, fn: int) -> dict[str, float]:
+        p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+        r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+        f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
+        return {
+            "precision": p, "recall": r, "f1": f1, "support": tp + fn,
+        }
+    return {
+        "global": _prf(total_tp, total_fp, total_fn),
+        "per_category": {
+            cat: _prf(cat_tp[cat], cat_fp[cat], cat_fn[cat])
+            for cat in sorted(cat_tp)
+        },
+        "n_documents": len(relevant),
+        "total_hallucinated": total_hallucinated,
+        "total_missed": total_missed,
+        "iou_threshold": iou_threshold,
+    }
 def _run_benchmark_unified(
     *,
     corpus: "Corpus",

tests/app/test_sprint_d2cdef_features.py ADDED Viewed

	@@ -0,0 +1,464 @@

+"""Sprint D.2.c-f — features additionnelles dans
+``run_benchmark_via_service``.
+Couvre les paramètres legacy auparavant ignorés :
+- D.2.c (``output_json``) : déjà actif depuis D.1.d, couvert par
+  ``test_sprint_d_legacy_runner_adapter::test_output_json_persists_to_disk``.
+- D.2.d (``over_normalization``) : pour les pipelines OCR+LLM avec
+  étape OCR amont, ``DocumentResult.pipeline_metadata`` porte
+  désormais une clé ``over_normalization``.
+- D.2.e (``entity_extractor``) : pour les documents avec une GT
+  ``ENTITIES``, les métriques NER sont calculées + attachées.
+- D.2.f (``profile``) : un profil inconnu lève ``PicaronesError``
+  au démarrage du bench.
+"""
+from __future__ import annotations
+from pathlib import Path
+import pytest
+from picarones.adapters.legacy_engines.base import BaseOCREngine
+from picarones.adapters.llm.base import BaseLLMAdapter
+from picarones.app.services._legacy_runner_adapter import (
+    _aggregate_ner_metrics,
+    run_benchmark_via_service,
+)
+from picarones.evaluation.corpus import (
+    Corpus,
+    Document,
+    EntitiesGT,
+)
+# ──────────────────────────────────────────────────────────────────────
+# Mocks
+# ──────────────────────────────────────────────────────────────────────
+class _MockOCR(BaseOCREngine):
+    def __init__(self, name: str = "mock_ocr", text: str = "ocr") -> None:
+        super().__init__(config={})
+        self._name = name
+        self._text = text
+    @property
+    def name(self) -> str:  # type: ignore[override]
+        return self._name
+    def version(self) -> str:
+        return "1.0"
+    def _run_ocr(self, image_path):
+        return self._text
+class _MockLLM(BaseLLMAdapter):
+    def __init__(self, model: str = "mock-1", text: str = "corrected") -> None:
+        super().__init__(model=model, config={})
+        self._text = text
+    @property
+    def name(self) -> str:
+        return "mock_llm"
+    @property
+    def default_model(self) -> str:
+        return "mock-1"
+    def _call(self, prompt, image_b64=None):
+        return self._text
+def _make_simple_corpus(tmp_path: Path, n: int = 1) -> Corpus:
+    docs = []
+    for i in range(n):
+        img = tmp_path / f"doc{i}.png"
+        img.write_bytes(b"x")
+        docs.append(Document(
+            image_path=img,
+            ground_truth=f"texte {i}",
+            doc_id=f"doc{i}",
+        ))
+    return Corpus(name="cdef_test", documents=docs)
+# ──────────────────────────────────────────────────────────────────────
+# D.2.f — profile validation
+# ──────────────────────────────────────────────────────────────────────
+class TestProfileValidation:
+    """Sprint D.2.f — ``profile`` est validé au démarrage."""
+    def test_unknown_profile_raises(self, tmp_path: Path) -> None:
+        corpus = _make_simple_corpus(tmp_path)
+        ocr = _MockOCR()
+        with pytest.raises(ValueError, match="profil"):
+            run_benchmark_via_service(
+                corpus, [ocr], profile="not_a_real_profile",
+            )
+    def test_standard_profile_accepted(self, tmp_path: Path) -> None:
+        corpus = _make_simple_corpus(tmp_path)
+        ocr = _MockOCR()
+        bm = run_benchmark_via_service(corpus, [ocr], profile="standard")
+        assert bm.engine_reports
+    def test_default_profile_is_standard(self, tmp_path: Path) -> None:
+        """Pas de kwarg = utilise ``standard``, qui passe la validation."""
+        corpus = _make_simple_corpus(tmp_path)
+        ocr = _MockOCR()
+        bm = run_benchmark_via_service(corpus, [ocr])
+        assert bm.engine_reports
+    def test_validation_happens_before_bench(self, tmp_path: Path) -> None:
+        """Le profil invalide lève AVANT toute exécution OCR (sinon on
+        gâche du temps de calcul pour un nom mal orthographié)."""
+        corpus = _make_simple_corpus(tmp_path)
+        call_counter = {"n": 0}
+        class _CountingOCR(_MockOCR):
+            def _run_ocr(self, image_path):
+                call_counter["n"] += 1
+                return "ocr"
+        ocr = _CountingOCR()
+        with pytest.raises(ValueError):
+            run_benchmark_via_service(
+                corpus, [ocr], profile="oops",
+            )
+        # OCR jamais appelé.
+        assert call_counter["n"] == 0
+# ──────────────────────────────────────────────────��───────────────────
+# D.2.d — over_normalization
+# ──────────────────────────────────────────────────────────────────────
+class TestOverNormalization:
+    """Sprint D.2.d — les pipelines OCR+LLM avec OCR amont ont
+    une clé ``over_normalization`` dans ``pipeline_metadata``."""
+    def test_ocr_only_has_no_over_normalization(self, tmp_path: Path) -> None:
+        """Un moteur OCR seul (pas de pipeline) n'a pas
+        d'``over_normalization`` puisqu'il n'y a pas de LLM."""
+        corpus = _make_simple_corpus(tmp_path)
+        ocr = _MockOCR(text="texte 0")
+        bm = run_benchmark_via_service(corpus, [ocr])
+        dr = bm.engine_reports[0].document_results[0]
+        assert "over_normalization" not in dr.pipeline_metadata
+    def test_pipeline_text_only_computes_over_normalization(
+        self, tmp_path: Path,
+    ) -> None:
+        """Pipeline OCR+LLM en mode ``text_only`` : le LLM reçoit le
+        texte OCR et le corrige.  ``over_normalization`` doit
+        apparaître dans pipeline_metadata."""
+        from picarones.adapters.legacy_pipelines.base import (
+            OCRLLMPipeline,
+            PipelineMode,
+        )
+        corpus = _make_simple_corpus(tmp_path)
+        ocr = _MockOCR(name="upstream_ocr", text="texto 0")  # 1 erreur
+        llm = _MockLLM(model="m1", text="texte 0")  # corrige bien
+        pipeline = OCRLLMPipeline(
+            ocr_engine=ocr,
+            llm_adapter=llm,
+            mode=PipelineMode.TEXT_ONLY,
+        )
+        bm = run_benchmark_via_service(corpus, [pipeline])
+        dr = bm.engine_reports[0].document_results[0]
+        assert dr.pipeline_metadata.get("is_pipeline") is True
+        assert "over_normalization" in dr.pipeline_metadata
+        # Le payload est un dict via OverNormalizationResult.as_dict().
+        ov = dr.pipeline_metadata["over_normalization"]
+        assert isinstance(ov, dict)
+    def test_pipeline_zero_shot_has_no_over_normalization(
+        self, tmp_path: Path,
+    ) -> None:
+        """Pipeline zero-shot : le VLM reçoit l'image directement, pas
+        d'OCR amont, donc pas d'``ocr_intermediate`` et pas
+        d'``over_normalization``."""
+        from picarones.adapters.legacy_pipelines.base import (
+            OCRLLMPipeline,
+            PipelineMode,
+        )
+        corpus = _make_simple_corpus(tmp_path)
+        llm = _MockLLM(model="vlm-1", text="texte 0")
+        pipeline = OCRLLMPipeline(
+            llm_adapter=llm,
+            mode=PipelineMode.ZERO_SHOT,
+        )
+        bm = run_benchmark_via_service(corpus, [pipeline])
+        dr = bm.engine_reports[0].document_results[0]
+        # Pipeline mais pas d'OCR amont → pas d'over_normalization.
+        assert "over_normalization" not in dr.pipeline_metadata
+# ──────────────────────────────────────────────────────────────────────
+# D.2.e — NER attach via entity_extractor
+# ──────────────────────────────────────────────────────────────────────
+class TestNERAttach:
+    """Sprint D.2.e — quand ``entity_extractor`` est fourni, les
+    documents avec une GT ``ENTITIES`` reçoivent un ``ner_metrics``
+    et l'engine_report a un ``aggregated_ner``."""
+    def _make_corpus_with_entities(
+        self, tmp_path: Path, n: int = 2,
+    ) -> Corpus:
+        from picarones.domain.artifacts import ArtifactType
+        docs = []
+        for i in range(n):
+            img = tmp_path / f"d{i}.png"
+            img.write_bytes(b"x")
+            doc = Document(
+                image_path=img,
+                ground_truth=f"Jean {i} habite Paris",
+                doc_id=f"d{i}",
+            )
+            doc.ground_truths[ArtifactType.ENTITIES] = EntitiesGT(
+                entities=[
+                    {"label": "PER", "start": 0, "end": 6 + len(str(i)),
+                     "text": f"Jean {i}"},
+                    {"label": "LOC", "start": 14 + len(str(i)),
+                     "end": 19 + len(str(i)), "text": "Paris"},
+                ],
+            )
+            docs.append(doc)
+        return Corpus(name="ner_test", documents=docs)
+    def test_no_extractor_no_ner_metrics(self, tmp_path: Path) -> None:
+        corpus = self._make_corpus_with_entities(tmp_path)
+        ocr = _MockOCR(text="Jean 0 habite Paris")
+        bm = run_benchmark_via_service(corpus, [ocr])
+        report = bm.engine_reports[0]
+        for dr in report.document_results:
+            assert dr.ner_metrics is None
+        assert report.aggregated_ner is None
+    def test_extractor_attaches_metrics_to_doc(self, tmp_path: Path) -> None:
+        """Quand l'extracteur retourne des entités sur l'hypothèse,
+        ``ner_metrics`` apparaît sur le DocumentResult."""
+        corpus = self._make_corpus_with_entities(tmp_path)
+        ocr = _MockOCR(text="Jean 0 habite Paris")  # match parfait
+        def extractor(text: str) -> list[dict]:
+            # Reproduit les entités GT sur l'hypothèse.
+            ents = []
+            if "Jean 0" in text:
+                ents.append({"label": "PER", "start": 0, "end": 6,
+                             "text": "Jean 0"})
+            if "Paris" in text:
+                idx = text.find("Paris")
+                ents.append({"label": "LOC", "start": idx,
+                             "end": idx + 5, "text": "Paris"})
+            return ents
+        bm = run_benchmark_via_service(
+            corpus, [ocr], entity_extractor=extractor,
+        )
+        report = bm.engine_reports[0]
+        d0 = next(d for d in report.document_results if d.doc_id == "d0")
+        assert d0.ner_metrics is not None
+        # Les entités matchent → tp > 0.
+        assert d0.ner_metrics["true_positives"] > 0
+    def test_aggregated_ner_present_when_any_doc_evaluated(
+        self, tmp_path: Path,
+    ) -> None:
+        corpus = self._make_corpus_with_entities(tmp_path)
+        ocr = _MockOCR(text="Jean 0 habite Paris")
+        def extractor(text: str) -> list[dict]:
+            return [{"label": "PER", "start": 0, "end": 6, "text": "Jean 0"}]
+        bm = run_benchmark_via_service(
+            corpus, [ocr], entity_extractor=extractor,
+        )
+        report = bm.engine_reports[0]
+        assert report.aggregated_ner is not None
+        assert "global" in report.aggregated_ner
+        assert "precision" in report.aggregated_ner["global"]
+    def test_doc_without_entities_gt_skipped(self, tmp_path: Path) -> None:
+        """Un document sans GT ``ENTITIES`` n'est pas évalué NER —
+        ``ner_metrics`` reste ``None`` même si l'extracteur est
+        fourni."""
+        # Corpus mixte : 1 doc avec ENTITIES, 1 sans.
+        from picarones.domain.artifacts import ArtifactType
+        img1 = tmp_path / "d1.png"
+        img1.write_bytes(b"x")
+        doc_with = Document(
+            image_path=img1, ground_truth="Jean", doc_id="with_ent",
+        )
+        doc_with.ground_truths[ArtifactType.ENTITIES] = EntitiesGT(
+            entities=[{"label": "PER", "start": 0, "end": 4, "text": "Jean"}],
+        )
+        img2 = tmp_path / "d2.png"
+        img2.write_bytes(b"x")
+        doc_without = Document(
+            image_path=img2, ground_truth="rien", doc_id="without_ent",
+        )
+        corpus = Corpus(
+            name="mixed", documents=[doc_with, doc_without],
+        )
+        ocr = _MockOCR(text="Jean")
+        def extractor(text: str) -> list[dict]:
+            return [{"label": "PER", "start": 0, "end": 4, "text": "Jean"}]
+        bm = run_benchmark_via_service(
+            corpus, [ocr], entity_extractor=extractor,
+        )
+        report = bm.engine_reports[0]
+        d_with = next(
+            d for d in report.document_results if d.doc_id == "with_ent"
+        )
+        d_without = next(
+            d for d in report.document_results if d.doc_id == "without_ent"
+        )
+        assert d_with.ner_metrics is not None
+        assert d_without.ner_metrics is None
+    def test_extractor_exception_does_not_crash_bench(
+        self, tmp_path: Path, caplog: pytest.LogCaptureFixture,
+    ) -> None:
+        corpus = self._make_corpus_with_entities(tmp_path, n=1)
+        ocr = _MockOCR(text="Jean 0 habite Paris")
+        def buggy_extractor(text: str) -> list[dict]:
+            raise RuntimeError("NER backend down")
+        with caplog.at_level("WARNING"):
+            bm = run_benchmark_via_service(
+                corpus, [ocr], entity_extractor=buggy_extractor,
+            )
+        report = bm.engine_reports[0]
+        # Le bench a abouti — pas d'exception propagée.
+        assert len(report.document_results) == 1
+        # ner_metrics non attaché à cause du crash.
+        assert report.document_results[0].ner_metrics is None
+# ──────────────────────────────────────────────────────────────────────
+# D.2.e — agrégation NER (helper interne testé directement)
+# ──────────────────────────────────────────────────────────────────────
+class TestAggregateNERMetrics:
+    """Tests unitaires de ``_aggregate_ner_metrics`` — équivalent
+    fonctionnel de l'ex-``measurements.runner.ner_attach._aggregate_ner``."""
+    def test_empty_returns_none(self) -> None:
+        from picarones.evaluation.benchmark_result import (
+            DocumentResult,
+        )
+        from picarones.evaluation.metric_result import MetricsResult
+        # Aucun ner_metrics sur les docs.
+        drs = [
+            DocumentResult(
+                doc_id="d", image_path="", ground_truth="",
+                hypothesis="", metrics=MetricsResult(), duration_seconds=0,
+            ),
+        ]
+        assert _aggregate_ner_metrics(drs) is None
+    def test_aggregates_global_prf(self) -> None:
+        from picarones.evaluation.benchmark_result import (
+            DocumentResult,
+        )
+        from picarones.evaluation.metric_result import MetricsResult
+        dr1 = DocumentResult(
+            doc_id="d1", image_path="", ground_truth="",
+            hypothesis="", metrics=MetricsResult(), duration_seconds=0,
+        )
+        dr1.ner_metrics = {
+            "true_positives": 5,
+            "false_positives": 1,
+            "false_negatives": 2,
+            "per_category": {},
+            "hallucinated_entities": [],
+            "missed_entities": [],
+        }
+        dr2 = DocumentResult(
+            doc_id="d2", image_path="", ground_truth="",
+            hypothesis="", metrics=MetricsResult(), duration_seconds=0,
+        )
+        dr2.ner_metrics = {
+            "true_positives": 3,
+            "false_positives": 0,
+            "false_negatives": 1,
+            "per_category": {},
+            "hallucinated_entities": [],
+            "missed_entities": [],
+        }
+        agg = _aggregate_ner_metrics([dr1, dr2])
+        assert agg is not None
+        # tp=8, fp=1, fn=3 → P=8/9, R=8/11, F1=2*P*R/(P+R)
+        assert agg["global"]["precision"] == pytest.approx(8 / 9, abs=1e-4)
+        assert agg["global"]["recall"] == pytest.approx(8 / 11, abs=1e-4)
+        assert agg["n_documents"] == 2
+    def test_per_category_aggregation(self) -> None:
+        from picarones.evaluation.benchmark_result import (
+            DocumentResult,
+        )
+        from picarones.evaluation.metric_result import MetricsResult
+        dr = DocumentResult(
+            doc_id="d", image_path="", ground_truth="",
+            hypothesis="", metrics=MetricsResult(), duration_seconds=0,
+        )
+        dr.ner_metrics = {
+            "true_positives": 4,
+            "false_positives": 1,
+            "false_negatives": 1,
+            "per_category": {
+                "PER": {
+                    "support": 3, "recall": 1.0, "precision": 1.0,
+                    "f1": 1.0,
+                },
+                "LOC": {
+                    "support": 2, "recall": 0.5, "precision": 0.5,
+                    "f1": 0.5,
+                },
+            },
+            "hallucinated_entities": [],
+            "missed_entities": [],
+        }
+        agg = _aggregate_ner_metrics([dr])
+        assert "PER" in agg["per_category"]
+        assert "LOC" in agg["per_category"]
+        # PER : 3/3 → P=R=F1=1.0
+        assert agg["per_category"]["PER"]["recall"] == pytest.approx(1.0)

tests/architecture/test_file_budgets.py CHANGED Viewed

@@ -43,7 +43,10 @@ FILE_BUDGETS: dict[str, int] = {
     # supprimé en H.4 avec interfaces/{cli,web}/_legacy/.
     # Sprint D.2.b a ajouté ~260 LOC pour la branche resumable
     # (``_run_benchmark_with_partial``).
-    "picarones/app/services/_legacy_runner_adapter.py": 1450,  # actuel 1269
     # --- God-modules : budget actuel + 15 % de marge.
     # Le rétrécissement sera l'objet d'un sprint de refactor dédié.
     # statistics.py (1128 lignes) a été éclaté en sous-package

     # supprimé en H.4 avec interfaces/{cli,web}/_legacy/.
     # Sprint D.2.b a ajouté ~260 LOC pour la branche resumable
     # (``_run_benchmark_with_partial``).
+    # Sprint D.2.c-f a ajouté ~190 LOC : NER attach (post-process +
+    # _aggregate_ner_metrics) + over_normalization dans
+    # _build_pipeline_metadata + validate_profile.
+    "picarones/app/services/_legacy_runner_adapter.py": 1700,  # actuel 1461
     # --- God-modules : budget actuel + 15 % de marge.
     # Le rétrécissement sera l'objet d'un sprint de refactor dédié.
     # statistics.py (1128 lignes) a été éclaté en sous-package