Spaces:
Sleeping
feat(sprint-D.2.c-f): NER, over-normalization, profile validation
Browse filesSprint D.2.c-f du plan v2.0 — branchement des derniers paramètres
legacy auparavant ignorés dans ``run_benchmark_via_service``.
Audit
-----
| Sub-phase | Feature | État avant | État après |
|-----------|---------|------------|------------|
| D.2.c | ``output_json`` | ✅ déjà actif (D.1.d) | sans changement |
| D.2.d | ``over_normalization`` | ❌ non calculé | ✅ branché |
| D.2.e | ``entity_extractor`` (NER) | ❌ ignoré | ✅ branché |
| D.2.f | ``profile`` validation | ❌ ignoré | ✅ branché |
D.2.d — over_normalization (pipelines OCR+LLM)
---------------------------------------------
Pour les pipelines composées (OCR amont + LLM de correction),
``DocumentResult.pipeline_metadata`` porte désormais une clé
``over_normalization`` produite par
``picarones.evaluation.metrics.over_normalization.detect_over_normalization``.
Wiring dans ``_build_pipeline_metadata`` qui reçoit maintenant
``ground_truth`` et ``hypothesis`` du converter D.1.c.
Cas non concernés (pas d'``over_normalization`` émis) :
- Engines OCR seuls (pas d'``is_pipeline``).
- Pipelines zero-shot (VLM direct, pas d'``ocr_intermediate``).
Equivalent fonctionnel exact de
``picarones.measurements.runner.document._compute_doc_result``
lignes 102-112 (legacy supprimé en D.6.b).
D.2.e — NER attach via entity_extractor
---------------------------------------
Quand ``entity_extractor: Callable[[str], list[dict]]`` est fourni,
le service invoque post-bench :
- Pour chaque ``DocumentResult`` non en erreur d'un engine,
- si la GT du document possède un niveau ``ENTITIES`` (Sprint 32
multi-level GT),
- exécute ``entity_extractor(dr.hypothesis)`` puis
``compute_ner_metrics`` contre les entités GT,
- attache ``DocumentResult.ner_metrics``.
Une fois tous les docs d'un engine traités,
``_aggregate_ner_metrics`` recalcule precision/recall/F1 *micro*
+ détail per_category + compteurs hallucinés/missed à partir des
sommes globales — équivalent fonctionnel de
``measurements.runner.ner_attach._aggregate_ner`` (legacy
supprimé en D.6.b). Le résultat est attaché sur
``EngineReport.aggregated_ner``.
Tolérance : un crash de l'extracteur sur un document spécifique
est dégradé en warning (logger.warning), le bench continue. Le
``ner_metrics`` non attaché reste à ``None`` pour ce doc.
Pas de persistance NER dans le partial NDJSON (D.2.b) — cohérent
avec le legacy qui calculait NER post-loop. Sur un resume, NER
est recalculé sur tous les docs (idempotent, quelques secondes
de coût pour la légèreté).
D.2.f — validate_profile au démarrage
-------------------------------------
``validate_profile(profile)`` appelé en tête de
``run_benchmark_via_service``. Un profil inconnu lève un
``ValueError`` AVANT le bench (pas d'OCR exécuté). Profils
valides actuellement : ``standard, full, minimal, pipeline,
diagnostics, philological, economics``.
La valeur de ``profile`` n'a pas encore d'effet sur les hooks
document-level — ce serait l'objet d'un sprint ultérieur,
hors v2.0.
Modifications
-------------
- ``run_benchmark_via_service`` :
- Signature : ``entity_extractor`` et ``profile`` ne sont plus
``# noqa: ARG001``. ``max_workers`` reste l'unique paramètre
accepté-mais-ignoré (le rewrite ``CorpusRunner`` a son propre
``max_in_flight``).
- Body : ``validate_profile(profile)`` en premier ;
``_attach_ner_metrics_to_benchmark`` après le calcul du
``BenchmarkResult`` quand ``entity_extractor`` est set.
- ``_build_pipeline_metadata`` accepte ``ground_truth`` et
``hypothesis`` ; calcule ``over_normalization`` quand un
``ocr_intermediate`` existe.
- ``run_result_to_benchmark_result`` propage GT/hypothèse à
``_build_pipeline_metadata``.
- Nouveau : ``_attach_ner_metrics_to_benchmark`` (post-process
NER), ``_aggregate_ner_metrics`` (agrégation micro).
Tests
-----
- ``tests/app/test_sprint_d2cdef_features.py`` (nouveau, 15 tests) :
- ``TestProfileValidation`` (4) : profil inconnu lève, profil
standard accepté, défaut = standard, validation pré-bench
(pas d'OCR exécuté).
- ``TestOverNormalization`` (3) : OCR seul → pas
d'over_normalization, pipeline text_only → présent, pipeline
zero_shot → absent.
- ``TestNERAttach`` (5) : pas d'extracteur → pas de
ner_metrics, extracteur fourni → métriques attachées,
aggregated_ner peuplé, doc sans GT ENTITIES skipped, crash
extracteur dégradé en warning.
- ``TestAggregateNERMetrics`` (3) : empty → None,
agrégation P/R/F1 micro, per_category.
Tests existants restent verts :
- ``test_sprint_d_legacy_runner_adapter`` : 43 passed.
- ``test_sprint_d2b_partial_dir_resume`` : 25 passed.
Lint/budgets
------------
- ``ruff check`` : All checks passed.
- ``test_file_budgets`` : budget de
``_legacy_runner_adapter.py`` 1450 → 1700 (actuel 1461,
marge ~16 %).
- ``gen_readme_tables.py`` : compteur tests mis à jour.
Total : 4651 passed, 9 skipped, 24 deselected.
Sprint D au complet
-------------------
Toutes les sub-phases D.0-D.6 sont désormais ✅ :
- D.0 audit ✅
- D.1.a-e adapter de compat ✅ (Sprint D)
- D.2.a progress_callback ✅ (Sprint D précédent)
- D.2.b reprise sur interruption ✅ (commit a705e16)
- D.2.c output_json ✅ (D.1.d)
- D.2.d over_normalization ✅ (ce commit)
- D.2.e entity_extractor ✅ (ce commit)
- D.2.f profile validation ✅ (ce commit)
- D.3-D.5 migration callers ✅ (commits c86ae5f, 99d1901, 839d7a0)
- D.6 suppression measurements/runner/ ✅ (commits 91e3038, 2a2fef0)
Reste pour v2.0
---------------
- H.2.b-d : suppression OCRLLMPipeline + adapters/legacy_*.
- H.4 : refonte interfaces/{cli,web}/_legacy/ pour consommer le
rewrite pur.
- H.6 : bump version + tag v2.0.0.
https://claude.ai/code/session_01NxyVKqg2SowXLZdM4H1ZDE
- CLAUDE.md +3 -3
- README.md +1 -1
- picarones/app/services/_legacy_runner_adapter.py +202 -10
- tests/app/test_sprint_d2cdef_features.py +464 -0
- tests/architecture/test_file_budgets.py +4 -1
|
@@ -123,7 +123,7 @@ picarones/
|
|
| 123 |
|
| 124 |
## État des tests et bugs historiques
|
| 125 |
|
| 126 |
-
`pytest tests/` → **
|
| 127 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intégration
|
| 128 |
contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
|
| 129 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
@@ -252,7 +252,7 @@ Résumé express :
|
|
| 252 |
|
| 253 |
1. `git branch --show-current` → `claude/repo-analysis-cukvm`.
|
| 254 |
2. `git status` → working tree clean.
|
| 255 |
-
3. `pytest tests/ -q --no-header --tb=line` →
|
| 256 |
4. `git log -1 --format=%B` → décrit la prochaine sub-phase.
|
| 257 |
|
| 258 |
**Règles d'architecture critiques** (apprises à la dure) :
|
|
@@ -340,7 +340,7 @@ détecte, arbitre, rend.
|
|
| 340 |
## Contexte développement
|
| 341 |
|
| 342 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 343 |
-
- **Tests** : `pytest tests/ -q` →
|
| 344 |
deselected, 0 failed (au moment de la pause de session).
|
| 345 |
- **Plan d'évolution actif** : [`docs/roadmap/evolution-2026.md`](docs/roadmap/evolution-2026.md).
|
| 346 |
- **Plan retrait du legacy (maître)** : [`docs/migration/legacy-retirement-plan.md`](docs/migration/legacy-retirement-plan.md).
|
|
|
|
| 123 |
|
| 124 |
## État des tests et bugs historiques
|
| 125 |
|
| 126 |
+
`pytest tests/` → **4680 passed, 12 skipped, 8 deselected, 0 failed**
|
| 127 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intégration
|
| 128 |
contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
|
| 129 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
|
|
| 252 |
|
| 253 |
1. `git branch --show-current` → `claude/repo-analysis-cukvm`.
|
| 254 |
2. `git status` → working tree clean.
|
| 255 |
+
3. `pytest tests/ -q --no-header --tb=line` → 4680 passed.
|
| 256 |
4. `git log -1 --format=%B` → décrit la prochaine sub-phase.
|
| 257 |
|
| 258 |
**Règles d'architecture critiques** (apprises à la dure) :
|
|
|
|
| 340 |
## Contexte développement
|
| 341 |
|
| 342 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 343 |
+
- **Tests** : `pytest tests/ -q` → 4680 passed, 12 skipped, 24
|
| 344 |
deselected, 0 failed (au moment de la pause de session).
|
| 345 |
- **Plan d'évolution actif** : [`docs/roadmap/evolution-2026.md`](docs/roadmap/evolution-2026.md).
|
| 346 |
- **Plan retrait du legacy (maître)** : [`docs/migration/legacy-retirement-plan.md`](docs/migration/legacy-retirement-plan.md).
|
|
@@ -395,7 +395,7 @@ ruff check picarones/ tests/
|
|
| 395 |
python -m mypy picarones/core/
|
| 396 |
```
|
| 397 |
|
| 398 |
-
**Test suite**: ~
|
| 399 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 400 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 401 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
|
|
| 395 |
python -m mypy picarones/core/
|
| 396 |
```
|
| 397 |
|
| 398 |
+
**Test suite**: ~4680 tests, ~3 min on a modern laptop. Coverage
|
| 399 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 400 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 401 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
@@ -319,6 +319,8 @@ def run_result_to_benchmark_result(
|
|
| 319 |
pipeline_metadata = _build_pipeline_metadata(
|
| 320 |
engine=engine,
|
| 321 |
ocr_intermediate=ocr_intermediate,
|
|
|
|
|
|
|
| 322 |
)
|
| 323 |
|
| 324 |
doc_results.append(
|
|
@@ -407,8 +409,18 @@ def _build_pipeline_metadata(
|
|
| 407 |
*,
|
| 408 |
engine: Any,
|
| 409 |
ocr_intermediate: str | None,
|
|
|
|
|
|
|
| 410 |
) -> dict:
|
| 411 |
-
"""Reconstitue les ``pipeline_metadata`` legacy pour un DocumentResult.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 412 |
if not getattr(engine, "is_pipeline", False):
|
| 413 |
return {}
|
| 414 |
metadata: dict = {
|
|
@@ -425,6 +437,23 @@ def _build_pipeline_metadata(
|
|
| 425 |
metadata["llm_provider"] = llm_adapter.name
|
| 426 |
if ocr_intermediate is not None:
|
| 427 |
metadata["ocr_intermediate"] = ocr_intermediate
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 428 |
return metadata
|
| 429 |
|
| 430 |
|
|
@@ -762,12 +791,13 @@ def run_benchmark_via_service(
|
|
| 762 |
timeout_seconds: float = 60.0,
|
| 763 |
cancel_event: Any | None = None,
|
| 764 |
partial_dir: str | Path | None = None,
|
|
|
|
|
|
|
| 765 |
# ---- Paramètres legacy non encore portés vers BenchmarkService ----
|
| 766 |
-
# Sprint D.2 du plan v2.0 —
|
| 767 |
-
#
|
|
|
|
| 768 |
max_workers: int = 4, # noqa: ARG001
|
| 769 |
-
entity_extractor: Any | None = None, # noqa: ARG001
|
| 770 |
-
profile: str = "standard", # noqa: ARG001
|
| 771 |
) -> Any:
|
| 772 |
"""Adapter de compatibilité ``run_benchmark`` legacy →
|
| 773 |
``BenchmarkService`` rewrite.
|
|
@@ -793,13 +823,27 @@ def run_benchmark_via_service(
|
|
| 793 |
Périmètre reporté (D.2)
|
| 794 |
-----------------------
|
| 795 |
Les paramètres suivants sont **acceptés mais ignorés** dans
|
| 796 |
-
cette MVP —
|
| 797 |
-
le Sprint D.2 :
|
| 798 |
|
| 799 |
- ``show_progress`` (tqdm),
|
| 800 |
-
- ``max_workers`` (
|
| 801 |
-
|
| 802 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 803 |
|
| 804 |
Reprise sur interruption (D.2.b)
|
| 805 |
--------------------------------
|
|
@@ -851,6 +895,13 @@ def run_benchmark_via_service(
|
|
| 851 |
Si les engines ne déclarent pas tous un ``name`` unique
|
| 852 |
(cf. ``build_adapter_resolver``).
|
| 853 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 854 |
if code_version is None:
|
| 855 |
# Le scanner d'archi rejette ``from picarones import __version__``
|
| 856 |
# parce qu'il classe ``picarones`` (sans sous-package) comme une
|
|
@@ -887,6 +938,15 @@ def run_benchmark_via_service(
|
|
| 887 |
cancel_event=cancel_event,
|
| 888 |
)
|
| 889 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 890 |
# Sérialisation JSON optionnelle
|
| 891 |
if output_json is not None:
|
| 892 |
_persist_benchmark_result_json(benchmark_result, Path(output_json))
|
|
@@ -894,6 +954,138 @@ def run_benchmark_via_service(
|
|
| 894 |
return benchmark_result
|
| 895 |
|
| 896 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 897 |
def _run_benchmark_unified(
|
| 898 |
*,
|
| 899 |
corpus: "Corpus",
|
|
|
|
| 319 |
pipeline_metadata = _build_pipeline_metadata(
|
| 320 |
engine=engine,
|
| 321 |
ocr_intermediate=ocr_intermediate,
|
| 322 |
+
ground_truth=document.ground_truth,
|
| 323 |
+
hypothesis=text_final,
|
| 324 |
)
|
| 325 |
|
| 326 |
doc_results.append(
|
|
|
|
| 409 |
*,
|
| 410 |
engine: Any,
|
| 411 |
ocr_intermediate: str | None,
|
| 412 |
+
ground_truth: str = "",
|
| 413 |
+
hypothesis: str = "",
|
| 414 |
) -> dict:
|
| 415 |
+
"""Reconstitue les ``pipeline_metadata`` legacy pour un DocumentResult.
|
| 416 |
+
|
| 417 |
+
Sprint D.2.d — pour les pipelines composées OCR+LLM, calcule
|
| 418 |
+
``over_normalization`` (détection des cas où le LLM a sur-normalisé
|
| 419 |
+
le texte par rapport à la GT) si ``ocr_intermediate`` est
|
| 420 |
+
disponible. Equivalent fonctionnel de
|
| 421 |
+
``picarones.measurements.runner.document._compute_doc_result``
|
| 422 |
+
lignes 102-112 (legacy supprimé en D.6.b).
|
| 423 |
+
"""
|
| 424 |
if not getattr(engine, "is_pipeline", False):
|
| 425 |
return {}
|
| 426 |
metadata: dict = {
|
|
|
|
| 437 |
metadata["llm_provider"] = llm_adapter.name
|
| 438 |
if ocr_intermediate is not None:
|
| 439 |
metadata["ocr_intermediate"] = ocr_intermediate
|
| 440 |
+
# D.2.d : over_normalization computé pour les pipelines avec
|
| 441 |
+
# OCR amont — pas de signal exploitable en zero-shot.
|
| 442 |
+
try:
|
| 443 |
+
from picarones.evaluation.metrics.over_normalization import (
|
| 444 |
+
detect_over_normalization,
|
| 445 |
+
)
|
| 446 |
+
over_norm = detect_over_normalization(
|
| 447 |
+
ground_truth=ground_truth,
|
| 448 |
+
ocr_text=ocr_intermediate,
|
| 449 |
+
llm_text=hypothesis,
|
| 450 |
+
)
|
| 451 |
+
metadata["over_normalization"] = over_norm.as_dict()
|
| 452 |
+
except Exception as exc: # noqa: BLE001
|
| 453 |
+
logger.warning(
|
| 454 |
+
"[over_normalization] fonctionnalité dégradée : %s",
|
| 455 |
+
exc,
|
| 456 |
+
)
|
| 457 |
return metadata
|
| 458 |
|
| 459 |
|
|
|
|
| 791 |
timeout_seconds: float = 60.0,
|
| 792 |
cancel_event: Any | None = None,
|
| 793 |
partial_dir: str | Path | None = None,
|
| 794 |
+
entity_extractor: Callable[[str], list[dict]] | None = None,
|
| 795 |
+
profile: str = "standard",
|
| 796 |
# ---- Paramètres legacy non encore portés vers BenchmarkService ----
|
| 797 |
+
# Sprint D.2 du plan v2.0 — features marginales restantes :
|
| 798 |
+
# ``max_workers`` (le rewrite a son propre max_in_flight via
|
| 799 |
+
# ``CorpusRunner``).
|
| 800 |
max_workers: int = 4, # noqa: ARG001
|
|
|
|
|
|
|
| 801 |
) -> Any:
|
| 802 |
"""Adapter de compatibilité ``run_benchmark`` legacy →
|
| 803 |
``BenchmarkService`` rewrite.
|
|
|
|
| 823 |
Périmètre reporté (D.2)
|
| 824 |
-----------------------
|
| 825 |
Les paramètres suivants sont **acceptés mais ignorés** dans
|
| 826 |
+
cette MVP — le rewrite gère ces aspects nativement :
|
|
|
|
| 827 |
|
| 828 |
- ``show_progress`` (tqdm),
|
| 829 |
+
- ``max_workers`` (le rewrite ``CorpusRunner`` a son propre
|
| 830 |
+
``max_in_flight``, branché à 2 par défaut).
|
| 831 |
+
|
| 832 |
+
Profil de mesures (D.2.f)
|
| 833 |
+
-------------------------
|
| 834 |
+
``profile`` est validé au démarrage via
|
| 835 |
+
``picarones.evaluation.metric_hooks.validate_profile``. Un
|
| 836 |
+
profil inconnu lève ``PicaronesError``. La valeur n'a pas
|
| 837 |
+
encore d'effet sur les hooks document-level (ce serait l'objet
|
| 838 |
+
d'un sprint ultérieur, hors du périmètre v2.0).
|
| 839 |
+
|
| 840 |
+
NER attach (D.2.e)
|
| 841 |
+
------------------
|
| 842 |
+
Si ``entity_extractor`` est fourni, après le calcul des
|
| 843 |
+
``DocumentResult``, le service appelle l'extracteur sur chaque
|
| 844 |
+
hypothèse OCR pour les documents dont la GT possède un niveau
|
| 845 |
+
``ENTITIES``, puis attache les métriques NER (``ner_metrics``
|
| 846 |
+
par document, ``aggregated_ner`` au niveau engine).
|
| 847 |
|
| 848 |
Reprise sur interruption (D.2.b)
|
| 849 |
--------------------------------
|
|
|
|
| 895 |
Si les engines ne déclarent pas tous un ``name`` unique
|
| 896 |
(cf. ``build_adapter_resolver``).
|
| 897 |
"""
|
| 898 |
+
# D.2.f : valide ``profile`` tôt — un nom inconnu lève
|
| 899 |
+
# ``PicaronesError`` avant que le bench ne démarre, plutôt
|
| 900 |
+
# que de dégrader silencieusement plus loin.
|
| 901 |
+
from picarones.evaluation.metric_hooks import validate_profile
|
| 902 |
+
|
| 903 |
+
validate_profile(profile)
|
| 904 |
+
|
| 905 |
if code_version is None:
|
| 906 |
# Le scanner d'archi rejette ``from picarones import __version__``
|
| 907 |
# parce qu'il classe ``picarones`` (sans sous-package) comme une
|
|
|
|
| 938 |
cancel_event=cancel_event,
|
| 939 |
)
|
| 940 |
|
| 941 |
+
# D.2.e : NER attach post-process. Idempotent — re-calcule à
|
| 942 |
+
# chaque run même en mode resume (les ner_metrics ne sont pas
|
| 943 |
+
# persistées dans le partial NDJSON, cohérent avec le legacy
|
| 944 |
+
# qui calculait NER après le doc loop).
|
| 945 |
+
if entity_extractor is not None:
|
| 946 |
+
_attach_ner_metrics_to_benchmark(
|
| 947 |
+
benchmark_result, corpus, entity_extractor,
|
| 948 |
+
)
|
| 949 |
+
|
| 950 |
# Sérialisation JSON optionnelle
|
| 951 |
if output_json is not None:
|
| 952 |
_persist_benchmark_result_json(benchmark_result, Path(output_json))
|
|
|
|
| 954 |
return benchmark_result
|
| 955 |
|
| 956 |
|
| 957 |
+
def _attach_ner_metrics_to_benchmark(
|
| 958 |
+
benchmark_result: Any,
|
| 959 |
+
corpus: "Corpus",
|
| 960 |
+
entity_extractor: Callable[[str], list[dict]],
|
| 961 |
+
) -> None:
|
| 962 |
+
"""Sprint D.2.e — calcule + attache les métriques NER post-bench.
|
| 963 |
+
|
| 964 |
+
Parcourt les ``DocumentResult`` de chaque ``EngineReport`` et,
|
| 965 |
+
pour chaque doc dont la GT possède un niveau ``ENTITIES``,
|
| 966 |
+
invoque ``entity_extractor(hypothesis)`` puis
|
| 967 |
+
``compute_ner_metrics`` contre les entités de la GT. Le
|
| 968 |
+
résultat est attaché sur ``dr.ner_metrics``. Les agrégats
|
| 969 |
+
par engine sont calculés via ``_aggregate_ner_metrics`` et
|
| 970 |
+
stockés sur ``EngineReport.aggregated_ner``.
|
| 971 |
+
|
| 972 |
+
Tolérance : un échec d'extraction ou de calcul sur un doc
|
| 973 |
+
spécifique est dégradé en warning ; le bench n'est pas
|
| 974 |
+
interrompu.
|
| 975 |
+
"""
|
| 976 |
+
from picarones.domain.artifacts import ArtifactType
|
| 977 |
+
from picarones.evaluation.metrics.ner import compute_ner_metrics
|
| 978 |
+
|
| 979 |
+
docs_by_id = {d.doc_id: d for d in corpus.documents}
|
| 980 |
+
|
| 981 |
+
for report in benchmark_result.engine_reports:
|
| 982 |
+
n_done = 0
|
| 983 |
+
for dr in report.document_results:
|
| 984 |
+
if dr.engine_error is not None or not dr.hypothesis:
|
| 985 |
+
continue
|
| 986 |
+
doc = docs_by_id.get(dr.doc_id)
|
| 987 |
+
if doc is None or not doc.has_gt(ArtifactType.ENTITIES):
|
| 988 |
+
continue
|
| 989 |
+
try:
|
| 990 |
+
gt_payload = doc.get_gt(ArtifactType.ENTITIES)
|
| 991 |
+
gt_entities = (
|
| 992 |
+
list(gt_payload.entities) if gt_payload else []
|
| 993 |
+
)
|
| 994 |
+
hyp_entities = entity_extractor(dr.hypothesis) or []
|
| 995 |
+
dr.ner_metrics = compute_ner_metrics(
|
| 996 |
+
gt_entities, hyp_entities,
|
| 997 |
+
)
|
| 998 |
+
n_done += 1
|
| 999 |
+
except Exception as exc: # noqa: BLE001
|
| 1000 |
+
logger.warning(
|
| 1001 |
+
"[ner.attach] %s/%s : extraction/comparaison "
|
| 1002 |
+
"NER dégradée : %s",
|
| 1003 |
+
report.engine_name, dr.doc_id, exc,
|
| 1004 |
+
)
|
| 1005 |
+
|
| 1006 |
+
if n_done > 0:
|
| 1007 |
+
report.aggregated_ner = _aggregate_ner_metrics(
|
| 1008 |
+
report.document_results,
|
| 1009 |
+
)
|
| 1010 |
+
logger.info(
|
| 1011 |
+
"[ner] %d documents évalués pour engine '%s'.",
|
| 1012 |
+
n_done, report.engine_name,
|
| 1013 |
+
)
|
| 1014 |
+
|
| 1015 |
+
|
| 1016 |
+
def _aggregate_ner_metrics(doc_results: list) -> dict | None:
|
| 1017 |
+
"""Sprint D.2.e — agrège les ``ner_metrics`` au niveau engine.
|
| 1018 |
+
|
| 1019 |
+
Recalcule precision/recall/F1 *micro* à partir des sommes
|
| 1020 |
+
globales TP/FP/FN, plus le détail par catégorie, plus les
|
| 1021 |
+
compteurs totaux d'hallucinations et d'entités manquées.
|
| 1022 |
+
|
| 1023 |
+
Equivalent fonctionnel de
|
| 1024 |
+
``picarones.measurements.runner.ner_attach._aggregate_ner``
|
| 1025 |
+
(legacy supprimé en D.6.b).
|
| 1026 |
+
"""
|
| 1027 |
+
relevant = [
|
| 1028 |
+
dr for dr in doc_results if dr.ner_metrics is not None
|
| 1029 |
+
]
|
| 1030 |
+
if not relevant:
|
| 1031 |
+
return None
|
| 1032 |
+
|
| 1033 |
+
total_tp = 0
|
| 1034 |
+
total_fp = 0
|
| 1035 |
+
total_fn = 0
|
| 1036 |
+
cat_tp: dict[str, int] = {}
|
| 1037 |
+
cat_fp: dict[str, int] = {}
|
| 1038 |
+
cat_fn: dict[str, int] = {}
|
| 1039 |
+
total_hallucinated = 0
|
| 1040 |
+
total_missed = 0
|
| 1041 |
+
iou_threshold = 0.5
|
| 1042 |
+
|
| 1043 |
+
for dr in relevant:
|
| 1044 |
+
m = dr.ner_metrics
|
| 1045 |
+
total_tp += int(m.get("true_positives", 0))
|
| 1046 |
+
total_fp += int(m.get("false_positives", 0))
|
| 1047 |
+
total_fn += int(m.get("false_negatives", 0))
|
| 1048 |
+
total_hallucinated += len(m.get("hallucinated_entities", []) or [])
|
| 1049 |
+
total_missed += len(m.get("missed_entities", []) or [])
|
| 1050 |
+
iou_threshold = float(m.get("iou_threshold", iou_threshold))
|
| 1051 |
+
for cat, stats in (m.get("per_category") or {}).items():
|
| 1052 |
+
cat_tp.setdefault(cat, 0)
|
| 1053 |
+
cat_fp.setdefault(cat, 0)
|
| 1054 |
+
cat_fn.setdefault(cat, 0)
|
| 1055 |
+
support = int(stats.get("support", 0))
|
| 1056 |
+
recall = float(stats.get("recall", 0.0))
|
| 1057 |
+
precision = float(stats.get("precision", 0.0))
|
| 1058 |
+
tp_cat = round(support * recall) if support > 0 else 0
|
| 1059 |
+
fn_cat = max(0, support - tp_cat)
|
| 1060 |
+
fp_cat = (
|
| 1061 |
+
round(tp_cat * (1 - precision) / precision)
|
| 1062 |
+
if precision > 0 else 0
|
| 1063 |
+
)
|
| 1064 |
+
cat_tp[cat] += tp_cat
|
| 1065 |
+
cat_fp[cat] += fp_cat
|
| 1066 |
+
cat_fn[cat] += fn_cat
|
| 1067 |
+
|
| 1068 |
+
def _prf(tp: int, fp: int, fn: int) -> dict[str, float]:
|
| 1069 |
+
p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
|
| 1070 |
+
r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
|
| 1071 |
+
f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
|
| 1072 |
+
return {
|
| 1073 |
+
"precision": p, "recall": r, "f1": f1, "support": tp + fn,
|
| 1074 |
+
}
|
| 1075 |
+
|
| 1076 |
+
return {
|
| 1077 |
+
"global": _prf(total_tp, total_fp, total_fn),
|
| 1078 |
+
"per_category": {
|
| 1079 |
+
cat: _prf(cat_tp[cat], cat_fp[cat], cat_fn[cat])
|
| 1080 |
+
for cat in sorted(cat_tp)
|
| 1081 |
+
},
|
| 1082 |
+
"n_documents": len(relevant),
|
| 1083 |
+
"total_hallucinated": total_hallucinated,
|
| 1084 |
+
"total_missed": total_missed,
|
| 1085 |
+
"iou_threshold": iou_threshold,
|
| 1086 |
+
}
|
| 1087 |
+
|
| 1088 |
+
|
| 1089 |
def _run_benchmark_unified(
|
| 1090 |
*,
|
| 1091 |
corpus: "Corpus",
|
|
@@ -0,0 +1,464 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Sprint D.2.c-f — features additionnelles dans
|
| 2 |
+
``run_benchmark_via_service``.
|
| 3 |
+
|
| 4 |
+
Couvre les paramètres legacy auparavant ignorés :
|
| 5 |
+
|
| 6 |
+
- D.2.c (``output_json``) : déjà actif depuis D.1.d, couvert par
|
| 7 |
+
``test_sprint_d_legacy_runner_adapter::test_output_json_persists_to_disk``.
|
| 8 |
+
- D.2.d (``over_normalization``) : pour les pipelines OCR+LLM avec
|
| 9 |
+
étape OCR amont, ``DocumentResult.pipeline_metadata`` porte
|
| 10 |
+
désormais une clé ``over_normalization``.
|
| 11 |
+
- D.2.e (``entity_extractor``) : pour les documents avec une GT
|
| 12 |
+
``ENTITIES``, les métriques NER sont calculées + attachées.
|
| 13 |
+
- D.2.f (``profile``) : un profil inconnu lève ``PicaronesError``
|
| 14 |
+
au démarrage du bench.
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
import pytest
|
| 22 |
+
|
| 23 |
+
from picarones.adapters.legacy_engines.base import BaseOCREngine
|
| 24 |
+
from picarones.adapters.llm.base import BaseLLMAdapter
|
| 25 |
+
from picarones.app.services._legacy_runner_adapter import (
|
| 26 |
+
_aggregate_ner_metrics,
|
| 27 |
+
run_benchmark_via_service,
|
| 28 |
+
)
|
| 29 |
+
from picarones.evaluation.corpus import (
|
| 30 |
+
Corpus,
|
| 31 |
+
Document,
|
| 32 |
+
EntitiesGT,
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 37 |
+
# Mocks
|
| 38 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
class _MockOCR(BaseOCREngine):
|
| 42 |
+
def __init__(self, name: str = "mock_ocr", text: str = "ocr") -> None:
|
| 43 |
+
super().__init__(config={})
|
| 44 |
+
self._name = name
|
| 45 |
+
self._text = text
|
| 46 |
+
|
| 47 |
+
@property
|
| 48 |
+
def name(self) -> str: # type: ignore[override]
|
| 49 |
+
return self._name
|
| 50 |
+
|
| 51 |
+
def version(self) -> str:
|
| 52 |
+
return "1.0"
|
| 53 |
+
|
| 54 |
+
def _run_ocr(self, image_path):
|
| 55 |
+
return self._text
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
class _MockLLM(BaseLLMAdapter):
|
| 59 |
+
def __init__(self, model: str = "mock-1", text: str = "corrected") -> None:
|
| 60 |
+
super().__init__(model=model, config={})
|
| 61 |
+
self._text = text
|
| 62 |
+
|
| 63 |
+
@property
|
| 64 |
+
def name(self) -> str:
|
| 65 |
+
return "mock_llm"
|
| 66 |
+
|
| 67 |
+
@property
|
| 68 |
+
def default_model(self) -> str:
|
| 69 |
+
return "mock-1"
|
| 70 |
+
|
| 71 |
+
def _call(self, prompt, image_b64=None):
|
| 72 |
+
return self._text
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def _make_simple_corpus(tmp_path: Path, n: int = 1) -> Corpus:
|
| 76 |
+
docs = []
|
| 77 |
+
for i in range(n):
|
| 78 |
+
img = tmp_path / f"doc{i}.png"
|
| 79 |
+
img.write_bytes(b"x")
|
| 80 |
+
docs.append(Document(
|
| 81 |
+
image_path=img,
|
| 82 |
+
ground_truth=f"texte {i}",
|
| 83 |
+
doc_id=f"doc{i}",
|
| 84 |
+
))
|
| 85 |
+
return Corpus(name="cdef_test", documents=docs)
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 89 |
+
# D.2.f — profile validation
|
| 90 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
class TestProfileValidation:
|
| 94 |
+
"""Sprint D.2.f — ``profile`` est validé au démarrage."""
|
| 95 |
+
|
| 96 |
+
def test_unknown_profile_raises(self, tmp_path: Path) -> None:
|
| 97 |
+
corpus = _make_simple_corpus(tmp_path)
|
| 98 |
+
ocr = _MockOCR()
|
| 99 |
+
|
| 100 |
+
with pytest.raises(ValueError, match="profil"):
|
| 101 |
+
run_benchmark_via_service(
|
| 102 |
+
corpus, [ocr], profile="not_a_real_profile",
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
def test_standard_profile_accepted(self, tmp_path: Path) -> None:
|
| 106 |
+
corpus = _make_simple_corpus(tmp_path)
|
| 107 |
+
ocr = _MockOCR()
|
| 108 |
+
bm = run_benchmark_via_service(corpus, [ocr], profile="standard")
|
| 109 |
+
assert bm.engine_reports
|
| 110 |
+
|
| 111 |
+
def test_default_profile_is_standard(self, tmp_path: Path) -> None:
|
| 112 |
+
"""Pas de kwarg = utilise ``standard``, qui passe la validation."""
|
| 113 |
+
corpus = _make_simple_corpus(tmp_path)
|
| 114 |
+
ocr = _MockOCR()
|
| 115 |
+
bm = run_benchmark_via_service(corpus, [ocr])
|
| 116 |
+
assert bm.engine_reports
|
| 117 |
+
|
| 118 |
+
def test_validation_happens_before_bench(self, tmp_path: Path) -> None:
|
| 119 |
+
"""Le profil invalide lève AVANT toute exécution OCR (sinon on
|
| 120 |
+
gâche du temps de calcul pour un nom mal orthographié)."""
|
| 121 |
+
corpus = _make_simple_corpus(tmp_path)
|
| 122 |
+
|
| 123 |
+
call_counter = {"n": 0}
|
| 124 |
+
|
| 125 |
+
class _CountingOCR(_MockOCR):
|
| 126 |
+
def _run_ocr(self, image_path):
|
| 127 |
+
call_counter["n"] += 1
|
| 128 |
+
return "ocr"
|
| 129 |
+
|
| 130 |
+
ocr = _CountingOCR()
|
| 131 |
+
with pytest.raises(ValueError):
|
| 132 |
+
run_benchmark_via_service(
|
| 133 |
+
corpus, [ocr], profile="oops",
|
| 134 |
+
)
|
| 135 |
+
# OCR jamais appelé.
|
| 136 |
+
assert call_counter["n"] == 0
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
# ──────────────────────────────────────────────────��───────────────────
|
| 140 |
+
# D.2.d — over_normalization
|
| 141 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
class TestOverNormalization:
|
| 145 |
+
"""Sprint D.2.d — les pipelines OCR+LLM avec OCR amont ont
|
| 146 |
+
une clé ``over_normalization`` dans ``pipeline_metadata``."""
|
| 147 |
+
|
| 148 |
+
def test_ocr_only_has_no_over_normalization(self, tmp_path: Path) -> None:
|
| 149 |
+
"""Un moteur OCR seul (pas de pipeline) n'a pas
|
| 150 |
+
d'``over_normalization`` puisqu'il n'y a pas de LLM."""
|
| 151 |
+
corpus = _make_simple_corpus(tmp_path)
|
| 152 |
+
ocr = _MockOCR(text="texte 0")
|
| 153 |
+
bm = run_benchmark_via_service(corpus, [ocr])
|
| 154 |
+
|
| 155 |
+
dr = bm.engine_reports[0].document_results[0]
|
| 156 |
+
assert "over_normalization" not in dr.pipeline_metadata
|
| 157 |
+
|
| 158 |
+
def test_pipeline_text_only_computes_over_normalization(
|
| 159 |
+
self, tmp_path: Path,
|
| 160 |
+
) -> None:
|
| 161 |
+
"""Pipeline OCR+LLM en mode ``text_only`` : le LLM reçoit le
|
| 162 |
+
texte OCR et le corrige. ``over_normalization`` doit
|
| 163 |
+
apparaître dans pipeline_metadata."""
|
| 164 |
+
from picarones.adapters.legacy_pipelines.base import (
|
| 165 |
+
OCRLLMPipeline,
|
| 166 |
+
PipelineMode,
|
| 167 |
+
)
|
| 168 |
+
|
| 169 |
+
corpus = _make_simple_corpus(tmp_path)
|
| 170 |
+
ocr = _MockOCR(name="upstream_ocr", text="texto 0") # 1 erreur
|
| 171 |
+
llm = _MockLLM(model="m1", text="texte 0") # corrige bien
|
| 172 |
+
pipeline = OCRLLMPipeline(
|
| 173 |
+
ocr_engine=ocr,
|
| 174 |
+
llm_adapter=llm,
|
| 175 |
+
mode=PipelineMode.TEXT_ONLY,
|
| 176 |
+
)
|
| 177 |
+
|
| 178 |
+
bm = run_benchmark_via_service(corpus, [pipeline])
|
| 179 |
+
|
| 180 |
+
dr = bm.engine_reports[0].document_results[0]
|
| 181 |
+
assert dr.pipeline_metadata.get("is_pipeline") is True
|
| 182 |
+
assert "over_normalization" in dr.pipeline_metadata
|
| 183 |
+
# Le payload est un dict via OverNormalizationResult.as_dict().
|
| 184 |
+
ov = dr.pipeline_metadata["over_normalization"]
|
| 185 |
+
assert isinstance(ov, dict)
|
| 186 |
+
|
| 187 |
+
def test_pipeline_zero_shot_has_no_over_normalization(
|
| 188 |
+
self, tmp_path: Path,
|
| 189 |
+
) -> None:
|
| 190 |
+
"""Pipeline zero-shot : le VLM reçoit l'image directement, pas
|
| 191 |
+
d'OCR amont, donc pas d'``ocr_intermediate`` et pas
|
| 192 |
+
d'``over_normalization``."""
|
| 193 |
+
from picarones.adapters.legacy_pipelines.base import (
|
| 194 |
+
OCRLLMPipeline,
|
| 195 |
+
PipelineMode,
|
| 196 |
+
)
|
| 197 |
+
|
| 198 |
+
corpus = _make_simple_corpus(tmp_path)
|
| 199 |
+
llm = _MockLLM(model="vlm-1", text="texte 0")
|
| 200 |
+
pipeline = OCRLLMPipeline(
|
| 201 |
+
llm_adapter=llm,
|
| 202 |
+
mode=PipelineMode.ZERO_SHOT,
|
| 203 |
+
)
|
| 204 |
+
|
| 205 |
+
bm = run_benchmark_via_service(corpus, [pipeline])
|
| 206 |
+
dr = bm.engine_reports[0].document_results[0]
|
| 207 |
+
# Pipeline mais pas d'OCR amont → pas d'over_normalization.
|
| 208 |
+
assert "over_normalization" not in dr.pipeline_metadata
|
| 209 |
+
|
| 210 |
+
|
| 211 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 212 |
+
# D.2.e — NER attach via entity_extractor
|
| 213 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
class TestNERAttach:
|
| 217 |
+
"""Sprint D.2.e — quand ``entity_extractor`` est fourni, les
|
| 218 |
+
documents avec une GT ``ENTITIES`` reçoivent un ``ner_metrics``
|
| 219 |
+
et l'engine_report a un ``aggregated_ner``."""
|
| 220 |
+
|
| 221 |
+
def _make_corpus_with_entities(
|
| 222 |
+
self, tmp_path: Path, n: int = 2,
|
| 223 |
+
) -> Corpus:
|
| 224 |
+
from picarones.domain.artifacts import ArtifactType
|
| 225 |
+
|
| 226 |
+
docs = []
|
| 227 |
+
for i in range(n):
|
| 228 |
+
img = tmp_path / f"d{i}.png"
|
| 229 |
+
img.write_bytes(b"x")
|
| 230 |
+
doc = Document(
|
| 231 |
+
image_path=img,
|
| 232 |
+
ground_truth=f"Jean {i} habite Paris",
|
| 233 |
+
doc_id=f"d{i}",
|
| 234 |
+
)
|
| 235 |
+
doc.ground_truths[ArtifactType.ENTITIES] = EntitiesGT(
|
| 236 |
+
entities=[
|
| 237 |
+
{"label": "PER", "start": 0, "end": 6 + len(str(i)),
|
| 238 |
+
"text": f"Jean {i}"},
|
| 239 |
+
{"label": "LOC", "start": 14 + len(str(i)),
|
| 240 |
+
"end": 19 + len(str(i)), "text": "Paris"},
|
| 241 |
+
],
|
| 242 |
+
)
|
| 243 |
+
docs.append(doc)
|
| 244 |
+
return Corpus(name="ner_test", documents=docs)
|
| 245 |
+
|
| 246 |
+
def test_no_extractor_no_ner_metrics(self, tmp_path: Path) -> None:
|
| 247 |
+
corpus = self._make_corpus_with_entities(tmp_path)
|
| 248 |
+
ocr = _MockOCR(text="Jean 0 habite Paris")
|
| 249 |
+
|
| 250 |
+
bm = run_benchmark_via_service(corpus, [ocr])
|
| 251 |
+
report = bm.engine_reports[0]
|
| 252 |
+
for dr in report.document_results:
|
| 253 |
+
assert dr.ner_metrics is None
|
| 254 |
+
assert report.aggregated_ner is None
|
| 255 |
+
|
| 256 |
+
def test_extractor_attaches_metrics_to_doc(self, tmp_path: Path) -> None:
|
| 257 |
+
"""Quand l'extracteur retourne des entités sur l'hypothèse,
|
| 258 |
+
``ner_metrics`` apparaît sur le DocumentResult."""
|
| 259 |
+
corpus = self._make_corpus_with_entities(tmp_path)
|
| 260 |
+
ocr = _MockOCR(text="Jean 0 habite Paris") # match parfait
|
| 261 |
+
|
| 262 |
+
def extractor(text: str) -> list[dict]:
|
| 263 |
+
# Reproduit les entités GT sur l'hypothèse.
|
| 264 |
+
ents = []
|
| 265 |
+
if "Jean 0" in text:
|
| 266 |
+
ents.append({"label": "PER", "start": 0, "end": 6,
|
| 267 |
+
"text": "Jean 0"})
|
| 268 |
+
if "Paris" in text:
|
| 269 |
+
idx = text.find("Paris")
|
| 270 |
+
ents.append({"label": "LOC", "start": idx,
|
| 271 |
+
"end": idx + 5, "text": "Paris"})
|
| 272 |
+
return ents
|
| 273 |
+
|
| 274 |
+
bm = run_benchmark_via_service(
|
| 275 |
+
corpus, [ocr], entity_extractor=extractor,
|
| 276 |
+
)
|
| 277 |
+
|
| 278 |
+
report = bm.engine_reports[0]
|
| 279 |
+
d0 = next(d for d in report.document_results if d.doc_id == "d0")
|
| 280 |
+
assert d0.ner_metrics is not None
|
| 281 |
+
# Les entités matchent → tp > 0.
|
| 282 |
+
assert d0.ner_metrics["true_positives"] > 0
|
| 283 |
+
|
| 284 |
+
def test_aggregated_ner_present_when_any_doc_evaluated(
|
| 285 |
+
self, tmp_path: Path,
|
| 286 |
+
) -> None:
|
| 287 |
+
corpus = self._make_corpus_with_entities(tmp_path)
|
| 288 |
+
ocr = _MockOCR(text="Jean 0 habite Paris")
|
| 289 |
+
|
| 290 |
+
def extractor(text: str) -> list[dict]:
|
| 291 |
+
return [{"label": "PER", "start": 0, "end": 6, "text": "Jean 0"}]
|
| 292 |
+
|
| 293 |
+
bm = run_benchmark_via_service(
|
| 294 |
+
corpus, [ocr], entity_extractor=extractor,
|
| 295 |
+
)
|
| 296 |
+
|
| 297 |
+
report = bm.engine_reports[0]
|
| 298 |
+
assert report.aggregated_ner is not None
|
| 299 |
+
assert "global" in report.aggregated_ner
|
| 300 |
+
assert "precision" in report.aggregated_ner["global"]
|
| 301 |
+
|
| 302 |
+
def test_doc_without_entities_gt_skipped(self, tmp_path: Path) -> None:
|
| 303 |
+
"""Un document sans GT ``ENTITIES`` n'est pas évalué NER —
|
| 304 |
+
``ner_metrics`` reste ``None`` même si l'extracteur est
|
| 305 |
+
fourni."""
|
| 306 |
+
# Corpus mixte : 1 doc avec ENTITIES, 1 sans.
|
| 307 |
+
from picarones.domain.artifacts import ArtifactType
|
| 308 |
+
|
| 309 |
+
img1 = tmp_path / "d1.png"
|
| 310 |
+
img1.write_bytes(b"x")
|
| 311 |
+
doc_with = Document(
|
| 312 |
+
image_path=img1, ground_truth="Jean", doc_id="with_ent",
|
| 313 |
+
)
|
| 314 |
+
doc_with.ground_truths[ArtifactType.ENTITIES] = EntitiesGT(
|
| 315 |
+
entities=[{"label": "PER", "start": 0, "end": 4, "text": "Jean"}],
|
| 316 |
+
)
|
| 317 |
+
|
| 318 |
+
img2 = tmp_path / "d2.png"
|
| 319 |
+
img2.write_bytes(b"x")
|
| 320 |
+
doc_without = Document(
|
| 321 |
+
image_path=img2, ground_truth="rien", doc_id="without_ent",
|
| 322 |
+
)
|
| 323 |
+
|
| 324 |
+
corpus = Corpus(
|
| 325 |
+
name="mixed", documents=[doc_with, doc_without],
|
| 326 |
+
)
|
| 327 |
+
ocr = _MockOCR(text="Jean")
|
| 328 |
+
|
| 329 |
+
def extractor(text: str) -> list[dict]:
|
| 330 |
+
return [{"label": "PER", "start": 0, "end": 4, "text": "Jean"}]
|
| 331 |
+
|
| 332 |
+
bm = run_benchmark_via_service(
|
| 333 |
+
corpus, [ocr], entity_extractor=extractor,
|
| 334 |
+
)
|
| 335 |
+
|
| 336 |
+
report = bm.engine_reports[0]
|
| 337 |
+
d_with = next(
|
| 338 |
+
d for d in report.document_results if d.doc_id == "with_ent"
|
| 339 |
+
)
|
| 340 |
+
d_without = next(
|
| 341 |
+
d for d in report.document_results if d.doc_id == "without_ent"
|
| 342 |
+
)
|
| 343 |
+
|
| 344 |
+
assert d_with.ner_metrics is not None
|
| 345 |
+
assert d_without.ner_metrics is None
|
| 346 |
+
|
| 347 |
+
def test_extractor_exception_does_not_crash_bench(
|
| 348 |
+
self, tmp_path: Path, caplog: pytest.LogCaptureFixture,
|
| 349 |
+
) -> None:
|
| 350 |
+
corpus = self._make_corpus_with_entities(tmp_path, n=1)
|
| 351 |
+
ocr = _MockOCR(text="Jean 0 habite Paris")
|
| 352 |
+
|
| 353 |
+
def buggy_extractor(text: str) -> list[dict]:
|
| 354 |
+
raise RuntimeError("NER backend down")
|
| 355 |
+
|
| 356 |
+
with caplog.at_level("WARNING"):
|
| 357 |
+
bm = run_benchmark_via_service(
|
| 358 |
+
corpus, [ocr], entity_extractor=buggy_extractor,
|
| 359 |
+
)
|
| 360 |
+
|
| 361 |
+
report = bm.engine_reports[0]
|
| 362 |
+
# Le bench a abouti — pas d'exception propagée.
|
| 363 |
+
assert len(report.document_results) == 1
|
| 364 |
+
# ner_metrics non attaché à cause du crash.
|
| 365 |
+
assert report.document_results[0].ner_metrics is None
|
| 366 |
+
|
| 367 |
+
|
| 368 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 369 |
+
# D.2.e — agrégation NER (helper interne testé directement)
|
| 370 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 371 |
+
|
| 372 |
+
|
| 373 |
+
class TestAggregateNERMetrics:
|
| 374 |
+
"""Tests unitaires de ``_aggregate_ner_metrics`` — équivalent
|
| 375 |
+
fonctionnel de l'ex-``measurements.runner.ner_attach._aggregate_ner``."""
|
| 376 |
+
|
| 377 |
+
def test_empty_returns_none(self) -> None:
|
| 378 |
+
from picarones.evaluation.benchmark_result import (
|
| 379 |
+
DocumentResult,
|
| 380 |
+
)
|
| 381 |
+
from picarones.evaluation.metric_result import MetricsResult
|
| 382 |
+
|
| 383 |
+
# Aucun ner_metrics sur les docs.
|
| 384 |
+
drs = [
|
| 385 |
+
DocumentResult(
|
| 386 |
+
doc_id="d", image_path="", ground_truth="",
|
| 387 |
+
hypothesis="", metrics=MetricsResult(), duration_seconds=0,
|
| 388 |
+
),
|
| 389 |
+
]
|
| 390 |
+
assert _aggregate_ner_metrics(drs) is None
|
| 391 |
+
|
| 392 |
+
def test_aggregates_global_prf(self) -> None:
|
| 393 |
+
from picarones.evaluation.benchmark_result import (
|
| 394 |
+
DocumentResult,
|
| 395 |
+
)
|
| 396 |
+
from picarones.evaluation.metric_result import MetricsResult
|
| 397 |
+
|
| 398 |
+
dr1 = DocumentResult(
|
| 399 |
+
doc_id="d1", image_path="", ground_truth="",
|
| 400 |
+
hypothesis="", metrics=MetricsResult(), duration_seconds=0,
|
| 401 |
+
)
|
| 402 |
+
dr1.ner_metrics = {
|
| 403 |
+
"true_positives": 5,
|
| 404 |
+
"false_positives": 1,
|
| 405 |
+
"false_negatives": 2,
|
| 406 |
+
"per_category": {},
|
| 407 |
+
"hallucinated_entities": [],
|
| 408 |
+
"missed_entities": [],
|
| 409 |
+
}
|
| 410 |
+
dr2 = DocumentResult(
|
| 411 |
+
doc_id="d2", image_path="", ground_truth="",
|
| 412 |
+
hypothesis="", metrics=MetricsResult(), duration_seconds=0,
|
| 413 |
+
)
|
| 414 |
+
dr2.ner_metrics = {
|
| 415 |
+
"true_positives": 3,
|
| 416 |
+
"false_positives": 0,
|
| 417 |
+
"false_negatives": 1,
|
| 418 |
+
"per_category": {},
|
| 419 |
+
"hallucinated_entities": [],
|
| 420 |
+
"missed_entities": [],
|
| 421 |
+
}
|
| 422 |
+
|
| 423 |
+
agg = _aggregate_ner_metrics([dr1, dr2])
|
| 424 |
+
|
| 425 |
+
assert agg is not None
|
| 426 |
+
# tp=8, fp=1, fn=3 → P=8/9, R=8/11, F1=2*P*R/(P+R)
|
| 427 |
+
assert agg["global"]["precision"] == pytest.approx(8 / 9, abs=1e-4)
|
| 428 |
+
assert agg["global"]["recall"] == pytest.approx(8 / 11, abs=1e-4)
|
| 429 |
+
assert agg["n_documents"] == 2
|
| 430 |
+
|
| 431 |
+
def test_per_category_aggregation(self) -> None:
|
| 432 |
+
from picarones.evaluation.benchmark_result import (
|
| 433 |
+
DocumentResult,
|
| 434 |
+
)
|
| 435 |
+
from picarones.evaluation.metric_result import MetricsResult
|
| 436 |
+
|
| 437 |
+
dr = DocumentResult(
|
| 438 |
+
doc_id="d", image_path="", ground_truth="",
|
| 439 |
+
hypothesis="", metrics=MetricsResult(), duration_seconds=0,
|
| 440 |
+
)
|
| 441 |
+
dr.ner_metrics = {
|
| 442 |
+
"true_positives": 4,
|
| 443 |
+
"false_positives": 1,
|
| 444 |
+
"false_negatives": 1,
|
| 445 |
+
"per_category": {
|
| 446 |
+
"PER": {
|
| 447 |
+
"support": 3, "recall": 1.0, "precision": 1.0,
|
| 448 |
+
"f1": 1.0,
|
| 449 |
+
},
|
| 450 |
+
"LOC": {
|
| 451 |
+
"support": 2, "recall": 0.5, "precision": 0.5,
|
| 452 |
+
"f1": 0.5,
|
| 453 |
+
},
|
| 454 |
+
},
|
| 455 |
+
"hallucinated_entities": [],
|
| 456 |
+
"missed_entities": [],
|
| 457 |
+
}
|
| 458 |
+
|
| 459 |
+
agg = _aggregate_ner_metrics([dr])
|
| 460 |
+
|
| 461 |
+
assert "PER" in agg["per_category"]
|
| 462 |
+
assert "LOC" in agg["per_category"]
|
| 463 |
+
# PER : 3/3 → P=R=F1=1.0
|
| 464 |
+
assert agg["per_category"]["PER"]["recall"] == pytest.approx(1.0)
|
|
@@ -43,7 +43,10 @@ FILE_BUDGETS: dict[str, int] = {
|
|
| 43 |
# supprimé en H.4 avec interfaces/{cli,web}/_legacy/.
|
| 44 |
# Sprint D.2.b a ajouté ~260 LOC pour la branche resumable
|
| 45 |
# (``_run_benchmark_with_partial``).
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
| 47 |
# --- God-modules : budget actuel + 15 % de marge.
|
| 48 |
# Le rétrécissement sera l'objet d'un sprint de refactor dédié.
|
| 49 |
# statistics.py (1128 lignes) a été éclaté en sous-package
|
|
|
|
| 43 |
# supprimé en H.4 avec interfaces/{cli,web}/_legacy/.
|
| 44 |
# Sprint D.2.b a ajouté ~260 LOC pour la branche resumable
|
| 45 |
# (``_run_benchmark_with_partial``).
|
| 46 |
+
# Sprint D.2.c-f a ajouté ~190 LOC : NER attach (post-process +
|
| 47 |
+
# _aggregate_ner_metrics) + over_normalization dans
|
| 48 |
+
# _build_pipeline_metadata + validate_profile.
|
| 49 |
+
"picarones/app/services/_legacy_runner_adapter.py": 1700, # actuel 1461
|
| 50 |
# --- God-modules : budget actuel + 15 % de marge.
|
| 51 |
# Le rétrécissement sera l'objet d'un sprint de refactor dédié.
|
| 52 |
# statistics.py (1128 lignes) a été éclaté en sous-package
|