Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

App Files Files Community

Marcel Bautista-Kuljevan commited on May 3

Commit

9993409

unverified ·

2 Parent(s): 6221160 31ef91a

Merge pull request #54 from maribakulj/claude/repo-analysis-a319T

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

CLAUDE.md +0 -0
README.md +1 -1
SPECS.md +1 -1
docs/architecture.md +37 -1
docs/cli-workflows.md +1 -1
docs/developer/index.md +10 -2
docs/profiles.md +1 -1
docs/roadmap/evolution-2026.md +2 -2
docs/user/writing-a-pipeline-module.md +1 -1
picarones/measurements/__init__.py +25 -0
picarones/measurements/runner/__init__.py +103 -0
picarones/measurements/runner/aggregation.py +82 -0
picarones/measurements/runner/document.py +190 -0
picarones/measurements/runner/ner_attach.py +133 -0
picarones/measurements/{runner.py → runner/orchestration.py} +33 -558
picarones/measurements/runner/partial.py +140 -0
picarones/measurements/runner/workers.py +107 -0
picarones/measurements/statistics.py +0 -1128
picarones/measurements/statistics/__init__.py +82 -0
picarones/measurements/statistics/bootstrap.py +47 -0
picarones/measurements/statistics/cdd_render.py +171 -0
picarones/measurements/statistics/clustering.py +158 -0
picarones/measurements/statistics/correlation.py +75 -0
picarones/measurements/statistics/distributions.py +88 -0
picarones/measurements/statistics/friedman_nemenyi.py +350 -0
picarones/measurements/statistics/pareto.py +87 -0
picarones/measurements/statistics/wilcoxon.py +227 -0
picarones/report/assets.py +203 -0
picarones/report/calibration_render.py +2 -16
picarones/report/error_absorption_render.py +17 -51
picarones/report/generator.py +178 -775
picarones/report/image_predictive_render.py +4 -18
picarones/report/incremental_comparison_render.py +17 -19
picarones/report/inter_engine_render.py +8 -15
picarones/report/levers_render.py +9 -1
picarones/report/lexical_modernization_render.py +5 -10
picarones/report/longitudinal_render.py +15 -24
picarones/report/marginal_cost_render.py +111 -0
picarones/report/multirun_stability_render.py +2 -16
picarones/report/ner_render.py +3 -22
picarones/report/numerical_sequences_render.py +3 -18
picarones/report/philological_render.py +5 -25
picarones/report/pipeline_render.py +14 -24
picarones/report/rare_token_recall_render.py +116 -0
picarones/report/readability_render.py +13 -18
picarones/report/render_helpers.py +422 -0
picarones/report/report_data/__init__.py +132 -0
picarones/report/report_data/_helpers.py +30 -0
picarones/report/report_data/documents.py +167 -0
picarones/report/report_data/engines.py +103 -0

CLAUDE.md CHANGED Viewed

The diff for this file is too large to render. See raw diff

README.md CHANGED Viewed

@@ -385,7 +385,7 @@ ruff check picarones/ tests/
 python -m mypy picarones/core/
 ```
-**Test suite**: ~3763 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP.

 python -m mypy picarones/core/
 ```
+**Test suite**: ~3871 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP.

SPECS.md CHANGED Viewed

@@ -425,7 +425,7 @@ colonne) et `picarones/report/glossary/{fr,en}.yaml`.
 **Note de traçabilité** : les références primaires (Demšar 2006,
 Wilcoxon 1945, Efron 1979, etc.) sont citées dans les docstrings
-de chaque fonction de `picarones/measurements/statistics.py`.
 Le glossaire contextuel relie chaque métrique à sa publication
 canonique (champ `reference`).

 **Note de traçabilité** : les références primaires (Demšar 2006,
 Wilcoxon 1945, Efron 1979, etc.) sont citées dans les docstrings
+de chaque fonction de `picarones/measurements/statistics/`.
 Le glossaire contextuel relie chaque métrique à sa publication
 canonique (champ `reference`).

docs/architecture.md CHANGED Viewed

@@ -41,7 +41,7 @@ Les implémentations distribuées par défaut dans le package `picarones`.
 | Catégorie | Modules |
 |---|---|
-| Coeur | `metrics.py`, `statistics.py`, `runner.py`, `builtin_hooks.py`, `builtin_metrics.py`, `normalization.py` |
 | Erreurs | `confusion.py`, `taxonomy.py`, `taxonomy_comparison.py`, `taxonomy_cooccurrence.py`, `taxonomy_intra_doc.py` |
 | Lignes/structure | `line_metrics.py`, `structure.py`, `worst_lines.py`, `char_scores.py` |
 | Calibration/fiabilité | `calibration.py`, `reliability.py`, `hallucination.py` |
@@ -141,3 +141,39 @@ Organisés par cercle : `tests/core/`, `tests/measurements/`,
 Un test du cercle N **n'importe pas** les implémentations des
 cercles > N (sauf `tests/integration/`).

 | Catégorie | Modules |
 |---|---|
+| Coeur | `metrics.py`, `statistics/` (sous-package), `runner.py`, `builtin_hooks.py`, `builtin_metrics.py`, `normalization.py` |
 | Erreurs | `confusion.py`, `taxonomy.py`, `taxonomy_comparison.py`, `taxonomy_cooccurrence.py`, `taxonomy_intra_doc.py` |
 | Lignes/structure | `line_metrics.py`, `structure.py`, `worst_lines.py`, `char_scores.py` |
 | Calibration/fiabilité | `calibration.py`, `reliability.py`, `hallucination.py` |
 Un test du cercle N **n'importe pas** les implémentations des
 cercles > N (sauf `tests/integration/`).
+## Convention de découpage des modules > 400 lignes
+Quand un module Python dépasse 400 lignes ET contient plusieurs
+responsabilités disjointes, le découper en **sous-package** plutôt
+qu'en plusieurs modules à plat. Modèle de référence :
+[`picarones/measurements/statistics/`](../picarones/measurements/statistics/)
+issu du sprint « découpage de statistics.py » (mai 2026).
+Convention :
+1. **Renommer** `X.py` en `X/__init__.py` via `git mv` (préserve
+   l'historique du fichier original).
+2. **Créer** dans `X/` un sous-module par famille fonctionnelle
+   (`bootstrap.py`, `wilcoxon.py`, `friedman_nemenyi.py`, etc.).
+   Chaque sous-module doit faire moins de ~400 lignes ; sinon
+   re-décomposer.
+3. **`X/__init__.py`** ne contient QUE des ré-exports rétrocompat —
+   tous les symboles publics de l'ancien `X.py` doivent rester
+   importables via `from picarones.X import …`. Les symboles privés
+   ré-exportés doivent être ceux **réellement** consommés par les
+   tests (vérifié par grep, pas par supposition).
+4. **`__all__`** explicite dans chaque sous-module et dans le
+   `__init__.py`.
+5. **Tests architecture** (`tests/architecture/test_*.py`) doivent
+   continuer à passer : si nécessaire, étendre `_measurements_modules()`
+   ou `_imports_target_*` pour reconnaître les sous-packages.
+6. **Préfixer les modules de rendu** par leur domaine
+   (`cdd_render.py` plutôt que `render_cdd.py`) pour cohérence avec
+   `picarones/report/*_render.py`.
+**Quand NE PAS découper** : si les responsabilités sont fortement
+couplées (ex: un orchestrateur qui appelle 12 sous-fonctions au
+même endroit), le maintien dans un seul fichier > 400 lignes est
+acceptable. Le budget par fichier (`tests/architecture/test_file_budgets.py`)
+documente ces dérogations conscientes.

docs/cli-workflows.md CHANGED Viewed

@@ -133,7 +133,7 @@ picarones import iiif \
 Télécharge un manifeste IIIF v2/v3 (BnF Gallica, Bodleian, Vatican…) et
 crée un corpus local avec `.gt.txt` extraits de l'OCR ALTO si présent.
 Depuis le chantier 4, IIIF et Gallica utilisent les mêmes helpers HTTP
-factorisés ([`picarones/importers/_http.py`](../picarones/importers/_http.py))
 avec garde-fou `file://`/`ftp://`/`javascript://`.
 ## Outils utilitaires

 Télécharge un manifeste IIIF v2/v3 (BnF Gallica, Bodleian, Vatican…) et
 crée un corpus local avec `.gt.txt` extraits de l'OCR ALTO si présent.
 Depuis le chantier 4, IIIF et Gallica utilisent les mêmes helpers HTTP
+factorisés ([`picarones/extras/importers/_http.py`](../picarones/extras/importers/_http.py))
 avec garde-fou `file://`/`ftp://`/`javascript://`.
 ## Outils utilitaires

docs/developer/index.md CHANGED Viewed

@@ -10,10 +10,18 @@ modules. En résumé :
 ```
 picarones/
-├── core/                # cœur analytique pur Python
 │   ├── runner.py        # orchestration ThreadPool/ProcessPool
 │   ├── metrics.py       # CER/WER/MER/WIL via jiwer
-│   ├── statistics.py    # Wilcoxon, Friedman, Nemenyi, Pareto
 │   ├── narrative/       # moteur de synthèse factuelle
 │   ├── pricing.py       # modèle de coût pour la vue Pareto
 │   └── …

 ```
 picarones/
+├── core/                # cœur analytique pur Python (Cercle 1)
+│   ├── pipeline.py      # PipelineRunner pour pipelines composées
+│   ├── corpus.py        # Document, Corpus, GTLevel
+│   ├── results.py       # DocumentResult, EngineReport, BenchmarkResult
+│   ├── modules.py       # BaseModule, ArtifactType
+│   ├── facts.py         # Fact, FactType, registre narratif
+│   └── …
+├── measurements/        # métriques officielles (Cercle 2)
 │   ├── runner.py        # orchestration ThreadPool/ProcessPool
 │   ├── metrics.py       # CER/WER/MER/WIL via jiwer
+│   ├── statistics/      # Wilcoxon, Friedman, Nemenyi, Pareto
+│   │   (sous-package depuis le sprint « découpage statistics.py »)
 │   ├── narrative/       # moteur de synthèse factuelle
 │   ├── pricing.py       # modèle de coût pour la vue Pareto
 │   └── …

docs/profiles.md CHANGED Viewed

@@ -150,7 +150,7 @@ def my_hook(*, ground_truth, hypothesis, image_path, corpus_lang, ocr_result):
 - [`picarones/core/metric_hooks.py`](../picarones/core/metric_hooks.py)
   — registre, profils, `run_document_hooks()`, `run_corpus_aggregators()`.
-- [`picarones/core/builtin_hooks.py`](../picarones/core/builtin_hooks.py)
   — les 12 hooks doc + 12 agrégateurs natifs Picarones.
 - [`tests/test_metric_hooks.py`](../tests/test_metric_hooks.py)
   — tests unitaires + rétrocompat profil `standard`.

 - [`picarones/core/metric_hooks.py`](../picarones/core/metric_hooks.py)
   — registre, profils, `run_document_hooks()`, `run_corpus_aggregators()`.
+- [`picarones/measurements/builtin_hooks.py`](../picarones/measurements/builtin_hooks.py)
   — les 12 hooks doc + 12 agrégateurs natifs Picarones.
 - [`tests/test_metric_hooks.py`](../tests/test_metric_hooks.py)
   — tests unitaires + rétrocompat profil `standard`.

docs/roadmap/evolution-2026.md CHANGED Viewed

@@ -442,7 +442,7 @@ nouvelle dans le rapport.
 **A.II.1.a — Précision sur entités nommées (NER).**
-Nouveau module `picarones/core/ner.py`. Backends : spaCy multilingue,
 Stanza, modèle HIPE pour les corpus historiques. Choix paramétré par
 profil (`fr_core_news_lg`, `xx_ent_wiki_sm`, `hipe2022`).
@@ -464,7 +464,7 @@ glossaire (entrée `ner_score`).
 **A.II.1.b — Score de calibration des moteurs.**
-Nouveau module `picarones/core/calibration.py`. Tous les moteurs cibles
 fournissent une confidence par token ou par ligne (Tesseract `tsv`
 output, Pero OCR via `PageLayout`, Mistral OCR via `confidence`, Google
 Vision via `Word.confidence`). Ajout d'un champ

 **A.II.1.a — Précision sur entités nommées (NER).**
+Nouveau module `picarones/measurements/ner.py`. Backends : spaCy multilingue,
 Stanza, modèle HIPE pour les corpus historiques. Choix paramétré par
 profil (`fr_core_news_lg`, `xx_ent_wiki_sm`, `hipe2022`).
 **A.II.1.b — Score de calibration des moteurs.**
+Nouveau module `picarones/measurements/calibration.py`. Tous les moteurs cibles
 fournissent une confidence par token ou par ligne (Tesseract `tsv`
 output, Pero OCR via `PageLayout`, Mistral OCR via `confidence`, Google
 Vision via `Word.confidence`). Ajout d'un champ

docs/user/writing-a-pipeline-module.md CHANGED Viewed

@@ -350,7 +350,7 @@ brancher dans la pipeline et de mesurer.
 ### 6.b « Et si je veux juste tester une pipeline OCR seule, sans étapes en aval ? »
 C'est exactement ce que fait le runner OCR historique
-(`run_benchmark` dans `picarones/core/runner.py`) — il est
 toujours là, n'a pas changé, et reste la voie recommandée pour
 les benchmarks d'OCR mono-étage.

 ### 6.b « Et si je veux juste tester une pipeline OCR seule, sans étapes en aval ? »
 C'est exactement ce que fait le runner OCR historique
+(`run_benchmark` dans `picarones/measurements/runner/`) — il est
 toujours là, n'a pas changé, et reste la voie recommandée pour
 les benchmarks d'OCR mono-étage.

picarones/measurements/__init__.py CHANGED Viewed

@@ -151,3 +151,28 @@ from picarones.measurements import reading_order  # noqa: F401
 # Chantier 1 (post-Sprint 97) : métriques (ALTO, ALTO) pour évaluer
 # les reconstructeurs ALTO contre une GT ALTO du document.
 from picarones.measurements import alto_metrics  # noqa: F401

 # Chantier 1 (post-Sprint 97) : métriques (ALTO, ALTO) pour évaluer
 # les reconstructeurs ALTO contre une GT ALTO du document.
 from picarones.measurements import alto_metrics  # noqa: F401
+# ──────────────────────────────────────────────────────────────────────────
+# Sprint « zéro dette actionnable » (mai 2026) — modules sans appel
+# automatique par le runner OCR principal mais qui font partie de l'API
+# publique de ``picarones.measurements``. L'import ici les rend
+# accessibles en ``from picarones.measurements import X`` et garantit
+# qu'aucun ne devient « test-only » silencieusement (cf.
+# ``tests/architecture/test_module_coverage.py``).
+#
+# Distinction de scope :
+# - Modules de calcul utilisés via les renderers HTML composables
+#   (l'utilisateur les compose lui-même selon son use case) :
+from picarones.measurements import baseline_comparison  # noqa: F401  # historique SQLite
+from picarones.measurements import cost_projection  # noqa: F401  # volume cible utilisateur
+from picarones.measurements import equivalence_profile  # noqa: F401  # curseur HTML
+from picarones.measurements import error_absorption  # noqa: F401  # jonction pipeline composée
+from picarones.measurements import layout  # noqa: F401  # GT ALTO requise (axe B)
+from picarones.measurements import longitudinal  # noqa: F401  # historique SQLite
+from picarones.measurements import marginal_cost  # noqa: F401  # paires de moteurs
+from picarones.measurements import module_policy  # noqa: F401  # outil d'audit
+from picarones.measurements import ner_backends  # noqa: F401  # factory backends NER
+from picarones.measurements import rare_tokens  # noqa: F401  # corpus-wide
+from picarones.measurements import reliability  # noqa: F401  # multi-runs
+from picarones.measurements import taxonomy_cooccurrence  # noqa: F401  # depuis taxonomy
+from picarones.measurements import taxonomy_intra_doc  # noqa: F401  # depuis taxonomy

picarones/measurements/runner/__init__.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""Orchestrateur du benchmark.
+Exécute les moteurs OCR/HTR sur le corpus de manière parallèle :
+- ``ProcessPoolExecutor`` pour les moteurs CPU-bound (Tesseract, Pero OCR,
+  Kraken) — les workers picklables vivent dans :mod:`workers`.
+- ``ThreadPoolExecutor`` pour les moteurs IO-bound / API (Mistral, Google,
+  Azure, LLMs).
+Avant le sprint « découpage de runner.py » (mai 2026) ce module était
+un fichier unique de 1019 lignes. Le sous-package éclate la
+responsabilité par concern :
+- :mod:`document` — calcul d'un :class:`DocumentResult` à partir d'un
+  OCR (métriques principales + hooks via ``run_document_hooks(profile)``).
+- :mod:`workers` — fonctions de niveau module pour ``ProcessPoolExecutor``
+  (:func:`_cpu_doc_worker`) et ``ThreadPoolExecutor`` (:func:`_io_doc_worker`).
+- :mod:`partial` — persistance NDJSON des résultats partiels pour
+  reprise sur interruption.
+- :mod:`orchestration` — :func:`run_benchmark` (boucle principale,
+  pools, agrégation par moteur) + :func:`_build_pipeline_info`.
+- :mod:`aggregation` — délégations rétrocompat vers les agrégateurs de
+  ``builtin_hooks`` (chantier 2 post-Sprint 97).
+- :mod:`ner_attach` — câblage NER au post-process (Sprint 40).
+Ce ``__init__.py`` ré-exporte toute l'API publique historique pour que
+les ~25 fichiers qui importent depuis ``picarones.measurements.runner``
+continuent à fonctionner sans modification. Les symboles privés
+``_compute_document_result``, ``_load_partial``, ``_partial_path``,
+``_aggregate_*``, ``_calibration_from_engine_result`` sont ré-exportés
+car les tests Sprint 13/40/42 les consomment directement.
+"""
+from picarones.measurements.runner.aggregation import (
+    _aggregate_calibration,
+    _aggregate_char_scores,
+    _aggregate_confusion,
+    _aggregate_hallucination,
+    _aggregate_image_quality,
+    _aggregate_line_metrics,
+    _aggregate_structure,
+    _aggregate_taxonomy,
+)
+from picarones.measurements.runner.document import (
+    _calibration_from_engine_result,
+    _compute_document_result,
+    _make_error_doc_result,
+    _make_timeout_doc_result,
+)
+from picarones.measurements.runner.ner_attach import (
+    _aggregate_ner,
+    _attach_ner_metrics,
+)
+from picarones.measurements.runner.orchestration import (
+    _build_pipeline_info,
+    run_benchmark,
+)
+from picarones.measurements.runner.partial import (
+    _delete_partial,
+    _load_partial,
+    _partial_path,
+    _partial_write_lock,
+    _sanitize_filename,
+    _save_partial_line,
+)
+from picarones.measurements.runner.workers import (
+    _cpu_doc_worker,
+    _io_doc_worker,
+)
+__all__ = [
+    # API publique principale
+    "run_benchmark",
+    # Helpers calcul document
+    "_compute_document_result",
+    "_calibration_from_engine_result",
+    "_make_error_doc_result",
+    "_make_timeout_doc_result",
+    # Workers picklables
+    "_cpu_doc_worker",
+    "_io_doc_worker",
+    # Persistance partial
+    "_partial_path",
+    "_load_partial",
+    "_save_partial_line",
+    "_delete_partial",
+    "_sanitize_filename",
+    "_partial_write_lock",
+    # Orchestration helper
+    "_build_pipeline_info",
+    # Délégations agrégation (rétrocompat tests Sprint 13/42)
+    "_aggregate_calibration",
+    "_aggregate_char_scores",
+    "_aggregate_confusion",
+    "_aggregate_hallucination",
+    "_aggregate_image_quality",
+    "_aggregate_line_metrics",
+    "_aggregate_structure",
+    "_aggregate_taxonomy",
+    # NER (Sprint 40)
+    "_aggregate_ner",
+    "_attach_ner_metrics",
+]

picarones/measurements/runner/aggregation.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""Délégations rétrocompat vers ``builtin_hooks._aggregate_*``.
+Chantier 2 (post-Sprint 97) : la logique d'agrégation par-engine de
+toutes les métriques (confusion, taxonomy, structure, image_quality,
+line_metrics, hallucination, calibration, char_scores) vit désormais
+dans :mod:`picarones.measurements.builtin_hooks` (single source of truth,
+exposé via le registre :mod:`picarones.core.metric_hooks`).
+Les noms ci-dessous restent disponibles depuis
+``picarones.measurements.runner`` pour la rétrocompat des tests
+Sprint 13 / 42 qui les importent directement.
+"""
+from __future__ import annotations
+from typing import Optional
+def _aggregate_confusion(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_confusion`."""
+    from picarones.measurements.builtin_hooks import _aggregate_confusion as _impl
+    return _impl(doc_results)
+def _aggregate_char_scores(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_char_scores`."""
+    from picarones.measurements.builtin_hooks import _aggregate_char_scores as _impl
+    return _impl(doc_results)
+def _aggregate_taxonomy(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_taxonomy`."""
+    from picarones.measurements.builtin_hooks import _aggregate_taxonomy as _impl
+    return _impl(doc_results)
+def _aggregate_structure(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_structure`."""
+    from picarones.measurements.builtin_hooks import _aggregate_structure as _impl
+    return _impl(doc_results)
+def _aggregate_image_quality(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_image_quality`."""
+    from picarones.measurements.builtin_hooks import _aggregate_image_quality as _impl
+    return _impl(doc_results)
+def _aggregate_line_metrics(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_line_metrics`."""
+    from picarones.measurements.builtin_hooks import _aggregate_line_metrics as _impl
+    return _impl(doc_results)
+def _aggregate_hallucination(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_hallucination`."""
+    from picarones.measurements.builtin_hooks import _aggregate_hallucination as _impl
+    return _impl(doc_results)
+def _aggregate_calibration(doc_results: list) -> Optional[dict]:
+    """Délégation vers :func:`builtin_hooks._aggregate_calibration`.
+    Conservé pour la rétrocompat du test ``test_sprint42_calibration_runner``
+    qui importe directement depuis ``picarones.measurements.runner``. La
+    logique réelle vit dans :mod:`picarones.measurements.builtin_hooks`
+    (chantier 2 post-Sprint 97).
+    """
+    from picarones.measurements.builtin_hooks import _aggregate_calibration as _impl
+    return _impl(doc_results)
+__all__ = [
+    "_aggregate_calibration",
+    "_aggregate_char_scores",
+    "_aggregate_confusion",
+    "_aggregate_hallucination",
+    "_aggregate_image_quality",
+    "_aggregate_line_metrics",
+    "_aggregate_structure",
+    "_aggregate_taxonomy",
+]

picarones/measurements/runner/document.py ADDED Viewed

	@@ -0,0 +1,190 @@

+"""Construction d'un :class:`DocumentResult` à partir d'un OCR.
+Centralise le calcul de toutes les métriques attachées à un document
+unique : métriques principales (CER/WER/MER/WIL via jiwer), hooks
+optionnels (calibration, taxonomy, philological, etc. — exécutés via
+``run_document_hooks(profile)``), et meta pipeline OCR+LLM.
+Aussi : helpers pour construire les ``DocumentResult`` synthétiques
+en cas de timeout ou d'erreur d'engine (``_make_timeout_doc_result``,
+``_make_error_doc_result``).
+"""
+from __future__ import annotations
+from typing import Optional
+from picarones.core.results import DocumentResult
+from picarones.engines.base import EngineResult
+from picarones.measurements.metrics import MetricsResult, compute_metrics
+def _calibration_from_engine_result(
+    ground_truth: str,
+    token_confidences: list,
+) -> Optional[dict]:
+    """Délégation vers
+    :func:`picarones.measurements.builtin_hooks.calibration_from_engine_result`.
+    Conservé pour la rétrocompat des tests Sprint 42 qui font
+    ``from picarones.measurements.runner import _calibration_from_engine_result``.
+    Toute évolution du calcul doit se faire dans ``builtin_hooks``.
+    """
+    from picarones.measurements.builtin_hooks import calibration_from_engine_result
+    return calibration_from_engine_result(ground_truth, token_confidences)
+def _compute_document_result(
+    doc_id: str,
+    image_path: str,
+    ground_truth: str,
+    ocr_result: EngineResult,
+    char_exclude: Optional[frozenset],
+    corpus_lang: str = "fr",
+    profile: str = "standard",
+) -> DocumentResult:
+    """Calcule toutes les métriques pour un document et retourne un DocumentResult.
+    Utilisable à la fois dans le processus principal (IO-bound) et dans les
+    sous-processus créés par ProcessPoolExecutor (CPU-bound).
+    Les imports lourds sont différés pour accélérer le démarrage des sous-processus.
+    Chantier 2 (post-Sprint 97) — refonte
+    ------------------------------------
+    Les 11 ``try/except`` codés en dur (Sprints 5+10+39+42+61+86+87) sont
+    désormais centralisés dans ``picarones.measurements.builtin_hooks`` et
+    sélectionnés via ``run_document_hooks(profile)``.  Le profil
+    ``"standard"`` (défaut) reproduit strictement le comportement
+    pré-chantier-2.  Les profils ``"minimal"``, ``"philological"``,
+    ``"diagnostics"``, ``"economics"``, ``"pipeline"``, ``"full"``
+    permettent à l'utilisateur de moduler le coût de calcul.
+    """
+    import logging as _logging
+    _logger = _logging.getLogger(__name__)
+    # Eager-load des hooks natifs pour peupler le registre dans les
+    # sous-processus du pool (le top-level ``import`` du runner ne le fait
+    # pas pour ne pas pénaliser le démarrage des moteurs minimaux).
+    import picarones.measurements.builtin_hooks  # noqa: F401
+    from picarones.core.metric_hooks import run_document_hooks
+    if ocr_result.success:
+        metrics = compute_metrics(ground_truth, ocr_result.text, char_exclude=char_exclude)
+    else:
+        metrics = MetricsResult(
+            cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
+            wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
+            reference_length=len(ground_truth),
+            hypothesis_length=0,
+            error=ocr_result.error,
+        )
+    ocr_intermediate = ocr_result.metadata.get("ocr_intermediate")
+    pipeline_meta: dict = {}
+    if ocr_result.metadata.get("is_pipeline"):
+        pipeline_meta = {
+            "pipeline_mode": ocr_result.metadata.get("pipeline_mode"),
+            "prompt_file": ocr_result.metadata.get("prompt_file"),
+            "llm_model": ocr_result.metadata.get("llm_model"),
+            "llm_provider": ocr_result.metadata.get("llm_provider"),
+        }
+        if ocr_intermediate is not None and ocr_result.success:
+            try:
+                from picarones.pipelines.over_normalization import detect_over_normalization
+                over_norm = detect_over_normalization(
+                    ground_truth=ground_truth,
+                    ocr_text=ocr_intermediate,
+                    llm_text=ocr_result.text,
+                )
+                pipeline_meta["over_normalization"] = over_norm.as_dict()
+            except Exception as e:
+                _logger.warning("[over_normalization] fonctionnalité dégradée : %s", e)
+    # Hooks document-level — chaque hook produit un attribut nommé du
+    # ``DocumentResult``.  Les hooks invalides pour ce contexte (échec
+    # OCR pour les hooks ``requires_success``, absence de
+    # ``token_confidences`` pour ``calibration``) sont sautés
+    # silencieusement.  Les exceptions levées par un hook sont
+    # capturées et loggées en warning par ``run_document_hooks``.
+    extras = run_document_hooks(
+        profile,
+        ground_truth=ground_truth,
+        hypothesis=ocr_result.text,
+        image_path=image_path,
+        corpus_lang=corpus_lang,
+        ocr_result=ocr_result,
+    )
+    return DocumentResult(
+        doc_id=doc_id,
+        image_path=image_path,
+        ground_truth=ground_truth,
+        hypothesis=ocr_result.text,
+        metrics=metrics,
+        duration_seconds=ocr_result.duration_seconds,
+        engine_error=ocr_result.error,
+        ocr_intermediate=ocr_intermediate,
+        pipeline_metadata=pipeline_meta,
+        confusion_matrix=extras.get("confusion_matrix"),
+        char_scores=extras.get("char_scores"),
+        taxonomy=extras.get("taxonomy"),
+        structure=extras.get("structure"),
+        image_quality=extras.get("image_quality"),
+        line_metrics=extras.get("line_metrics"),
+        hallucination_metrics=extras.get("hallucination_metrics"),
+        calibration_metrics=extras.get("calibration_metrics"),
+        philological_metrics=extras.get("philological_metrics"),
+        searchability_metrics=extras.get("searchability_metrics"),
+        numerical_sequence_metrics=extras.get("numerical_sequence_metrics"),
+        readability_metrics=extras.get("readability_metrics"),
+    )
+def _make_timeout_doc_result(doc: object, timeout_seconds: float) -> DocumentResult:
+    """DocumentResult synthétique pour un document ayant dépassé le timeout."""
+    err = f"timeout ({timeout_seconds:.0f}s)"
+    metrics = MetricsResult(
+        cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
+        wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
+        reference_length=len(doc.ground_truth),  # type: ignore[attr-defined]
+        hypothesis_length=0,
+        error=err,
+    )
+    return DocumentResult(
+        doc_id=doc.doc_id,  # type: ignore[attr-defined]
+        image_path=str(doc.image_path),  # type: ignore[attr-defined]
+        ground_truth=doc.ground_truth,  # type: ignore[attr-defined]
+        hypothesis="",
+        metrics=metrics,
+        duration_seconds=timeout_seconds,
+        engine_error=err,
+    )
+def _make_error_doc_result(doc: object, error_msg: str) -> DocumentResult:
+    """DocumentResult synthétique pour une erreur lors d'un appel engine."""
+    metrics = MetricsResult(
+        cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
+        wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
+        reference_length=len(doc.ground_truth),  # type: ignore[attr-defined]
+        hypothesis_length=0,
+        error=error_msg,
+    )
+    return DocumentResult(
+        doc_id=doc.doc_id,  # type: ignore[attr-defined]
+        image_path=str(doc.image_path),  # type: ignore[attr-defined]
+        ground_truth=doc.ground_truth,  # type: ignore[attr-defined]
+        hypothesis="",
+        metrics=metrics,
+        duration_seconds=0.0,
+        engine_error=error_msg,
+    )
+__all__ = [
+    "_calibration_from_engine_result",
+    "_compute_document_result",
+    "_make_error_doc_result",
+    "_make_timeout_doc_result",
+]

picarones/measurements/runner/ner_attach.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""Câblage NER au post-process du benchmark (Sprint 40).
+Le runner appelle :func:`_attach_ner_metrics` après que tous les
+documents ont été calculés, pour les moteurs où la GT possède un
+niveau ``ENTITIES`` (Sprint 32 — multi-level GT).
+L'extracteur NER est typiquement un wrapper :class:`SpacyEntityExtractor`
+construit via :func:`picarones.measurements.ner_backends.get_extractor`.
+"""
+from __future__ import annotations
+import logging
+from picarones.core.corpus import Corpus
+logger = logging.getLogger(__name__)
+def _attach_ner_metrics(
+    corpus: Corpus,
+    doc_results: list,
+    entity_extractor: callable,
+) -> None:
+    """Calcule et attache ``DocumentResult.ner_metrics`` pour chaque doc
+    dont la GT possède un niveau ``ENTITIES`` (Sprint 32).
+    L'extracteur est appelé sur l'hypothèse OCR ``dr.hypothesis``.
+    Les erreurs sont dégradées en warnings (pas de propagation) afin
+    de ne pas casser le benchmark si un document spécifique fait
+    crasher le NER.
+    """
+    try:
+        from picarones.core.corpus import GTLevel
+        from picarones.measurements.ner import compute_ner_metrics
+    except ImportError as exc:
+        logger.warning("[ner.attach] imports indisponibles : %s", exc)
+        return
+    docs_by_id = {d.doc_id: d for d in corpus.documents}
+    n_done = 0
+    for dr in doc_results:
+        if dr.engine_error is not None or not dr.hypothesis:
+            continue
+        doc = docs_by_id.get(dr.doc_id)
+        if doc is None or not doc.has_gt(GTLevel.ENTITIES):
+            continue
+        try:
+            gt_payload = doc.get_gt(GTLevel.ENTITIES)
+            gt_entities = list(gt_payload.entities) if gt_payload else []
+            hyp_entities = entity_extractor(dr.hypothesis) or []
+            dr.ner_metrics = compute_ner_metrics(gt_entities, hyp_entities)
+            n_done += 1
+        except Exception as exc:  # noqa: BLE001
+            logger.warning(
+                "[ner.attach] %s : extraction/comparaison NER dégradée : %s",
+                dr.doc_id, exc,
+            )
+    if n_done > 0:
+        logger.info("[ner] %d documents évalués pour NER.", n_done)
+def _aggregate_ner(doc_results: list) -> "dict | None":
+    """Agrège les métriques NER au niveau du moteur.
+    Recalcule precision/recall/F1 *micro* à partir des sommes globales
+    de TP/FP/FN, plus le détail par catégorie, plus les compteurs
+    totaux d'hallucinations et d'entités manquées.
+    """
+    relevant = [dr for dr in doc_results if dr.ner_metrics is not None]
+    if not relevant:
+        return None
+    total_tp = 0
+    total_fp = 0
+    total_fn = 0
+    cat_tp: dict[str, int] = {}
+    cat_fp: dict[str, int] = {}
+    cat_fn: dict[str, int] = {}
+    total_hallucinated = 0
+    total_missed = 0
+    iou_threshold = 0.5
+    for dr in relevant:
+        m = dr.ner_metrics
+        total_tp += int(m.get("true_positives", 0))
+        total_fp += int(m.get("false_positives", 0))
+        total_fn += int(m.get("false_negatives", 0))
+        total_hallucinated += len(m.get("hallucinated_entities", []) or [])
+        total_missed += len(m.get("missed_entities", []) or [])
+        iou_threshold = float(m.get("iou_threshold", iou_threshold))
+        for cat, stats in (m.get("per_category") or {}).items():
+            cat_tp[cat] = cat_tp.get(cat, 0)
+            cat_fp[cat] = cat_fp.get(cat, 0)
+            cat_fn[cat] = cat_fn.get(cat, 0)
+            # Reconstitue les sommes par catégorie via support et P/R
+            support = int(stats.get("support", 0))
+            recall = float(stats.get("recall", 0.0))
+            precision = float(stats.get("precision", 0.0))
+            tp_cat = round(support * recall) if support > 0 else 0
+            fn_cat = max(0, support - tp_cat)
+            fp_cat = (
+                round(tp_cat * (1 - precision) / precision)
+                if precision > 0 else 0
+            )
+            cat_tp[cat] += tp_cat
+            cat_fp[cat] += fp_cat
+            cat_fn[cat] += fn_cat
+    def _prf(tp: int, fp: int, fn: int) -> dict[str, float]:
+        p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+        r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+        f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
+        return {"precision": p, "recall": r, "f1": f1, "support": tp + fn}
+    return {
+        "global": _prf(total_tp, total_fp, total_fn),
+        "per_category": {
+            cat: _prf(cat_tp[cat], cat_fp[cat], cat_fn[cat])
+            for cat in sorted(set(cat_tp) | set(cat_fp) | set(cat_fn))
+        },
+        "true_positives": total_tp,
+        "false_positives": total_fp,
+        "false_negatives": total_fn,
+        "hallucinated_total": total_hallucinated,
+        "missed_total": total_missed,
+        "doc_count": len(relevant),
+        "iou_threshold": iou_threshold,
+    }
+__all__ = ["_aggregate_ner", "_attach_ner_metrics"]

picarones/measurements/{runner.py → runner/orchestration.py} RENAMED Viewed

@@ -1,21 +1,25 @@
-"""Orchestrateur du benchmark.
-Exécute les moteurs OCR/HTR sur le corpus de manière parallèle :
-- ``ProcessPoolExecutor`` pour les moteurs CPU-bound (Tesseract, Pero OCR, Kraken)
-- ``ThreadPoolExecutor``  pour les moteurs IO-bound / API (Mistral, Google, Azure, LLMs)
-Les résultats partiels sont sauvegardés après chaque document dans un fichier
-``{partial_dir}/{corpus}_{engine}.partial.json`` (NDJSON).  Si le benchmark est
-interrompu, la prochaine exécution reprend automatiquement depuis ce fichier.
 """
 from __future__ import annotations
 import concurrent.futures
-import json
 import logging
-import re
-import tempfile
 import threading
 import time
 from pathlib import Path
@@ -24,379 +28,28 @@ from typing import Optional
 from tqdm import tqdm
 from picarones.core.corpus import Corpus
-from picarones.measurements.metrics import MetricsResult, compute_metrics
 from picarones.core.results import BenchmarkResult, DocumentResult, EngineReport
-from picarones.engines.base import BaseOCREngine, EngineResult
 logger = logging.getLogger(__name__)
-# Lock pour la sérialisation des écritures de résultats partiels
-_partial_write_lock = threading.Lock()
-# ---------------------------------------------------------------------------
-# Workers de niveau module (requis pour ProcessPoolExecutor — picklables)
-# ---------------------------------------------------------------------------
-def _cpu_doc_worker(args: tuple) -> "DocumentResult":
-    """Worker pour ProcessPoolExecutor (moteurs CPU-bound).
-    Instancie le moteur dans le sous-processus, exécute l'OCR et calcule
-    toutes les métriques.  Doit être une fonction de niveau module pour être
-    sérialisable par ``pickle``.
-    Le tuple ``args`` peut contenir, par compatibilité ascendante :
-    - 7 éléments : legacy (Sprint 13)
-    - 8 éléments : + ``corpus_lang`` (Sprint 87)
-    - 9 éléments : + ``profile`` (chantier 2 post-Sprint 97)
-    """
-    if len(args) == 9:
-        (engine_module, engine_class_name, engine_config, doc_id,
-         image_path, ground_truth, char_exclude_chars, corpus_lang,
-         profile) = args
-    elif len(args) == 8:
-        (engine_module, engine_class_name, engine_config, doc_id,
-         image_path, ground_truth, char_exclude_chars, corpus_lang) = args
-        profile = "standard"
-    else:
-        (engine_module, engine_class_name, engine_config, doc_id,
-         image_path, ground_truth, char_exclude_chars) = args
-        corpus_lang = "fr"
-        profile = "standard"
-    import importlib
-    mod = importlib.import_module(engine_module)
-    engine_cls = getattr(mod, engine_class_name)
-    engine = engine_cls(config=engine_config)
-    ocr_result = engine.run(image_path)
-    char_exclude = frozenset(char_exclude_chars) if char_exclude_chars else None
-    return _compute_document_result(
-        doc_id=doc_id,
-        image_path=image_path,
-        ground_truth=ground_truth,
-        ocr_result=ocr_result,
-        char_exclude=char_exclude,
-        corpus_lang=corpus_lang,
-        profile=profile,
-    )
-def _io_doc_worker(
-    engine: BaseOCREngine,
-    doc: object,
-    char_exclude: Optional[frozenset],
-    corpus_lang: str = "fr",
-    profile: str = "standard",
-) -> "DocumentResult":
-    """Worker pour ThreadPoolExecutor (moteurs IO-bound / API).
-    Exécute l'OCR et calcule les métriques dans un thread.  L'instance du
-    moteur est partagée entre les threads — les adaptateurs HTTP sont
-    généralement sans état mutable entre les appels.
-    Si le document possède un texte OCR pré-calculé (corpus triplet) et que
-    le moteur est un pipeline OCR+LLM, utilise ``run_with_ocr_text()`` pour
-    court-circuiter l'étape OCR et tester directement la post-correction LLM.
-    """
-    doc_ocr_text = getattr(doc, "ocr_text", None)
-    if doc_ocr_text is not None:
-        # Corpus triplet — vérifier si le moteur supporte run_with_ocr_text
-        run_with = getattr(engine, "run_with_ocr_text", None)
-        if run_with is not None:
-            ocr_result = run_with(doc.image_path, doc_ocr_text)  # type: ignore[attr-defined]
-        else:
-            # Moteur OCR classique — ignorer le texte OCR pré-calculé
-            ocr_result = engine.run(doc.image_path)  # type: ignore[attr-defined]
-    else:
-        ocr_result = engine.run(doc.image_path)  # type: ignore[attr-defined]
-    return _compute_document_result(
-        doc_id=doc.doc_id,  # type: ignore[attr-defined]
-        image_path=str(doc.image_path),  # type: ignore[attr-defined]
-        ground_truth=doc.ground_truth,  # type: ignore[attr-defined]
-        ocr_result=ocr_result,
-        char_exclude=char_exclude,
-        corpus_lang=corpus_lang,
-        profile=profile,
-    )
-# ---------------------------------------------------------------------------
-# Calcul documentaire centralisé
-# ---------------------------------------------------------------------------
-# Chantier 2 (post-Sprint 97) — la logique du helper calibration vit
-# désormais dans :mod:`picarones.measurements.builtin_hooks`. Ce nom reste exposé
-# ici pour la rétrocompat des tests Sprint 42 qui font
-# ``from picarones.measurements.runner import _calibration_from_engine_result``.
-def _calibration_from_engine_result(
-    ground_truth: str,
-    token_confidences: list,
-) -> Optional[dict]:
-    """Délégation vers :func:`picarones.measurements.builtin_hooks.calibration_from_engine_result`.
-    Conservé pour la rétrocompat des tests existants ; toute évolution
-    du calcul doit se faire dans ``builtin_hooks``.
-    """
-    from picarones.measurements.builtin_hooks import calibration_from_engine_result
-    return calibration_from_engine_result(ground_truth, token_confidences)
-def _compute_document_result(
-    doc_id: str,
-    image_path: str,
-    ground_truth: str,
-    ocr_result: EngineResult,
-    char_exclude: Optional[frozenset],
-    corpus_lang: str = "fr",
-    profile: str = "standard",
-) -> DocumentResult:
-    """Calcule toutes les métriques pour un document et retourne un DocumentResult.
-    Utilisable à la fois dans le processus principal (IO-bound) et dans les
-    sous-processus créés par ProcessPoolExecutor (CPU-bound).
-    Les imports lourds sont différés pour accélérer le démarrage des sous-processus.
-    Chantier 2 (post-Sprint 97) — refonte
-    ------------------------------------
-    Les 11 ``try/except`` codés en dur (Sprints 5+10+39+42+61+86+87) sont
-    désormais centralisés dans ``picarones.measurements.builtin_hooks`` et
-    sélectionnés via ``run_document_hooks(profile)``.  Le profil
-    ``"standard"`` (défaut) reproduit strictement le comportement
-    pré-chantier-2.  Les profils ``"minimal"``, ``"philological"``,
-    ``"diagnostics"``, ``"economics"``, ``"pipeline"``, ``"full"``
-    permettent à l'utilisateur de moduler le coût de calcul.
-    """
-    import logging as _logging
-    _logger = _logging.getLogger(__name__)
-    # Eager-load des hooks natifs pour peupler le registre dans les
-    # sous-processus du pool (le top-level ``import`` du runner ne le fait
-    # pas pour ne pas pénaliser le démarrage des moteurs minimaux).
-    import picarones.measurements.builtin_hooks  # noqa: F401
-    from picarones.core.metric_hooks import run_document_hooks
-    if ocr_result.success:
-        metrics = compute_metrics(ground_truth, ocr_result.text, char_exclude=char_exclude)
-    else:
-        metrics = MetricsResult(
-            cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
-            wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
-            reference_length=len(ground_truth),
-            hypothesis_length=0,
-            error=ocr_result.error,
-        )
-    ocr_intermediate = ocr_result.metadata.get("ocr_intermediate")
-    pipeline_meta: dict = {}
-    if ocr_result.metadata.get("is_pipeline"):
-        pipeline_meta = {
-            "pipeline_mode": ocr_result.metadata.get("pipeline_mode"),
-            "prompt_file": ocr_result.metadata.get("prompt_file"),
-            "llm_model": ocr_result.metadata.get("llm_model"),
-            "llm_provider": ocr_result.metadata.get("llm_provider"),
-        }
-        if ocr_intermediate is not None and ocr_result.success:
-            try:
-                from picarones.pipelines.over_normalization import detect_over_normalization
-                over_norm = detect_over_normalization(
-                    ground_truth=ground_truth,
-                    ocr_text=ocr_intermediate,
-                    llm_text=ocr_result.text,
-                )
-                pipeline_meta["over_normalization"] = over_norm.as_dict()
-            except Exception as e:
-                _logger.warning("[over_normalization] fonctionnalité dégradée : %s", e)
-    # Hooks document-level — chaque hook produit un attribut nommé du
-    # ``DocumentResult``.  Les hooks invalides pour ce contexte (échec
-    # OCR pour les hooks ``requires_success``, absence de
-    # ``token_confidences`` pour ``calibration``) sont sautés
-    # silencieusement.  Les exceptions levées par un hook sont
-    # capturées et loggées en warning par ``run_document_hooks``.
-    extras = run_document_hooks(
-        profile,
-        ground_truth=ground_truth,
-        hypothesis=ocr_result.text,
-        image_path=image_path,
-        corpus_lang=corpus_lang,
-        ocr_result=ocr_result,
-    )
-    return DocumentResult(
-        doc_id=doc_id,
-        image_path=image_path,
-        ground_truth=ground_truth,
-        hypothesis=ocr_result.text,
-        metrics=metrics,
-        duration_seconds=ocr_result.duration_seconds,
-        engine_error=ocr_result.error,
-        ocr_intermediate=ocr_intermediate,
-        pipeline_metadata=pipeline_meta,
-        confusion_matrix=extras.get("confusion_matrix"),
-        char_scores=extras.get("char_scores"),
-        taxonomy=extras.get("taxonomy"),
-        structure=extras.get("structure"),
-        image_quality=extras.get("image_quality"),
-        line_metrics=extras.get("line_metrics"),
-        hallucination_metrics=extras.get("hallucination_metrics"),
-        calibration_metrics=extras.get("calibration_metrics"),
-        philological_metrics=extras.get("philological_metrics"),
-        searchability_metrics=extras.get("searchability_metrics"),
-        numerical_sequence_metrics=extras.get("numerical_sequence_metrics"),
-        readability_metrics=extras.get("readability_metrics"),
-    )
-def _make_timeout_doc_result(doc: object, timeout_seconds: float) -> DocumentResult:
-    """DocumentResult synthétique pour un document ayant dépassé le timeout."""
-    err = f"timeout ({timeout_seconds:.0f}s)"
-    metrics = MetricsResult(
-        cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
-        wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
-        reference_length=len(doc.ground_truth),  # type: ignore[attr-defined]
-        hypothesis_length=0,
-        error=err,
-    )
-    return DocumentResult(
-        doc_id=doc.doc_id,  # type: ignore[attr-defined]
-        image_path=str(doc.image_path),  # type: ignore[attr-defined]
-        ground_truth=doc.ground_truth,  # type: ignore[attr-defined]
-        hypothesis="",
-        metrics=metrics,
-        duration_seconds=timeout_seconds,
-        engine_error=err,
-    )
-def _make_error_doc_result(doc: object, error_msg: str) -> DocumentResult:
-    """DocumentResult synthétique pour un document en erreur inattendue."""
-    metrics = MetricsResult(
-        cer=1.0, cer_nfc=1.0, cer_caseless=1.0,
-        wer=1.0, wer_normalized=1.0, mer=1.0, wil=1.0,
-        reference_length=len(doc.ground_truth),  # type: ignore[attr-defined]
-        hypothesis_length=0,
-        error=error_msg,
-    )
-    return DocumentResult(
-        doc_id=doc.doc_id,  # type: ignore[attr-defined]
-        image_path=str(doc.image_path),  # type: ignore[attr-defined]
-        ground_truth=doc.ground_truth,  # type: ignore[attr-defined]
-        hypothesis="",
-        metrics=metrics,
-        duration_seconds=0.0,
-        engine_error=error_msg,
-    )
-# ---------------------------------------------------------------------------
-# Résultats partiels (sauvegarde / reprise)
-# ---------------------------------------------------------------------------
-def _sanitize_filename(s: str) -> str:
-    return re.sub(r"[^\w\-]", "_", s)[:64]
-def _partial_path(
-    corpus_name: str,
-    engine_name: str,
-    partial_dir: Optional[str | Path],
-) -> Path:
-    base = Path(partial_dir) if partial_dir else Path(tempfile.gettempdir())
-    name = (
-        f"picarones_{_sanitize_filename(corpus_name)}"
-        f"_{_sanitize_filename(engine_name)}.partial.json"
-    )
-    return base / name
-def _load_partial(
-    corpus_name: str,
-    engine_name: str,
-    partial_dir: Optional[str | Path],
-) -> tuple[Path, list[DocumentResult]]:
-    """Charge les résultats partiels d'une exécution précédente interrompue.
-    Returns
-    -------
-    (path, results) — chemin du fichier partiel et liste des DocumentResult déjà calculés.
-    """
-    path = _partial_path(corpus_name, engine_name, partial_dir)
-    results: list[DocumentResult] = []
-    if not path.exists():
-        return path, results
-    try:
-        with path.open("r", encoding="utf-8") as fh:
-            for line in fh:
-                line = line.strip()
-                if not line:
-                    continue
-                d = json.loads(line)
-                m = d.get("metrics", {})
-                metrics = MetricsResult(
-                    cer=m.get("cer", 1.0),
-                    cer_nfc=m.get("cer_nfc", 1.0),
-                    cer_caseless=m.get("cer_caseless", 1.0),
-                    wer=m.get("wer", 1.0),
-                    wer_normalized=m.get("wer_normalized", 1.0),
-                    mer=m.get("mer", 1.0),
-                    wil=m.get("wil", 1.0),
-                    reference_length=m.get("reference_length", 0),
-                    hypothesis_length=m.get("hypothesis_length", 0),
-                    error=m.get("error"),
-                )
-                results.append(DocumentResult(
-                    doc_id=d["doc_id"],
-                    image_path=d.get("image_path", ""),
-                    ground_truth=d.get("ground_truth", ""),
-                    hypothesis=d.get("hypothesis", ""),
-                    metrics=metrics,
-                    duration_seconds=d.get("duration_seconds", 0.0),
-                    engine_error=d.get("engine_error"),
-                    ocr_intermediate=d.get("ocr_intermediate"),
-                    pipeline_metadata=d.get("pipeline_metadata", {}),
-                    confusion_matrix=d.get("confusion_matrix"),
-                    char_scores=d.get("char_scores"),
-                    taxonomy=d.get("taxonomy"),
-                    structure=d.get("structure"),
-                    image_quality=d.get("image_quality"),
-                    line_metrics=d.get("line_metrics"),
-                    hallucination_metrics=d.get("hallucination_metrics"),
-                ))
-    except Exception as e:
-        logger.warning("Impossible de charger les résultats partiels '%s' : %s", path, e)
-        results = []
-    return path, results
-def _save_partial_line(partial_path: Path, doc_result: DocumentResult) -> None:
-    """Ajoute une entrée NDJSON au fichier de résultats partiels (thread-safe)."""
-    try:
-        line = json.dumps(doc_result.as_dict(), ensure_ascii=False) + "\n"
-        with _partial_write_lock:
-            with partial_path.open("a", encoding="utf-8") as fh:
-                fh.write(line)
-    except Exception as e:
-        logger.warning("Impossible d'écrire dans le fichier partiel '%s' : %s", partial_path, e)
-def _delete_partial(partial_path: Path) -> None:
-    """Supprime le fichier de résultats partiels à la fin d'un moteur."""
-    try:
-        if partial_path.exists():
-            partial_path.unlink()
-    except Exception as e:
-        logger.warning("Impossible de supprimer le fichier partiel '%s' : %s", partial_path, e)
-# ---------------------------------------------------------------------------
-# Benchmark principal
-# ---------------------------------------------------------------------------
 def run_benchmark(
     corpus: Corpus,
@@ -838,182 +491,4 @@ def _build_pipeline_info(engine: BaseOCREngine, doc_results: list[DocumentResult
     return info
-# ---------------------------------------------------------------------------
-# Helpers d'agrégation — délégations rétrocompat
-# ---------------------------------------------------------------------------
-# Chantier 2 (post-Sprint 97) : les implémentations vivent désormais dans
-# :mod:`picarones.measurements.builtin_hooks` (single source of truth, exposé via
-# le registre :mod:`picarones.core.metric_hooks`).  Les noms ci-dessous
-# restent disponibles depuis ``picarones.measurements.runner`` pour la rétrocompat
-# des tests Sprint 13 / 42 qui les importent directement.
-def _aggregate_confusion(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_confusion`."""
-    from picarones.measurements.builtin_hooks import _aggregate_confusion as _impl
-    return _impl(doc_results)
-def _aggregate_char_scores(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_char_scores`."""
-    from picarones.measurements.builtin_hooks import _aggregate_char_scores as _impl
-    return _impl(doc_results)
-def _aggregate_taxonomy(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_taxonomy`."""
-    from picarones.measurements.builtin_hooks import _aggregate_taxonomy as _impl
-    return _impl(doc_results)
-def _aggregate_structure(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_structure`."""
-    from picarones.measurements.builtin_hooks import _aggregate_structure as _impl
-    return _impl(doc_results)
-def _aggregate_image_quality(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_image_quality`."""
-    from picarones.measurements.builtin_hooks import _aggregate_image_quality as _impl
-    return _impl(doc_results)
-def _aggregate_line_metrics(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_line_metrics`."""
-    from picarones.measurements.builtin_hooks import _aggregate_line_metrics as _impl
-    return _impl(doc_results)
-def _aggregate_hallucination(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_hallucination`."""
-    from picarones.measurements.builtin_hooks import _aggregate_hallucination as _impl
-    return _impl(doc_results)
-# ────────────────────────────────────────��─────────────────────────────────
-# Sprint 40 — extraction NER au post-process et agrégation
-# ──────────────────────────────────────────────────────────────────────────
-def _attach_ner_metrics(
-    corpus: Corpus,
-    doc_results: list,
-    entity_extractor: callable,
-) -> None:
-    """Calcule et attache ``DocumentResult.ner_metrics`` pour chaque doc
-    dont la GT possède un niveau ``ENTITIES`` (Sprint 32).
-    L'extracteur est appelé sur l'hypothèse OCR ``dr.hypothesis``.
-    Les erreurs sont dégradées en warnings (pas de propagation) afin
-    de ne pas casser le benchmark si un document spécifique fait
-    crasher le NER.
-    """
-    try:
-        from picarones.core.corpus import GTLevel
-        from picarones.measurements.ner import compute_ner_metrics
-    except ImportError as exc:
-        logger.warning("[ner.attach] imports indisponibles : %s", exc)
-        return
-    docs_by_id = {d.doc_id: d for d in corpus.documents}
-    n_done = 0
-    for dr in doc_results:
-        if dr.engine_error is not None or not dr.hypothesis:
-            continue
-        doc = docs_by_id.get(dr.doc_id)
-        if doc is None or not doc.has_gt(GTLevel.ENTITIES):
-            continue
-        try:
-            gt_payload = doc.get_gt(GTLevel.ENTITIES)
-            gt_entities = list(gt_payload.entities) if gt_payload else []
-            hyp_entities = entity_extractor(dr.hypothesis) or []
-            dr.ner_metrics = compute_ner_metrics(gt_entities, hyp_entities)
-            n_done += 1
-        except Exception as exc:  # noqa: BLE001
-            logger.warning(
-                "[ner.attach] %s : extraction/comparaison NER dégradée : %s",
-                dr.doc_id, exc,
-            )
-    if n_done > 0:
-        logger.info("[ner] %d documents évalués pour NER.", n_done)
-def _aggregate_calibration(doc_results: list) -> Optional[dict]:
-    """Délégation vers :func:`builtin_hooks._aggregate_calibration`.
-    Conservé pour la rétrocompat du test ``test_sprint42_calibration_runner``
-    qui importe directement depuis ``picarones.measurements.runner``. La logique
-    réelle vit dans :mod:`picarones.measurements.builtin_hooks` (chantier 2
-    post-Sprint 97).
-    """
-    from picarones.measurements.builtin_hooks import _aggregate_calibration as _impl
-    return _impl(doc_results)
-def _aggregate_ner(doc_results: list) -> Optional[dict]:
-    """Agrège les métriques NER au niveau du moteur.
-    Recalcule precision/recall/F1 *micro* à partir des sommes globales
-    de TP/FP/FN, plus le détail par catégorie, plus les compteurs
-    totaux d'hallucinations et d'entités manquées.
-    """
-    relevant = [dr for dr in doc_results if dr.ner_metrics is not None]
-    if not relevant:
-        return None
-    total_tp = 0
-    total_fp = 0
-    total_fn = 0
-    cat_tp: dict[str, int] = {}
-    cat_fp: dict[str, int] = {}
-    cat_fn: dict[str, int] = {}
-    total_hallucinated = 0
-    total_missed = 0
-    iou_threshold = 0.5
-    for dr in relevant:
-        m = dr.ner_metrics
-        total_tp += int(m.get("true_positives", 0))
-        total_fp += int(m.get("false_positives", 0))
-        total_fn += int(m.get("false_negatives", 0))
-        total_hallucinated += len(m.get("hallucinated_entities", []) or [])
-        total_missed += len(m.get("missed_entities", []) or [])
-        iou_threshold = float(m.get("iou_threshold", iou_threshold))
-        for cat, stats in (m.get("per_category") or {}).items():
-            cat_tp[cat] = cat_tp.get(cat, 0)
-            cat_fp[cat] = cat_fp.get(cat, 0)
-            cat_fn[cat] = cat_fn.get(cat, 0)
-            # Reconstitue les sommes par catégorie via support et P/R
-            support = int(stats.get("support", 0))
-            recall = float(stats.get("recall", 0.0))
-            precision = float(stats.get("precision", 0.0))
-            tp_cat = round(support * recall) if support > 0 else 0
-            fn_cat = max(0, support - tp_cat)
-            fp_cat = (
-                round(tp_cat * (1 - precision) / precision)
-                if precision > 0 else 0
-            )
-            cat_tp[cat] += tp_cat
-            cat_fp[cat] += fp_cat
-            cat_fn[cat] += fn_cat
-    def _prf(tp: int, fp: int, fn: int) -> dict[str, float]:
-        p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
-        r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
-        f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
-        return {"precision": p, "recall": r, "f1": f1, "support": tp + fn}
-    return {
-        "global": _prf(total_tp, total_fp, total_fn),
-        "per_category": {
-            cat: _prf(cat_tp[cat], cat_fp[cat], cat_fn[cat])
-            for cat in sorted(set(cat_tp) | set(cat_fp) | set(cat_fn))
-        },
-        "true_positives": total_tp,
-        "false_positives": total_fp,
-        "false_negatives": total_fn,
-        "hallucinated_total": total_hallucinated,
-        "missed_total": total_missed,
-        "doc_count": len(relevant),
-        "iou_threshold": iou_threshold,
-    }

+"""Orchestrateur principal du benchmark.
+Contient :func:`run_benchmark` et son helper :func:`_build_pipeline_info`.
+Le runner exécute chaque moteur de la liste sur le corpus complet :
+- Pour les moteurs CPU-bound (``execution_mode == "cpu"`` :
+  Tesseract, Pero OCR, Kraken), utilise un ``ProcessPoolExecutor``
+  et délègue aux workers picklables de :mod:`workers`.
+- Pour les moteurs IO-bound (Mistral, Google Vision, Azure, LLMs),
+  utilise un ``ThreadPoolExecutor``.
+Les résultats partiels (NDJSON par moteur) sont gérés par
+:mod:`partial` ; le calcul d'un :class:`DocumentResult` individuel
+par :mod:`document` ; l'agrégation finale par les hooks délégués à
+:mod:`builtin_hooks` (chantier 2 post-Sprint 97).
 """
 from __future__ import annotations
 import concurrent.futures
 import logging
 import threading
 import time
 from pathlib import Path
 from tqdm import tqdm
 from picarones.core.corpus import Corpus
 from picarones.core.results import BenchmarkResult, DocumentResult, EngineReport
+from picarones.engines.base import BaseOCREngine
+from picarones.measurements.runner.document import (
+    _make_error_doc_result,
+    _make_timeout_doc_result,
+)
+from picarones.measurements.runner.ner_attach import (
+    _aggregate_ner,
+    _attach_ner_metrics,
+)
+from picarones.measurements.runner.partial import (
+    _delete_partial,
+    _load_partial,
+    _save_partial_line,
+)
+from picarones.measurements.runner.workers import (
+    _cpu_doc_worker,
+    _io_doc_worker,
+)
 logger = logging.getLogger(__name__)
 def run_benchmark(
     corpus: Corpus,
     return info
+__all__ = ["_build_pipeline_info", "run_benchmark"]

picarones/measurements/runner/partial.py ADDED Viewed

	@@ -0,0 +1,140 @@

+"""Persistance des résultats partiels du benchmark (NDJSON).
+Quand le runner traite un corpus, il écrit chaque ``DocumentResult``
+dans un fichier ``{partial_dir}/picarones_{corpus}_{engine}.partial.json``
+au format NDJSON. Si le benchmark est interrompu (Ctrl+C, crash, kill),
+la prochaine exécution reprend depuis ce fichier sans perdre le travail
+déjà fait.
+Thread-safe : le module utilise un :class:`threading.Lock` partagé
+entre toutes les écritures pour sérialiser les appends.
+"""
+from __future__ import annotations
+import json
+import logging
+import re
+import tempfile
+import threading
+from pathlib import Path
+from typing import Optional
+from picarones.core.results import DocumentResult
+from picarones.measurements.metrics import MetricsResult
+logger = logging.getLogger(__name__)
+# Lock pour la sérialisation des écritures de résultats partiels.
+# Partagé entre tous les call sites (workers IO et CPU se relayent
+# sur la même file).
+_partial_write_lock = threading.Lock()
+def _sanitize_filename(s: str) -> str:
+    return re.sub(r"[^\w\-]", "_", s)[:64]
+def _partial_path(
+    corpus_name: str,
+    engine_name: str,
+    partial_dir: Optional[str | Path],
+) -> Path:
+    base = Path(partial_dir) if partial_dir else Path(tempfile.gettempdir())
+    name = (
+        f"picarones_{_sanitize_filename(corpus_name)}"
+        f"_{_sanitize_filename(engine_name)}.partial.json"
+    )
+    return base / name
+def _load_partial(
+    corpus_name: str,
+    engine_name: str,
+    partial_dir: Optional[str | Path],
+) -> tuple[Path, list[DocumentResult]]:
+    """Charge les résultats partiels d'une exécution précédente interrompue.
+    Returns
+    -------
+    (path, results) — chemin du fichier partiel et liste des
+    DocumentResult déjà calculés.
+    """
+    path = _partial_path(corpus_name, engine_name, partial_dir)
+    results: list[DocumentResult] = []
+    if not path.exists():
+        return path, results
+    try:
+        with path.open("r", encoding="utf-8") as fh:
+            for line in fh:
+                line = line.strip()
+                if not line:
+                    continue
+                d = json.loads(line)
+                m = d.get("metrics", {})
+                metrics = MetricsResult(
+                    cer=m.get("cer", 1.0),
+                    cer_nfc=m.get("cer_nfc", 1.0),
+                    cer_caseless=m.get("cer_caseless", 1.0),
+                    wer=m.get("wer", 1.0),
+                    wer_normalized=m.get("wer_normalized", 1.0),
+                    mer=m.get("mer", 1.0),
+                    wil=m.get("wil", 1.0),
+                    reference_length=m.get("reference_length", 0),
+                    hypothesis_length=m.get("hypothesis_length", 0),
+                    error=m.get("error"),
+                )
+                results.append(DocumentResult(
+                    doc_id=d["doc_id"],
+                    image_path=d.get("image_path", ""),
+                    ground_truth=d.get("ground_truth", ""),
+                    hypothesis=d.get("hypothesis", ""),
+                    metrics=metrics,
+                    duration_seconds=d.get("duration_seconds", 0.0),
+                    engine_error=d.get("engine_error"),
+                    ocr_intermediate=d.get("ocr_intermediate"),
+                    pipeline_metadata=d.get("pipeline_metadata", {}),
+                    confusion_matrix=d.get("confusion_matrix"),
+                    char_scores=d.get("char_scores"),
+                    taxonomy=d.get("taxonomy"),
+                    structure=d.get("structure"),
+                    image_quality=d.get("image_quality"),
+                    line_metrics=d.get("line_metrics"),
+                    hallucination_metrics=d.get("hallucination_metrics"),
+                ))
+    except Exception as e:
+        logger.warning("Impossible de charger les résultats partiels '%s' : %s", path, e)
+        results = []
+    return path, results
+def _save_partial_line(partial_path: Path, doc_result: DocumentResult) -> None:
+    """Ajoute une entrée NDJSON au fichier de résultats partiels (thread-safe)."""
+    try:
+        line = json.dumps(doc_result.as_dict(), ensure_ascii=False) + "\n"
+        with _partial_write_lock:
+            with partial_path.open("a", encoding="utf-8") as fh:
+                fh.write(line)
+    except Exception as e:
+        logger.warning("Impossible d'écrire dans le fichier partiel '%s' : %s", partial_path, e)
+def _delete_partial(partial_path: Path) -> None:
+    """Supprime le fichier de résultats partiels à la fin d'un moteur."""
+    try:
+        if partial_path.exists():
+            partial_path.unlink()
+    except Exception as e:
+        logger.warning("Impossible de supprimer le fichier partiel '%s' : %s", partial_path, e)
+__all__ = [
+    "_delete_partial",
+    "_load_partial",
+    "_partial_path",
+    "_partial_write_lock",
+    "_sanitize_filename",
+    "_save_partial_line",
+]

picarones/measurements/runner/workers.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""Workers de niveau module pour les pools d'exécution.
+Deux workers correspondant aux deux modes d'exécution :
+- :func:`_cpu_doc_worker` — pour ``ProcessPoolExecutor`` (moteurs
+  CPU-bound, instanciés dans le sous-processus). Doit être picklable :
+  c'est pour ça qu'il est défini au niveau module.
+- :func:`_io_doc_worker` — pour ``ThreadPoolExecutor`` (moteurs
+  IO-bound / API HTTP). L'instance du moteur est partagée entre les
+  threads.
+Les deux finissent par appeler :func:`_compute_document_result` du
+sous-module :mod:`document` pour calculer toutes les métriques.
+"""
+from __future__ import annotations
+from typing import Optional
+from picarones.core.results import DocumentResult
+from picarones.engines.base import BaseOCREngine
+from picarones.measurements.runner.document import _compute_document_result
+def _cpu_doc_worker(args: tuple) -> "DocumentResult":
+    """Worker pour ProcessPoolExecutor (moteurs CPU-bound).
+    Instancie le moteur dans le sous-processus, exécute l'OCR et calcule
+    toutes les métriques.  Doit être une fonction de niveau module pour être
+    sérialisable par ``pickle``.
+    Le tuple ``args`` peut contenir, par compatibilité ascendante :
+    - 7 éléments : legacy (Sprint 13)
+    - 8 éléments : + ``corpus_lang`` (Sprint 87)
+    - 9 éléments : + ``profile`` (chantier 2 post-Sprint 97)
+    """
+    if len(args) == 9:
+        (engine_module, engine_class_name, engine_config, doc_id,
+         image_path, ground_truth, char_exclude_chars, corpus_lang,
+         profile) = args
+    elif len(args) == 8:
+        (engine_module, engine_class_name, engine_config, doc_id,
+         image_path, ground_truth, char_exclude_chars, corpus_lang) = args
+        profile = "standard"
+    else:
+        (engine_module, engine_class_name, engine_config, doc_id,
+         image_path, ground_truth, char_exclude_chars) = args
+        corpus_lang = "fr"
+        profile = "standard"
+    import importlib
+    mod = importlib.import_module(engine_module)
+    engine_cls = getattr(mod, engine_class_name)
+    engine = engine_cls(config=engine_config)
+    ocr_result = engine.run(image_path)
+    char_exclude = frozenset(char_exclude_chars) if char_exclude_chars else None
+    return _compute_document_result(
+        doc_id=doc_id,
+        image_path=image_path,
+        ground_truth=ground_truth,
+        ocr_result=ocr_result,
+        char_exclude=char_exclude,
+        corpus_lang=corpus_lang,
+        profile=profile,
+    )
+def _io_doc_worker(
+    engine: BaseOCREngine,
+    doc: object,
+    char_exclude: Optional[frozenset],
+    corpus_lang: str = "fr",
+    profile: str = "standard",
+) -> "DocumentResult":
+    """Worker pour ThreadPoolExecutor (moteurs IO-bound / API).
+    Exécute l'OCR et calcule les métriques dans un thread.  L'instance du
+    moteur est partagée entre les threads — les adaptateurs HTTP sont
+    généralement sans état mutable entre les appels.
+    Si le document possède un texte OCR pré-calculé (corpus triplet) et que
+    le moteur est un pipeline OCR+LLM, utilise ``run_with_ocr_text()`` pour
+    court-circuiter l'étape OCR et tester directement la post-correction LLM.
+    """
+    doc_ocr_text = getattr(doc, "ocr_text", None)
+    if doc_ocr_text is not None:
+        # Corpus triplet — vérifier si le moteur supporte run_with_ocr_text
+        run_with = getattr(engine, "run_with_ocr_text", None)
+        if run_with is not None:
+            ocr_result = run_with(doc.image_path, doc_ocr_text)  # type: ignore[attr-defined]
+        else:
+            # Moteur OCR classique — ignorer le texte OCR pré-calculé
+            ocr_result = engine.run(doc.image_path)  # type: ignore[attr-defined]
+    else:
+        ocr_result = engine.run(doc.image_path)  # type: ignore[attr-defined]
+    return _compute_document_result(
+        doc_id=doc.doc_id,  # type: ignore[attr-defined]
+        image_path=str(doc.image_path),  # type: ignore[attr-defined]
+        ground_truth=doc.ground_truth,  # type: ignore[attr-defined]
+        ocr_result=ocr_result,
+        char_exclude=char_exclude,
+        corpus_lang=corpus_lang,
+        profile=profile,
+    )
+__all__ = ["_cpu_doc_worker", "_io_doc_worker"]

picarones/measurements/statistics.py DELETED Viewed

@@ -1,1128 +0,0 @@
-"""Tests statistiques et clustering d'erreurs pour Picarones.
-Fonctions fournies
-------------------
-- wilcoxon_test(a, b)                  : Wilcoxon signé-rangé (2 moteurs appariés)
-- bootstrap_ci(values, ...)            : intervalle de confiance à 95 % par bootstrap
-- compute_pairwise_stats(...)          : matrice de Wilcoxon entre toutes les paires
-- friedman_test(engine_cer_map)        : Friedman (k moteurs, n documents)       [Sprint 17]
-- nemenyi_posthoc(engine_cer_map)      : post-hoc Nemenyi avec critical distance [Sprint 17]
-- build_critical_difference_svg(...)   : rendu SVG du CDD (Demšar 2006)          [Sprint 17]
-- compute_pareto_front(points, ...)    : frontière de Pareto multi-objectifs     [Sprint 19]
-- cluster_errors(...)                  : regroupement des patterns d'erreurs
-- compute_correlation_matrix(...)      : matrice de corrélation des métriques
-- compute_reliability_curve(...)       : courbe CER vs. % docs les plus faciles
-- compute_venn_data(...)               : diagramme de Venn 2/3 moteurs
-"""
-from __future__ import annotations
-import math
-import random
-import re
-from collections import defaultdict
-from dataclasses import dataclass
-from typing import Optional
-# Import optionnel de scipy — utilisé pour le test de Wilcoxon si disponible
-# (méthode exacte pour n ≤ 25, approximation normale pour n > 25).
-# En son absence, l'implémentation native (approximation normale pour n ≥ 10)
-# est utilisée automatiquement.
-try:
-    from scipy.stats import wilcoxon as _scipy_wilcoxon  # type: ignore[import-untyped]
-    _SCIPY_AVAILABLE = True
-except ImportError:
-    _SCIPY_AVAILABLE = False
-# ---------------------------------------------------------------------------
-# Bootstrap CI
-# ---------------------------------------------------------------------------
-def bootstrap_ci(
-    values: list[float],
-    n_iter: int = 1000,
-    ci: float = 0.95,
-    seed: int = 42,
-) -> tuple[float, float]:
-    """Intervalle de confiance par bootstrap.
-    Parameters
-    ----------
-    values : liste des valeurs (ex. CER par document)
-    n_iter : nombre d'itérations bootstrap (défaut 1000)
-    ci     : niveau de confiance (défaut 0.95 → 95 %)
-    seed   : graine RNG pour reproductibilité
-    Returns
-    -------
-    (lower, upper) — les bornes de l'IC à ``ci`` %
-    """
-    if not values:
-        return (0.0, 0.0)
-    rng = random.Random(seed)
-    n = len(values)
-    means = []
-    for _ in range(n_iter):
-        sample = [values[rng.randint(0, n - 1)] for _ in range(n)]
-        means.append(sum(sample) / n)
-    means.sort()
-    alpha = (1.0 - ci) / 2.0
-    lo_idx = max(0, int(alpha * n_iter))
-    hi_idx = min(n_iter - 1, int((1.0 - alpha) * n_iter))
-    return (means[lo_idx], means[hi_idx])
-# ---------------------------------------------------------------------------
-# Test de Wilcoxon signé-rangé (implémentation pure Python)
-# ---------------------------------------------------------------------------
-def wilcoxon_test(
-    a: list[float],
-    b: list[float],
-    zero_method: str = "wilcox",
-) -> dict:
-    """Test de Wilcoxon signé-rangé entre deux séries de CER appariées.
-    Retourne un dict avec :
-      - statistic     : W = min(W⁺, W⁻)
-      - p_value       : p-value bilatérale
-      - significant   : bool (p < 0.05)
-      - interpretation : phrase lisible
-      - n_pairs       : nombre de paires utilisées (après retrait des zéros)
-      - W_plus        : somme des rangs des différences positives
-      - W_minus       : somme des rangs des différences négatives
-    Hypothèses et limites
-    ---------------------
-    * Les observations sont appariées (même corpus, deux moteurs différents).
-    * Le test est non-paramétrique : aucune hypothèse de normalité des CER.
-    * ``zero_method="wilcox"`` (défaut) : les paires sans différence (aᵢ = bᵢ)
-      sont simplement exclues.  Les autres méthodes (``"pratt"``, ``"zsplit"``)
-      nécessitent scipy.
-    * **Approximation normale** (implémentation native, n ≥ 10) :
-      L'approximation est raisonnable pour n ≥ 10 et converge vers la
-      distribution exacte.  Pour n < 10, une table critique simplifiée est
-      utilisée (p ∈ {0.04, 0.20}) — résultat **conservateur**.
-    * **scipy** (si installé) : ``scipy.stats.wilcoxon`` est utilisé à la place
-      de l'approximation native.  scipy utilise la méthode exacte pour n ≤ 25
-      et l'approximation normale pour n > 25, ce qui est plus précis.
-    * **Validité** : le test suppose la symétrie de la distribution des
-      différences.  Avec de très petits n (< 5), les résultats sont peu fiables
-      quelle que soit la méthode.
-    Parameters
-    ----------
-    a, b : séries de CER (même longueur, même ordre de documents)
-    zero_method : gestion des paires nulles (défaut : ``"wilcox"``)
-    """
-    if len(a) != len(b):
-        raise ValueError("Les deux listes doivent avoir la même longueur")
-    diffs = [x - y for x, y in zip(a, b)]
-    # Retirer les zéros (méthode "wilcox")
-    if zero_method == "wilcox":
-        diffs = [d for d in diffs if d != 0.0]
-    n = len(diffs)
-    if n == 0:
-        return {
-            "statistic": 0.0,
-            "p_value": 1.0,
-            "significant": False,
-            "interpretation": "Aucune différence entre les deux concurrents.",
-            "n_pairs": 0,
-        }
-    # Rangs des valeurs absolues
-    abs_diffs = [abs(d) for d in diffs]
-    indexed = sorted(enumerate(abs_diffs), key=lambda x: x[1])
-    # Gestion des ex-aequo : rang moyen
-    ranks = [0.0] * n
-    i = 0
-    while i < n:
-        j = i
-        while j < n and abs_diffs[indexed[j][0]] == abs_diffs[indexed[i][0]]:
-            j += 1
-        avg_rank = (i + j + 1) / 2.0  # rang moyen (1-based)
-        for k in range(i, j):
-            ranks[indexed[k][0]] = avg_rank
-        i = j
-    W_plus  = sum(ranks[k] for k in range(n) if diffs[k] > 0)
-    W_minus = sum(ranks[k] for k in range(n) if diffs[k] < 0)
-    W = min(W_plus, W_minus)
-    # Calcul de la p-value : scipy si disponible, sinon approximation native
-    if _SCIPY_AVAILABLE:
-        try:
-            scipy_res = _scipy_wilcoxon(diffs, zero_method=zero_method)
-            p_value = float(scipy_res.pvalue)
-        except Exception:
-            # Repli sur l'implémentation native en cas d'erreur scipy
-            p_value = _native_p_value(n, W)
-    else:
-        p_value = _native_p_value(n, W)
-    significant = p_value < 0.05
-    if significant:
-        better = "premier" if W_plus < W_minus else "second"
-        interpretation = (
-            f"Différence statistiquement significative (p = {p_value:.4f} < 0.05). "
-            f"Le {better} concurrent obtient de meilleurs scores."
-        )
-    else:
-        interpretation = (
-            f"Différence non significative (p = {p_value:.4f} ≥ 0.05). "
-            "On ne peut pas conclure que l'un surpasse l'autre."
-        )
-    return {
-        "statistic": round(W, 4),
-        "p_value": round(p_value, 6),
-        "significant": significant,
-        "interpretation": interpretation,
-        "n_pairs": n,
-        "W_plus": round(W_plus, 4),
-        "W_minus": round(W_minus, 4),
-    }
-def _normal_sf(z: float) -> float:
-    """Survival function de la loi normale standard (1 - CDF)."""
-    # Approximation Abramowitz & Stegun 26.2.17
-    t = 1.0 / (1.0 + 0.2316419 * abs(z))
-    poly = t * (0.319381530 + t * (-0.356563782 + t * (1.781477937
-           + t * (-1.821255978 + t * 1.330274429))))
-    phi_z = math.exp(-0.5 * z * z) / math.sqrt(2.0 * math.pi)
-    p = phi_z * poly
-    return p if z >= 0 else 1.0 - p
-# Table des valeurs critiques de W pour α=0.05 bilatéral (test exact, source : tables de Wilcoxon)
-_W_CRITICAL = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 2, 8: 3, 9: 5}
-def _wilcoxon_exact_p(n: int, w: float) -> float:
-    """P-value approximée pour petits n (< 10) via table critique simplifiée.
-    Note : résultat **conservateur** — seules deux valeurs sont retournées :
-    0.04 (significatif à 5 %) ou 0.20 (non significatif).
-    Préférer scipy pour des p-values exactes.
-    """
-    critical = _W_CRITICAL.get(n, 0)
-    if w <= critical:
-        return 0.04  # significatif à 5 %
-    return 0.20      # non significatif (approximation conservative)
-def _native_p_value(n: int, W: float) -> float:
-    """Calcule la p-value via l'approximation normale (n ≥ 10) ou la table exacte (n < 10)."""
-    if n >= 10:
-        mu = n * (n + 1) / 4.0
-        sigma2 = n * (n + 1) * (2 * n + 1) / 24.0
-        if sigma2 <= 0:
-            return 1.0
-        z = abs((W + 0.5) - mu) / math.sqrt(sigma2)  # correction de continuité
-        return 2.0 * _normal_sf(z)  # test bilatéral
-    return _wilcoxon_exact_p(n, W)
-# ---------------------------------------------------------------------------
-# Matrice des tests pairwise
-# ---------------------------------------------------------------------------
-def compute_pairwise_stats(
-    engine_cer_map: dict[str, list[float]],
-) -> list[dict]:
-    """Calcule les tests de Wilcoxon entre toutes les paires de concurrents.
-    Parameters
-    ----------
-    engine_cer_map : dict {engine_name → [cer_doc1, cer_doc2, ...]}
-    Returns
-    -------
-    Liste de dicts, un par paire :
-      - engine_a, engine_b, statistic, p_value, significant, interpretation
-    """
-    names = list(engine_cer_map.keys())
-    results = []
-    for i in range(len(names)):
-        for j in range(i + 1, len(names)):
-            a_name, b_name = names[i], names[j]
-            a_vals = engine_cer_map[a_name]
-            b_vals = engine_cer_map[b_name]
-            # Aligner les longueurs
-            min_len = min(len(a_vals), len(b_vals))
-            if min_len < 2:
-                continue
-            res = wilcoxon_test(a_vals[:min_len], b_vals[:min_len])
-            results.append({
-                "engine_a": a_name,
-                "engine_b": b_name,
-                **res,
-            })
-    return results
-# ---------------------------------------------------------------------------
-# Test de Friedman + post-hoc Nemenyi (Sprint 17)
-# ---------------------------------------------------------------------------
-#
-# Référence : Demšar, J. (2006), "Statistical Comparisons of Classifiers over
-# Multiple Data Sets", Journal of Machine Learning Research 7:1-30. Standard
-# de facto pour comparer plusieurs systèmes sur plusieurs datasets — ici :
-# plusieurs moteurs OCR sur plusieurs documents. Le CDD (critical difference
-# diagram) issu de Nemenyi est le rendu canonique.
-# Valeurs critiques de la distribution du Studentized Range divisées par √2,
-# pour df = ∞ (approximation usuelle pour Nemenyi). Source : tables de Tukey.
-# Clé : nombre de traitements k ; valeur : q_α pour α ∈ {0.05, 0.01}.
-_NEMENYI_Q_TABLE = {
-    # k   q_0.05   q_0.01
-    2:  (1.960, 2.576),
-    3:  (2.343, 2.913),
-    4:  (2.569, 3.113),
-    5:  (2.728, 3.255),
-    6:  (2.850, 3.364),
-    7:  (2.949, 3.452),
-    8:  (3.031, 3.526),
-    9:  (3.102, 3.590),
-    10: (3.164, 3.646),
-    11: (3.219, 3.696),
-    12: (3.268, 3.741),
-    13: (3.313, 3.781),
-    14: (3.354, 3.818),
-    15: (3.391, 3.853),
-    16: (3.426, 3.886),
-    17: (3.458, 3.916),
-    18: (3.489, 3.944),
-    19: (3.517, 3.970),
-    20: (3.544, 3.995),
-    25: (3.658, 4.095),
-    30: (3.739, 4.167),
-    40: (3.858, 4.272),
-    50: (3.945, 4.349),
-}
-def _chi_square_sf(x: float, df: int) -> float:
-    """Survival function de la loi chi², 1 - CDF(x).
-    Utilise scipy si disponible (méthode exacte), sinon Wilson-Hilferty
-    (approximation normale précise dès df ≥ 3).
-    """
-    if x <= 0 or df <= 0:
-        return 1.0
-    try:
-        from scipy.stats import chi2 as _chi2  # type: ignore[import-untyped]
-        return float(_chi2.sf(x, df))
-    except ImportError:
-        pass
-    # Wilson-Hilferty : transforme chi² en approximation normale
-    z = (((x / df) ** (1.0 / 3.0)) - (1.0 - 2.0 / (9.0 * df))) / math.sqrt(2.0 / (9.0 * df))
-    return _normal_sf(z)
-def _rank_row(values: list[float]) -> list[float]:
-    """Rangs d'une ligne — petit = rang 1. Ex-aequo : rangs moyens."""
-    n = len(values)
-    indexed = sorted(range(n), key=lambda i: values[i])
-    ranks = [0.0] * n
-    i = 0
-    while i < n:
-        j = i
-        while j < n and values[indexed[j]] == values[indexed[i]]:
-            j += 1
-        avg_rank = (i + j + 1) / 2.0  # 1-based
-        for k in range(i, j):
-            ranks[indexed[k]] = avg_rank
-        i = j
-    return ranks
-def _aligned_cer_matrix(
-    engine_cer_map: dict[str, list[float]],
-) -> tuple[list[str], list[list[float]]]:
-    """Construit la matrice (k moteurs × n documents) alignée sur la longueur
-    minimale. Retourne ``(noms, matrice_colonne_par_moteur)``.
-    Friedman exige des blocs (documents) complets : si les moteurs n'ont pas
-    tous été exécutés sur les mêmes documents, on tronque à la longueur
-    minimale, documentée dans le résultat via ``n_blocks``.
-    """
-    names = list(engine_cer_map.keys())
-    if not names:
-        return [], []
-    min_len = min(len(v) for v in engine_cer_map.values())
-    if min_len == 0:
-        return names, []
-    matrix = [engine_cer_map[n][:min_len] for n in names]
-    return names, matrix
-def friedman_test(engine_cer_map: dict[str, list[float]]) -> dict:
-    """Test de Friedman — k moteurs sur n documents appariés.
-    Test non-paramétrique équivalent à l'ANOVA à mesures répétées pour des
-    données ordinales. Hypothèse nulle : tous les moteurs ont la même
-    performance moyenne. Rejet → au moins un moteur diffère des autres.
-    Parameters
-    ----------
-    engine_cer_map:
-        Dict ``{engine_name → [cer_doc1, cer_doc2, ...]}``. Tous les moteurs
-        doivent avoir été évalués sur les mêmes documents (dans le même ordre).
-    Returns
-    -------
-    dict avec :
-      - ``statistic``     : Q corrigé pour les ex-aequo
-      - ``p_value``       : p-value (scipy si dispo, sinon Wilson-Hilferty)
-      - ``significant``   : bool, p < 0.05
-      - ``df``            : degrés de liberté = k - 1
-      - ``n_blocks``      : nombre de documents (blocs) utilisés
-      - ``n_engines``     : nombre de moteurs (k)
-      - ``mean_ranks``    : dict ``{engine: rang_moyen}``
-      - ``interpretation``: phrase lisible
-      - ``error``         : message si le test n'est pas applicable
-    """
-    names, matrix = _aligned_cer_matrix(engine_cer_map)
-    k = len(names)
-    n = len(matrix[0]) if matrix else 0
-    if k < 2:
-        return {
-            "statistic": 0.0, "p_value": 1.0, "significant": False,
-            "df": 0, "n_blocks": n, "n_engines": k,
-            "mean_ranks": {names[0]: 1.0} if k == 1 else {},
-            "interpretation": "Test de Friedman non applicable : il faut au moins 2 moteurs.",
-            "error": "not_enough_engines",
-        }
-    if n < 2:
-        return {
-            "statistic": 0.0, "p_value": 1.0, "significant": False,
-            "df": k - 1, "n_blocks": n, "n_engines": k,
-            "mean_ranks": {name: 1.0 for name in names},
-            "interpretation": "Test de Friedman non applicable : il faut au moins 2 documents communs.",
-            "error": "not_enough_blocks",
-        }
-    # Rangs par bloc (document) : pour chaque doc, ranger les k moteurs
-    ranks_by_engine: list[list[float]] = [[] for _ in range(k)]
-    for j in range(n):
-        row = [matrix[i][j] for i in range(k)]
-        row_ranks = _rank_row(row)
-        for i in range(k):
-            ranks_by_engine[i].append(row_ranks[i])
-    rank_sums = [sum(r) for r in ranks_by_engine]
-    mean_ranks = {names[i]: rank_sums[i] / n for i in range(k)}
-    # Statistique Q non-corrigée (sans ex-aequo)
-    #   Q = 12 / (n·k·(k+1)) · Σ R_j² − 3·n·(k+1)
-    Q = (12.0 / (n * k * (k + 1))) * sum(rs ** 2 for rs in rank_sums) - 3.0 * n * (k + 1)
-    # Correction pour les ex-aequo (ties factor) — ajuste si des rangs sont
-    # partagés dans certains blocs. Formule : Q_corr = Q / (1 - T/(n·(k³−k)))
-    # où T = Σ (tⱼ³ − tⱼ) sur tous les groupes d'ex-aequo.
-    tie_correction = 0.0
-    for j in range(n):
-        row = [matrix[i][j] for i in range(k)]
-        sorted_row = sorted(row)
-        i = 0
-        while i < len(sorted_row):
-            count = 1
-            while i + count < len(sorted_row) and sorted_row[i + count] == sorted_row[i]:
-                count += 1
-            if count > 1:
-                tie_correction += count ** 3 - count
-            i += count
-    denom = 1.0 - tie_correction / (n * (k ** 3 - k)) if k >= 2 else 1.0
-    if denom > 0:
-        Q = Q / denom
-    df = k - 1
-    p_value = _chi_square_sf(Q, df)
-    significant = p_value < 0.05
-    if significant:
-        interpretation = (
-            f"Test de Friedman significatif (Q = {Q:.3f}, df = {df}, p = {p_value:.4f}). "
-            f"Au moins un moteur diffère des autres — utiliser le post-hoc Nemenyi "
-            f"pour identifier les paires distinguables."
-        )
-    else:
-        interpretation = (
-            f"Test de Friedman non significatif (Q = {Q:.3f}, df = {df}, p = {p_value:.4f}). "
-            f"Aucune différence globale détectée entre les moteurs sur ce corpus."
-        )
-    return {
-        "statistic": round(Q, 4),
-        "p_value": round(p_value, 6),
-        "significant": significant,
-        "df": df,
-        "n_blocks": n,
-        "n_engines": k,
-        "mean_ranks": {k_: round(v, 4) for k_, v in mean_ranks.items()},
-        "interpretation": interpretation,
-    }
-def _nemenyi_critical_value(k: int, alpha: float = 0.05) -> Optional[float]:
-    """Valeur critique q_α pour k traitements, df = ∞.
-    Retourne ``None`` si k est hors table (< 2 ou > 50).
-    """
-    if k < 2:
-        return None
-    if k in _NEMENYI_Q_TABLE:
-        q05, q01 = _NEMENYI_Q_TABLE[k]
-        return q05 if alpha == 0.05 else q01 if alpha == 0.01 else q05
-    # Au-delà de la table : borne supérieure (conservateur)
-    max_k = max(_NEMENYI_Q_TABLE.keys())
-    if k > max_k:
-        q05, q01 = _NEMENYI_Q_TABLE[max_k]
-        return q05 if alpha == 0.05 else q01
-    # Entre deux clés : interpolation linéaire
-    keys = sorted(_NEMENYI_Q_TABLE.keys())
-    for i in range(len(keys) - 1):
-        if keys[i] < k < keys[i + 1]:
-            lo, hi = keys[i], keys[i + 1]
-            q_lo = _NEMENYI_Q_TABLE[lo][0 if alpha == 0.05 else 1]
-            q_hi = _NEMENYI_Q_TABLE[hi][0 if alpha == 0.05 else 1]
-            frac = (k - lo) / (hi - lo)
-            return q_lo + frac * (q_hi - q_lo)
-    return None
-def nemenyi_posthoc(
-    engine_cer_map: dict[str, list[float]],
-    alpha: float = 0.05,
-) -> dict:
-    """Post-hoc de Nemenyi — identifie les paires de moteurs statistiquement
-    indiscernables après un test de Friedman.
-    Calcule la *critical distance* CD = q_α · √(k·(k+1) / (6·n)). Deux moteurs
-    dont les rangs moyens diffèrent de moins que CD ne sont **pas**
-    statistiquement distinguables au seuil α.
-    Returns
-    -------
-    dict avec :
-      - ``alpha``               : seuil utilisé
-      - ``critical_distance``   : CD calculée
-      - ``q_alpha``             : valeur critique q_α issue de la table
-      - ``n_blocks``, ``n_engines``
-      - ``mean_ranks``          : rangs moyens par moteur (dict)
-      - ``engines_sorted``      : liste des moteurs triés par rang croissant
-      - ``significant_matrix``  : matrice bool (list[list[bool]]),
-                                  ``True`` = paire significativement différente
-      - ``tied_groups``         : liste de listes de moteurs indiscernables
-                                  (groupes maximaux d'ex-aequo pratiques)
-      - ``error``               : présent si le test n'est pas applicable
-    """
-    names, matrix = _aligned_cer_matrix(engine_cer_map)
-    k = len(names)
-    n = len(matrix[0]) if matrix else 0
-    if k < 2 or n < 2:
-        return {
-            "alpha": alpha,
-            "critical_distance": 0.0,
-            "q_alpha": 0.0,
-            "n_blocks": n,
-            "n_engines": k,
-            "mean_ranks": {name: 1.0 for name in names},
-            "engines_sorted": list(names),
-            "significant_matrix": [[False] * k for _ in range(k)],
-            "tied_groups": [list(names)] if names else [],
-            "error": "not_enough_data",
-        }
-    # Friedman fournit les rangs moyens — on les recalcule ici pour rester
-    # autonome (sans forcer l'utilisateur à chaîner les deux appels).
-    ranks_by_engine: list[list[float]] = [[] for _ in range(k)]
-    for j in range(n):
-        row = [matrix[i][j] for i in range(k)]
-        row_ranks = _rank_row(row)
-        for i in range(k):
-            ranks_by_engine[i].append(row_ranks[i])
-    mean_ranks_list = [sum(r) / n for r in ranks_by_engine]
-    mean_ranks = {names[i]: round(mean_ranks_list[i], 4) for i in range(k)}
-    q_alpha = _nemenyi_critical_value(k, alpha) or 0.0
-    critical_distance = q_alpha * math.sqrt(k * (k + 1) / (6.0 * n))
-    # Matrice de significativité : paire (i,j) significative si |R_i - R_j| > CD
-    significant_matrix = [
-        [
-            (i != j) and (abs(mean_ranks_list[i] - mean_ranks_list[j]) > critical_distance)
-            for j in range(k)
-        ]
-        for i in range(k)
-    ]
-    # Groupes d'ex-aequo pratiques : fenêtre glissante sur les rangs triés.
-    # Deux moteurs sont dans le même groupe si leur écart ≤ CD.
-    order = sorted(range(k), key=lambda i: mean_ranks_list[i])
-    sorted_names = [names[i] for i in order]
-    sorted_ranks = [mean_ranks_list[i] for i in order]
-    tied_groups: list[list[str]] = []
-    i = 0
-    while i < len(sorted_names):
-        # étendre le groupe tant que le moteur suivant est à ≤ CD du premier du groupe
-        j = i
-        while j + 1 < len(sorted_names) and (sorted_ranks[j + 1] - sorted_ranks[i]) <= critical_distance:
-            j += 1
-        tied_groups.append(sorted_names[i:j + 1])
-        i = j + 1 if j > i else i + 1
-    return {
-        "alpha": alpha,
-        "critical_distance": round(critical_distance, 4),
-        "q_alpha": round(q_alpha, 4),
-        "n_blocks": n,
-        "n_engines": k,
-        "mean_ranks": mean_ranks,
-        "engines_sorted": sorted_names,
-        "significant_matrix": significant_matrix,
-        "tied_groups": tied_groups,
-    }
-# ---------------------------------------------------------------------------
-# Critical Difference Diagram — rendu SVG (Sprint 17)
-# ---------------------------------------------------------------------------
-def build_critical_difference_svg(
-    nemenyi_result: dict,
-    width: int = 780,
-    row_height: int = 22,
-) -> str:
-    """Génère le SVG du Critical Difference Diagram (Demšar 2006).
-    Le diagramme montre :
-      * un axe horizontal des rangs moyens (1 à k),
-      * chaque moteur positionné sur l'axe à son rang moyen,
-      * des barres horizontales épaisses reliant les moteurs statistiquement
-        indiscernables (distance ≤ CD),
-      * la longueur de CD affichée au-dessus de l'axe en référence.
-    Parameters
-    ----------
-    nemenyi_result:
-        Résultat de ``nemenyi_posthoc``.
-    width:
-        Largeur totale du SVG en pixels.
-    row_height:
-        Hauteur de chaque ligne d'étiquette moteur (auto-adaptatif).
-    Returns
-    -------
-    Chaîne contenant le SVG (balise racine ``<svg>…</svg>``).
-    """
-    k = nemenyi_result.get("n_engines", 0)
-    if k < 2 or nemenyi_result.get("error"):
-        return (
-            '<svg xmlns="http://www.w3.org/2000/svg" width="100%" height="40" '
-            'role="img" aria-label="Critical Difference Diagram indisponible">'
-            '<text x="10" y="24" font-family="sans-serif" font-size="12" fill="#666">'
-            'Critical Difference Diagram non calculable — données insuffisantes.'
-            '</text></svg>'
-        )
-    engines_sorted: list[str] = list(nemenyi_result.get("engines_sorted", []))
-    mean_ranks: dict[str, float] = dict(nemenyi_result.get("mean_ranks", {}))
-    tied_groups: list[list[str]] = list(nemenyi_result.get("tied_groups", []))
-    cd: float = float(nemenyi_result.get("critical_distance", 0.0))
-    # Dimensions
-    left_pad, right_pad = 40, 40
-    top_pad = 50   # espace pour l'affichage CD
-    axis_y = top_pad + 10
-    bars_start_y = axis_y + 20  # première barre d'ex-aequo sous l'axe
-    # Empiler une ligne par groupe + une ligne par étiquette
-    label_rows = k  # chaque moteur a sa propre ligne de label
-    bars_count = len(tied_groups)
-    total_h = bars_start_y + bars_count * 10 + label_rows * row_height + 20
-    axis_x0, axis_x1 = left_pad, width - right_pad
-    axis_width = axis_x1 - axis_x0
-    def x_for_rank(r: float) -> float:
-        # Rang 1 à gauche, rang k à droite
-        if k <= 1:
-            return axis_x0
-        return axis_x0 + (r - 1.0) / (k - 1.0) * axis_width
-    parts: list[str] = []
-    parts.append(
-        f'<svg xmlns="http://www.w3.org/2000/svg" width="100%" viewBox="0 0 {width} {total_h}" '
-        f'role="img" aria-label="Critical Difference Diagram (Friedman-Nemenyi)" '
-        f'font-family="system-ui, -apple-system, sans-serif">'
-    )
-    parts.append('<style>.cd-axis{stroke:#334155;stroke-width:1.5}.cd-tick{stroke:#334155;stroke-width:1}'
-                 '.cd-label{fill:#0f172a;font-size:11px}'
-                 '.cd-tie{stroke:#0f172a;stroke-width:4;stroke-linecap:round}'
-                 '.cd-cd-bar{stroke:#dc2626;stroke-width:2}'
-                 '.cd-cd-txt{fill:#dc2626;font-size:11px;font-weight:600}'
-                 '.cd-name{fill:#0f172a;font-size:12px}'
-                 '.cd-rank{fill:#64748b;font-size:10px}'
-                 '</style>')
-    # Barre CD de référence (en haut, à gauche de l'axe)
-    if cd > 0 and k >= 2:
-        cd_bar_x0 = axis_x0
-        cd_bar_x1 = axis_x0 + (cd / max(1, k - 1)) * axis_width
-        cd_y = top_pad - 20
-        parts.append(f'<line class="cd-cd-bar" x1="{cd_bar_x0:.1f}" y1="{cd_y}" '
-                     f'x2="{cd_bar_x1:.1f}" y2="{cd_y}"/>')
-        parts.append(f'<line class="cd-cd-bar" x1="{cd_bar_x0:.1f}" y1="{cd_y - 4}" '
-                     f'x2="{cd_bar_x0:.1f}" y2="{cd_y + 4}"/>')
-        parts.append(f'<line class="cd-cd-bar" x1="{cd_bar_x1:.1f}" y1="{cd_y - 4}" '
-                     f'x2="{cd_bar_x1:.1f}" y2="{cd_y + 4}"/>')
-        parts.append(f'<text class="cd-cd-txt" x="{(cd_bar_x0 + cd_bar_x1)/2:.1f}" y="{cd_y - 8}" '
-                     f'text-anchor="middle">CD = {cd:.3f}</text>')
-    # Axe principal
-    parts.append(f'<line class="cd-axis" x1="{axis_x0}" y1="{axis_y}" '
-                 f'x2="{axis_x1}" y2="{axis_y}"/>')
-    # Ticks entiers
-    for r in range(1, k + 1):
-        xt = x_for_rank(r)
-        parts.append(f'<line class="cd-tick" x1="{xt:.1f}" y1="{axis_y - 5}" '
-                     f'x2="{xt:.1f}" y2="{axis_y + 5}"/>')
-        parts.append(f'<text class="cd-label" x="{xt:.1f}" y="{axis_y - 9}" '
-                     f'text-anchor="middle">{r}</text>')
-    # Barres reliant les groupes indiscernables
-    for i, group in enumerate(tied_groups):
-        if len(group) < 2:
-            continue
-        rs = [mean_ranks[n] for n in group]
-        x0 = x_for_rank(min(rs))
-        x1 = x_for_rank(max(rs))
-        y_bar = bars_start_y + i * 10
-        parts.append(f'<line class="cd-tie" x1="{x0 - 3:.1f}" y1="{y_bar}" '
-                     f'x2="{x1 + 3:.1f}" y2="{y_bar}"/>')
-    # Étiquettes des moteurs : la moitié la plus basse à gauche, l'autre à droite
-    labels_y_base = bars_start_y + bars_count * 10 + 15
-    half = (len(engines_sorted) + 1) // 2
-    left_engines = engines_sorted[:half]
-    right_engines = engines_sorted[half:]
-    for idx, name in enumerate(left_engines):
-        r = mean_ranks[name]
-        x = x_for_rank(r)
-        y_label = labels_y_base + idx * row_height
-        # Ligne du moteur vers axe
-        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{axis_y + 6}" '
-                     f'x2="{x:.1f}" y2="{y_label - 4}"/>')
-        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{y_label - 4}" '
-                     f'x2="{axis_x0 - 4:.1f}" y2="{y_label - 4}"/>')
-        parts.append(f'<text class="cd-name" x="{axis_x0 - 6:.1f}" y="{y_label}" '
-                     f'text-anchor="end">{_svg_escape(name)} '
-                     f'<tspan class="cd-rank">({r:.2f})</tspan></text>')
-    for idx, name in enumerate(right_engines):
-        r = mean_ranks[name]
-        x = x_for_rank(r)
-        y_label = labels_y_base + idx * row_height
-        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{axis_y + 6}" '
-                     f'x2="{x:.1f}" y2="{y_label - 4}"/>')
-        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{y_label - 4}" '
-                     f'x2="{axis_x1 + 4:.1f}" y2="{y_label - 4}"/>')
-        parts.append(f'<text class="cd-name" x="{axis_x1 + 6:.1f}" y="{y_label}" '
-                     f'text-anchor="start">{_svg_escape(name)} '
-                     f'<tspan class="cd-rank">({r:.2f})</tspan></text>')
-    parts.append('</svg>')
-    return "".join(parts)
-def _svg_escape(text: str) -> str:
-    """Échappe un texte pour inclusion sûre dans un nœud SVG/XML."""
-    return (text.replace("&", "&amp;")
-                .replace("<", "&lt;")
-                .replace(">", "&gt;")
-                .replace('"', "&quot;")
-                .replace("'", "&#39;"))
-# ---------------------------------------------------------------------------
-# Frontière de Pareto (Sprint 19)
-# ---------------------------------------------------------------------------
-def compute_pareto_front(
-    points: list[dict],
-    objectives: tuple[str, ...] = ("cer", "cost"),
-    name_key: str = "engine",
-    minimize: Optional[tuple[bool, ...]] = None,
-) -> list[str]:
-    """Calcule la frontière de Pareto sur ``len(objectives)`` dimensions.
-    Un point ``p`` est Pareto-dominant si aucun autre point n'a, pour TOUS
-    les objectifs, une valeur au moins aussi bonne ET au moins une valeur
-    strictement meilleure.
-    Parameters
-    ----------
-    points:
-        Liste de dicts. Chaque dict doit contenir ``name_key`` et toutes les
-        clés de ``objectives``. Les points dont une valeur d'objectif est
-        ``None`` sont ignorés (pas de comparaison possible).
-    objectives:
-        Clés des objectifs à minimiser/maximiser.
-    name_key:
-        Clé identifiant le point (par défaut ``"engine"``).
-    minimize:
-        Pour chaque objectif, ``True`` = minimiser (ex. CER, coût),
-        ``False`` = maximiser (ex. ancrage). Doit avoir la même longueur
-        que ``objectives``.
-    Returns
-    -------
-    Liste des ``name`` des points sur le front Pareto, ordre stable depuis
-    ``points``.
-    """
-    if minimize is None:
-        minimize = tuple(True for _ in objectives)
-    if len(minimize) != len(objectives):
-        raise ValueError("`minimize` doit avoir la même longueur que `objectives`")
-    valid = []
-    for p in points:
-        try:
-            vals = tuple(float(p[k]) for k in objectives)
-        except (KeyError, TypeError, ValueError):
-            continue
-        valid.append((p[name_key], vals))
-    front: list[str] = []
-    for name_a, vals_a in valid:
-        dominated = False
-        for name_b, vals_b in valid:
-            if name_a == name_b:
-                continue
-            # B domine A si B est ≥ aussi bon partout ET strictement meilleur quelque part
-            better_or_equal_everywhere = True
-            strictly_better_somewhere = False
-            for va, vb, mini in zip(vals_a, vals_b, minimize):
-                if mini:
-                    if vb > va:
-                        better_or_equal_everywhere = False
-                        break
-                    if vb < va:
-                        strictly_better_somewhere = True
-                else:  # maximiser
-                    if vb < va:
-                        better_or_equal_everywhere = False
-                        break
-                    if vb > va:
-                        strictly_better_somewhere = True
-            if better_or_equal_everywhere and strictly_better_somewhere:
-                dominated = True
-                break
-        if not dominated:
-            front.append(name_a)
-    return front
-# ---------------------------------------------------------------------------
-# Clustering des patterns d'erreurs
-# ---------------------------------------------------------------------------
-# Patterns d'erreurs fréquentes (OCR + HTR documents patrimoniaux)
-_ERROR_PATTERNS = [
-    # (pattern_re, label)
-    (r"\brn\b.*\bm\b|\bm\b.*\brn\b|rn→m|m→rn",       "confusion rn/m"),
-    (r"[lI]→1|1→[lI]|l→1|1→l|I→1|1→I",               "confusion l/1/I"),
-    (r"u→n|n→u|v→u|u→v",                              "confusion u/n/v"),
-    (r"[oO]→0|0→[oO]",                                "confusion O/0"),
-    (r"ſ→[fs]|[fs]→ſ",                                "confusion ſ/f/s"),
-    (r"é→e|è→e|ê→e|e→[éèê]",                          "erreur diacritique é/e"),
-    (r"œ→oe|oe→œ|æ→ae|ae→æ",                          "ligature œ/æ"),
-    (r"[fF]i→fi|fi→[fF]i",                            "ligature fi"),
-    (r"[fF]l→fl|fl→[fF]l",                            "ligature fl"),
-    (r"\s+→''|''→\s+",                                "segmentation espace"),
-]
-def _extract_error_pairs(gt: str, hyp: str) -> list[tuple[str, str]]:
-    """Extrait les paires (gt_char_seq, hyp_char_seq) d'erreurs de substitution."""
-    # Sprint A3 (B-1) : import depuis Cercle 1, plus de violation Cercle 2→3.
-    from picarones.core.diff_utils import compute_word_diff
-    ops = compute_word_diff(gt, hyp)
-    pairs = []
-    for op in ops:
-        if op["op"] == "replace":
-            pairs.append((op["old"], op["new"]))
-        elif op["op"] == "delete":
-            pairs.append((op["text"], ""))
-        elif op["op"] == "insert":
-            pairs.append(("", op["text"]))
-    return pairs
-@dataclass
-class ErrorCluster:
-    """Un cluster d'erreurs similaires."""
-    cluster_id: int
-    label: str
-    """Description humaine du pattern (ex. 'confusion rn/m')."""
-    count: int
-    examples: list[dict]
-    """Liste de {engine, gt_fragment, ocr_fragment}."""
-    def as_dict(self) -> dict:
-        return {
-            "cluster_id": self.cluster_id,
-            "label": self.label,
-            "count": self.count,
-            "examples": self.examples[:5],  # 5 exemples max
-        }
-def cluster_errors(
-    error_data: list[dict],
-    max_clusters: int = 8,
-) -> list[ErrorCluster]:
-    """Regroupe les erreurs en clusters avec labels lisibles.
-    Parameters
-    ----------
-    error_data : liste de dicts {engine, gt, hypothesis}
-    max_clusters : nombre max de clusters à retourner
-    Returns
-    -------
-    Liste de ErrorCluster triée par count décroissant.
-    """
-    # Collecter tous les patterns d'erreur avec contexte
-    # Clé : catégorie d'erreur → liste d'exemples
-    bucket: dict[str, list[dict]] = defaultdict(list)
-    other_pairs: list[dict] = []
-    for item in error_data:
-        engine = item.get("engine", "")
-        gt = item.get("gt", "")
-        hyp = item.get("hypothesis", "")
-        pairs = _extract_error_pairs(gt, hyp)
-        for old, new in pairs:
-            if not old and not new:
-                continue
-            matched = False
-            # Essayer de matcher un pattern connu
-            probe = f"{old}→{new}"
-            for _pat, label in _ERROR_PATTERNS:
-                try:
-                    if re.search(_pat, probe, re.IGNORECASE):
-                        bucket[label].append({
-                            "engine": engine,
-                            "gt_fragment": old,
-                            "ocr_fragment": new,
-                        })
-                        matched = True
-                        break
-                except re.error:
-                    pass
-            if not matched:
-                # Regrouper les substitutions restantes par paire de caractères
-                if len(old) <= 3 and len(new) <= 3:
-                    key = f"{old}→{new}" if (old and new) else (f"—→{new}" if new else f"{old}→—")
-                    bucket[key].append({
-                        "engine": engine,
-                        "gt_fragment": old,
-                        "ocr_fragment": new,
-                    })
-                else:
-                    other_pairs.append({
-                        "engine": engine,
-                        "gt_fragment": old,
-                        "ocr_fragment": new,
-                    })
-    # Construire les clusters triés par fréquence
-    clusters: list[ErrorCluster] = []
-    cluster_id = 1
-    sorted_buckets = sorted(bucket.items(), key=lambda x: -len(x[1]))
-    for label, examples in sorted_buckets[:max_clusters - 1]:
-        clusters.append(ErrorCluster(
-            cluster_id=cluster_id,
-            label=label,
-            count=len(examples),
-            examples=examples,
-        ))
-        cluster_id += 1
-    # Cluster "autres"
-    if other_pairs:
-        clusters.append(ErrorCluster(
-            cluster_id=cluster_id,
-            label="autres substitutions",
-            count=len(other_pairs),
-            examples=other_pairs,
-        ))
-    # Trier par count décroissant et limiter
-    clusters.sort(key=lambda c: -c.count)
-    return clusters[:max_clusters]
-# ---------------------------------------------------------------------------
-# Matrice de corrélation entre métriques
-# ---------------------------------------------------------------------------
-def _pearson(x: list[float], y: list[float]) -> float:
-    """Coefficient de corrélation de Pearson."""
-    n = len(x)
-    if n < 2:
-        return 0.0
-    mx = sum(x) / n
-    my = sum(y) / n
-    num = sum((xi - mx) * (yi - my) for xi, yi in zip(x, y))
-    den = math.sqrt(
-        sum((xi - mx) ** 2 for xi in x) * sum((yi - my) ** 2 for yi in y)
-    )
-    return num / den if den > 0 else 0.0
-def compute_correlation_matrix(
-    metrics_per_doc: list[dict],
-    metric_keys: Optional[list[str]] = None,
-) -> dict:
-    """Calcule la matrice de corrélation entre toutes les métriques numériques.
-    Parameters
-    ----------
-    metrics_per_doc : liste de dicts, un par document, contenant les métriques
-    metric_keys     : clés à inclure (None → toutes les clés numériques)
-    Returns
-    -------
-    {
-      "labels": [...],
-      "matrix": [[r_ij, ...], ...]   // coefficients de Pearson
-    }
-    """
-    if not metrics_per_doc:
-        return {"labels": [], "matrix": []}
-    if metric_keys is None:
-        # Déduire les clés numériques
-        sample = metrics_per_doc[0]
-        metric_keys = [k for k, v in sample.items() if isinstance(v, (int, float))]
-    # Construire les vecteurs
-    vectors: dict[str, list[float]] = {k: [] for k in metric_keys}
-    for doc in metrics_per_doc:
-        for k in metric_keys:
-            v = doc.get(k)
-            vectors[k].append(float(v) if v is not None else 0.0)
-    # Calculer la matrice
-    labels = metric_keys
-    n = len(labels)
-    matrix = []
-    for i in range(n):
-        row = []
-        for j in range(n):
-            r = _pearson(vectors[labels[i]], vectors[labels[j]])
-            row.append(round(r, 4))
-        matrix.append(row)
-    return {"labels": labels, "matrix": matrix}
-# ---------------------------------------------------------------------------
-# Courbe de fiabilité (reliability curve)
-# ---------------------------------------------------------------------------
-def compute_reliability_curve(
-    cer_values: list[float],
-    steps: int = 20,
-) -> list[dict]:
-    """Pour les X% documents les plus faciles, quel est le CER moyen ?
-    Returns
-    -------
-    Liste de {pct_docs: float, mean_cer: float}
-    """
-    if not cer_values:
-        return []
-    sorted_cer = sorted(cer_values)
-    n = len(sorted_cer)
-    points = []
-    for step in range(1, steps + 1):
-        pct = step / steps
-        cutoff = max(1, int(pct * n))
-        subset = sorted_cer[:cutoff]
-        mean_cer = sum(subset) / len(subset)
-        points.append({"pct_docs": round(pct * 100, 1), "mean_cer": round(mean_cer, 6)})
-    return points
-# ---------------------------------------------------------------------------
-# Données pour le diagramme de Venn (erreurs communes / exclusives)
-# ---------------------------------------------------------------------------
-def compute_venn_data(
-    engine_error_sets: dict[str, set[str]],
-) -> dict:
-    """Calcule les cardinalités pour un diagramme de Venn entre 2 ou 3 concurrents.
-    Parameters
-    ----------
-    engine_error_sets : {engine_name → set of doc_id:error_token_pair strings}
-    Returns
-    -------
-    Pour 2 concurrents :
-      {only_a, only_b, both, label_a, label_b}
-    Pour 3 concurrents :
-      {only_a, only_b, only_c, ab, ac, bc, abc, label_a, label_b, label_c}
-    """
-    names = list(engine_error_sets.keys())[:3]  # max 3 pour Venn lisible
-    if len(names) < 2:
-        return {}
-    sets = {n: engine_error_sets[n] for n in names}
-    if len(names) == 2:
-        a, b = names
-        sa, sb = sets[a], sets[b]
-        return {
-            "type": "venn2",
-            "label_a": a,
-            "label_b": b,
-            "only_a": len(sa - sb),
-            "only_b": len(sb - sa),
-            "both": len(sa & sb),
-        }
-    else:
-        a, b, c = names
-        sa, sb, sc = sets[a], sets[b], sets[c]
-        return {
-            "type": "venn3",
-            "label_a": a,
-            "label_b": b,
-            "label_c": c,
-            "only_a": len(sa - sb - sc),
-            "only_b": len(sb - sa - sc),
-            "only_c": len(sc - sa - sb),
-            "ab": len((sa & sb) - sc),
-            "ac": len((sa & sc) - sb),
-            "bc": len((sb & sc) - sa),
-            "abc": len(sa & sb & sc),
-        }

picarones/measurements/statistics/__init__.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""Tests statistiques et clustering d'erreurs pour Picarones.
+Avant le sprint « découpage de statistics.py » (2026-05-02) ce module
+était un fichier unique de 1128 lignes mélangeant Wilcoxon, Friedman,
+Nemenyi, bootstrap, Pareto, clustering, corrélation, courbes de
+distribution et rendu SVG du Critical Difference Diagram.
+Le sous-package éclate la responsabilité par famille statistique :
+- :mod:`bootstrap` — IC bootstrap par rééchantillonnage.
+- :mod:`wilcoxon` — Test signé-rangé + matrice pairwise.
+- :mod:`friedman_nemenyi` — Friedman multi-moteurs + post-hoc Nemenyi
+  (calcul uniquement, pas de rendu).
+- :mod:`cdd_render` — Rendu SVG du Critical Difference Diagram.
+- :mod:`pareto` — Frontière de Pareto multi-objectifs.
+- :mod:`clustering` — Regroupement des patterns d'erreur OCR/HTR.
+- :mod:`correlation` — Matrice de corrélation entre métriques.
+- :mod:`distributions` — Reliability curve et données Venn 2/3.
+Ce ``__init__.py`` ré-exporte toute l'API publique historique pour
+que les ~30 fichiers qui importent depuis
+``picarones.measurements.statistics`` continuent à fonctionner sans
+modification. Les symboles privés ``_SCIPY_AVAILABLE``,
+``_chi_square_sf``, ``_nemenyi_critical_value``, ``_rank_row`` sont
+également ré-exportés car certains tests les consomment directement.
+"""
+from picarones.measurements.statistics.bootstrap import bootstrap_ci
+from picarones.measurements.statistics.cdd_render import (
+    build_critical_difference_svg,
+)
+from picarones.measurements.statistics.clustering import (
+    ErrorCluster,
+    cluster_errors,
+)
+from picarones.measurements.statistics.correlation import (
+    compute_correlation_matrix,
+)
+from picarones.measurements.statistics.distributions import (
+    compute_reliability_curve,
+    compute_venn_data,
+)
+from picarones.measurements.statistics.friedman_nemenyi import (
+    _chi_square_sf,
+    _nemenyi_critical_value,
+    _rank_row,
+    friedman_test,
+    nemenyi_posthoc,
+)
+from picarones.measurements.statistics.pareto import compute_pareto_front
+from picarones.measurements.statistics.wilcoxon import (
+    _SCIPY_AVAILABLE,
+    compute_pairwise_stats,
+    wilcoxon_test,
+)
+__all__ = [
+    # Bootstrap
+    "bootstrap_ci",
+    # Wilcoxon
+    "wilcoxon_test",
+    "compute_pairwise_stats",
+    # Friedman / Nemenyi
+    "friedman_test",
+    "nemenyi_posthoc",
+    "build_critical_difference_svg",
+    # Pareto
+    "compute_pareto_front",
+    # Clustering
+    "ErrorCluster",
+    "cluster_errors",
+    # Correlation
+    "compute_correlation_matrix",
+    # Distributions
+    "compute_reliability_curve",
+    "compute_venn_data",
+    # Privés ré-exportés (consommés par certains tests)
+    "_SCIPY_AVAILABLE",
+    "_chi_square_sf",
+    "_nemenyi_critical_value",
+    "_rank_row",
+]

picarones/measurements/statistics/bootstrap.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""Intervalle de confiance par bootstrap (Sprint 7).
+Méthode de rééchantillonnage non-paramétrique. Pas d'hypothèse de
+distribution normale — adapté aux distributions asymétriques de CER
+typiques des corpus patrimoniaux.
+"""
+from __future__ import annotations
+import random
+def bootstrap_ci(
+    values: list[float],
+    n_iter: int = 1000,
+    ci: float = 0.95,
+    seed: int = 42,
+) -> tuple[float, float]:
+    """Intervalle de confiance par bootstrap.
+    Parameters
+    ----------
+    values : liste des valeurs (ex. CER par document)
+    n_iter : nombre d'itérations bootstrap (défaut 1000)
+    ci     : niveau de confiance (défaut 0.95 → 95 %)
+    seed   : graine RNG pour reproductibilité
+    Returns
+    -------
+    (lower, upper) — les bornes de l'IC à ``ci`` %
+    """
+    if not values:
+        return (0.0, 0.0)
+    rng = random.Random(seed)
+    n = len(values)
+    means = []
+    for _ in range(n_iter):
+        sample = [values[rng.randint(0, n - 1)] for _ in range(n)]
+        means.append(sum(sample) / n)
+    means.sort()
+    alpha = (1.0 - ci) / 2.0
+    lo_idx = max(0, int(alpha * n_iter))
+    hi_idx = min(n_iter - 1, int((1.0 - alpha) * n_iter))
+    return (means[lo_idx], means[hi_idx])
+__all__ = ["bootstrap_ci"]

picarones/measurements/statistics/cdd_render.py ADDED Viewed

	@@ -0,0 +1,171 @@

+"""Rendu SVG du Critical Difference Diagram (Sprint 17).
+Visualisation canonique du résultat Friedman-Nemenyi (Demšar 2006) :
+axe horizontal des rangs moyens + barres horizontales reliant les
+moteurs statistiquement indiscernables au seuil α.
+Module séparé du calcul (:mod:`friedman_nemenyi`) pour respecter la
+distinction "computation vs presentation" : on peut imaginer un
+rendu PNG, PDF, ou autre, sans toucher au calcul.
+"""
+from __future__ import annotations
+def build_critical_difference_svg(
+    nemenyi_result: dict,
+    width: int = 780,
+    row_height: int = 22,
+) -> str:
+    """Génère le SVG du Critical Difference Diagram (Demšar 2006).
+    Le diagramme montre :
+      * un axe horizontal des rangs moyens (1 à k),
+      * chaque moteur positionné sur l'axe à son rang moyen,
+      * des barres horizontales épaisses reliant les moteurs statistiquement
+        indiscernables (distance ≤ CD),
+      * la longueur de CD affichée au-dessus de l'axe en référence.
+    Parameters
+    ----------
+    nemenyi_result:
+        Résultat de ``nemenyi_posthoc``.
+    width:
+        Largeur totale du SVG en pixels.
+    row_height:
+        Hauteur de chaque ligne d'étiquette moteur (auto-adaptatif).
+    Returns
+    -------
+    Chaîne contenant le SVG (balise racine ``<svg>…</svg>``).
+    """
+    k = nemenyi_result.get("n_engines", 0)
+    if k < 2 or nemenyi_result.get("error"):
+        return (
+            '<svg xmlns="http://www.w3.org/2000/svg" width="100%" height="40" '
+            'role="img" aria-label="Critical Difference Diagram indisponible">'
+            '<text x="10" y="24" font-family="sans-serif" font-size="12" fill="#666">'
+            'Critical Difference Diagram non calculable — données insuffisantes.'
+            '</text></svg>'
+        )
+    engines_sorted: list[str] = list(nemenyi_result.get("engines_sorted", []))
+    mean_ranks: dict[str, float] = dict(nemenyi_result.get("mean_ranks", {}))
+    tied_groups: list[list[str]] = list(nemenyi_result.get("tied_groups", []))
+    cd: float = float(nemenyi_result.get("critical_distance", 0.0))
+    # Dimensions
+    left_pad, right_pad = 40, 40
+    top_pad = 50   # espace pour l'affichage CD
+    axis_y = top_pad + 10
+    bars_start_y = axis_y + 20  # première barre d'ex-aequo sous l'axe
+    # Empiler une ligne par groupe + une ligne par étiquette
+    label_rows = k  # chaque moteur a sa propre ligne de label
+    bars_count = len(tied_groups)
+    total_h = bars_start_y + bars_count * 10 + label_rows * row_height + 20
+    axis_x0, axis_x1 = left_pad, width - right_pad
+    axis_width = axis_x1 - axis_x0
+    def x_for_rank(r: float) -> float:
+        # Rang 1 à gauche, rang k à droite
+        if k <= 1:
+            return axis_x0
+        return axis_x0 + (r - 1.0) / (k - 1.0) * axis_width
+    parts: list[str] = []
+    parts.append(
+        f'<svg xmlns="http://www.w3.org/2000/svg" width="100%" viewBox="0 0 {width} {total_h}" '
+        f'role="img" aria-label="Critical Difference Diagram (Friedman-Nemenyi)" '
+        f'font-family="system-ui, -apple-system, sans-serif">'
+    )
+    parts.append('<style>.cd-axis{stroke:#334155;stroke-width:1.5}.cd-tick{stroke:#334155;stroke-width:1}'
+                 '.cd-label{fill:#0f172a;font-size:11px}'
+                 '.cd-tie{stroke:#0f172a;stroke-width:4;stroke-linecap:round}'
+                 '.cd-cd-bar{stroke:#dc2626;stroke-width:2}'
+                 '.cd-cd-txt{fill:#dc2626;font-size:11px;font-weight:600}'
+                 '.cd-name{fill:#0f172a;font-size:12px}'
+                 '.cd-rank{fill:#64748b;font-size:10px}'
+                 '</style>')
+    # Barre CD de référence (en haut, à gauche de l'axe)
+    if cd > 0 and k >= 2:
+        cd_bar_x0 = axis_x0
+        cd_bar_x1 = axis_x0 + (cd / max(1, k - 1)) * axis_width
+        cd_y = top_pad - 20
+        parts.append(f'<line class="cd-cd-bar" x1="{cd_bar_x0:.1f}" y1="{cd_y}" '
+                     f'x2="{cd_bar_x1:.1f}" y2="{cd_y}"/>')
+        parts.append(f'<line class="cd-cd-bar" x1="{cd_bar_x0:.1f}" y1="{cd_y - 4}" '
+                     f'x2="{cd_bar_x0:.1f}" y2="{cd_y + 4}"/>')
+        parts.append(f'<line class="cd-cd-bar" x1="{cd_bar_x1:.1f}" y1="{cd_y - 4}" '
+                     f'x2="{cd_bar_x1:.1f}" y2="{cd_y + 4}"/>')
+        parts.append(f'<text class="cd-cd-txt" x="{(cd_bar_x0 + cd_bar_x1)/2:.1f}" y="{cd_y - 8}" '
+                     f'text-anchor="middle">CD = {cd:.3f}</text>')
+    # Axe principal
+    parts.append(f'<line class="cd-axis" x1="{axis_x0}" y1="{axis_y}" '
+                 f'x2="{axis_x1}" y2="{axis_y}"/>')
+    # Ticks entiers
+    for r in range(1, k + 1):
+        xt = x_for_rank(r)
+        parts.append(f'<line class="cd-tick" x1="{xt:.1f}" y1="{axis_y - 5}" '
+                     f'x2="{xt:.1f}" y2="{axis_y + 5}"/>')
+        parts.append(f'<text class="cd-label" x="{xt:.1f}" y="{axis_y - 9}" '
+                     f'text-anchor="middle">{r}</text>')
+    # Barres reliant les groupes indiscernables
+    for i, group in enumerate(tied_groups):
+        if len(group) < 2:
+            continue
+        rs = [mean_ranks[n] for n in group]
+        x0 = x_for_rank(min(rs))
+        x1 = x_for_rank(max(rs))
+        y_bar = bars_start_y + i * 10
+        parts.append(f'<line class="cd-tie" x1="{x0 - 3:.1f}" y1="{y_bar}" '
+                     f'x2="{x1 + 3:.1f}" y2="{y_bar}"/>')
+    # Étiquettes des moteurs : la moitié la plus basse à gauche, l'autre à droite
+    labels_y_base = bars_start_y + bars_count * 10 + 15
+    half = (len(engines_sorted) + 1) // 2
+    left_engines = engines_sorted[:half]
+    right_engines = engines_sorted[half:]
+    for idx, name in enumerate(left_engines):
+        r = mean_ranks[name]
+        x = x_for_rank(r)
+        y_label = labels_y_base + idx * row_height
+        # Ligne du moteur vers axe
+        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{axis_y + 6}" '
+                     f'x2="{x:.1f}" y2="{y_label - 4}"/>')
+        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{y_label - 4}" '
+                     f'x2="{axis_x0 - 4:.1f}" y2="{y_label - 4}"/>')
+        parts.append(f'<text class="cd-name" x="{axis_x0 - 6:.1f}" y="{y_label}" '
+                     f'text-anchor="end">{_svg_escape(name)} '
+                     f'<tspan class="cd-rank">({r:.2f})</tspan></text>')
+    for idx, name in enumerate(right_engines):
+        r = mean_ranks[name]
+        x = x_for_rank(r)
+        y_label = labels_y_base + idx * row_height
+        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{axis_y + 6}" '
+                     f'x2="{x:.1f}" y2="{y_label - 4}"/>')
+        parts.append(f'<line class="cd-tick" x1="{x:.1f}" y1="{y_label - 4}" '
+                     f'x2="{axis_x1 + 4:.1f}" y2="{y_label - 4}"/>')
+        parts.append(f'<text class="cd-name" x="{axis_x1 + 6:.1f}" y="{y_label}" '
+                     f'text-anchor="start">{_svg_escape(name)} '
+                     f'<tspan class="cd-rank">({r:.2f})</tspan></text>')
+    parts.append('</svg>')
+    return "".join(parts)
+def _svg_escape(text: str) -> str:
+    """Échappe un texte pour inclusion sûre dans un nœud SVG/XML."""
+    return (text.replace("&", "&amp;")
+                .replace("<", "&lt;")
+                .replace(">", "&gt;")
+                .replace('"', "&quot;")
+                .replace("'", "&#39;"))
+__all__ = ["build_critical_difference_svg"]

picarones/measurements/statistics/clustering.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""Clustering des patterns d'erreurs (Sprint 7).
+Regroupe les substitutions OCR/HTR fréquentes en clusters lisibles
+(« confusion rn/m », « ligature œ/æ », etc.) pour le rapport HTML.
+"""
+from __future__ import annotations
+import re
+from collections import defaultdict
+from dataclasses import dataclass
+from picarones.core.diff_utils import compute_word_diff
+# Patterns d'erreurs fréquentes (OCR + HTR documents patrimoniaux)
+_ERROR_PATTERNS = [
+    # (pattern_re, label)
+    (r"\brn\b.*\bm\b|\bm\b.*\brn\b|rn→m|m→rn",       "confusion rn/m"),
+    (r"[lI]→1|1→[lI]|l→1|1→l|I→1|1→I",               "confusion l/1/I"),
+    (r"u→n|n→u|v→u|u→v",                              "confusion u/n/v"),
+    (r"[oO]→0|0→[oO]",                                "confusion O/0"),
+    (r"ſ→[fs]|[fs]→ſ",                                "confusion ſ/f/s"),
+    (r"é→e|è→e|ê→e|e→[éèê]",                          "erreur diacritique é/e"),
+    (r"œ→oe|oe→œ|æ→ae|ae→æ",                          "ligature œ/æ"),
+    (r"[fF]i→fi|fi→[fF]i",                            "ligature fi"),
+    (r"[fF]l→fl|fl→[fF]l",                            "ligature fl"),
+    (r"\s+→''|''→\s+",                                "segmentation espace"),
+]
+def _extract_error_pairs(gt: str, hyp: str) -> list[tuple[str, str]]:
+    """Extrait les paires (gt_char_seq, hyp_char_seq) d'erreurs de substitution.
+    L'import de ``compute_word_diff`` est au top-level du module
+    (cercle 1 → cercle 2, sens autorisé). Il était paresseux historiquement
+    pour contourner une violation de cercle (Sprint A3) qui n'existe plus.
+    """
+    ops = compute_word_diff(gt, hyp)
+    pairs = []
+    for op in ops:
+        if op["op"] == "replace":
+            pairs.append((op["old"], op["new"]))
+        elif op["op"] == "delete":
+            pairs.append((op["text"], ""))
+        elif op["op"] == "insert":
+            pairs.append(("", op["text"]))
+    return pairs
+@dataclass
+class ErrorCluster:
+    """Un cluster d'erreurs similaires."""
+    cluster_id: int
+    label: str
+    """Description humaine du pattern (ex. 'confusion rn/m')."""
+    count: int
+    examples: list[dict]
+    """Liste de {engine, gt_fragment, ocr_fragment}."""
+    def as_dict(self) -> dict:
+        return {
+            "cluster_id": self.cluster_id,
+            "label": self.label,
+            "count": self.count,
+            "examples": self.examples[:5],  # 5 exemples max
+        }
+def cluster_errors(
+    error_data: list[dict],
+    max_clusters: int = 8,
+) -> list[ErrorCluster]:
+    """Regroupe les erreurs en clusters avec labels lisibles.
+    Parameters
+    ----------
+    error_data : liste de dicts {engine, gt, hypothesis}
+    max_clusters : nombre max de clusters à retourner
+    Returns
+    -------
+    Liste de ErrorCluster triée par count décroissant.
+    """
+    # Collecter tous les patterns d'erreur avec contexte
+    # Clé : catégorie d'erreur → liste d'exemples
+    bucket: dict[str, list[dict]] = defaultdict(list)
+    other_pairs: list[dict] = []
+    for item in error_data:
+        engine = item.get("engine", "")
+        gt = item.get("gt", "")
+        hyp = item.get("hypothesis", "")
+        pairs = _extract_error_pairs(gt, hyp)
+        for old, new in pairs:
+            if not old and not new:
+                continue
+            matched = False
+            # Essayer de matcher un pattern connu
+            probe = f"{old}→{new}"
+            for _pat, label in _ERROR_PATTERNS:
+                try:
+                    if re.search(_pat, probe, re.IGNORECASE):
+                        bucket[label].append({
+                            "engine": engine,
+                            "gt_fragment": old,
+                            "ocr_fragment": new,
+                        })
+                        matched = True
+                        break
+                except re.error:
+                    pass
+            if not matched:
+                # Regrouper les substitutions restantes par paire de caractères
+                if len(old) <= 3 and len(new) <= 3:
+                    key = f"{old}→{new}" if (old and new) else (f"—→{new}" if new else f"{old}→—")
+                    bucket[key].append({
+                        "engine": engine,
+                        "gt_fragment": old,
+                        "ocr_fragment": new,
+                    })
+                else:
+                    other_pairs.append({
+                        "engine": engine,
+                        "gt_fragment": old,
+                        "ocr_fragment": new,
+                    })
+    # Construire les clusters triés par fréquence
+    clusters: list[ErrorCluster] = []
+    cluster_id = 1
+    sorted_buckets = sorted(bucket.items(), key=lambda x: -len(x[1]))
+    for label, examples in sorted_buckets[:max_clusters - 1]:
+        clusters.append(ErrorCluster(
+            cluster_id=cluster_id,
+            label=label,
+            count=len(examples),
+            examples=examples,
+        ))
+        cluster_id += 1
+    # Cluster "autres"
+    if other_pairs:
+        clusters.append(ErrorCluster(
+            cluster_id=cluster_id,
+            label="autres substitutions",
+            count=len(other_pairs),
+            examples=other_pairs,
+        ))
+    # Trier par count décroissant et limiter
+    clusters.sort(key=lambda c: -c.count)
+    return clusters[:max_clusters]
+__all__ = ["ErrorCluster", "cluster_errors"]

picarones/measurements/statistics/correlation.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""Matrice de corrélation entre métriques (Sprint 7).
+Coefficient de Pearson entre toutes les métriques numériques d'un
+DocumentResult — montre les redondances (CER ↔ WER ≈ 1) et les
+dimensions indépendantes (CER ↔ image_quality ≈ 0.5).
+"""
+from __future__ import annotations
+import math
+from typing import Optional
+def _pearson(x: list[float], y: list[float]) -> float:
+    """Coefficient de corrélation de Pearson."""
+    n = len(x)
+    if n < 2:
+        return 0.0
+    mx = sum(x) / n
+    my = sum(y) / n
+    num = sum((xi - mx) * (yi - my) for xi, yi in zip(x, y))
+    den = math.sqrt(
+        sum((xi - mx) ** 2 for xi in x) * sum((yi - my) ** 2 for yi in y)
+    )
+    return num / den if den > 0 else 0.0
+def compute_correlation_matrix(
+    metrics_per_doc: list[dict],
+    metric_keys: Optional[list[str]] = None,
+) -> dict:
+    """Calcule la matrice de corrélation entre toutes les métriques numériques.
+    Parameters
+    ----------
+    metrics_per_doc : liste de dicts, un par document, contenant les métriques
+    metric_keys     : clés à inclure (None → toutes les clés numériques)
+    Returns
+    -------
+    {
+      "labels": [...],
+      "matrix": [[r_ij, ...], ...]   // coefficients de Pearson
+    }
+    """
+    if not metrics_per_doc:
+        return {"labels": [], "matrix": []}
+    if metric_keys is None:
+        # Déduire les clés numériques
+        sample = metrics_per_doc[0]
+        metric_keys = [k for k, v in sample.items() if isinstance(v, (int, float))]
+    # Construire les vecteurs
+    vectors: dict[str, list[float]] = {k: [] for k in metric_keys}
+    for doc in metrics_per_doc:
+        for k in metric_keys:
+            v = doc.get(k)
+            vectors[k].append(float(v) if v is not None else 0.0)
+    # Calculer la matrice
+    labels = metric_keys
+    n = len(labels)
+    matrix = []
+    for i in range(n):
+        row = []
+        for j in range(n):
+            r = _pearson(vectors[labels[i]], vectors[labels[j]])
+            row.append(round(r, 4))
+        matrix.append(row)
+    return {"labels": labels, "matrix": matrix}
+__all__ = ["compute_correlation_matrix"]

picarones/measurements/statistics/distributions.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""Courbes de distribution de la performance (Sprint 7).
+- :func:`compute_reliability_curve` — pour les X % docs les plus
+  faciles, quel est le CER moyen ? Révèle si un moteur a un long
+  tail catastrophique.
+- :func:`compute_venn_data` — cardinalités pour un diagramme de
+  Venn 2 ou 3 moteurs sur les ensembles d'erreurs commises.
+"""
+from __future__ import annotations
+def compute_reliability_curve(
+    cer_values: list[float],
+    steps: int = 20,
+) -> list[dict]:
+    """Pour les X% documents les plus faciles, quel est le CER moyen ?
+    Returns
+    -------
+    Liste de {pct_docs: float, mean_cer: float}
+    """
+    if not cer_values:
+        return []
+    sorted_cer = sorted(cer_values)
+    n = len(sorted_cer)
+    points = []
+    for step in range(1, steps + 1):
+        pct = step / steps
+        cutoff = max(1, int(pct * n))
+        subset = sorted_cer[:cutoff]
+        mean_cer = sum(subset) / len(subset)
+        points.append({"pct_docs": round(pct * 100, 1), "mean_cer": round(mean_cer, 6)})
+    return points
+def compute_venn_data(
+    engine_error_sets: dict[str, set[str]],
+) -> dict:
+    """Calcule les cardinalités pour un diagramme de Venn entre 2 ou 3 concurrents.
+    Parameters
+    ----------
+    engine_error_sets : {engine_name → set of doc_id:error_token_pair strings}
+    Returns
+    -------
+    Pour 2 concurrents :
+      {only_a, only_b, both, label_a, label_b}
+    Pour 3 concurrents :
+      {only_a, only_b, only_c, ab, ac, bc, abc, label_a, label_b, label_c}
+    """
+    names = list(engine_error_sets.keys())[:3]  # max 3 pour Venn lisible
+    if len(names) < 2:
+        return {}
+    sets = {n: engine_error_sets[n] for n in names}
+    if len(names) == 2:
+        a, b = names
+        sa, sb = sets[a], sets[b]
+        return {
+            "type": "venn2",
+            "label_a": a,
+            "label_b": b,
+            "only_a": len(sa - sb),
+            "only_b": len(sb - sa),
+            "both": len(sa & sb),
+        }
+    else:
+        a, b, c = names
+        sa, sb, sc = sets[a], sets[b], sets[c]
+        return {
+            "type": "venn3",
+            "label_a": a,
+            "label_b": b,
+            "label_c": c,
+            "only_a": len(sa - sb - sc),
+            "only_b": len(sb - sa - sc),
+            "only_c": len(sc - sa - sb),
+            "ab": len((sa & sb) - sc),
+            "ac": len((sa & sc) - sb),
+            "bc": len((sb & sc) - sa),
+            "abc": len(sa & sb & sc),
+        }
+__all__ = ["compute_reliability_curve", "compute_venn_data"]

picarones/measurements/statistics/friedman_nemenyi.py ADDED Viewed

	@@ -0,0 +1,350 @@

+"""Test de Friedman + post-hoc Nemenyi (Sprint 17).
+Référence : Demšar, J. (2006), "Statistical Comparisons of Classifiers
+over Multiple Data Sets", Journal of Machine Learning Research 7:1-30.
+Standard de facto pour comparer plusieurs systèmes sur plusieurs
+datasets — ici plusieurs moteurs OCR sur plusieurs documents.
+Le rendu visuel canonique (Critical Difference Diagram) vit dans
+:mod:`picarones.measurements.statistics.cdd_render` pour séparer
+calcul (ce module) et présentation (l'autre).
+"""
+from __future__ import annotations
+import math
+from typing import Optional
+from picarones.measurements.statistics.wilcoxon import _normal_sf
+# Valeurs critiques de la distribution du Studentized Range divisées par √2,
+# pour df = ∞ (approximation usuelle pour Nemenyi). Source : tables de Tukey.
+# Clé : nombre de traitements k ; valeur : q_α pour α ∈ {0.05, 0.01}.
+_NEMENYI_Q_TABLE = {
+    # k   q_0.05   q_0.01
+    2:  (1.960, 2.576),
+    3:  (2.343, 2.913),
+    4:  (2.569, 3.113),
+    5:  (2.728, 3.255),
+    6:  (2.850, 3.364),
+    7:  (2.949, 3.452),
+    8:  (3.031, 3.526),
+    9:  (3.102, 3.590),
+    10: (3.164, 3.646),
+    11: (3.219, 3.696),
+    12: (3.268, 3.741),
+    13: (3.313, 3.781),
+    14: (3.354, 3.818),
+    15: (3.391, 3.853),
+    16: (3.426, 3.886),
+    17: (3.458, 3.916),
+    18: (3.489, 3.944),
+    19: (3.517, 3.970),
+    20: (3.544, 3.995),
+    25: (3.658, 4.095),
+    30: (3.739, 4.167),
+    40: (3.858, 4.272),
+    50: (3.945, 4.349),
+}
+def _chi_square_sf(x: float, df: int) -> float:
+    """Survival function de la loi chi², 1 - CDF(x).
+    Utilise scipy si disponible (méthode exacte), sinon Wilson-Hilferty
+    (approximation normale précise dès df ≥ 3).
+    """
+    if x <= 0 or df <= 0:
+        return 1.0
+    try:
+        from scipy.stats import chi2 as _chi2  # type: ignore[import-untyped]
+        return float(_chi2.sf(x, df))
+    except ImportError:
+        pass
+    # Wilson-Hilferty : transforme chi² en approximation normale
+    z = (((x / df) ** (1.0 / 3.0)) - (1.0 - 2.0 / (9.0 * df))) / math.sqrt(2.0 / (9.0 * df))
+    return _normal_sf(z)
+def _rank_row(values: list[float]) -> list[float]:
+    """Rangs d'une ligne — petit = rang 1. Ex-aequo : rangs moyens."""
+    n = len(values)
+    indexed = sorted(range(n), key=lambda i: values[i])
+    ranks = [0.0] * n
+    i = 0
+    while i < n:
+        j = i
+        while j < n and values[indexed[j]] == values[indexed[i]]:
+            j += 1
+        avg_rank = (i + j + 1) / 2.0  # 1-based
+        for k in range(i, j):
+            ranks[indexed[k]] = avg_rank
+        i = j
+    return ranks
+def _aligned_cer_matrix(
+    engine_cer_map: dict[str, list[float]],
+) -> tuple[list[str], list[list[float]]]:
+    """Construit la matrice (k moteurs × n documents) alignée sur la longueur
+    minimale. Retourne ``(noms, matrice_colonne_par_moteur)``.
+    Friedman exige des blocs (documents) complets : si les moteurs n'ont pas
+    tous été exécutés sur les mêmes documents, on tronque à la longueur
+    minimale, documentée dans le résultat via ``n_blocks``.
+    """
+    names = list(engine_cer_map.keys())
+    if not names:
+        return [], []
+    min_len = min(len(v) for v in engine_cer_map.values())
+    if min_len == 0:
+        return names, []
+    matrix = [engine_cer_map[n][:min_len] for n in names]
+    return names, matrix
+def friedman_test(engine_cer_map: dict[str, list[float]]) -> dict:
+    """Test de Friedman — k moteurs sur n documents appariés.
+    Test non-paramétrique équivalent à l'ANOVA à mesures répétées pour des
+    données ordinales. Hypothèse nulle : tous les moteurs ont la même
+    performance moyenne. Rejet → au moins un moteur diffère des autres.
+    Parameters
+    ----------
+    engine_cer_map:
+        Dict ``{engine_name → [cer_doc1, cer_doc2, ...]}``. Tous les moteurs
+        doivent avoir été évalués sur les mêmes documents (dans le même ordre).
+    Returns
+    -------
+    dict avec :
+      - ``statistic``     : Q corrigé pour les ex-aequo
+      - ``p_value``       : p-value (scipy si dispo, sinon Wilson-Hilferty)
+      - ``significant``   : bool, p < 0.05
+      - ``df``            : degrés de liberté = k - 1
+      - ``n_blocks``      : nombre de documents (blocs) utilisés
+      - ``n_engines``     : nombre de moteurs (k)
+      - ``mean_ranks``    : dict ``{engine: rang_moyen}``
+      - ``interpretation``: phrase lisible
+      - ``error``         : message si le test n'est pas applicable
+    """
+    names, matrix = _aligned_cer_matrix(engine_cer_map)
+    k = len(names)
+    n = len(matrix[0]) if matrix else 0
+    if k < 2:
+        return {
+            "statistic": 0.0, "p_value": 1.0, "significant": False,
+            "df": 0, "n_blocks": n, "n_engines": k,
+            "mean_ranks": {names[0]: 1.0} if k == 1 else {},
+            "interpretation": "Test de Friedman non applicable : il faut au moins 2 moteurs.",
+            "error": "not_enough_engines",
+        }
+    if n < 2:
+        return {
+            "statistic": 0.0, "p_value": 1.0, "significant": False,
+            "df": k - 1, "n_blocks": n, "n_engines": k,
+            "mean_ranks": {name: 1.0 for name in names},
+            "interpretation": "Test de Friedman non applicable : il faut au moins 2 documents communs.",
+            "error": "not_enough_blocks",
+        }
+    # Rangs par bloc (document) : pour chaque doc, ranger les k moteurs
+    ranks_by_engine: list[list[float]] = [[] for _ in range(k)]
+    for j in range(n):
+        row = [matrix[i][j] for i in range(k)]
+        row_ranks = _rank_row(row)
+        for i in range(k):
+            ranks_by_engine[i].append(row_ranks[i])
+    rank_sums = [sum(r) for r in ranks_by_engine]
+    mean_ranks = {names[i]: rank_sums[i] / n for i in range(k)}
+    # Statistique Q non-corrigée (sans ex-aequo)
+    #   Q = 12 / (n·k·(k+1)) · Σ R_j² − 3·n·(k+1)
+    Q = (12.0 / (n * k * (k + 1))) * sum(rs ** 2 for rs in rank_sums) - 3.0 * n * (k + 1)
+    # Correction pour les ex-aequo (ties factor) — ajuste si des rangs sont
+    # partagés dans certains blocs. Formule : Q_corr = Q / (1 - T/(n·(k³−k)))
+    # où T = Σ (tⱼ³ − tⱼ) sur tous les groupes d'ex-aequo.
+    tie_correction = 0.0
+    for j in range(n):
+        row = [matrix[i][j] for i in range(k)]
+        sorted_row = sorted(row)
+        i = 0
+        while i < len(sorted_row):
+            count = 1
+            while i + count < len(sorted_row) and sorted_row[i + count] == sorted_row[i]:
+                count += 1
+            if count > 1:
+                tie_correction += count ** 3 - count
+            i += count
+    denom = 1.0 - tie_correction / (n * (k ** 3 - k)) if k >= 2 else 1.0
+    if denom > 0:
+        Q = Q / denom
+    df = k - 1
+    p_value = _chi_square_sf(Q, df)
+    significant = p_value < 0.05
+    if significant:
+        interpretation = (
+            f"Test de Friedman significatif (Q = {Q:.3f}, df = {df}, p = {p_value:.4f}). "
+            f"Au moins un moteur diffère des autres — utiliser le post-hoc Nemenyi "
+            f"pour identifier les paires distinguables."
+        )
+    else:
+        interpretation = (
+            f"Test de Friedman non significatif (Q = {Q:.3f}, df = {df}, p = {p_value:.4f}). "
+            f"Aucune différence globale détectée entre les moteurs sur ce corpus."
+        )
+    return {
+        "statistic": round(Q, 4),
+        "p_value": round(p_value, 6),
+        "significant": significant,
+        "df": df,
+        "n_blocks": n,
+        "n_engines": k,
+        "mean_ranks": {k_: round(v, 4) for k_, v in mean_ranks.items()},
+        "interpretation": interpretation,
+    }
+def _nemenyi_critical_value(k: int, alpha: float = 0.05) -> Optional[float]:
+    """Valeur critique q_α pour k traitements, df = ∞.
+    Retourne ``None`` si k est hors table (< 2 ou > 50).
+    """
+    if k < 2:
+        return None
+    if k in _NEMENYI_Q_TABLE:
+        q05, q01 = _NEMENYI_Q_TABLE[k]
+        return q05 if alpha == 0.05 else q01 if alpha == 0.01 else q05
+    # Au-delà de la table : borne supérieure (conservateur)
+    max_k = max(_NEMENYI_Q_TABLE.keys())
+    if k > max_k:
+        q05, q01 = _NEMENYI_Q_TABLE[max_k]
+        return q05 if alpha == 0.05 else q01
+    # Entre deux clés : interpolation linéaire
+    keys = sorted(_NEMENYI_Q_TABLE.keys())
+    for i in range(len(keys) - 1):
+        if keys[i] < k < keys[i + 1]:
+            lo, hi = keys[i], keys[i + 1]
+            q_lo = _NEMENYI_Q_TABLE[lo][0 if alpha == 0.05 else 1]
+            q_hi = _NEMENYI_Q_TABLE[hi][0 if alpha == 0.05 else 1]
+            frac = (k - lo) / (hi - lo)
+            return q_lo + frac * (q_hi - q_lo)
+    return None
+def nemenyi_posthoc(
+    engine_cer_map: dict[str, list[float]],
+    alpha: float = 0.05,
+) -> dict:
+    """Post-hoc de Nemenyi — identifie les paires de moteurs statistiquement
+    indiscernables après un test de Friedman.
+    Calcule la *critical distance* CD = q_α · √(k·(k+1) / (6·n)). Deux moteurs
+    dont les rangs moyens diffèrent de moins que CD ne sont **pas**
+    statistiquement distinguables au seuil α.
+    Returns
+    -------
+    dict avec :
+      - ``alpha``               : seuil utilisé
+      - ``critical_distance``   : CD calculée
+      - ``q_alpha``             : valeur critique q_α issue de la table
+      - ``n_blocks``, ``n_engines``
+      - ``mean_ranks``          : rangs moyens par moteur (dict)
+      - ``engines_sorted``      : liste des moteurs triés par rang croissant
+      - ``significant_matrix``  : matrice bool (list[list[bool]]),
+                                  ``True`` = paire significativement différente
+      - ``tied_groups``         : liste de listes de moteurs indiscernables
+                                  (groupes maximaux d'ex-aequo pratiques)
+      - ``error``               : présent si le test n'est pas applicable
+    """
+    names, matrix = _aligned_cer_matrix(engine_cer_map)
+    k = len(names)
+    n = len(matrix[0]) if matrix else 0
+    if k < 2 or n < 2:
+        return {
+            "alpha": alpha,
+            "critical_distance": 0.0,
+            "q_alpha": 0.0,
+            "n_blocks": n,
+            "n_engines": k,
+            "mean_ranks": {name: 1.0 for name in names},
+            "engines_sorted": list(names),
+            "significant_matrix": [[False] * k for _ in range(k)],
+            "tied_groups": [list(names)] if names else [],
+            "error": "not_enough_data",
+        }
+    # Friedman fournit les rangs moyens — on les recalcule ici pour rester
+    # autonome (sans forcer l'utilisateur à chaîner les deux appels).
+    ranks_by_engine: list[list[float]] = [[] for _ in range(k)]
+    for j in range(n):
+        row = [matrix[i][j] for i in range(k)]
+        row_ranks = _rank_row(row)
+        for i in range(k):
+            ranks_by_engine[i].append(row_ranks[i])
+    mean_ranks_list = [sum(r) / n for r in ranks_by_engine]
+    mean_ranks = {names[i]: round(mean_ranks_list[i], 4) for i in range(k)}
+    q_alpha = _nemenyi_critical_value(k, alpha) or 0.0
+    critical_distance = q_alpha * math.sqrt(k * (k + 1) / (6.0 * n))
+    # Matrice de significativité : paire (i,j) significative si |R_i - R_j| > CD
+    significant_matrix = [
+        [
+            (i != j) and (abs(mean_ranks_list[i] - mean_ranks_list[j]) > critical_distance)
+            for j in range(k)
+        ]
+        for i in range(k)
+    ]
+    # Groupes d'ex-aequo pratiques : fenêtre glissante sur les rangs triés.
+    # Deux moteurs sont dans le même groupe si leur écart ≤ CD.
+    order = sorted(range(k), key=lambda i: mean_ranks_list[i])
+    sorted_names = [names[i] for i in order]
+    sorted_ranks = [mean_ranks_list[i] for i in order]
+    tied_groups: list[list[str]] = []
+    i = 0
+    while i < len(sorted_names):
+        # étendre le groupe tant que le moteur suivant est à ≤ CD du premier du groupe
+        j = i
+        while j + 1 < len(sorted_names) and (sorted_ranks[j + 1] - sorted_ranks[i]) <= critical_distance:
+            j += 1
+        tied_groups.append(sorted_names[i:j + 1])
+        i = j + 1 if j > i else i + 1
+    return {
+        "alpha": alpha,
+        "critical_distance": round(critical_distance, 4),
+        "q_alpha": round(q_alpha, 4),
+        "n_blocks": n,
+        "n_engines": k,
+        "mean_ranks": mean_ranks,
+        "engines_sorted": sorted_names,
+        "significant_matrix": significant_matrix,
+        "tied_groups": tied_groups,
+    }
+__all__ = [
+    # Symboles publics.
+    "friedman_test",
+    "nemenyi_posthoc",
+    # Symboles privés ré-exportés (consommés par les tests Sprint 18).
+    # Note : ``_aligned_cer_matrix`` reste strictement interne au module
+    # (utilisé seulement par friedman_test et nemenyi_posthoc) ; il n'est
+    # ni dans __all__ ni ré-exporté par le __init__.py du sous-package.
+    "_chi_square_sf",
+    "_nemenyi_critical_value",
+    "_rank_row",
+]

picarones/measurements/statistics/pareto.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""Frontière de Pareto multi-objectifs (Sprint 19).
+Algorithme générique sur N objectifs (CER, coût, vitesse, CO₂…).
+Renvoie les noms des points non-dominés.
+"""
+from __future__ import annotations
+from typing import Optional
+def compute_pareto_front(
+    points: list[dict],
+    objectives: tuple[str, ...] = ("cer", "cost"),
+    name_key: str = "engine",
+    minimize: Optional[tuple[bool, ...]] = None,
+) -> list[str]:
+    """Calcule la frontière de Pareto sur ``len(objectives)`` dimensions.
+    Un point ``p`` est Pareto-dominant si aucun autre point n'a, pour TOUS
+    les objectifs, une valeur au moins aussi bonne ET au moins une valeur
+    strictement meilleure.
+    Parameters
+    ----------
+    points:
+        Liste de dicts. Chaque dict doit contenir ``name_key`` et toutes les
+        clés de ``objectives``. Les points dont une valeur d'objectif est
+        ``None`` sont ignorés (pas de comparaison possible).
+    objectives:
+        Clés des objectifs à minimiser/maximiser.
+    name_key:
+        Clé identifiant le point (par défaut ``"engine"``).
+    minimize:
+        Pour chaque objectif, ``True`` = minimiser (ex. CER, coût),
+        ``False`` = maximiser (ex. ancrage). Doit avoir la même longueur
+        que ``objectives``.
+    Returns
+    -------
+    Liste des ``name`` des points sur le front Pareto, ordre stable depuis
+    ``points``.
+    """
+    if minimize is None:
+        minimize = tuple(True for _ in objectives)
+    if len(minimize) != len(objectives):
+        raise ValueError("`minimize` doit avoir la même longueur que `objectives`")
+    valid = []
+    for p in points:
+        try:
+            vals = tuple(float(p[k]) for k in objectives)
+        except (KeyError, TypeError, ValueError):
+            continue
+        valid.append((p[name_key], vals))
+    front: list[str] = []
+    for name_a, vals_a in valid:
+        dominated = False
+        for name_b, vals_b in valid:
+            if name_a == name_b:
+                continue
+            # B domine A si B est ≥ aussi bon partout ET strictement meilleur quelque part
+            better_or_equal_everywhere = True
+            strictly_better_somewhere = False
+            for va, vb, mini in zip(vals_a, vals_b, minimize):
+                if mini:
+                    if vb > va:
+                        better_or_equal_everywhere = False
+                        break
+                    if vb < va:
+                        strictly_better_somewhere = True
+                else:  # maximiser
+                    if vb < va:
+                        better_or_equal_everywhere = False
+                        break
+                    if vb > va:
+                        strictly_better_somewhere = True
+            if better_or_equal_everywhere and strictly_better_somewhere:
+                dominated = True
+                break
+        if not dominated:
+            front.append(name_a)
+    return front
+__all__ = ["compute_pareto_front"]

picarones/measurements/statistics/wilcoxon.py ADDED Viewed

	@@ -0,0 +1,227 @@

+"""Test de Wilcoxon signé-rangé + tests pairwise (Sprint 7).
+Test non-paramétrique pour comparer 2 séries appariées (mêmes
+documents, deux moteurs différents). Utilise scipy si disponible
+(méthode exacte n ≤ 25), sinon approximation normale native (n ≥ 10)
+ou table critique simplifiée pour très petits n.
+"""
+from __future__ import annotations
+import math
+# Import optionnel de scipy — utilisé pour le test de Wilcoxon si disponible
+# (méthode exacte pour n ≤ 25, approximation normale pour n > 25).
+# En son absence, l'implémentation native (approximation normale pour n ≥ 10)
+# est utilisée automatiquement.
+try:
+    from scipy.stats import wilcoxon as _scipy_wilcoxon  # type: ignore[import-untyped]
+    _SCIPY_AVAILABLE = True
+except ImportError:
+    _SCIPY_AVAILABLE = False
+def wilcoxon_test(
+    a: list[float],
+    b: list[float],
+    zero_method: str = "wilcox",
+) -> dict:
+    """Test de Wilcoxon signé-rangé entre deux séries de CER appariées.
+    Retourne un dict avec :
+      - statistic     : W = min(W⁺, W⁻)
+      - p_value       : p-value bilatérale
+      - significant   : bool (p < 0.05)
+      - interpretation : phrase lisible
+      - n_pairs       : nombre de paires utilisées (après retrait des zéros)
+      - W_plus        : somme des rangs des différences positives
+      - W_minus       : somme des rangs des différences négatives
+    Hypothèses et limites
+    ---------------------
+    * Les observations sont appariées (même corpus, deux moteurs différents).
+    * Le test est non-paramétrique : aucune hypothèse de normalité des CER.
+    * ``zero_method="wilcox"`` (défaut) : les paires sans différence (aᵢ = bᵢ)
+      sont simplement exclues.  Les autres méthodes (``"pratt"``, ``"zsplit"``)
+      nécessitent scipy.
+    * **Approximation normale** (implémentation native, n ≥ 10) :
+      L'approximation est raisonnable pour n ≥ 10 et converge vers la
+      distribution exacte.  Pour n < 10, une table critique simplifiée est
+      utilisée (p ∈ {0.04, 0.20}) — résultat **conservateur**.
+    * **scipy** (si installé) : ``scipy.stats.wilcoxon`` est utilisé à la place
+      de l'approximation native.  scipy utilise la méthode exacte pour n ≤ 25
+      et l'approximation normale pour n > 25, ce qui est plus précis.
+    * **Validité** : le test suppose la symétrie de la distribution des
+      différences.  Avec de très petits n (< 5), les résultats sont peu fiables
+      quelle que soit la méthode.
+    Parameters
+    ----------
+    a, b : séries de CER (même longueur, même ordre de documents)
+    zero_method : gestion des paires nulles (défaut : ``"wilcox"``)
+    """
+    if len(a) != len(b):
+        raise ValueError("Les deux listes doivent avoir la même longueur")
+    diffs = [x - y for x, y in zip(a, b)]
+    # Retirer les zéros (méthode "wilcox")
+    if zero_method == "wilcox":
+        diffs = [d for d in diffs if d != 0.0]
+    n = len(diffs)
+    if n == 0:
+        return {
+            "statistic": 0.0,
+            "p_value": 1.0,
+            "significant": False,
+            "interpretation": "Aucune différence entre les deux concurrents.",
+            "n_pairs": 0,
+        }
+    # Rangs des valeurs absolues
+    abs_diffs = [abs(d) for d in diffs]
+    indexed = sorted(enumerate(abs_diffs), key=lambda x: x[1])
+    # Gestion des ex-aequo : rang moyen
+    ranks = [0.0] * n
+    i = 0
+    while i < n:
+        j = i
+        while j < n and abs_diffs[indexed[j][0]] == abs_diffs[indexed[i][0]]:
+            j += 1
+        avg_rank = (i + j + 1) / 2.0  # rang moyen (1-based)
+        for k in range(i, j):
+            ranks[indexed[k][0]] = avg_rank
+        i = j
+    W_plus  = sum(ranks[k] for k in range(n) if diffs[k] > 0)
+    W_minus = sum(ranks[k] for k in range(n) if diffs[k] < 0)
+    W = min(W_plus, W_minus)
+    # Calcul de la p-value : scipy si disponible, sinon approximation native
+    if _SCIPY_AVAILABLE:
+        try:
+            scipy_res = _scipy_wilcoxon(diffs, zero_method=zero_method)
+            p_value = float(scipy_res.pvalue)
+        except Exception:  # noqa: BLE001 — fallback gracieux
+            # Repli sur l'implémentation native en cas d'erreur scipy
+            p_value = _native_p_value(n, W)
+    else:
+        p_value = _native_p_value(n, W)
+    significant = p_value < 0.05
+    if significant:
+        better = "premier" if W_plus < W_minus else "second"
+        interpretation = (
+            f"Différence statistiquement significative (p = {p_value:.4f} < 0.05). "
+            f"Le {better} concurrent obtient de meilleurs scores."
+        )
+    else:
+        interpretation = (
+            f"Différence non significative (p = {p_value:.4f} ≥ 0.05). "
+            "On ne peut pas conclure que l'un surpasse l'autre."
+        )
+    return {
+        "statistic": round(W, 4),
+        "p_value": round(p_value, 6),
+        "significant": significant,
+        "interpretation": interpretation,
+        "n_pairs": n,
+        "W_plus": round(W_plus, 4),
+        "W_minus": round(W_minus, 4),
+    }
+def _normal_sf(z: float) -> float:
+    """Survival function de la loi normale standard (1 - CDF).
+    Approximation Abramowitz & Stegun 26.2.17. Utilisée par cette
+    famille pour Wilcoxon ET par friedman_nemenyi pour le fallback
+    Wilson-Hilferty quand scipy n'est pas disponible.
+    """
+    t = 1.0 / (1.0 + 0.2316419 * abs(z))
+    poly = t * (0.319381530 + t * (-0.356563782 + t * (1.781477937
+           + t * (-1.821255978 + t * 1.330274429))))
+    phi_z = math.exp(-0.5 * z * z) / math.sqrt(2.0 * math.pi)
+    p = phi_z * poly
+    return p if z >= 0 else 1.0 - p
+# Table des valeurs critiques de W pour α=0.05 bilatéral (test exact, source : tables de Wilcoxon)
+_W_CRITICAL = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 2, 8: 3, 9: 5}
+def _wilcoxon_exact_p(n: int, w: float) -> float:
+    """P-value approximée pour petits n (< 10) via table critique simplifiée.
+    Note : résultat **conservateur** — seules deux valeurs sont retournées :
+    0.04 (significatif à 5 %) ou 0.20 (non significatif).
+    Préférer scipy pour des p-values exactes.
+    """
+    critical = _W_CRITICAL.get(n, 0)
+    if w <= critical:
+        return 0.04  # significatif à 5 %
+    return 0.20      # non significatif (approximation conservative)
+def _native_p_value(n: int, W: float) -> float:
+    """Calcule la p-value via l'approximation normale (n ≥ 10) ou la table exacte (n < 10)."""
+    if n >= 10:
+        mu = n * (n + 1) / 4.0
+        sigma2 = n * (n + 1) * (2 * n + 1) / 24.0
+        if sigma2 <= 0:
+            return 1.0
+        z = abs((W + 0.5) - mu) / math.sqrt(sigma2)  # correction de continuité
+        return 2.0 * _normal_sf(z)  # test bilatéral
+    return _wilcoxon_exact_p(n, W)
+def compute_pairwise_stats(
+    engine_cer_map: dict[str, list[float]],
+) -> list[dict]:
+    """Calcule les tests de Wilcoxon entre toutes les paires de concurrents.
+    Parameters
+    ----------
+    engine_cer_map : dict {engine_name → [cer_doc1, cer_doc2, ...]}
+    Returns
+    -------
+    Liste de dicts, un par paire :
+      - engine_a, engine_b, statistic, p_value, significant, interpretation
+    """
+    names = list(engine_cer_map.keys())
+    results = []
+    for i in range(len(names)):
+        for j in range(i + 1, len(names)):
+            a_name, b_name = names[i], names[j]
+            a_vals = engine_cer_map[a_name]
+            b_vals = engine_cer_map[b_name]
+            # Aligner les longueurs
+            min_len = min(len(a_vals), len(b_vals))
+            if min_len < 2:
+                continue
+            res = wilcoxon_test(a_vals[:min_len], b_vals[:min_len])
+            results.append({
+                "engine_a": a_name,
+                "engine_b": b_name,
+                **res,
+            })
+    return results
+__all__ = [
+    # Symboles publics : signature stable, consommés directement par les
+    # tests via le ré-export de ``picarones.measurements.statistics``.
+    "compute_pairwise_stats",
+    "wilcoxon_test",
+    # Symboles privés ré-exportés (consommés par certains tests) :
+    # ``_SCIPY_AVAILABLE`` est utilisé pour skip les tests scipy quand
+    # la dépendance n'est pas installée. ``_normal_sf`` est par ailleurs
+    # importée par :mod:`friedman_nemenyi` comme utilité math pure.
+    "_SCIPY_AVAILABLE",
+    "_normal_sf",
+]

picarones/report/assets.py ADDED Viewed

	@@ -0,0 +1,203 @@

+"""Chargement et préparation des assets du rapport HTML.
+Ce module concentre tout ce qui touche aux ressources binaires
+embarquées ou référencées par le rapport :
+- ``load_vendor_js`` lit un fichier JS vendorisé (Chart.js, etc.).
+- ``encode_image_b64`` redimensionne et encode une image en data-URI.
+- ``encode_images_b64_from_result`` itère sur un BenchmarkResult.
+- ``externalize_images_to_dir`` écrit les images sur disque à côté
+  du HTML (mode ``--lazy-images`` du Sprint A5).
+Extrait de ``picarones/report/generator.py`` lors du sprint de
+découpage : isole l'I/O image et vendor du reste de l'orchestration.
+"""
+from __future__ import annotations
+import base64
+import io
+import logging
+from pathlib import Path
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from picarones.core.results import BenchmarkResult
+logger = logging.getLogger(__name__)
+#: Dossier où sont stockées les ressources JS embarquées.
+_VENDOR_DIR = Path(__file__).parent / "vendor"
+def load_vendor_js(name: str) -> str:
+    """Lit un fichier JS vendorisé et retourne son contenu.
+    Si le fichier n'existe pas, retourne un commentaire JS qui
+    garde le rapport valide (pas de SyntaxError côté navigateur).
+    """
+    p = _VENDOR_DIR / name
+    if p.exists():
+        return p.read_text(encoding="utf-8")
+    return f"/* vendor/{name} non trouvé */"
+def encode_image_b64(image_path: str, max_width: int = 1200) -> str:
+    """Lit une image, la redimensionne si besoin, et retourne un data-URI base64.
+    Retourne ``""`` si l'image est introuvable ou si l'encodage
+    échoue (Pillow indisponible, format non géré, fichier corrompu).
+    Logue un avertissement dans ce dernier cas — le rapport reste
+    fonctionnel mais l'image manquera dans la galerie.
+    Distingue ``ImportError`` (Pillow non installée — problème
+    d'environnement) du reste (problème par image) pour aider au
+    diagnostic en logs de production.
+    """
+    p = Path(image_path)
+    if not p.exists():
+        return ""
+    try:
+        from PIL import Image
+    except ImportError as exc:
+        logger.warning(
+            "[report] Pillow indisponible : %s — toutes les images "
+            "du rapport seront omises. Installer ``pip install Pillow`` "
+            "ou ``pip install picarones[report]``.",
+            exc,
+        )
+        return ""
+    try:
+        with Image.open(p) as img:
+            if img.width > max_width:
+                ratio = max_width / img.width
+                new_h = max(1, int(img.height * ratio))
+                img = img.resize((max_width, new_h), Image.LANCZOS)
+            # Convertir en RGB pour éviter les problèmes de mode (RGBA, palette…)
+            if img.mode not in ("RGB", "L"):
+                img = img.convert("RGB")
+            buf = io.BytesIO()
+            fmt = "JPEG" if p.suffix.lower() in (".jpg", ".jpeg") else "PNG"
+            img.save(buf, format=fmt, optimize=True, quality=85)
+            b64 = base64.b64encode(buf.getvalue()).decode("ascii")
+            mime = "image/jpeg" if fmt == "JPEG" else "image/png"
+            return f"data:{mime};base64,{b64}"
+    except Exception as exc:  # noqa: BLE001 — fallback gracieux + warning
+        logger.warning(
+            "[report] échec d'encodage base64 de l'image %s : %s — "
+            "le rapport ignorera cette image",
+            image_path,
+            exc,
+        )
+        return ""
+def encode_images_b64_from_result(
+    benchmark: "BenchmarkResult", max_width: int = 1200,
+) -> dict[str, str]:
+    """Encode toutes les images d'un BenchmarkResult en base64.
+    Returns
+    -------
+    dict
+        ``{doc_id: data_uri}``
+    """
+    images: dict[str, str] = {}
+    if not benchmark.engine_reports:
+        return images
+    for dr in benchmark.engine_reports[0].document_results:
+        if dr.image_path and dr.doc_id not in images:
+            uri = encode_image_b64(dr.image_path, max_width=max_width)
+            if uri:
+                images[dr.doc_id] = uri
+    return images
+def externalize_images_to_dir(
+    benchmark: "BenchmarkResult",
+    output_dir: Path,
+    max_width: int = 1200,
+    asset_subdir: str = "report-assets",
+) -> dict[str, str]:
+    """Sprint A5 (item M-16) — écrit les images sur disque dans un
+    sous-dossier à côté du HTML, et retourne ``{doc_id: url_relative}``.
+    Mode « lazy loading » : au lieu d'embarquer chaque image en
+    base64 dans le HTML (50 MB+ pour un corpus de 100 documents,
+    ~200 MB+ pour 1 000 documents), on les externalise en fichiers
+    PNG/JPEG locaux. Le HTML les référence via
+    ``<img src="report-assets/…">`` avec ``loading="lazy"`` côté
+    navigateur.
+    Le rapport reste auto-portant si l'utilisateur copie le dossier
+    ``report-assets/`` à côté du HTML (cf. CLI ``--lazy-images``).
+    Parameters
+    ----------
+    benchmark:
+        Résultat de benchmark (lit ``image_path`` de chaque DocumentResult).
+    output_dir:
+        Dossier où le HTML sera écrit ; le sous-dossier d'assets sera
+        créé à côté.
+    max_width:
+        Largeur max du redimensionnement (cohérent avec
+        ``encode_image_b64``).
+    asset_subdir:
+        Nom du sous-dossier d'assets (défaut ``"report-assets"``).
+    Returns
+    -------
+    dict[str, str]
+        ``{doc_id: "report-assets/<doc_id>.png"}`` (URL relative
+        consommable directement dans un attribut HTML ``src``).
+    """
+    from PIL import Image
+    assets_dir = output_dir / asset_subdir
+    assets_dir.mkdir(parents=True, exist_ok=True)
+    out: dict[str, str] = {}
+    seen_ids: set[str] = set()
+    for engine_report in benchmark.engine_reports:
+        for dr in engine_report.document_results:
+            doc_id = dr.doc_id
+            if doc_id in seen_ids:
+                continue
+            seen_ids.add(doc_id)
+            try:
+                src = Path(dr.image_path)
+                if not src.exists():
+                    continue
+                # Nom de fichier dérivé du doc_id, normalisé sans
+                # caractères dangereux pour le filesystem.
+                safe_id = "".join(
+                    c if c.isalnum() or c in "._-" else "_" for c in doc_id
+                )
+                dest = assets_dir / f"{safe_id}{src.suffix.lower() or '.png'}"
+                with Image.open(src) as img:
+                    if img.width > max_width:
+                        ratio = max_width / img.width
+                        new_h = max(1, int(img.height * ratio))
+                        img = img.resize((max_width, new_h), Image.LANCZOS)
+                    if img.mode not in ("RGB", "L"):
+                        img = img.convert("RGB")
+                    fmt = "JPEG" if dest.suffix in (".jpg", ".jpeg") else "PNG"
+                    img.save(dest, format=fmt, optimize=True, quality=85)
+                # URL relative (POSIX style même sur Windows pour HTML).
+                out[doc_id] = f"{asset_subdir}/{dest.name}"
+            except Exception as exc:  # noqa: BLE001 — fallback silencieux + warning
+                logger.warning(
+                    "[report] échec d'externalisation de l'image %s : %s — "
+                    "le rapport ignorera cette image",
+                    dr.image_path,
+                    exc,
+                )
+    return out
+__all__ = [
+    "load_vendor_js",
+    "encode_image_b64",
+    "encode_images_b64_from_result",
+    "externalize_images_to_dir",
+]

picarones/report/calibration_render.py CHANGED Viewed

@@ -28,21 +28,7 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_ece(ece: float) -> str:
-    """Gradient vert (ECE = 0, bien calibré) → rouge (ECE = 0.5+)."""
-    f = max(0.0, min(1.0, ece * 2.0))  # ECE > 0.5 → rouge max
-    if f <= 0.5:
-        ratio = f / 0.5
-        r = int(130 + (240 - 130) * ratio)
-        g = int(200 + (220 - 200) * ratio)
-        b = int(130 + (130 - 130) * ratio)
-    else:
-        ratio = (f - 0.5) / 0.5
-        r = int(240 + (220 - 240) * ratio)
-        g = int(220 + (100 - 220) * ratio)
-        b = int(130 + (100 - 130) * ratio)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def _engines_with_calibration(engines_summary: list[dict]) -> list[dict]:
@@ -98,7 +84,7 @@ def build_calibration_summary_html(
         acc = float(agg.get("overall_accuracy") or 0.0)
         conf = float(agg.get("overall_confidence") or 0.0)
         doc_count = int(agg.get("doc_count") or 0)
-        bg = _color_for_ece(ece)
         parts.append("<tr>")
         parts.append(
             f'<td style="padding:.3rem .5rem;font-weight:600">'

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_traffic_light
 def _engines_with_calibration(engines_summary: list[dict]) -> list[dict]:
         acc = float(agg.get("overall_accuracy") or 0.0)
         conf = float(agg.get("overall_confidence") or 0.0)
         doc_count = int(agg.get("doc_count") or 0)
+        bg = color_traffic_light(ece, low_is_good=True, scale_max=0.5)
         parts.append("<tr>")
         parts.append(
             f'<td style="padding:.3rem .5rem;font-weight:600">'

picarones/report/error_absorption_render.py CHANGED Viewed

@@ -51,55 +51,15 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_correction(rate: float) -> str:
-    """Faible (rouge) → élevé (vert) — bon = beaucoup corrigées."""
-    f = max(0.0, min(1.0, rate))
-    if f < 0.5:
-        t = f / 0.5
-        r = 235
-        g = int(70 + (200 - 70) * t)
-        b = 70
-    else:
-        t = (f - 0.5) / 0.5
-        r = int(235 + (60 - 235) * t)
-        g = int(200 + (160 - 200) * t)
-        b = int(70 + (90 - 70) * t)
-    return f"#{r:02x}{g:02x}{b:02x}"
-def _color_for_introduction(rate: float) -> str:
-    """Faible (vert) → élevé (rouge) — bon = peu introduites."""
-    f = max(0.0, min(1.0, rate))
-    if f < 0.5:
-        t = f / 0.5
-        r = int(60 + (235 - 60) * t)
-        g = int(160 + (180 - 160) * t)
-        b = int(90 + (60 - 90) * t)
-    else:
-        t = (f - 0.5) / 0.5
-        r = int(235 + (220 - 235) * t)
-        g = int(180 + (50 - 180) * t)
-        b = int(60 + (50 - 60) * t)
-    return f"#{r:02x}{g:02x}{b:02x}"
-def _color_for_net(net: int, max_abs: int) -> str:
-    """Vert si positif, rouge si négatif. Saturation à max_abs."""
-    if max_abs <= 0 or net == 0:
-        return "#a7f0a7"
-    f = max(-1.0, min(1.0, net / max_abs))
-    if f >= 0:
-        # vert clair → vert profond
-        r = int(167 + (90 - 167) * f)
-        g = int(240 + (200 - 240) * f)
-        b = int(167 + (90 - 167) * f)
-    else:
-        f = -f
-        r = int(167 + (220 - 167) * f)
-        g = int(240 + (50 - 240) * f)
-        b = int(167 + (50 - 167) * f)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def build_error_absorption_html(
@@ -186,7 +146,7 @@ def build_error_absorption_html(
         intro_rate = entry.get("introduction_rate")
         if isinstance(corr_rate, (int, float)):
             corr_rate_str = f"{corr_rate * 100:.1f}%"
-            corr_color = _color_for_correction(float(corr_rate))
             corr_cell = (
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{corr_color};font-family:monospace;'
@@ -199,7 +159,7 @@ def build_error_absorption_html(
             )
         if isinstance(intro_rate, (int, float)):
             intro_rate_str = f"{intro_rate * 100:.1f}%"
-            intro_color = _color_for_introduction(float(intro_rate))
             intro_cell = (
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{intro_color};font-family:monospace;'
@@ -210,7 +170,13 @@ def build_error_absorption_html(
                 '<td style="padding:.4rem .6rem;text-align:right;'
                 'opacity:.4">—</td>'
             )
-        net_color = _color_for_net(net, max_abs_net)
         intro_sample = entry.get("introduced_tokens_sample") or []
         sample_cell_text = ", ".join(
             _e(str(t)) for t in intro_sample[:sample_max]

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_diverging, color_traffic_light
+# Palette « net improvement » : vert clair au centre, vert profond
+# si favorable (net > 0), rouge si défavorable (net < 0).  Centrée
+# sur le vert clair car un delta nul est déjà « pas de régression ».
+_NET_NEUTRAL_RGB = (167, 240, 167)
+_NET_POSITIVE_RGB = (90, 200, 90)
+_NET_NEGATIVE_RGB = (220, 50, 50)
 def build_error_absorption_html(
         intro_rate = entry.get("introduction_rate")
         if isinstance(corr_rate, (int, float)):
             corr_rate_str = f"{corr_rate * 100:.1f}%"
+            corr_color = color_traffic_light(float(corr_rate))
             corr_cell = (
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{corr_color};font-family:monospace;'
             )
         if isinstance(intro_rate, (int, float)):
             intro_rate_str = f"{intro_rate * 100:.1f}%"
+            intro_color = color_traffic_light(float(intro_rate), low_is_good=True)
             intro_cell = (
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{intro_color};font-family:monospace;'
                 '<td style="padding:.4rem .6rem;text-align:right;'
                 'opacity:.4">—</td>'
             )
+        net_color = color_diverging(
+            float(net),
+            max_abs=float(max_abs_net) if max_abs_net else 1.0,
+            neutral_rgb=_NET_NEUTRAL_RGB,
+            positive_rgb=_NET_POSITIVE_RGB,
+            negative_rgb=_NET_NEGATIVE_RGB,
+        )
         intro_sample = entry.get("introduced_tokens_sample") or []
         sample_cell_text = ", ".join(
             _e(str(t)) for t in intro_sample[:sample_max]

picarones/report/generator.py CHANGED Viewed

@@ -11,667 +11,56 @@ Vues disponibles
 2. Galerie     — grille d'images avec badge CER coloré
 3. Document    — image zoomable + diff coloré GT / OCR par moteur
 4. Analyses    — histogramme CER + graphique radar
 """
 from __future__ import annotations
-import base64
-import io
 import json
 import logging
 from pathlib import Path
 from typing import Any, Optional
-logger = logging.getLogger(__name__)
-# ---------------------------------------------------------------------------
-# Ressources vendor (embarquées dans le rapport HTML)
-# ---------------------------------------------------------------------------
-_VENDOR_DIR = Path(__file__).parent / "vendor"
-def _load_vendor_js(name: str) -> str:
-    """Lit un fichier JS vendorisé et retourne son contenu."""
-    p = _VENDOR_DIR / name
-    if p.exists():
-        return p.read_text(encoding="utf-8")
-    return f"/* vendor/{name} non trouvé */"
 from picarones.core.results import BenchmarkResult
-from picarones.core.diff_utils import compute_char_diff, compute_word_diff
-from picarones.measurements.statistics import (
-    compute_pairwise_stats,
-    compute_reliability_curve,
-    compute_correlation_matrix,
-    compute_venn_data,
-    cluster_errors,
-    bootstrap_ci,
-    friedman_test,
-    nemenyi_posthoc,
-    build_critical_difference_svg,
-    compute_pareto_front,
 )
-from picarones.measurements.pricing import build_costs_for_benchmark, load_pricing_database
-from picarones.measurements.difficulty import compute_all_difficulties, difficulty_label
-# ---------------------------------------------------------------------------
-# Helpers
-# ---------------------------------------------------------------------------
-def _encode_image_b64(image_path: str, max_width: int = 1200) -> str:
-    """Lit une image, la redimensionne si besoin, et retourne un data-URI base64."""
-    try:
-        from PIL import Image
-        p = Path(image_path)
-        if not p.exists():
-            return ""
-        with Image.open(p) as img:
-            if img.width > max_width:
-                ratio = max_width / img.width
-                new_h = max(1, int(img.height * ratio))
-                img = img.resize((max_width, new_h), Image.LANCZOS)
-            # Convertir en RGB pour éviter les problèmes de mode (RGBA, palette…)
-            if img.mode not in ("RGB", "L"):
-                img = img.convert("RGB")
-            buf = io.BytesIO()
-            fmt = "JPEG" if p.suffix.lower() in (".jpg", ".jpeg") else "PNG"
-            img.save(buf, format=fmt, optimize=True, quality=85)
-            b64 = base64.b64encode(buf.getvalue()).decode("ascii")
-            mime = "image/jpeg" if fmt == "JPEG" else "image/png"
-            return f"data:{mime};base64,{b64}"
-    except Exception:
-        return ""
-def _externalize_images_to_dir(
-    benchmark: "BenchmarkResult",
-    output_dir: Path,
-    max_width: int = 1200,
-    asset_subdir: str = "report-assets",
-) -> dict[str, str]:
-    """Sprint A5 (item M-16) — écrit les images sur disque dans un
-    sous-dossier à côté du HTML, et retourne ``{doc_id: url_relative}``.
-    Mode « lazy loading » : au lieu d'embarquer chaque image en
-    base64 dans le HTML (50 MB+ pour un corpus de 100 documents,
-    ~200 MB+ pour 1 000 documents), on les externalise en fichiers
-    PNG/JPEG locaux. Le HTML les référence via ``<img src="report-assets/…">``
-    avec ``loading="lazy"`` côté navigateur.
-    Le rapport reste auto-portant si l'utilisateur copie le dossier
-    ``report-assets/`` à côté du HTML (cf. CLI ``--lazy-images``).
-    Parameters
-    ----------
-    benchmark:
-        Résultat de benchmark (lit ``image_path`` de chaque DocumentResult).
-    output_dir:
-        Dossier où le HTML sera écrit ; le sous-dossier d'assets sera
-        créé à côté.
-    max_width:
-        Largeur max du redimensionnement (cohérent avec
-        ``_encode_image_b64``).
-    asset_subdir:
-        Nom du sous-dossier d'assets (défaut ``"report-assets"``).
-    Returns
-    -------
-    dict[str, str]
-        ``{doc_id: "report-assets/<doc_id>.png"}`` (URL relative
-        consommable directement dans un attribut HTML ``src``).
-    """
-    from PIL import Image
-    assets_dir = output_dir / asset_subdir
-    assets_dir.mkdir(parents=True, exist_ok=True)
-    out: dict[str, str] = {}
-    seen_ids: set[str] = set()
-    for engine_report in benchmark.engine_reports:
-        for dr in engine_report.document_results:
-            doc_id = dr.doc_id
-            if doc_id in seen_ids:
-                continue
-            seen_ids.add(doc_id)
-            try:
-                src = Path(dr.image_path)
-                if not src.exists():
-                    continue
-                # Nom de fichier dérivé du doc_id, normalisé sans
-                # caractères dangereux pour le filesystem.
-                safe_id = "".join(
-                    c if c.isalnum() or c in "._-" else "_" for c in doc_id
-                )
-                dest = assets_dir / f"{safe_id}{src.suffix.lower() or '.png'}"
-                with Image.open(src) as img:
-                    if img.width > max_width:
-                        ratio = max_width / img.width
-                        new_h = max(1, int(img.height * ratio))
-                        img = img.resize((max_width, new_h), Image.LANCZOS)
-                    if img.mode not in ("RGB", "L"):
-                        img = img.convert("RGB")
-                    fmt = "JPEG" if dest.suffix in (".jpg", ".jpeg") else "PNG"
-                    img.save(dest, format=fmt, optimize=True, quality=85)
-                # URL relative (POSIX style même sur Windows pour HTML).
-                out[doc_id] = f"{asset_subdir}/{dest.name}"
-            except Exception as exc:  # noqa: BLE001 — fallback silencieux + warning
-                logger.warning(
-                    "[report] échec d'externalisation de l'image %s : %s — "
-                    "le rapport ignorera cette image",
-                    dr.image_path,
-                    exc,
-                )
-    return out
-def _encode_images_b64_from_result(benchmark: "BenchmarkResult", max_width: int = 1200) -> dict[str, str]:
-    """Encode toutes les images d'un BenchmarkResult en base64.
-    Returns
-    -------
-    dict
-        ``{doc_id: data_uri}``
-    """
-    images: dict[str, str] = {}
-    if not benchmark.engine_reports:
-        return images
-    for dr in benchmark.engine_reports[0].document_results:
-        if dr.image_path and dr.doc_id not in images:
-            uri = _encode_image_b64(dr.image_path, max_width=max_width)
-            if uri:
-                images[dr.doc_id] = uri
-    return images
-def _cer_color(cer: float) -> str:
-    """Retourne une couleur CSS pour un score CER donné (0→vert, 1→rouge)."""
-    from picarones.report.colors import COLOR_GREEN, COLOR_YELLOW, COLOR_ORANGE, COLOR_RED
-    if cer < 0.05:
-        return COLOR_GREEN
-    if cer < 0.15:
-        return COLOR_YELLOW
-    if cer < 0.30:
-        return COLOR_ORANGE
-    return COLOR_RED
-def _cer_bg(cer: float) -> str:
-    from picarones.report.colors import BG_GREEN, BG_YELLOW, BG_ORANGE, BG_RED
-    if cer < 0.05:
-        return BG_GREEN
-    if cer < 0.15:
-        return BG_YELLOW
-    if cer < 0.30:
-        return BG_ORANGE
-    return BG_RED
-def _pct(v: Optional[float], decimals: int = 2) -> str:
-    if v is None:
-        return "—"
-    return f"{v * 100:.{decimals}f} %"
-def _safe(v: Optional[float], decimals: int = 4) -> float:
-    return round(v or 0.0, decimals)
-# ---------------------------------------------------------------------------
-# Préparation des données
-# ---------------------------------------------------------------------------
-def _build_report_data(benchmark: BenchmarkResult, images_b64: dict[str, str]) -> dict:
-    """Transforme un BenchmarkResult en dict JSON pour le rapport HTML."""
-    engines_summary = []
-    for report in benchmark.engine_reports:
-        agg = report.aggregated_metrics
-        diplo_agg = agg.get("cer_diplomatic", {})
-        entry: dict = {
-            "name": report.engine_name,
-            "version": report.engine_version,
-            "cer":  _safe(agg.get("cer", {}).get("mean")),
-            "wer":  _safe(agg.get("wer", {}).get("mean")),
-            "mer":  _safe(agg.get("mer", {}).get("mean")),
-            "wil":  _safe(agg.get("wil", {}).get("mean")),
-            "cer_median": _safe(agg.get("cer", {}).get("median")),
-            "cer_min":    _safe(agg.get("cer", {}).get("min")),
-            "cer_max":    _safe(agg.get("cer", {}).get("max")),
-            "doc_count":  agg.get("document_count", 0),
-            "failed":     agg.get("failed_count", 0),
-            # CER diplomatique (après normalisation historique : ſ=s, u=v, i=j…)
-            "cer_diplomatic": _safe(diplo_agg.get("mean")) if diplo_agg else None,
-            "cer_diplomatic_profile": diplo_agg.get("profile"),
-            # Distribution pour l'histogramme : liste des CER individuels
-            "cer_values": [
-                _safe(dr.metrics.cer)
-                for dr in report.document_results
-                if dr.metrics.error is None
-            ],
-            "cer_diplomatic_values": [
-                _safe(dr.metrics.cer_diplomatic)
-                for dr in report.document_results
-                if dr.metrics.error is None and dr.metrics.cer_diplomatic is not None
-            ],
-            # Champs pipeline OCR+LLM (vides pour les moteurs OCR seuls)
-            "is_pipeline": report.is_pipeline,
-            "pipeline_info": report.pipeline_info,
-            # Sprint 5 — métriques avancées patrimoniales
-            "ligature_score": _safe(report.ligature_score) if report.ligature_score is not None else None,
-            "diacritic_score": _safe(report.diacritic_score) if report.diacritic_score is not None else None,
-            "aggregated_confusion": report.aggregated_confusion,
-            "aggregated_taxonomy": report.aggregated_taxonomy,
-            "aggregated_structure": report.aggregated_structure,
-            "aggregated_image_quality": report.aggregated_image_quality,
-            # Sprint 10 — distribution des erreurs + hallucinations VLM
-            "gini": _safe(report.aggregated_line_metrics.get("gini_mean")) if report.aggregated_line_metrics else None,
-            "cer_p90": _safe(report.aggregated_line_metrics.get("percentiles", {}).get("p90")) if report.aggregated_line_metrics else None,
-            "cer_p99": _safe(report.aggregated_line_metrics.get("percentiles", {}).get("p99")) if report.aggregated_line_metrics else None,
-            "catastrophic_rate_30": _safe(report.aggregated_line_metrics.get("catastrophic_rate", {}).get("0.3")) if report.aggregated_line_metrics else None,
-            "aggregated_line_metrics": report.aggregated_line_metrics,
-            "anchor_score": _safe(report.aggregated_hallucination.get("anchor_score_mean")) if report.aggregated_hallucination else None,
-            "length_ratio": _safe(report.aggregated_hallucination.get("length_ratio_mean")) if report.aggregated_hallucination else None,
-            "hallucinating_doc_rate": _safe(report.aggregated_hallucination.get("hallucinating_doc_rate")) if report.aggregated_hallucination else None,
-            "aggregated_hallucination": report.aggregated_hallucination,
-            # Sprint 41 — NER agrégé (None si aucun calcul effectué)
-            "aggregated_ner": report.aggregated_ner,
-            # Sprint 43 — calibration agrégée (None si aucune confidence
-            # n'a été exposée par le moteur sur ce corpus)
-            "aggregated_calibration": report.aggregated_calibration,
-            # Sprint 62 — profil philologique agrégé (None si aucun
-            # signal philologique sur le corpus pour ce moteur)
-            "aggregated_philological": report.aggregated_philological,
-            # Sprint 86 — A.II.5 (recherchabilité fuzzy + séquences
-            # numériques). None si aucun document n'a de signal.
-            "aggregated_searchability": report.aggregated_searchability,
-            "aggregated_numerical_sequences": (
-                report.aggregated_numerical_sequences
-            ),
-            # Sprint 87 — A.II.2 (delta Flesch agrégé)
-            "aggregated_readability": report.aggregated_readability,
-            "is_vlm": report.pipeline_info.get("is_vlm", False) if report.pipeline_info else False,
-        }
-        engines_summary.append(entry)
-    # Documents (vue galerie + vue détail)
-    # On collecte tous les doc_ids depuis l'union de tous les moteurs,
-    # en préservant l'ordre d'apparition (premier moteur d'abord, puis compléments).
-    seen_doc_ids: set[str] = set()
-    doc_ids_ordered: list[str] = []
-    for report in benchmark.engine_reports:
-        for dr in report.document_results:
-            if dr.doc_id not in seen_doc_ids:
-                seen_doc_ids.add(dr.doc_id)
-                doc_ids_ordered.append(dr.doc_id)
-    # Index croisé : doc_id → {engine_name → DocumentResult}
-    doc_engine_map: dict[str, dict] = {did: {} for did in doc_ids_ordered}
-    for report in benchmark.engine_reports:
-        for dr in report.document_results:
-            doc_engine_map.setdefault(dr.doc_id, {})[report.engine_name] = dr
-    documents = []
-    for doc_id in doc_ids_ordered:
-        engine_results = []
-        gt = ""
-        image_path = ""
-        for engine_name in [r.engine_name for r in benchmark.engine_reports]:
-            dr = doc_engine_map[doc_id].get(engine_name)
-            if dr is None:
-                continue
-            gt = dr.ground_truth
-            image_path = dr.image_path
-            diff_ops = compute_char_diff(dr.ground_truth, dr.hypothesis)
-            er_entry: dict = {
-                "engine": engine_name,
-                "hypothesis": dr.hypothesis,
-                "cer": _safe(dr.metrics.cer),
-                "cer_diplomatic": _safe(dr.metrics.cer_diplomatic) if dr.metrics.cer_diplomatic is not None else None,
-                "wer": _safe(dr.metrics.wer),
-                "mer": _safe(dr.metrics.mer),
-                "wil": _safe(dr.metrics.wil),
-                "duration": dr.duration_seconds,
-                "error": dr.engine_error,
-                "diff": diff_ops,
-            }
-            # Champs spécifiques aux pipelines OCR+LLM
-            if dr.ocr_intermediate is not None:
-                er_entry["ocr_intermediate"] = dr.ocr_intermediate
-                er_entry["ocr_diff"] = compute_word_diff(dr.ground_truth, dr.ocr_intermediate)
-                er_entry["llm_correction_diff"] = compute_word_diff(dr.ocr_intermediate, dr.hypothesis)
-            if dr.pipeline_metadata:
-                on = dr.pipeline_metadata.get("over_normalization")
-                if on is not None:
-                    er_entry["over_normalization"] = on
-                er_entry["pipeline_mode"] = dr.pipeline_metadata.get("pipeline_mode")
-            # Sprint 5 — métriques avancées par document
-            if dr.char_scores is not None:
-                er_entry["ligature_score"] = _safe(dr.char_scores.get("ligature", {}).get("score"))
-                er_entry["diacritic_score"] = _safe(dr.char_scores.get("diacritic", {}).get("score"))
-            if dr.taxonomy is not None:
-                er_entry["taxonomy"] = dr.taxonomy
-            if dr.structure is not None:
-                er_entry["structure"] = dr.structure
-            if dr.image_quality is not None:
-                er_entry["image_quality"] = dr.image_quality
-            # Sprint 10
-            if dr.line_metrics is not None:
-                er_entry["line_metrics"] = dr.line_metrics
-            if dr.hallucination_metrics is not None:
-                er_entry["hallucination_metrics"] = dr.hallucination_metrics
-            engine_results.append(er_entry)
-        # CER moyen sur ce document (pour le badge galerie)
-        cer_values = [er["cer"] for er in engine_results if er["error"] is None]
-        mean_cer = sum(cer_values) / len(cer_values) if cer_values else 1.0
-        best_engine = min(engine_results, key=lambda x: x["cer"], default=None)
-        # Script type (depuis metadata par document si disponible)
-        script_type = ""
-        first_dr = doc_engine_map[doc_id].get(
-            benchmark.engine_reports[0].engine_name if benchmark.engine_reports else None
-        )
-        if first_dr and first_dr.image_quality:
-            script_type = first_dr.image_quality.get("script_type", "")
-        documents.append({
-            "doc_id": doc_id,
-            "image_path": image_path,
-            "image_b64": images_b64.get(doc_id, ""),
-            "ground_truth": gt,
-            "mean_cer": _safe(mean_cer),
-            "best_engine": best_engine["engine"] if best_engine else "",
-            "engine_results": engine_results,
-            "script_type": script_type,
-        })
-    # ── Sprint 7 — Score de difficulté intrinsèque ───────────────────────
-    gt_map = {d["doc_id"]: d["ground_truth"] for d in documents}
-    cer_map: dict[str, dict[str, float]] = {d["doc_id"]: {} for d in documents}
-    iq_map: dict[str, float] = {}
-    for report in benchmark.engine_reports:
-        for dr in report.document_results:
-            cer_map.setdefault(dr.doc_id, {})[report.engine_name] = _safe(dr.metrics.cer)
-            if dr.image_quality and "quality_score" in dr.image_quality:
-                iq_map[dr.doc_id] = dr.image_quality["quality_score"]
-    difficulty_scores = compute_all_difficulties(
-        doc_ids=doc_ids_ordered,
-        ground_truths=gt_map,
-        cer_map=cer_map,
-        image_quality_map=iq_map or None,
-    )
-    # Ajouter difficulty_score à chaque document
-    for doc in documents:
-        ds = difficulty_scores.get(doc["doc_id"])
-        if ds:
-            doc["difficulty_score"] = _safe(ds.score)
-            doc["difficulty_label"] = difficulty_label(ds.score)
-        else:
-            doc["difficulty_score"] = 0.5
-            doc["difficulty_label"] = "Modéré"
-    # ── Sprint 7 — Tests statistiques (Wilcoxon pairwise + bootstrap CI) ─
-    engine_cer_map_stats: dict[str, list[float]] = {}
-    for report in benchmark.engine_reports:
-        vals = [_safe(dr.metrics.cer) for dr in report.document_results if dr.metrics.error is None]
-        if vals:
-            engine_cer_map_stats[report.engine_name] = vals
-    pairwise_stats = compute_pairwise_stats(engine_cer_map_stats)
-    # ── Sprint 17 — Friedman + Nemenyi ──────────────────────────────────
-    # Alignement strict sur le même ordre de documents : on reconstruit la
-    # map à partir des documents communs à tous les moteurs, sinon Friedman
-    # n'est pas applicable.
-    engine_cer_aligned: dict[str, list[float]] = {}
-    common_doc_ids: Optional[set[str]] = None
-    for report in benchmark.engine_reports:
-        doc_ids = {dr.doc_id for dr in report.document_results if dr.metrics.error is None}
-        common_doc_ids = doc_ids if common_doc_ids is None else common_doc_ids & doc_ids
-    if common_doc_ids:
-        ordered_common = [d for d in doc_ids_ordered if d in common_doc_ids]
-        for report in benchmark.engine_reports:
-            dr_by_id = {dr.doc_id: dr for dr in report.document_results}
-            engine_cer_aligned[report.engine_name] = [
-                _safe(dr_by_id[d].metrics.cer) for d in ordered_common
-            ]
-    friedman = friedman_test(engine_cer_aligned) if engine_cer_aligned else {
-        "statistic": 0.0, "p_value": 1.0, "significant": False,
-        "df": 0, "n_blocks": 0, "n_engines": 0, "mean_ranks": {},
-        "interpretation": "Test de Friedman non calculé — aucun document commun.",
-        "error": "no_common_documents",
-    }
-    nemenyi = nemenyi_posthoc(engine_cer_aligned) if engine_cer_aligned else {
-        "alpha": 0.05, "critical_distance": 0.0, "q_alpha": 0.0,
-        "n_blocks": 0, "n_engines": 0, "mean_ranks": {},
-        "engines_sorted": [], "significant_matrix": [], "tied_groups": [],
-        "error": "no_common_documents",
-    }
-    bootstrap_cis: list[dict] = []
-    for engine_name, vals in engine_cer_map_stats.items():
-        lo, hi = bootstrap_ci(vals)
-        mean_v = sum(vals) / len(vals) if vals else 0.0
-        bootstrap_cis.append({
-            "engine": engine_name,
-            "mean": _safe(mean_v),
-            "ci_lower": _safe(lo),
-            "ci_upper": _safe(hi),
-        })
-    # ── Sprint 7 — Courbes de fiabilité ──────────────────────────────────
-    reliability_curves: list[dict] = []
-    for report in benchmark.engine_reports:
-        vals = [_safe(dr.metrics.cer) for dr in report.document_results if dr.metrics.error is None]
-        curve = compute_reliability_curve(vals)
-        reliability_curves.append({
-            "engine": report.engine_name,
-            "points": curve,
-        })
-    # ── Sprint 7 — Venn des erreurs communes / exclusives ────────────────
-    # Construire les ensembles d'erreurs par moteur : {engine → set(doc_id:gt_tok:hyp_tok)}
-    venn_error_sets: dict[str, set[str]] = {}
-    for report in benchmark.engine_reports:
-        error_set: set[str] = set()
-        for dr in report.document_results:
-            ops = compute_word_diff(dr.ground_truth, dr.hypothesis)
-            for op in ops:
-                if op["op"] in ("replace", "delete", "insert"):
-                    key = f"{dr.doc_id}:{op.get('old', op.get('text',''))}:{op.get('new', op.get('text',''))}"
-                    error_set.add(key)
-        venn_error_sets[report.engine_name] = error_set
-    venn_data = compute_venn_data(venn_error_sets)
-    # ── Sprint 7 — Clustering des patterns d'erreurs ─────────────────────
-    error_data_all: list[dict] = []
-    for report in benchmark.engine_reports:
-        for dr in report.document_results:
-            error_data_all.append({
-                "engine": report.engine_name,
-                "gt": dr.ground_truth,
-                "hypothesis": dr.hypothesis,
-            })
-    error_clusters_raw = cluster_errors(error_data_all, max_clusters=8)
-    error_clusters = [c.as_dict() for c in error_clusters_raw]
-    # ── Sprint 7 — Matrice de corrélation ────────────────────────────────
-    # Pour chaque moteur : une liste de dicts métriques par document
-    correlation_per_engine: list[dict] = []
-    for report in benchmark.engine_reports:
-        metrics_list = []
-        for dr in report.document_results:
-            if dr.metrics.error is not None:
-                continue
-            entry: dict[str, float] = {
-                "cer": _safe(dr.metrics.cer),
-                "wer": _safe(dr.metrics.wer),
-                "mer": _safe(dr.metrics.mer),
-                "wil": _safe(dr.metrics.wil),
-            }
-            if dr.image_quality:
-                entry["quality_score"] = _safe(dr.image_quality.get("quality_score", 0.5))
-                entry["sharpness"] = _safe(dr.image_quality.get("sharpness_score", 0.5))
-            if dr.char_scores:
-                entry["ligature"] = _safe(dr.char_scores.get("ligature", {}).get("score", 0.5))
-                entry["diacritic"] = _safe(dr.char_scores.get("diacritic", {}).get("score", 0.5))
-            metrics_list.append(entry)
-        if metrics_list:
-            corr = compute_correlation_matrix(metrics_list)
-            correlation_per_engine.append({
-                "engine": report.engine_name,
-                **corr,
-            })
-    # ── Sprint 10 — Données scatter plots ─────────────────────────────────
-    # Scatter 1 : Gini vs CER moyen (moteurs)
-    gini_vs_cer = []
-    for report in benchmark.engine_reports:
-        gini_val = report.aggregated_line_metrics.get("gini_mean") if report.aggregated_line_metrics else None
-        cer_val = report.mean_cer
-        if gini_val is not None and cer_val is not None:
-            gini_vs_cer.append({
-                "engine": report.engine_name,
-                "cer": _safe(cer_val),
-                "gini": _safe(gini_val),
-                "is_pipeline": report.is_pipeline,
-            })
-    # ── Sprint 19 — Coûts et frontière de Pareto ────────────────────────
-    # Durée moyenne mesurée par moteur sur le benchmark courant (sec/page)
-    durations_by_engine: dict[str, float] = {}
-    for report in benchmark.engine_reports:
-        durs = [dr.duration_seconds for dr in report.document_results
-                if dr.duration_seconds is not None]
-        if durs:
-            durations_by_engine[report.engine_name] = sum(durs) / len(durs)
-    pricing_defaults, _ = load_pricing_database()
-    costs_by_engine = build_costs_for_benchmark(
-        engines_summary, durations_by_engine,
-    )
-    # Annoter chaque résumé moteur avec son coût et sa durée
-    for entry in engines_summary:
-        name = entry["name"]
-        entry["mean_duration_seconds"] = round(durations_by_engine.get(name, 0.0), 4) \
-            if name in durations_by_engine else None
-        entry["cost"] = costs_by_engine.get(name)
-    # Front Pareto sur (CER moyen, coût €/1000 pages) — moteurs avec les deux dispos
-    pareto_points = []
-    for entry in engines_summary:
-        cer = entry.get("cer")
-        cost = (entry.get("cost") or {}).get("cost_per_1k_pages_eur")
-        if cer is None or cost is None:
-            continue
-        pareto_points.append({"engine": entry["name"], "cer": cer, "cost": cost})
-    pareto_front_engines = compute_pareto_front(
-        pareto_points, objectives=("cer", "cost"),
-    )
-    # Front Pareto secondaire (CER, vitesse) pour le toggle "vitesse"
-    pareto_speed_points = []
-    for entry in engines_summary:
-        cer = entry.get("cer")
-        dur = entry.get("mean_duration_seconds")
-        if cer is None or dur is None:
-            continue
-        pareto_speed_points.append({"engine": entry["name"], "cer": cer, "dur": dur})
-    pareto_front_speed = compute_pareto_front(
-        pareto_speed_points, objectives=("cer", "dur"),
-    )
-    # Front Pareto carbone (CER, g CO2 / 1000 pages) — étiqueté expérimental
-    pareto_co2_points = []
-    for entry in engines_summary:
-        cer = entry.get("cer")
-        co2 = (entry.get("cost") or {}).get("co2_per_1k_pages_g")
-        if cer is None or co2 is None:
-            continue
-        pareto_co2_points.append({"engine": entry["name"], "cer": cer, "co2": co2})
-    pareto_front_co2 = compute_pareto_front(
-        pareto_co2_points, objectives=("cer", "co2"),
-    )
-    pareto_data = {
-        "cost": {
-            "points": pareto_points,
-            "front": pareto_front_engines,
-            "axis_label": "Coût (€ / 1000 pages)",
-        },
-        "speed": {
-            "points": pareto_speed_points,
-            "front": pareto_front_speed,
-            "axis_label": "Temps moyen (s / page)",
-        },
-        "co2": {
-            "points": pareto_co2_points,
-            "front": pareto_front_co2,
-            "axis_label": "Empreinte carbone (g CO₂ / 1000 pages, expérimental)",
-        },
-        "pricing_meta": {
-            "last_updated": pricing_defaults.last_updated,
-            "currency": pricing_defaults.currency,
-            "hourly_rate_local_cpu_eur": pricing_defaults.hourly_rate_local_cpu_eur,
-            "hourly_rate_local_gpu_eur": pricing_defaults.hourly_rate_local_gpu_eur,
-            "grid_intensity_local": pricing_defaults.grid_intensity_local,
-            "grid_intensity_cloud": pricing_defaults.grid_intensity_cloud,
-        },
-    }
-    # Scatter 2 : ratio longueur vs score d'ancrage (moteurs)
-    ratio_vs_anchor = []
-    for report in benchmark.engine_reports:
-        if report.aggregated_hallucination:
-            ratio_vs_anchor.append({
-                "engine": report.engine_name,
-                "length_ratio": _safe(report.aggregated_hallucination.get("length_ratio_mean", 1.0)),
-                "anchor_score": _safe(report.aggregated_hallucination.get("anchor_score_mean", 1.0)),
-                "hallucinating_rate": _safe(report.aggregated_hallucination.get("hallucinating_doc_rate", 0.0)),
-                "is_vlm": report.pipeline_info.get("is_vlm", False) if report.pipeline_info else False,
-            })
-    return {
-        "meta": {
-            "corpus_name": benchmark.corpus_name,
-            "corpus_source": benchmark.corpus_source,
-            "document_count": benchmark.document_count,
-            "run_date": benchmark.run_date,
-            "picarones_version": benchmark.picarones_version,
-            "metadata": benchmark.metadata,
-        },
-        "ranking": benchmark.ranking(),
-        "engines": engines_summary,
-        "documents": documents,
-        # Sprint 7
-        "statistics": {
-            "pairwise_wilcoxon": pairwise_stats,
-            "bootstrap_cis": bootstrap_cis,
-            # Sprint 17 — Friedman multi-moteurs + post-hoc Nemenyi + CDD
-            "friedman": friedman,
-            "nemenyi": nemenyi,
-        },
-        "reliability_curves": reliability_curves,
-        "venn_data": venn_data,
-        "error_clusters": error_clusters,
-        "correlation_per_engine": correlation_per_engine,
-        # Sprint 10
-        "gini_vs_cer": gini_vs_cer,
-        "ratio_vs_anchor": ratio_vs_anchor,
-        # Sprint 19 — vue Pareto coût/qualité avec variantes d'axe
-        "pareto": pareto_data,
-        # Sprint 36 — analyse inter-moteurs (divergence taxonomique +
-        # complémentarité / oracle).  ``None`` si moins de 2 moteurs.
-        "inter_engine_analysis": benchmark.inter_engine_analysis,
-        # Sprint 45-46 — stratification par script_type
-        "available_strata": benchmark.available_strata(),
-        "stratified_ranking": benchmark.stratified_ranking() or None,
-        "corpus_homogeneity": benchmark.corpus_homogeneity(),
-    }
 # ---------------------------------------------------------------------------
@@ -691,8 +80,8 @@ def _build_jinja_env():
     Autoescape désactivé : le comportement est équivalent à celui du
     ``_HTML_TEMPLATE.format()`` historique. Les variables injectées
     (JSON embarqué, SVG généré, synthèse narrative issue de templates
-    internes) sont toutes produites par le code Picarones et ne nécessitent
-    pas d'échappement HTML.
     """
     from jinja2 import Environment, FileSystemLoader
     env = Environment(
@@ -834,174 +223,188 @@ class ReportGenerator:
         glossary = load_glossary(self.lang)
         glossary_json = json.dumps(glossary, ensure_ascii=False, separators=(",", ":"))
-        # Sprint 37 — section inter-moteurs (matrice de divergence + oracle)
-        # rendue côté serveur. Vide si moins de 2 moteurs ou taxonomie absente.
         from picarones.report.inter_engine_render import (
             build_divergence_matrix_html,
             build_oracle_gap_html,
         )
-        divergence_matrix_html = build_divergence_matrix_html(
-            report_data.get("inter_engine_analysis"),
-            labels=labels,
-        )
-        oracle_gap_html = build_oracle_gap_html(
-            report_data.get("inter_engine_analysis"),
-            labels=labels,
-        )
-        # Sprint 41 — section NER (résumé F1 par moteur + heatmap par
-        # catégorie). Vide si aucun moteur n'a de aggregated_ner.
         from picarones.report.ner_render import (
             build_ner_per_category_html,
             build_ner_summary_html,
         )
-        ner_summary_html = build_ner_summary_html(
-            report_data.get("engines", []),
-            labels=labels,
-        )
-        ner_per_category_html = build_ner_per_category_html(
-            report_data.get("engines", []),
-            labels=labels,
-        )
         # Sprint 43 — section calibration (tableau ECE/MCE + grille de
-        # reliability diagrams par moteur). Vide si aucun moteur n'a
-        # de aggregated_calibration.
         from picarones.report.calibration_render import (
             build_calibration_summary_html,
             build_reliability_diagrams_grid_html,
         )
-        calibration_summary_html = build_calibration_summary_html(
-            report_data.get("engines", []),
-            labels=labels,
-        )
-        reliability_diagrams_html = build_reliability_diagrams_grid_html(
-            report_data.get("engines", []),
-            labels=labels,
-        )
-        # Sprint 46 — section stratifiée (tableau par strate). Vide si
-        # aucune strate disponible.
         from picarones.report.stratification_render import (
             build_stratified_ranking_html,
         )
-        stratified_ranking_html = build_stratified_ranking_html(
-            report_data.get("stratified_ranking"),
-            report_data.get("available_strata"),
-            report_data.get("corpus_homogeneity"),
-            labels=labels,
-        )
-        # Sprint 62 — profil philologique (6 sections adaptive sur les
-        # modules philologiques Sprints 55-60). Vide si aucun moteur
-        # n'a de aggregated_philological.
         from picarones.report.philological_render import (
             build_philological_profile_html,
         )
-        philological_profile_html = build_philological_profile_html(
-            report_data.get("engines", []),
-            labels=labels,
-        )
-        # Sprint 86 — A.II.5 : recherchabilité fuzzy +
-        # séquences numériques. Adaptive : "" si aucun signal.
         from picarones.report.searchability_render import (
             build_searchability_summary_html,
         )
         from picarones.report.numerical_sequences_render import (
             build_numerical_sequences_html,
         )
-        searchability_html = build_searchability_summary_html(
-            report_data.get("engines", []), labels=labels,
-        )
-        numerical_sequences_html = build_numerical_sequences_html(
-            report_data.get("engines", []), labels=labels,
-        )
         # Sprint 87 — A.II.2 : lisibilité (delta Flesch).
-        # Adaptive : "" si aucun moteur n'a de signal.
         from picarones.report.readability_render import (
             build_readability_summary_html,
         )
-        readability_html = build_readability_summary_html(
-            report_data.get("engines", []), labels=labels,
-        )
         # Sprint 89 — A.II.8b : spécialisation inter-moteurs.
-        # Adaptive : "" si moins de 2 moteurs avec taxonomie.
         from picarones.report.specialization_render import (
             build_specialization_html,
         )
-        # Construit une map {engine: counts} depuis les
-        # ``aggregated_taxonomy`` ; un moteur sans taxonomie
-        # est exclu.
-        _taxos: dict = {}
-        for eng in report_data.get("engines", []):
-            tax = eng.get("aggregated_taxonomy")
-            if isinstance(tax, dict):
-                counts = tax.get("counts") if "counts" in tax else tax
-                if isinstance(counts, dict) and counts:
-                    _taxos[eng.get("name", "?")] = {
-                        k: float(v) for k, v in counts.items()
-                        if isinstance(v, (int, float))
-                    }
-        specialization_html = build_specialization_html(
-            _taxos, labels=labels,
-        )
-        # Chantier 3 (post-Sprint 97) — 3 nouvelles vues thématiques
-        # qui regroupent les renderers orphelins en sections
-        # collapsibles. Adaptive : retourne "" si aucune sous-section
-        # n'a de signal, donc la carte du template est masquée.
         from picarones.report.views import (
             build_advanced_taxonomy_view_html,
             build_diagnostics_view_html,
             build_economics_view_html,
         )
-        economics_view_html = build_economics_view_html(
-            report_data, labels=labels,
-            engine_reports=self.benchmark.engine_reports,
         )
-        advanced_taxonomy_view_html = build_advanced_taxonomy_view_html(
-            report_data, labels=labels,
         )
-        diagnostics_view_html = build_diagnostics_view_html(
-            report_data, labels=labels,
         )
-        env = _build_jinja_env()
-        template = env.get_template("base.html.j2")
-        html = template.render(
-            corpus_name=self.benchmark.corpus_name,
-            picarones_version=self.benchmark.picarones_version,
-            report_data_json=report_json,
-            i18n_json=i18n_json,
-            html_lang=labels.get("html_lang", "fr"),
-            chartjs_inline=chartjs_js,
-            critical_difference_svg=cdd_svg,
-            friedman=report_data.get("statistics", {}).get("friedman", {}),
-            synthesis=synthesis,
-            glossary_json=glossary_json,
-            divergence_matrix_html=divergence_matrix_html,
-            oracle_gap_html=oracle_gap_html,
-            ner_summary_html=ner_summary_html,
-            ner_per_category_html=ner_per_category_html,
-            calibration_summary_html=calibration_summary_html,
-            reliability_diagrams_html=reliability_diagrams_html,
-            stratified_ranking_html=stratified_ranking_html,
-            philological_profile_html=philological_profile_html,
-            searchability_html=searchability_html,
-            numerical_sequences_html=numerical_sequences_html,
-            readability_html=readability_html,
-            specialization_html=specialization_html,
-            # Chantier 3 — vues thématiques composées
-            economics_view_html=economics_view_html,
-            advanced_taxonomy_view_html=advanced_taxonomy_view_html,
-            diagnostics_view_html=diagnostics_view_html,
         )
-        output_path.write_text(html, encoding="utf-8")
-        return output_path.resolve()
     @classmethod
     def from_json(cls, json_path: str | Path, **kwargs) -> "ReportGenerator":

 2. Galerie     — grille d'images avec badge CER coloré
 3. Document    — image zoomable + diff coloré GT / OCR par moteur
 4. Analyses    — histogramme CER + graphique radar
+Architecture
+------------
+Ce module est l'**orchestrateur**. Les responsabilités lourdes sont
+découpées en sous-modules :
+- :mod:`picarones.report.assets` — chargement vendor.js, encodage
+  base64 d'images, externalisation lazy.
+- :mod:`picarones.report.report_data` — construction du dict JSON
+  passé au template (engines, documents, statistiques, Pareto, etc.).
+- :mod:`picarones.report.render_helpers` — couleurs / SVG mutualisés.
+Rétrocompat
+-----------
+Deux noms historiques sont **encore importés par des tests** sous
+leur préfixe ``_`` et doivent être préservés :
+- ``_build_report_data`` (importé par 14 fichiers de tests).
+- ``_cer_color`` (importé par ``tests/report/test_report.py``).
+Les autres noms ``_pct``, ``_safe``, ``_cer_bg``, ``_encode_image_b64``,
+``_encode_images_b64_from_result``, ``_externalize_images_to_dir``,
+``_load_vendor_js`` sont soit utilisés en interne (les 3 derniers,
+voir :meth:`ReportGenerator.generate`), soit accessibles via leur
+nom canonique dans :mod:`picarones.report.assets` ou
+:mod:`picarones.report.render_helpers`.
 """
 from __future__ import annotations
 import json
 import logging
 from pathlib import Path
 from typing import Any, Optional
 from picarones.core.results import BenchmarkResult
+from picarones.measurements.statistics import build_critical_difference_svg
+from picarones.report.assets import (
+    encode_images_b64_from_result as _encode_images_b64_from_result,
+    externalize_images_to_dir as _externalize_images_to_dir,
+    load_vendor_js as _load_vendor_js,
 )
+# Ré-exports rétrocompat consommés par les tests externes (cf. docstring
+# de module). La directive de fin de ligne documente l'intention de
+# ré-export et empêche ruff de marquer l'import comme inutilisé.
+from picarones.report.render_helpers import cer_step_color as _cer_color  # noqa: F401
+from picarones.report.report_data import build_report_data as _build_report_data  # noqa: F401
+logger = logging.getLogger(__name__)
 # ---------------------------------------------------------------------------
     Autoescape désactivé : le comportement est équivalent à celui du
     ``_HTML_TEMPLATE.format()`` historique. Les variables injectées
     (JSON embarqué, SVG généré, synthèse narrative issue de templates
+    internes) sont toutes produites par le code Picarones et ne
+    nécessitent pas d'échappement HTML.
     """
     from jinja2 import Environment, FileSystemLoader
     env = Environment(
         glossary = load_glossary(self.lang)
         glossary_json = json.dumps(glossary, ensure_ascii=False, separators=(",", ":"))
+        section_html = self._build_section_html(report_data, labels)
+        env = _build_jinja_env()
+        template = env.get_template("base.html.j2")
+        html = template.render(
+            corpus_name=self.benchmark.corpus_name,
+            picarones_version=self.benchmark.picarones_version,
+            report_data_json=report_json,
+            i18n_json=i18n_json,
+            html_lang=labels.get("html_lang", "fr"),
+            chartjs_inline=chartjs_js,
+            critical_difference_svg=cdd_svg,
+            friedman=report_data.get("statistics", {}).get("friedman", {}),
+            synthesis=synthesis,
+            glossary_json=glossary_json,
+            **section_html,
+        )
+        output_path.write_text(html, encoding="utf-8")
+        return output_path.resolve()
+    def _build_section_html(
+        self, report_data: dict, labels: dict[str, str],
+    ) -> dict[str, str]:
+        """Construit toutes les sections HTML conditionnelles du rapport.
+        Chaque renderer (NER, calibration, philologie, etc.) est appelé
+        de manière indépendante. Une section retourne ``""`` si aucun
+        moteur n'a de signal pour elle — le template gère l'affichage
+        conditionnel.
+        Returns
+        -------
+        dict[str, str]
+            Map ``{nom_de_section: html}`` à splatter dans
+            ``template.render(**section_html)``.
+        """
+        engines = report_data.get("engines", [])
+        # Sprint 37 — section inter-moteurs (matrice de divergence + oracle).
         from picarones.report.inter_engine_render import (
             build_divergence_matrix_html,
             build_oracle_gap_html,
         )
+        # Sprint 41 — section NER (résumé F1 par moteur + heatmap par catégorie).
         from picarones.report.ner_render import (
             build_ner_per_category_html,
             build_ner_summary_html,
         )
         # Sprint 43 — section calibration (tableau ECE/MCE + grille de
+        # reliability diagrams par moteur).
         from picarones.report.calibration_render import (
             build_calibration_summary_html,
             build_reliability_diagrams_grid_html,
         )
+        # Sprint 46 — section stratifiée (tableau par strate).
         from picarones.report.stratification_render import (
             build_stratified_ranking_html,
         )
+        # Sprint 62 — profil philologique (6 sections adaptive).
         from picarones.report.philological_render import (
             build_philological_profile_html,
         )
+        # Sprint 86 — A.II.5 : recherchabilité fuzzy + séquences numériques.
         from picarones.report.searchability_render import (
             build_searchability_summary_html,
         )
         from picarones.report.numerical_sequences_render import (
             build_numerical_sequences_html,
         )
         # Sprint 87 — A.II.2 : lisibilité (delta Flesch).
         from picarones.report.readability_render import (
             build_readability_summary_html,
         )
         # Sprint 89 — A.II.8b : spécialisation inter-moteurs.
         from picarones.report.specialization_render import (
             build_specialization_html,
         )
+        # Chantier 3 (post-Sprint 97) — 3 vues thématiques composées.
         from picarones.report.views import (
             build_advanced_taxonomy_view_html,
             build_diagnostics_view_html,
             build_economics_view_html,
         )
+        # Sprint « câblage des modules test-only » (mai 2026) — sections
+        # qui consomment les nouvelles métriques calculées dans
+        # ``report_data.extra_metrics``.
+        from picarones.report.marginal_cost_render import (
+            build_marginal_cost_html,
         )
+        from picarones.report.rare_token_recall_render import (
+            build_rare_token_recall_html,
         )
+        from picarones.report.taxonomy_cooccurrence_render import (
+            build_taxonomy_cooccurrence_html,
         )
+        from picarones.report.taxonomy_intra_doc_render import (
+            build_taxonomy_intra_doc_html,
         )
+        # Spécialisation : construit une map {engine: counts} depuis les
+        # ``aggregated_taxonomy`` ; un moteur sans taxonomie est exclu.
+        taxos: dict = {}
+        for eng in engines:
+            tax = eng.get("aggregated_taxonomy")
+            if isinstance(tax, dict):
+                counts = tax.get("counts") if "counts" in tax else tax
+                if isinstance(counts, dict) and counts:
+                    taxos[eng.get("name", "?")] = {
+                        k: float(v) for k, v in counts.items()
+                        if isinstance(v, (int, float))
+                    }
+        return {
+            # Sprint 37
+            "divergence_matrix_html": build_divergence_matrix_html(
+                report_data.get("inter_engine_analysis"), labels=labels,
+            ),
+            "oracle_gap_html": build_oracle_gap_html(
+                report_data.get("inter_engine_analysis"), labels=labels,
+            ),
+            # Sprint 41
+            "ner_summary_html": build_ner_summary_html(engines, labels=labels),
+            "ner_per_category_html": build_ner_per_category_html(engines, labels=labels),
+            # Sprint 43
+            "calibration_summary_html": build_calibration_summary_html(
+                engines, labels=labels,
+            ),
+            "reliability_diagrams_html": build_reliability_diagrams_grid_html(
+                engines, labels=labels,
+            ),
+            # Sprint 46
+            "stratified_ranking_html": build_stratified_ranking_html(
+                report_data.get("stratified_ranking"),
+                report_data.get("available_strata"),
+                report_data.get("corpus_homogeneity"),
+                labels=labels,
+            ),
+            # Sprint 62
+            "philological_profile_html": build_philological_profile_html(
+                engines, labels=labels,
+            ),
+            # Sprint 86
+            "searchability_html": build_searchability_summary_html(
+                engines, labels=labels,
+            ),
+            "numerical_sequences_html": build_numerical_sequences_html(
+                engines, labels=labels,
+            ),
+            # Sprint 87
+            "readability_html": build_readability_summary_html(
+                engines, labels=labels,
+            ),
+            # Sprint 89
+            "specialization_html": build_specialization_html(taxos, labels=labels),
+            # Chantier 3 — vues thématiques composées
+            "economics_view_html": build_economics_view_html(
+                report_data, labels=labels,
+                engine_reports=self.benchmark.engine_reports,
+            ),
+            "advanced_taxonomy_view_html": build_advanced_taxonomy_view_html(
+                report_data, labels=labels,
+            ),
+            "diagnostics_view_html": build_diagnostics_view_html(
+                report_data, labels=labels,
+            ),
+            # Sprint « câblage des modules test-only » (mai 2026) :
+            # 4 nouvelles sections pour les modules câblés en
+            # ``report_data.extra_metrics``. Adaptive : "" si pas de signal.
+            "taxonomy_cooccurrence_html": build_taxonomy_cooccurrence_html(
+                report_data.get("taxonomy_cooccurrence"), labels=labels,
+            ),
+            "taxonomy_intra_doc_html": build_taxonomy_intra_doc_html(
+                report_data.get("taxonomy_intra_doc"), labels=labels,
+            ),
+            "rare_token_recall_html": build_rare_token_recall_html(
+                report_data.get("rare_token_recall"), labels=labels,
+            ),
+            "marginal_cost_html": build_marginal_cost_html(
+                report_data.get("marginal_cost"), labels=labels,
+            ),
+        }
     @classmethod
     def from_json(cls, json_path: str | Path, **kwargs) -> "ReportGenerator":

picarones/report/image_predictive_render.py CHANGED Viewed

@@ -36,21 +36,7 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_score(score: float) -> str:
-    """Vert (faible) → orange → rouge (élevé)."""
-    f = max(0.0, min(1.0, score))
-    if f < 0.5:
-        t = f / 0.5
-        r = int(167 + (235 - 167) * t)
-        g = int(240 + (180 - 240) * t)
-        b = int(167 + (60 - 167) * t)
-    else:
-        t = (f - 0.5) / 0.5
-        r = int(235 + (220 - 235) * t)
-        g = int(180 + (50 - 180) * t)
-        b = int(60 + (50 - 60) * t)
-    return f"#{r:02x}{g:02x}{b:02x}"
 _FEATURE_LABEL_KEYS = {
@@ -79,7 +65,7 @@ def _render_complexity_block(
     mx = float(aggregated.get("complexity_max") or 0.0)
     sd = float(aggregated.get("complexity_stdev") or 0.0)
     n_docs = int(aggregated.get("n_docs") or 0)
-    color_mean = _color_for_score(mean)
     return (
         f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
         f'{_e(h_complex)}</div>'
@@ -130,7 +116,7 @@ def _render_homogeneity_block(
         "imgpred_feat_norm", "Contribution normalisée",
     )
     score = float(homogeneity.get("score") or 0.0)
-    color = _color_for_score(score)
     parts = [
         f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
         f'{_e(h_homo)} : '
@@ -157,7 +143,7 @@ def _render_homogeneity_block(
         feat_mean = float(slot.get("mean") or 0.0)
         feat_stdev = float(slot.get("stdev") or 0.0)
         feat_norm = float(slot.get("normalised") or 0.0)
-        norm_color = _color_for_score(feat_norm)
         parts.append(
             f'<tr>'
             f'<td style="padding:.4rem .6rem">{_e(feat_label)}</td>'

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_traffic_light
 _FEATURE_LABEL_KEYS = {
     mx = float(aggregated.get("complexity_max") or 0.0)
     sd = float(aggregated.get("complexity_stdev") or 0.0)
     n_docs = int(aggregated.get("n_docs") or 0)
+    color_mean = color_traffic_light(mean, low_is_good=True)
     return (
         f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
         f'{_e(h_complex)}</div>'
         "imgpred_feat_norm", "Contribution normalisée",
     )
     score = float(homogeneity.get("score") or 0.0)
+    color = color_traffic_light(score, low_is_good=True)
     parts = [
         f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
         f'{_e(h_homo)} : '
         feat_mean = float(slot.get("mean") or 0.0)
         feat_stdev = float(slot.get("stdev") or 0.0)
         feat_norm = float(slot.get("normalised") or 0.0)
+        norm_color = color_traffic_light(feat_norm, low_is_good=True)
         parts.append(
             f'<tr>'
             f'<td style="padding:.4rem .6rem">{_e(feat_label)}</td>'

picarones/report/incremental_comparison_render.py CHANGED Viewed

@@ -41,28 +41,26 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_score(
     score: float, low: float, high: float, higher_is_better: bool,
 ) -> str:
-    """Vert (meilleur) → orange → rouge (pire)."""
     if high == low:
-        return "#a7f0a7"
-    rel = (score - low) / (high - low)
-    if higher_is_better:
-        rel = 1.0 - rel
-    rel = max(0.0, min(1.0, rel))
-    if rel < 0.5:
-        t = rel / 0.5
-        r = int(167 + (235 - 167) * t)
-        g = int(240 + (180 - 240) * t)
-        b = int(167 + (60 - 167) * t)
-    else:
-        t = (rel - 0.5) / 0.5
-        r = int(235 + (220 - 235) * t)
-        g = int(180 + (50 - 180) * t)
-        b = int(60 + (50 - 60) * t)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def _format_score(value: Optional[float]) -> str:
@@ -160,7 +158,7 @@ def build_incremental_comparison_html(
         rank = d.get("mean_rank")
         n_obs = int(d.get("n_observations") or 0)
         if isinstance(mean, (int, float)):
-            color = _color_for_score(
                 float(mean), low, high, higher_is_better,
             )
             mean_cell = (

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_traffic_light
+def _bg_for_relative_score(
     score: float, low: float, high: float, higher_is_better: bool,
 ) -> str:
+    """Mappe ``score`` sur une plage [low, high] et retourne une cellule
+    colorée traffic-light.
+    Si ``higher_is_better=True``, ``score=high`` est vert ; sinon
+    ``score=low`` est vert.
+    """
     if high == low:
+        return color_traffic_light(1.0)  # neutre vert clair
+    return color_traffic_light(
+        score,
+        low_is_good=not higher_is_better,
+        scale_min=low,
+        scale_max=high,
+    )
 def _format_score(value: Optional[float]) -> str:
         rank = d.get("mean_rank")
         n_obs = int(d.get("n_observations") or 0)
         if isinstance(mean, (int, float)):
+            color = _bg_for_relative_score(
                 float(mean), low, high, higher_is_better,
             )
             mean_cell = (

picarones/report/inter_engine_render.py CHANGED Viewed

@@ -21,20 +21,10 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for(value: float, vmax: float) -> str:
-    """Gradient blanc → rouge proportionnel à ``value/vmax``.
-    Retourne une couleur CSS hex.  ``vmax = 0`` → blanc.
-    """
-    if vmax <= 0:
-        return "#ffffff"
-    ratio = max(0.0, min(1.0, value / vmax))
-    # Blanc (255,255,255) vers rouge soutenu (200, 60, 60)
-    r = int(255 - (255 - 200) * ratio)
-    g = int(255 - (255 - 60) * ratio)
-    b = int(255 - (255 - 60) * ratio)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def build_divergence_matrix_html(
@@ -126,7 +116,10 @@ def build_divergence_matrix_html(
                     f'font-style:italic">{_e(diag_label)}</td>'
                 )
             else:
-                bg = _color_for(v, vmax)
                 # Texte sombre toujours lisible (pas de seuil fort sur le rouge clair).
                 parts.append(
                     f'<td style="padding:.3rem .5rem;text-align:center;'

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import (
+    GRADIENT_TARGET_RED,
+    color_single_gradient,
+)
 def build_divergence_matrix_html(
                     f'font-style:italic">{_e(diag_label)}</td>'
                 )
             else:
+                bg = (
+                    color_single_gradient(v, end_rgb=GRADIENT_TARGET_RED, max_value=vmax)
+                    if vmax > 0 else "#ffffff"
+                )
                 # Texte sombre toujours lisible (pas de seuil fort sur le rouge clair).
                 parts.append(
                     f'<td style="padding:.3rem .5rem;text-align:center;'

picarones/report/levers_render.py CHANGED Viewed

@@ -25,9 +25,12 @@ recommandation : la phrase est purement descriptive.
 from __future__ import annotations
 from html import escape as _e
 from typing import Iterable, Optional
 def _lever_label(lever_type: str, labels: dict[str, str]) -> str:
     return labels.get(f"levers_label_{lever_type}", lever_type)
@@ -223,7 +226,12 @@ def build_levers_section_html(
             continue
         try:
             sentence = formatter(payload, labels)
-        except Exception:
             continue
         if not sentence:
             continue

 from __future__ import annotations
+import logging
 from html import escape as _e
 from typing import Iterable, Optional
+logger = logging.getLogger(__name__)
 def _lever_label(lever_type: str, labels: dict[str, str]) -> str:
     return labels.get(f"levers_label_{lever_type}", lever_type)
             continue
         try:
             sentence = formatter(payload, labels)
+        except Exception as exc:  # noqa: BLE001 — un formatter cassé ne doit pas casser la section
+            logger.warning(
+                "[levers_render] formatter %r a échoué sur payload=%r : %s — "
+                "ce levier sera omis du rapport",
+                lv_type, payload, exc,
+            )
             continue
         if not sentence:
             continue

picarones/report/lexical_modernization_render.py CHANGED Viewed

@@ -19,15 +19,10 @@ from html import escape as _e
 from typing import Optional
 from picarones.measurements.lexical_modernization import top_modernized_tokens
-def _color_for_rate(rate: float) -> str:
-    """Gradient blanc → orange profond pour rate ∈ [0, 1]."""
-    f = max(0.0, min(1.0, rate))
-    r = int(255 + (194 - 255) * f)
-    g = int(255 + (65 - 255) * f)
-    b = int(255 + (12 - 255) * f)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def _format_variants(variants: dict, max_show: int = 3) -> str:
@@ -96,7 +91,7 @@ def build_lexical_modernization_html(
         rate = slot.get("rate_modernized", 0.0)
         n_total = slot.get("n_total", 0)
         variants_str = _format_variants(slot.get("variants") or {})
-        rate_color = _color_for_rate(rate)
         parts.append(
             f'<tr>'
             f'<td style="padding:.3rem .5rem;font-family:monospace">'

 from typing import Optional
 from picarones.measurements.lexical_modernization import top_modernized_tokens
+from picarones.report.render_helpers import (
+    GRADIENT_TARGET_ORANGE,
+    color_single_gradient,
+)
 def _format_variants(variants: dict, max_show: int = 3) -> str:
         rate = slot.get("rate_modernized", 0.0)
         n_total = slot.get("n_total", 0)
         variants_str = _format_variants(slot.get("variants") or {})
+        rate_color = color_single_gradient(rate, end_rgb=GRADIENT_TARGET_ORANGE)
         parts.append(
             f'<tr>'
             f'<td style="padding:.3rem .5rem;font-family:monospace">'

picarones/report/longitudinal_render.py CHANGED Viewed

@@ -33,32 +33,23 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_delta(delta_pct: float) -> str:
-    """Vert (≈0) → orange → rouge (≥ +5 pts CER) ;
-    vert → bleu (≤ -5 pts CER, amélioration)."""
     if abs(delta_pct) < 1.0:
         return "#a7f0a7"
-    f = max(-1.0, min(1.0, delta_pct / 5.0))
-    if f >= 0:
-        # vert → orange profond → rouge profond
-        if f < 0.5:
-            t = f / 0.5
-            r = int(167 + (235 - 167) * t)
-            g = int(240 + (180 - 240) * t)
-            b = int(167 + (60 - 167) * t)
-        else:
-            t = (f - 0.5) / 0.5
-            r = int(235 + (220 - 235) * t)
-            g = int(180 + (50 - 180) * t)
-            b = int(60 + (50 - 60) * t)
-    else:
-        # vert → bleu (amélioration)
-        f = -f
-        r = int(167 + (90 - 167) * f)
-        g = int(240 + (160 - 240) * f)
-        b = int(167 + (210 - 167) * f)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def build_longitudinal_html(
@@ -126,7 +117,7 @@ def build_longitudinal_html(
         first_cer = float(entry.get("first_cer") or 0.0)
         last_cer = float(entry.get("last_cer") or 0.0)
         delta_pct = float(entry.get("absolute_delta_pct") or 0.0)
-        delta_color = _color_for_delta(delta_pct)
         trend = entry.get("trend") or {}
         slope = trend.get("slope")
         r2 = trend.get("r_squared")

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_diverging
+def _bg_for_cer_delta(delta_pct: float) -> str:
+    """Cellule colorée pour un delta de CER en points de pourcentage :
+    vert si delta ≈ 0, orange/rouge en régression, bleu en amélioration.
+    Saturation à ±5 points.
+    """
     if abs(delta_pct) < 1.0:
         return "#a7f0a7"
+    return color_diverging(
+        delta_pct,
+        max_abs=5.0,
+        neutral_rgb=(167, 240, 167),
+        positive_rgb=(220, 50, 50),
+        negative_rgb=(90, 160, 210),
+    )
 def build_longitudinal_html(
         first_cer = float(entry.get("first_cer") or 0.0)
         last_cer = float(entry.get("last_cer") or 0.0)
         delta_pct = float(entry.get("absolute_delta_pct") or 0.0)
+        delta_color = _bg_for_cer_delta(delta_pct)
         trend = entry.get("trend") or {}
         slope = trend.get("slope")
         r2 = trend.get("r_squared")

picarones/report/marginal_cost_render.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""Rendu HTML du coût marginal inter-moteurs (Sprint 91, A.II.6).
+Tableau récapitulatif des paires (A → B) avec le coût additionnel
+par erreur évitée. Adaptive : retourne ``""`` si moins de 2 moteurs
+ou si aucune paire n'a de données coût/erreur exploitables.
+Permet à un archiviste de voir : *« passer de Tesseract à GPT-4o
+coûte X € de plus par erreur évitée — est-ce justifié pour mon
+budget ? »*
+"""
+from __future__ import annotations
+from html import escape as _e
+from typing import Optional
+def build_marginal_cost_html(
+    matrix: Optional[list[dict]],
+    labels: Optional[dict[str, str]] = None,
+) -> str:
+    """Construit le tableau du coût marginal inter-moteurs.
+    Parameters
+    ----------
+    matrix:
+        Sortie de
+        :func:`picarones.report.report_data.extra_metrics.compute_marginal_cost_section`.
+        Liste de dicts triée par coût marginal croissant. Si ``None``
+        ou vide, retourne ``""``.
+    labels:
+        Dict i18n optionnel.
+    """
+    if not matrix:
+        return ""
+    labels = labels or {}
+    title = labels.get(
+        "marginal_cost_title",
+        "Coût marginal inter-moteurs (€ par erreur évitée)",
+    )
+    note = labels.get(
+        "marginal_cost_note",
+        "Pour chaque paire de moteurs (A → B), coût additionnel par "
+        "erreur évitée en passant de A à B. Valeur basse = changement "
+        "rentable. ‘Dominé’ = B est moins cher ET plus précis. Estimation "
+        "des erreurs basée sur ``cer × 1000`` (proxy par 1000 pages).",
+    )
+    h_from = labels.get("marginal_cost_from", "Depuis")
+    h_to = labels.get("marginal_cost_to", "Vers")
+    h_avoided = labels.get("marginal_cost_avoided", "Erreurs évitées")
+    h_delta = labels.get("marginal_cost_delta", "Coût Δ (€)")
+    h_per_err = labels.get("marginal_cost_per_err", "€ / erreur évitée")
+    h_dominated = labels.get("marginal_cost_dominated", "Dominé ?")
+    parts = [
+        '<section class="marginal-cost-section" style="margin:1rem 0">',
+        f'<h3 style="margin:0 0 .3rem 0">{_e(title)}</h3>',
+        f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
+        f'{_e(note)}</div>',
+        '<table style="border-collapse:collapse;width:100%;'
+        'font-size:.9rem">',
+        '<thead><tr>',
+    ]
+    for h in (h_from, h_to, h_avoided, h_delta, h_per_err, h_dominated):
+        parts.append(
+            f'<th scope="col" style="padding:.4rem .6rem;text-align:left;'
+            f'border-bottom:1px solid #ccc;font-weight:600">{_e(h)}</th>'
+        )
+    parts.append('</tr></thead><tbody>')
+    for row in matrix:
+        engine_a = row.get("engine_a") or row.get("from") or "?"
+        engine_b = row.get("engine_b") or row.get("to") or "?"
+        n_avoided = row.get("n_errors_avoided")
+        cost_delta = row.get("cost_delta")
+        cost_per_err = row.get("cost_per_avoided_error")
+        dominated = row.get("dominated", False)
+        n_avoided_cell = (
+            f"{int(n_avoided)}" if isinstance(n_avoided, (int, float)) else "—"
+        )
+        cost_delta_cell = (
+            f"{cost_delta:+.2f}" if isinstance(cost_delta, (int, float)) else "—"
+        )
+        if isinstance(cost_per_err, (int, float)):
+            cost_per_err_cell = f"{cost_per_err:.2f}"
+        else:
+            cost_per_err_cell = "—"
+        dominated_cell = (
+            '<span style="color:#16a34a;font-weight:600">✓ B dominé par A</span>'
+            if dominated else "—"
+        )
+        parts.append(
+            f'<tr>'
+            f'<td style="padding:.4rem .6rem">{_e(str(engine_a))}</td>'
+            f'<td style="padding:.4rem .6rem">{_e(str(engine_b))}</td>'
+            f'<td style="padding:.4rem .6rem;text-align:right;'
+            f'font-family:monospace">{n_avoided_cell}</td>'
+            f'<td style="padding:.4rem .6rem;text-align:right;'
+            f'font-family:monospace">{cost_delta_cell}</td>'
+            f'<td style="padding:.4rem .6rem;text-align:right;'
+            f'font-family:monospace;font-weight:600">{cost_per_err_cell}</td>'
+            f'<td style="padding:.4rem .6rem">{dominated_cell}</td>'
+            f'</tr>'
+        )
+    parts.append('</tbody></table></section>')
+    return "".join(parts)
+__all__ = ["build_marginal_cost_html"]

picarones/report/multirun_stability_render.py CHANGED Viewed

@@ -43,21 +43,7 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_cv(cv: float) -> str:
-    """Vert (≈0) → orange (10 %) → rouge (≥ 25 %)."""
-    f = max(0.0, min(1.0, cv / 0.25))
-    if f < 0.5:
-        t = f / 0.5
-        r = int(167 + (235 - 167) * t)
-        g = int(240 + (180 - 240) * t)
-        b = int(167 + (60 - 167) * t)
-    else:
-        t = (f - 0.5) / 0.5
-        r = int(235 + (220 - 235) * t)
-        g = int(180 + (50 - 180) * t)
-        b = int(60 + (50 - 60) * t)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def build_multirun_stability_html(
@@ -128,7 +114,7 @@ def build_multirun_stability_html(
         else:
             cer_str = "—"
         if isinstance(cer_cv, (int, float)):
-            cv_color = _color_for_cv(float(cer_cv))
             cv_cell = (
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{cv_color};font-family:monospace;'

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_traffic_light
 def build_multirun_stability_html(
         else:
             cer_str = "—"
         if isinstance(cer_cv, (int, float)):
+            cv_color = color_traffic_light(float(cer_cv), low_is_good=True, scale_max=0.25)
             cv_cell = (
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{cv_color};font-family:monospace;'

picarones/report/ner_render.py CHANGED Viewed

@@ -23,26 +23,7 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_f1(f1: float) -> str:
-    """Gradient rouge → jaune → vert proportionnel à ``f1`` ∈ [0, 1].
-    F1 = 0 → rouge clair, F1 = 0,5 → jaune pâle, F1 = 1 → vert clair.
-    """
-    f = max(0.0, min(1.0, f1))
-    # Interpolation linéaire 2-segments :
-    # 0 → (220, 100, 100) (rouge), 0.5 → (240, 220, 130), 1 → (130, 200, 130) (vert)
-    if f <= 0.5:
-        ratio = f / 0.5
-        r = int(220 + (240 - 220) * ratio)
-        g = int(100 + (220 - 100) * ratio)
-        b = int(100 + (130 - 100) * ratio)
-    else:
-        ratio = (f - 0.5) / 0.5
-        r = int(240 + (130 - 240) * ratio)
-        g = int(220 + (200 - 220) * ratio)
-        b = int(130 + (130 - 130) * ratio)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def _engines_with_ner(engines_summary: list[dict]) -> list[dict]:
@@ -110,7 +91,7 @@ def build_ner_summary_html(
         doc_count = int(agg.get("doc_count") or 0)
         hallucinated = int(agg.get("hallucinated_total") or 0)
         missed = int(agg.get("missed_total") or 0)
-        bg = _color_for_f1(f1)
         parts.append("<tr>")
         parts.append(
             f'<td style="padding:.3rem .5rem;font-weight:600">'
@@ -222,7 +203,7 @@ def build_ner_per_category_html(
             else:
                 f1 = float(stats.get("f1") or 0.0)
                 support = int(stats.get("support", 0))
-                bg = _color_for_f1(f1)
                 parts.append(
                     f'<td style="padding:.3rem .5rem;text-align:center;'
                     f'background:{bg};color:#222;'

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_traffic_light
 def _engines_with_ner(engines_summary: list[dict]) -> list[dict]:
         doc_count = int(agg.get("doc_count") or 0)
         hallucinated = int(agg.get("hallucinated_total") or 0)
         missed = int(agg.get("missed_total") or 0)
+        bg = color_traffic_light(f1)
         parts.append("<tr>")
         parts.append(
             f'<td style="padding:.3rem .5rem;font-weight:600">'
             else:
                 f1 = float(stats.get("f1") or 0.0)
                 support = int(stats.get("support", 0))
+                bg = color_traffic_light(f1)
                 parts.append(
                     f'<td style="padding:.3rem .5rem;text-align:center;'
                     f'background:{bg};color:#222;'

picarones/report/numerical_sequences_render.py CHANGED Viewed

@@ -24,22 +24,7 @@ from html import escape as _e
 from typing import Optional
 from picarones.measurements.numerical_sequences import CATEGORIES
-def _color_for_score(score: float) -> str:
-    """Gradient rouge → jaune → vert."""
-    f = max(0.0, min(1.0, score))
-    if f < 0.5:
-        t = f / 0.5
-        r = 235
-        g = int(70 + (200 - 70) * t)
-        b = 70
-    else:
-        t = (f - 0.5) / 0.5
-        r = int(235 + (60 - 235) * t)
-        g = int(200 + (160 - 200) * t)
-        b = int(70 + (90 - 70) * t)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def _category_columns_with_signal(rows: list[dict]) -> list[str]:
@@ -125,7 +110,7 @@ def build_numerical_sequences_html(
         global_strict = float(agg.get("global_strict_score") or 0.0)
         global_value = float(agg.get("global_value_score") or 0.0)
         n_total = int(agg.get("n_total") or 0)
-        global_color = _color_for_score(global_strict)
         parts.append(
             f'<tr>'
             f'<td style="padding:.4rem .6rem">{_e(str(name))}</td>'
@@ -148,7 +133,7 @@ def build_numerical_sequences_html(
                 continue
             strict = float(cat_data.get("strict_score") or 0.0)
             value = float(cat_data.get("value_score") or 0.0)
-            color = _color_for_score(strict)
             parts.append(
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{color};font-family:monospace">'

 from typing import Optional
 from picarones.measurements.numerical_sequences import CATEGORIES
+from picarones.report.render_helpers import color_traffic_light
 def _category_columns_with_signal(rows: list[dict]) -> list[str]:
         global_strict = float(agg.get("global_strict_score") or 0.0)
         global_value = float(agg.get("global_value_score") or 0.0)
         n_total = int(agg.get("n_total") or 0)
+        global_color = color_traffic_light(global_strict)
         parts.append(
             f'<tr>'
             f'<td style="padding:.4rem .6rem">{_e(str(name))}</td>'
                 continue
             strict = float(cat_data.get("strict_score") or 0.0)
             value = float(cat_data.get("value_score") or 0.0)
+            color = color_traffic_light(strict)
             parts.append(
                 f'<td style="padding:.4rem .6rem;text-align:right;'
                 f'background:{color};font-family:monospace">'

picarones/report/philological_render.py CHANGED Viewed

@@ -36,34 +36,14 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
 # ──────────────────────────────────────────────────────────────────────────
 # Helpers de coloration
 # ──────────────────────────────────────────────────────────────────────────
-def _color_for_score(score: float) -> str:
-    """Gradient rouge → jaune → vert proportionnel à ``score`` ∈ [0, 1].
-    Identique à ``ner_render._color_for_f1``.  Les scores
-    philologiques (preservation, coverage, accuracy) suivent la même
-    sémantique « plus c'est haut, mieux c'est » donc le gradient
-    est valide.
-    """
-    f = max(0.0, min(1.0, score))
-    if f <= 0.5:
-        ratio = f / 0.5
-        r = int(220 + (240 - 220) * ratio)
-        g = int(100 + (220 - 100) * ratio)
-        b = int(100 + (130 - 100) * ratio)
-    else:
-        ratio = (f - 0.5) / 0.5
-        r = int(240 + (130 - 240) * ratio)
-        g = int(220 + (200 - 220) * ratio)
-        b = int(130 + (130 - 130) * ratio)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def _engines_with_module(
     engines_summary: list[dict], module: str,
 ) -> list[dict]:
@@ -83,7 +63,7 @@ def _score_cell(score: Optional[float], extra: str = "") -> str:
             '<td style="padding:.3rem .5rem;text-align:center;'
             'background:#f0f0f0;color:#999">—</td>'
         )
-    color = _color_for_score(score)
     text = f"{score * 100:.1f}%"
     if extra:
         text += f" <span style=\"opacity:.6;font-size:.85em\">({_e(extra)})</span>"
@@ -539,8 +519,8 @@ def build_roman_numerals_section(
                 # la sémantique « plus c'est haut, plus l'OCR a
                 # adopté ce statut ».
                 color = (
-                    _color_for_score(1.0 - ratio) if status == "lost"
-                    else _color_for_score(ratio)
                 )
                 parts.append(
                     f'<td style="padding:.3rem .5rem;text-align:center;'

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_traffic_light
 # ──────────────────────────────────────────────────────────────────────────
 # Helpers de coloration
 # ──────────────────────────────────────────────────────────────────────────
 def _engines_with_module(
     engines_summary: list[dict], module: str,
 ) -> list[dict]:
             '<td style="padding:.3rem .5rem;text-align:center;'
             'background:#f0f0f0;color:#999">—</td>'
         )
+    color = color_traffic_light(score)
     text = f"{score * 100:.1f}%"
     if extra:
         text += f" <span style=\"opacity:.6;font-size:.85em\">({_e(extra)})</span>"
                 # la sémantique « plus c'est haut, plus l'OCR a
                 # adopté ce statut ».
                 color = (
+                    color_traffic_light(1.0 - ratio) if status == "lost"
+                    else color_traffic_light(ratio)
                 )
                 parts.append(
                     f'<td style="padding:.3rem .5rem;text-align:center;'

picarones/report/pipeline_render.py CHANGED Viewed

@@ -50,6 +50,7 @@ from typing import Optional
 from picarones.core.modules import ArtifactType
 from picarones.measurements.pipeline_benchmark import PipelineBenchmarkResult
 from picarones.measurements.pipeline_comparison import PipelineComparisonResult
 # ──────────────────────────────────────────────────────────────────────────
@@ -57,22 +58,6 @@ from picarones.measurements.pipeline_comparison import PipelineComparisonResult
 # ──────────────────────────────────────────────────────────────────────────
-def _color_for_success_rate(rate: float) -> str:
-    """Gradient rouge → jaune → vert pour le taux de succès."""
-    f = max(0.0, min(1.0, rate))
-    if f <= 0.5:
-        ratio = f / 0.5
-        r = int(220 + (240 - 220) * ratio)
-        g = int(100 + (220 - 100) * ratio)
-        b = int(100 + (130 - 100) * ratio)
-    else:
-        ratio = (f - 0.5) / 0.5
-        r = int(240 + (130 - 240) * ratio)
-        g = int(220 + (200 - 220) * ratio)
-        b = int(130 + (130 - 130) * ratio)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def _format_duration(seconds: float) -> str:
     """Formate une durée en ms si < 1s, en s sinon."""
     if seconds < 1.0:
@@ -109,7 +94,7 @@ def build_pipeline_summary_html(
     failed = bench.n_pipelines_failed
     total = bench.n_docs
     rate = success / total if total > 0 else 0.0
-    color = _color_for_success_rate(rate)
     parts = [
         '<div class="pipeline-summary" '
@@ -195,7 +180,7 @@ def build_pipeline_steps_table_html(
     for agg in bench.per_step_aggregates:
         rate = agg.success_rate
-        rate_color = _color_for_success_rate(rate)
         # Métriques aux jonctions : pour chaque type d'artefact,
         # liste des métriques mean
         metrics_cells: list[str] = []
@@ -381,12 +366,17 @@ class RankingSpec:
         return f"{self.artifact_type.value}.{self.metric_name}"
-def _color_for_rank(rank: int, total: int) -> str:
-    """Gradient vert (1er) → rouge (dernier) pour la cellule de rang."""
     if total <= 1:
-        return _color_for_success_rate(1.0)
-    score = 1.0 - (rank - 1) / (total - 1)
-    return _color_for_success_rate(score)
 def build_pipeline_ranking_table_html(
@@ -444,7 +434,7 @@ def build_pipeline_ranking_table_html(
             rank += 1
             rank_str = str(rank)
             value_str = f"{value:.4f}"
-            rank_color = _color_for_rank(rank, n_with_value)
         parts.append(
             f'<tr>'
             f'<td style="padding:.3rem .5rem;text-align:center;'

 from picarones.core.modules import ArtifactType
 from picarones.measurements.pipeline_benchmark import PipelineBenchmarkResult
 from picarones.measurements.pipeline_comparison import PipelineComparisonResult
+from picarones.report.render_helpers import color_traffic_light
 # ──────────────────────────────────────────────────────────────────────────
 # ──────────────────────────────────────────────────────────────────────────
 def _format_duration(seconds: float) -> str:
     """Formate une durée en ms si < 1s, en s sinon."""
     if seconds < 1.0:
     failed = bench.n_pipelines_failed
     total = bench.n_docs
     rate = success / total if total > 0 else 0.0
+    color = color_traffic_light(rate)
     parts = [
         '<div class="pipeline-summary" '
     for agg in bench.per_step_aggregates:
         rate = agg.success_rate
+        rate_color = color_traffic_light(rate)
         # Métriques aux jonctions : pour chaque type d'artefact,
         # liste des métriques mean
         metrics_cells: list[str] = []
         return f"{self.artifact_type.value}.{self.metric_name}"
+def _bg_for_rank(rank: int, total: int) -> str:
+    """Gradient vert (rang 1) → rouge (dernier rang).
+    Mapping : ``rank ∈ [1, total]`` → ``color_traffic_light`` avec
+    ``low_is_good=True`` (rang bas = bon).
+    """
     if total <= 1:
+        return color_traffic_light(1.0)
+    return color_traffic_light(
+        float(rank), low_is_good=True, scale_min=1.0, scale_max=float(total),
+    )
 def build_pipeline_ranking_table_html(
             rank += 1
             rank_str = str(rank)
             value_str = f"{value:.4f}"
+            rank_color = _bg_for_rank(rank, n_with_value)
         parts.append(
             f'<tr>'
             f'<td style="padding:.3rem .5rem;text-align:center;'

picarones/report/rare_token_recall_render.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""Rendu HTML du recall sur tokens rares (Sprint 71, A.I.1).
+Petit tableau récapitulatif moteur × {n_rare_tokens, n_recalled,
+recall, n_docs}. Adaptive : retourne ``""`` si aucune donnée.
+Critique pour l'indexation prosopographique : un OCR qui rate
+systématiquement les noms propres rares produit un corpus
+inutilisable pour la recherche, même avec un CER global respectable.
+"""
+from __future__ import annotations
+from html import escape as _e
+from typing import Optional
+from picarones.report.render_helpers import color_traffic_light
+def build_rare_token_recall_html(
+    per_engine: Optional[dict[str, dict]],
+    labels: Optional[dict[str, str]] = None,
+) -> str:
+    """Construit le tableau récapitulatif du recall sur tokens rares.
+    Parameters
+    ----------
+    per_engine:
+        Sortie de
+        :func:`picarones.report.report_data.extra_metrics.compute_rare_token_recall_per_engine`.
+        Dict ``{engine_name: {n_rare_tokens, n_recalled, recall, n_docs, max_freq}}``.
+        Si ``None`` ou vide, retourne ``""``.
+    labels:
+        Dict i18n optionnel.
+    """
+    if not per_engine:
+        return ""
+    labels = labels or {}
+    title = labels.get(
+        "rare_token_title", "Recall sur tokens rares (hapax + dis legomena)",
+    )
+    note = labels.get(
+        "rare_token_note",
+        "Pour chaque moteur, fraction des tokens rares (apparaissant ≤ 2 "
+        "fois dans la GT du corpus) effectivement transcrits. Critique "
+        "pour l'indexation prosopographique — un OCR qui rate les noms "
+        "propres rares rend le corpus inutilisable pour la recherche.",
+    )
+    h_engine = labels.get("rare_token_engine", "Moteur")
+    h_recall = labels.get("rare_token_recall", "Recall")
+    h_recalled = labels.get("rare_token_recalled", "Tokens recalled")
+    h_total = labels.get("rare_token_total", "Tokens rares (corpus)")
+    h_docs = labels.get("rare_token_docs", "Docs évalués")
+    rows = [
+        (engine, info)
+        for engine, info in per_engine.items()
+        if isinstance(info, dict)
+    ]
+    if not rows:
+        return ""
+    parts = [
+        '<section class="rare-token-section" style="margin:1rem 0">',
+        f'<h3 style="margin:0 0 .3rem 0">{_e(title)}</h3>',
+        f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
+        f'{_e(note)}</div>',
+        '<table style="border-collapse:collapse;width:100%;'
+        'font-size:.9rem">',
+        '<thead><tr>',
+    ]
+    for h in (h_engine, h_recall, h_recalled, h_total, h_docs):
+        parts.append(
+            f'<th scope="col" style="padding:.4rem .6rem;text-align:left;'
+            f'border-bottom:1px solid #ccc;font-weight:600">{_e(h)}</th>'
+        )
+    parts.append('</tr></thead><tbody>')
+    # Tri par recall décroissant (les meilleurs en haut, None en queue).
+    sorted_rows = sorted(
+        rows,
+        key=lambda kv: -(kv[1].get("recall") or -1.0),
+    )
+    for engine, info in sorted_rows:
+        recall = info.get("recall")
+        n_recalled = int(info.get("n_recalled") or 0)
+        n_total = int(info.get("n_rare_tokens") or 0)
+        n_docs = int(info.get("n_docs") or 0)
+        if isinstance(recall, (int, float)):
+            recall_color = color_traffic_light(float(recall))
+            recall_cell = (
+                f'<td style="padding:.4rem .6rem;text-align:right;'
+                f'background:{recall_color};font-family:monospace;'
+                f'font-weight:600">{recall * 100:.1f} %</td>'
+            )
+        else:
+            recall_cell = (
+                '<td style="padding:.4rem .6rem;text-align:right;'
+                'opacity:.4">—</td>'
+            )
+        parts.append(
+            f'<tr>'
+            f'<td style="padding:.4rem .6rem">{_e(str(engine))}</td>'
+            f'{recall_cell}'
+            f'<td style="padding:.4rem .6rem;text-align:right;'
+            f'font-family:monospace">{n_recalled}</td>'
+            f'<td style="padding:.4rem .6rem;text-align:right;'
+            f'font-family:monospace">{n_total}</td>'
+            f'<td style="padding:.4rem .6rem;text-align:right;'
+            f'font-family:monospace">{n_docs}</td>'
+            f'</tr>'
+        )
+    parts.append('</tbody></table></section>')
+    return "".join(parts)
+__all__ = ["build_rare_token_recall_html"]

picarones/report/readability_render.py CHANGED Viewed

@@ -25,27 +25,22 @@ from __future__ import annotations
 from html import escape as _e
 from typing import Optional
-def _color_for_delta(delta: float) -> str:
-    """Vert au centre, orange si over-norm, bleu si under-norm.
-    Plage de saturation : ±15 points de Flesch.
     """
     if abs(delta) <= 1.0:
-        return "#a7f0a7"  # vert clair
-    f = max(-1.0, min(1.0, delta / 15.0))
-    if f >= 0:
-        # vert → orange profond
-        r = int(167 + (220 - 167) * f)
-        g = int(240 + (140 - 240) * f)
-        b = int(167 + (60 - 167) * f)
-    else:
-        f = -f
-        # vert → bleu profond
-        r = int(167 + (90 - 167) * f)
-        g = int(240 + (160 - 240) * f)
-        b = int(167 + (210 - 167) * f)
-    return f"#{r:02x}{g:02x}{b:02x}"
 def build_readability_summary_html(
@@ -107,7 +102,7 @@ def build_readability_summary_html(
         over_rate = float(agg.get("over_normalized_rate") or 0.0)
         n_under = int(agg.get("n_under_normalized") or 0)
         n_docs = int(agg.get("n_docs") or 0)
-        color = _color_for_delta(delta_mean)
         parts.append(
             f'<tr>'
             f'<td style="padding:.4rem .6rem">{_e(str(name))}</td>'

 from html import escape as _e
 from typing import Optional
+from picarones.report.render_helpers import color_diverging
+def _bg_for_flesch_delta(delta: float) -> str:
+    """Vert au centre (delta ≈ 0), orange en sur-normalisation (delta > 0),
+    bleu en sous-normalisation (delta < 0). Saturation à ±15 pts Flesch.
     """
     if abs(delta) <= 1.0:
+        return "#a7f0a7"  # neutre vert clair, indistinguable du bruit
+    return color_diverging(
+        delta,
+        max_abs=15.0,
+        neutral_rgb=(167, 240, 167),
+        positive_rgb=(220, 140, 60),
+        negative_rgb=(90, 160, 210),
+    )
 def build_readability_summary_html(
         over_rate = float(agg.get("over_normalized_rate") or 0.0)
         n_under = int(agg.get("n_under_normalized") or 0)
         n_docs = int(agg.get("n_docs") or 0)
+        color = _bg_for_flesch_delta(delta_mean)
         parts.append(
             f'<tr>'
             f'<td style="padding:.4rem .6rem">{_e(str(name))}</td>'

picarones/report/render_helpers.py ADDED Viewed

	@@ -0,0 +1,422 @@

+"""Helpers de rendu mutualisés.
+Centralise les fonctions de coloration et le builder de grille SVG qui
+étaient auparavant dupliqués dans chaque ``*_render.py``. Avant cette
+consolidation, le projet comptait 25 versions différentes de
+``_color_for_*`` (toutes des dégradés rouge/jaune/vert ou blanc/couleur
+légèrement différentes) et 2 versions de ``_build_heatmap_svg``
+(matrice de classes × positions). Le test
+``tests/architecture/test_render_helpers.py`` mesure cette duplication
+et bloque sa réapparition.
+API
+---
+- :func:`color_traffic_light` — gradient rouge → jaune → vert. Couvre
+  la majorité des cellules du rapport (CER, F1, recall, ECE, deficit,
+  drag, CV, etc.). Argument ``low_is_good`` pour inverser la sémantique.
+- :func:`color_single_gradient` — gradient blanc → couleur intense.
+  Utilisé pour les heatmaps Jaccard, densité, lexical modernization.
+- :func:`color_diverging` — gradient signé (négatif → neutre → positif).
+  Utilisé pour les deltas Flesch, amélioration nette, sur/sous-norm.
+- :func:`text_color_for_bg` — noir ou blanc selon la luminosité du fond.
+- :func:`build_grid_svg` — builder de heatmap SVG paramétré.
+Conventions de bornes
+---------------------
+Trois conventions de paramétrage cohabitent (par dessein, pas par
+maladresse) :
+- :func:`color_traffic_light` accepte ``scale_min`` + ``scale_max``
+  parce que les cellules concernées (CER, ECE, deficit) peuvent
+  démarrer à une borne basse non nulle (rang 1 = vert, ou
+  ``scale_min=0.30`` pour démarrer le dégradé à partir d'un seuil).
+- :func:`color_single_gradient` accepte ``max_value`` parce que ces
+  cellules (Jaccard, densité) sont toujours bornées en bas par 0 —
+  pas besoin de ``scale_min``.
+- :func:`color_diverging` accepte ``max_abs`` parce que ces cellules
+  (deltas signés) sont symétriques autour de 0 — la borne est la
+  même des deux côtés.
+Le choix des couleurs reflète la sémantique métier :
+- **Traffic-light** rouge/jaune/vert : convention historique
+  largement comprise pour vision trichromate normale. **Compromis
+  d'accessibilité accepté** : la confusion rouge/vert affecte ~8 %
+  des hommes (deutéranopie/protanopie). Une migration vers la
+  palette Okabe-Ito de :mod:`picarones.report.colors` est tracée
+  comme dette dans un sprint dédié.
+- **Diverging** bleu/vert/orange par défaut : vert au centre =
+  neutre, extrémités opposées sémantiquement, et ces 3 teintes
+  restent distinguables en daltonisme deutéranope. Choix retenu
+  parce que les cellules diverging sont moins nombreuses et
+  qu'on a pu repartir de zéro en les écrivant.
+Palette
+-------
+Les bornes RGB des dégradés traffic-light sont la moyenne des palettes
+ad hoc qui peuplaient les 25 helpers d'origine. Cohérence visuelle
+unifiée tout en restant proche du rendu antérieur (≤ 10 unités RGB
+d'écart sur la majorité des bornes), pour ne pas casser les tests
+d'intégration HTML existants.
+"""
+from __future__ import annotations
+from html import escape as _e
+from typing import Callable, Optional
+# ──────────────────────────────────────────────────────────────────
+# Palettes — bornes RGB partagées par tous les dégradés.
+#
+# Choix éditorial : on conserve l'esprit « rouge → jaune → vert » des
+# helpers historiques plutôt que la palette daltonien-friendly
+# Okabe-Ito de ``colors.py`` (utilisée pour les badges principaux).
+# Migrer les cellules de tableau vers Okabe-Ito serait un sprint
+# d'accessibilité dédié, hors scope de la consolidation.
+# ──────────────────────────────────────────────────────────────────
+GRADIENT_RED_RGB: tuple[int, int, int] = (220, 100, 100)
+GRADIENT_YELLOW_RGB: tuple[int, int, int] = (240, 220, 130)
+GRADIENT_GREEN_RGB: tuple[int, int, int] = (130, 200, 130)
+#: Couleurs cibles pour les single-gradients fréquents.
+GRADIENT_TARGET_BLUE: tuple[int, int, int] = (30, 58, 138)      # Jaccard, specialization
+GRADIENT_TARGET_ORANGE: tuple[int, int, int] = (194, 65, 12)    # densité, lexical mod.
+GRADIENT_TARGET_RED: tuple[int, int, int] = (200, 60, 60)       # divergence inter-engine
+#: Couleurs cibles pour les diverging gradients.
+DIVERGING_NEGATIVE_RGB: tuple[int, int, int] = (95, 145, 215)   # bleu (under-norm)
+DIVERGING_NEUTRAL_RGB: tuple[int, int, int] = (130, 200, 130)   # vert (centre, OK)
+DIVERGING_POSITIVE_RGB: tuple[int, int, int] = (220, 130, 60)   # orange (over-norm)
+# ──────────────────────────────────────────────────────────────────
+# Helpers internes
+# ──────────────────────────────────────────────────────────────────
+def _interp(a: int, b: int, t: float) -> int:
+    """Interpolation linéaire bornée à un canal RGB ∈ [0, 255]."""
+    return max(0, min(255, int(a + (b - a) * t)))
+def _rgb_to_hex(r: int, g: int, b: int) -> str:
+    return f"#{r:02x}{g:02x}{b:02x}"
+# ──────────────────────────────────────────────────────────────────
+# API publique : couleurs
+# ──────────────────────────────────────────────────────────────────
+def color_traffic_light(
+    value: float,
+    *,
+    low_is_good: bool = False,
+    scale_max: float = 1.0,
+    scale_min: float = 0.0,
+) -> str:
+    """Gradient rouge → jaune → vert proportionnel à ``value``.
+    Paramètres
+    ----------
+    value : float
+        Valeur à colorer.
+    low_is_good : bool, default ``False``
+        Si ``True``, ``value = scale_min`` → vert et ``value = scale_max``
+        → rouge (sémantique « plus c'est bas, mieux c'est » : ECE,
+        deficit, drag, CV, taux d'introduction d'erreurs…).
+        Si ``False`` (défaut), c'est l'inverse (sémantique « plus c'est
+        haut, mieux c'est » : F1, recall, taux de correction…).
+    scale_max : float, default ``1.0``
+        Borne haute de l'échelle. Au-delà, la couleur sature.
+    scale_min : float, default ``0.0``
+        Borne basse de l'échelle.
+    Retour
+    ------
+    str
+        Couleur hex au format ``#rrggbb``.
+    """
+    span = scale_max - scale_min
+    if span <= 0:
+        f = 0.5
+    else:
+        f = (value - scale_min) / span
+    f = max(0.0, min(1.0, f))
+    if low_is_good:
+        f = 1.0 - f
+    if f <= 0.5:
+        t = f / 0.5
+        r = _interp(GRADIENT_RED_RGB[0], GRADIENT_YELLOW_RGB[0], t)
+        g = _interp(GRADIENT_RED_RGB[1], GRADIENT_YELLOW_RGB[1], t)
+        b = _interp(GRADIENT_RED_RGB[2], GRADIENT_YELLOW_RGB[2], t)
+    else:
+        t = (f - 0.5) / 0.5
+        r = _interp(GRADIENT_YELLOW_RGB[0], GRADIENT_GREEN_RGB[0], t)
+        g = _interp(GRADIENT_YELLOW_RGB[1], GRADIENT_GREEN_RGB[1], t)
+        b = _interp(GRADIENT_YELLOW_RGB[2], GRADIENT_GREEN_RGB[2], t)
+    return _rgb_to_hex(r, g, b)
+def color_single_gradient(
+    value: float,
+    *,
+    end_rgb: tuple[int, int, int],
+    max_value: float = 1.0,
+    start_rgb: tuple[int, int, int] = (255, 255, 255),
+) -> str:
+    """Gradient simple ``start_rgb`` → ``end_rgb`` proportionnel à ``value/max_value``.
+    Utilisé pour les heatmaps qui n'ont pas de sémantique « bon/mauvais »
+    mais juste une intensité (Jaccard, densité d'occurrence, taux de
+    modernisation lexicale).
+    """
+    if max_value <= 0:
+        f = 0.0
+    else:
+        f = max(0.0, min(1.0, value / max_value))
+    r = _interp(start_rgb[0], end_rgb[0], f)
+    g = _interp(start_rgb[1], end_rgb[1], f)
+    b = _interp(start_rgb[2], end_rgb[2], f)
+    return _rgb_to_hex(r, g, b)
+def color_diverging(
+    value: float,
+    *,
+    max_abs: float = 1.0,
+    negative_rgb: tuple[int, int, int] = DIVERGING_NEGATIVE_RGB,
+    neutral_rgb: tuple[int, int, int] = DIVERGING_NEUTRAL_RGB,
+    positive_rgb: tuple[int, int, int] = DIVERGING_POSITIVE_RGB,
+) -> str:
+    """Gradient signé : ``value < 0`` → ``negative_rgb`` (par défaut bleu),
+    ``value ≈ 0`` → ``neutral_rgb`` (par défaut vert),
+    ``value > 0`` → ``positive_rgb`` (par défaut orange).
+    Saturation à ``|value| = max_abs``.
+    """
+    if max_abs <= 0:
+        return _rgb_to_hex(*neutral_rgb)
+    f = max(-1.0, min(1.0, value / max_abs))
+    if f >= 0:
+        r = _interp(neutral_rgb[0], positive_rgb[0], f)
+        g = _interp(neutral_rgb[1], positive_rgb[1], f)
+        b = _interp(neutral_rgb[2], positive_rgb[2], f)
+    else:
+        t = -f
+        r = _interp(neutral_rgb[0], negative_rgb[0], t)
+        g = _interp(neutral_rgb[1], negative_rgb[1], t)
+        b = _interp(neutral_rgb[2], negative_rgb[2], t)
+    return _rgb_to_hex(r, g, b)
+def text_color_for_bg(intensity: float, *, threshold: float = 0.55) -> str:
+    """Retourne ``"#fff"`` sur fond foncé, ``"#222"`` sur fond clair.
+    ``intensity`` ∈ [0, 1] : 0 = fond clair, 1 = fond très foncé.
+    Pour les heatmaps single-gradient, c'est typiquement la même valeur
+    que celle passée à :func:`color_single_gradient`.
+    """
+    return "#fff" if intensity > threshold else "#222"
+# ──────────────────────────────────────────────────────────────────
+# API publique : barème CER par paliers (badges du rapport)
+# ──────────────────────────────────��───────────────────────────────
+#
+# Les badges de qualité du rapport (galerie, tableau de classement)
+# n'utilisent pas un dégradé continu mais un barème discret à 4
+# paliers calibrés sur les seuils éditoriaux usuels :
+#
+#   < 5 %  : vert    (qualité publication directe)
+#   < 15 % : jaune   (relecture humaine légère)
+#   < 30 % : orange  (relecture humaine systématique)
+#   ≥ 30 % : rouge   (catastrophique, à reprendre)
+#
+# Les couleurs sont importées de :mod:`picarones.report.colors`
+# (palette Okabe-Ito daltonien-friendly active par défaut).
+def cer_step_color(cer: float) -> str:
+    """Couleur de texte CSS pour un score CER, par paliers.
+    Voir le barème dans le bloc de documentation ci-dessus.
+    """
+    from picarones.report.colors import (
+        COLOR_GREEN,
+        COLOR_ORANGE,
+        COLOR_RED,
+        COLOR_YELLOW,
+    )
+    if cer < 0.05:
+        return COLOR_GREEN
+    if cer < 0.15:
+        return COLOR_YELLOW
+    if cer < 0.30:
+        return COLOR_ORANGE
+    return COLOR_RED
+def cer_step_bg(cer: float) -> str:
+    """Couleur de fond CSS associée à :func:`cer_step_color`."""
+    from picarones.report.colors import (
+        BG_GREEN,
+        BG_ORANGE,
+        BG_RED,
+        BG_YELLOW,
+    )
+    if cer < 0.05:
+        return BG_GREEN
+    if cer < 0.15:
+        return BG_YELLOW
+    if cer < 0.30:
+        return BG_ORANGE
+    return BG_RED
+# ──────────────────────────────────────────────────────────────────
+# API publique : grille SVG
+# ──────────────────────────────────────────────────────────────────
+def build_grid_svg(
+    *,
+    n_rows: int,
+    n_cols: int,
+    row_label_fn: Callable[[int], str],
+    col_label_fn: Callable[[int], str],
+    cell_color_fn: Callable[[int, int], str],
+    cell_text_fn: Callable[[int, int], Optional[str]] = lambda r, c: None,
+    cell_text_color_fn: Callable[[int, int], str] = lambda r, c: "#222",
+    cell_w: int = 36,
+    cell_h: int = 36,
+    label_left: int = 130,
+    label_top: int = 80,
+    rotate_col_labels: bool = False,
+    aria_label: str = "Heatmap",
+    x_axis_title: Optional[str] = None,
+) -> str:
+    """Construit une heatmap SVG paramétrable.
+    Architecture commune des deux `_build_heatmap_svg` historiques
+    (taxonomy_cooccurrence et taxonomy_intra_doc), mutualisée ici.
+    Paramètres
+    ----------
+    n_rows, n_cols : int
+        Dimensions de la grille.
+    row_label_fn, col_label_fn : Callable[[int], str]
+        Étiquettes des lignes (gauche) et colonnes (haut).
+    cell_color_fn : Callable[[int, int], str]
+        Retourne la couleur hex de fond pour la cellule (row, col).
+    cell_text_fn : Callable[[int, int], Optional[str]]
+        Texte à afficher dans la cellule, ou ``None`` pour ne rien afficher.
+    cell_text_color_fn : Callable[[int, int], str]
+        Couleur du texte de la cellule (typiquement obtenue via
+        :func:`text_color_for_bg`).
+    cell_w, cell_h : int
+        Dimensions de chaque cellule en pixels.
+    label_left, label_top : int
+        Marges réservées aux étiquettes.
+    rotate_col_labels : bool
+        Si ``True``, les étiquettes de colonnes sont rotées de -45°
+        (utile quand elles sont longues).
+    aria_label : str
+        Étiquette d'accessibilité du SVG.
+    x_axis_title : Optional[str]
+        Titre optionnel de l'axe horizontal, affiché en bas du SVG.
+    Retour
+    ------
+    str
+        SVG complet, ou ``""`` si la grille est vide.
+    """
+    if n_rows == 0 or n_cols == 0:
+        return ""
+    extra_bottom = 30 if x_axis_title else 10
+    width = label_left + n_cols * cell_w + 10
+    height = label_top + n_rows * cell_h + extra_bottom
+    parts: list[str] = [
+        f'<svg xmlns="http://www.w3.org/2000/svg" '
+        f'width="{width}" height="{height}" '
+        f'viewBox="0 0 {width} {height}" '
+        f'role="img" aria-label="{_e(aria_label)}">',
+    ]
+    # Étiquettes de colonnes
+    for j in range(n_cols):
+        cx = label_left + j * cell_w + cell_w // 2
+        cy = label_top - 6
+        label = _e(col_label_fn(j))
+        if rotate_col_labels:
+            parts.append(
+                f'<text x="{cx}" y="{cy}" '
+                f'transform="rotate(-45 {cx} {cy})" '
+                f'font-size="11" fill="#333" text-anchor="start">'
+                f'{label}</text>'
+            )
+        else:
+            parts.append(
+                f'<text x="{cx}" y="{cy}" '
+                f'font-size="10" fill="#666" text-anchor="middle">'
+                f'{label}</text>'
+            )
+    # Cellules + étiquettes de lignes
+    for i in range(n_rows):
+        rx = label_left - 6
+        ry = label_top + i * cell_h + cell_h // 2 + 4
+        parts.append(
+            f'<text x="{rx}" y="{ry}" '
+            f'font-size="11" fill="#333" text-anchor="end">'
+            f'{_e(row_label_fn(i))}</text>'
+        )
+        for j in range(n_cols):
+            x = label_left + j * cell_w
+            y = label_top + i * cell_h
+            color = cell_color_fn(i, j)
+            parts.append(
+                f'<rect x="{x}" y="{y}" '
+                f'width="{cell_w}" height="{cell_h}" '
+                f'fill="{color}" stroke="#ddd" stroke-width="0.5"/>'
+            )
+            text = cell_text_fn(i, j)
+            if text is not None:
+                text_color = cell_text_color_fn(i, j)
+                parts.append(
+                    f'<text x="{x + cell_w // 2}" '
+                    f'y="{y + cell_h // 2 + 4}" '
+                    f'font-size="10" fill="{text_color}" '
+                    f'text-anchor="middle">'
+                    f'{_e(text)}</text>'
+                )
+    if x_axis_title:
+        cx_axis = label_left + (n_cols * cell_w) // 2
+        cy_axis = height - 6
+        parts.append(
+            f'<text x="{cx_axis}" y="{cy_axis}" '
+            f'font-size="11" fill="#666" text-anchor="middle" '
+            f'font-style="italic">'
+            f'{_e(x_axis_title)}</text>'
+        )
+    parts.append("</svg>")
+    return "".join(parts)
+__all__ = [
+    "GRADIENT_RED_RGB",
+    "GRADIENT_YELLOW_RGB",
+    "GRADIENT_GREEN_RGB",
+    "GRADIENT_TARGET_BLUE",
+    "GRADIENT_TARGET_ORANGE",
+    "GRADIENT_TARGET_RED",
+    "DIVERGING_NEGATIVE_RGB",
+    "DIVERGING_NEUTRAL_RGB",
+    "DIVERGING_POSITIVE_RGB",
+    "cer_step_color",
+    "cer_step_bg",
+    "color_traffic_light",
+    "color_single_gradient",
+    "color_diverging",
+    "text_color_for_bg",
+    "build_grid_svg",
+]

picarones/report/report_data/__init__.py ADDED Viewed

	@@ -0,0 +1,132 @@

+"""Construction du dict de données consommé par le template Jinja.
+Avant le découpage, ``picarones.report.generator._build_report_data``
+faisait 463 lignes pour transformer un :class:`BenchmarkResult` en
+dict prêt pour Jinja. Cette fonction empilait par sprint des blocs
+indépendants — engines, documents, statistiques, scatter plots,
+front Pareto, etc.
+Ce sous-package éclate la construction en modules thématiques :
+- :mod:`engines` — résumé par moteur (``engines_summary``).
+- :mod:`documents` — vue galerie + détail + difficulté Sprint 7.
+- :mod:`statistics` — Wilcoxon, Friedman, Nemenyi, bootstrap CIs,
+  reliability curves, Venn, error clusters, corrélations.
+- :mod:`scatter` — Sprint 10 : Gini vs CER, ratio vs anchor.
+- :mod:`pareto` — Sprint 19 : 3 fronts Pareto + métadonnées pricing.
+  Expose deux fonctions séparées : :func:`attach_engine_costs`
+  (mute) et :func:`build_pareto_section` (pure).
+L'API publique :func:`build_report_data` orchestre ces modules dans
+le bon ordre. La séquence Pareto en deux temps
+(``attach_engine_costs`` → ``build_pareto_section``) rend la
+mutation explicite — les fonctions ``build_*`` du sous-package
+sont pures sauf ``attach_engine_costs`` dont le nom le dit.
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from picarones.core.results import BenchmarkResult
+from picarones.report.report_data.documents import (
+    annotate_documents_with_difficulty,
+    build_documents,
+)
+from picarones.report.report_data.engines import build_engines_summary
+from picarones.report.report_data.extra_metrics import (
+    compute_marginal_cost_section,
+    compute_rare_token_recall_per_engine,
+    compute_taxonomy_cooccurrence_section,
+    compute_taxonomy_intra_doc_section,
+)
+from picarones.report.report_data.pareto import (
+    attach_engine_costs,
+    build_pareto_section,
+)
+from picarones.report.report_data.scatter import (
+    build_gini_vs_cer,
+    build_ratio_vs_anchor,
+)
+from picarones.report.report_data.statistics import (
+    build_bootstrap_cis,
+    build_correlation_per_engine,
+    build_error_clusters,
+    build_friedman_and_nemenyi,
+    build_pairwise_wilcoxon,
+    build_reliability_curves,
+    build_venn_data,
+)
+def build_report_data(
+    benchmark: "BenchmarkResult", images_b64: dict[str, str],
+) -> dict:
+    """Transforme un :class:`BenchmarkResult` en dict pour le rapport HTML.
+    Ordre critique :
+    1. Construire ``engines_summary`` (pur).
+    2. Construire ``documents`` puis annoter avec la difficulté (mute
+       ``documents``).
+    3. **Attacher** les coûts à ``engines_summary`` (mute, nom
+       explicite).
+    4. **Construire** le bloc Pareto (pure, lit les coûts attachés).
+    """
+    engines_summary = build_engines_summary(benchmark)
+    documents = build_documents(benchmark, images_b64)
+    annotate_documents_with_difficulty(benchmark, documents)
+    attach_engine_costs(engines_summary, benchmark)
+    pareto_data = build_pareto_section(engines_summary)
+    return {
+        "meta": {
+            "corpus_name": benchmark.corpus_name,
+            "corpus_source": benchmark.corpus_source,
+            "document_count": benchmark.document_count,
+            "run_date": benchmark.run_date,
+            "picarones_version": benchmark.picarones_version,
+            "metadata": benchmark.metadata,
+        },
+        "ranking": benchmark.ranking(),
+        "engines": engines_summary,
+        "documents": documents,
+        # Sprint 7
+        "statistics": {
+            "pairwise_wilcoxon": build_pairwise_wilcoxon(benchmark),
+            "bootstrap_cis": build_bootstrap_cis(benchmark),
+            **build_friedman_and_nemenyi(benchmark),
+        },
+        "reliability_curves": build_reliability_curves(benchmark),
+        "venn_data": build_venn_data(benchmark),
+        "error_clusters": build_error_clusters(benchmark),
+        "correlation_per_engine": build_correlation_per_engine(benchmark),
+        # Sprint 10
+        "gini_vs_cer": build_gini_vs_cer(benchmark),
+        "ratio_vs_anchor": build_ratio_vs_anchor(benchmark),
+        # Sprint 19 — vue Pareto coût/qualité avec variantes d'axe
+        "pareto": pareto_data,
+        # Sprint 36 — analyse inter-moteurs (divergence taxonomique +
+        # complémentarité / oracle).  ``None`` si moins de 2 moteurs.
+        "inter_engine_analysis": benchmark.inter_engine_analysis,
+        # Sprint 45-46 — stratification par script_type
+        "available_strata": benchmark.available_strata(),
+        "stratified_ranking": benchmark.stratified_ranking() or None,
+        "corpus_homogeneity": benchmark.corpus_homogeneity(),
+        # Sprint « câblage des modules test-only » (mai 2026) — métriques
+        # corpus-wide qui jusque-là n'étaient pas remontées dans le rapport.
+        # Sprint 71 (A.I.1) : recall sur tokens rares (hapax + dis legomena).
+        "rare_token_recall": compute_rare_token_recall_per_engine(benchmark),
+        # Sprint 75 (A.I.4) : co-occurrence taxonomique inter-classes.
+        "taxonomy_cooccurrence": compute_taxonomy_cooccurrence_section(benchmark),
+        # Sprint 76 (A.I.4) : heatmap class × position (intra-document).
+        "taxonomy_intra_doc": compute_taxonomy_intra_doc_section(benchmark),
+        # Sprint 91 (A.II.6) : matrice de coût marginal entre paires de moteurs.
+        "marginal_cost": compute_marginal_cost_section(engines_summary),
+    }
+__all__ = ["build_report_data"]

picarones/report/report_data/_helpers.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""Helpers numériques internes au sous-package report_data.
+Petites fonctions utilitaires partagées par tous les builders de
+sections (engines, documents, statistics, scatter, pareto). Ne pas
+importer depuis l'extérieur du sous-package — ces helpers sont
+spécifiques aux conventions du dict JSON consommé par le template.
+"""
+from __future__ import annotations
+from typing import Optional
+def safe_round(v: Optional[float], decimals: int = 4) -> float:
+    """Arrondit un float optionnel ; ``None`` devient ``0.0``."""
+    return round(v or 0.0, decimals)
+def percent_string(v: Optional[float], decimals: int = 2) -> str:
+    """Formate un ratio ∈ [0, 1] en chaîne pourcentage : ``0.4723 → "47.23 %"``.
+    ``None`` → ``"—"``. Conservé pour rétrocompat avec d'éventuels
+    callers externes (Sprint 7 historique).
+    """
+    if v is None:
+        return "—"
+    return f"{v * 100:.{decimals}f} %"
+__all__ = ["safe_round", "percent_string"]

picarones/report/report_data/documents.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""Construction de la liste ``documents`` (vue galerie + vue détail).
+Pour chaque document du corpus, agrège les hypothèses de tous les
+moteurs avec leurs métriques, le diff caractère par caractère, et
+les champs spécifiques aux pipelines OCR+LLM (intermédiaire, mode,
+sur-normalisation).
+:func:`annotate_documents_with_difficulty` enrichit ensuite chaque
+document avec son score de difficulté intrinsèque (Sprint 7).
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+from picarones.core.diff_utils import compute_char_diff, compute_word_diff
+from picarones.measurements.difficulty import (
+    compute_all_difficulties,
+    difficulty_label,
+)
+from picarones.report.report_data._helpers import safe_round
+if TYPE_CHECKING:
+    from picarones.core.results import BenchmarkResult
+def build_documents(
+    benchmark: "BenchmarkResult", images_b64: dict[str, str],
+) -> list[dict]:
+    """Retourne la liste ordonnée des documents prêts pour le template.
+    L'ordre des documents préserve l'ordre d'apparition (premier moteur
+    d'abord, puis compléments depuis les moteurs suivants si certains
+    documents ne sont pas couverts par tous les moteurs).
+    """
+    seen_doc_ids: set[str] = set()
+    doc_ids_ordered: list[str] = []
+    for report in benchmark.engine_reports:
+        for dr in report.document_results:
+            if dr.doc_id not in seen_doc_ids:
+                seen_doc_ids.add(dr.doc_id)
+                doc_ids_ordered.append(dr.doc_id)
+    # Index croisé : doc_id → {engine_name → DocumentResult}
+    doc_engine_map: dict[str, dict] = {did: {} for did in doc_ids_ordered}
+    for report in benchmark.engine_reports:
+        for dr in report.document_results:
+            doc_engine_map.setdefault(dr.doc_id, {})[report.engine_name] = dr
+    documents: list[dict] = []
+    engine_names = [r.engine_name for r in benchmark.engine_reports]
+    for doc_id in doc_ids_ordered:
+        engine_results: list[dict] = []
+        gt = ""
+        image_path = ""
+        for engine_name in engine_names:
+            dr = doc_engine_map[doc_id].get(engine_name)
+            if dr is None:
+                continue
+            gt = dr.ground_truth
+            image_path = dr.image_path
+            er_entry = _build_engine_result_entry(engine_name, dr)
+            engine_results.append(er_entry)
+        # CER moyen sur ce document (pour le badge galerie)
+        cer_values = [er["cer"] for er in engine_results if er["error"] is None]
+        mean_cer = sum(cer_values) / len(cer_values) if cer_values else 1.0
+        best_engine = min(engine_results, key=lambda x: x["cer"], default=None)
+        # Script type (depuis metadata par document si disponible)
+        script_type = ""
+        first_engine = engine_names[0] if engine_names else None
+        first_dr = doc_engine_map[doc_id].get(first_engine)
+        if first_dr and first_dr.image_quality:
+            script_type = first_dr.image_quality.get("script_type", "")
+        documents.append({
+            "doc_id": doc_id,
+            "image_path": image_path,
+            "image_b64": images_b64.get(doc_id, ""),
+            "ground_truth": gt,
+            "mean_cer": safe_round(mean_cer),
+            "best_engine": best_engine["engine"] if best_engine else "",
+            "engine_results": engine_results,
+            "script_type": script_type,
+        })
+    return documents
+def _build_engine_result_entry(engine_name: str, dr) -> dict:
+    """Construit une entrée moteur pour un document donné (extrait pour lisibilité)."""
+    diff_ops = compute_char_diff(dr.ground_truth, dr.hypothesis)
+    er_entry: dict = {
+        "engine": engine_name,
+        "hypothesis": dr.hypothesis,
+        "cer": safe_round(dr.metrics.cer),
+        "cer_diplomatic": safe_round(dr.metrics.cer_diplomatic) if dr.metrics.cer_diplomatic is not None else None,
+        "wer": safe_round(dr.metrics.wer),
+        "mer": safe_round(dr.metrics.mer),
+        "wil": safe_round(dr.metrics.wil),
+        "duration": dr.duration_seconds,
+        "error": dr.engine_error,
+        "diff": diff_ops,
+    }
+    # Champs spécifiques aux pipelines OCR+LLM
+    if dr.ocr_intermediate is not None:
+        er_entry["ocr_intermediate"] = dr.ocr_intermediate
+        er_entry["ocr_diff"] = compute_word_diff(dr.ground_truth, dr.ocr_intermediate)
+        er_entry["llm_correction_diff"] = compute_word_diff(dr.ocr_intermediate, dr.hypothesis)
+    if dr.pipeline_metadata:
+        on = dr.pipeline_metadata.get("over_normalization")
+        if on is not None:
+            er_entry["over_normalization"] = on
+        er_entry["pipeline_mode"] = dr.pipeline_metadata.get("pipeline_mode")
+    # Sprint 5 — métriques avancées par document
+    if dr.char_scores is not None:
+        er_entry["ligature_score"] = safe_round(dr.char_scores.get("ligature", {}).get("score"))
+        er_entry["diacritic_score"] = safe_round(dr.char_scores.get("diacritic", {}).get("score"))
+    if dr.taxonomy is not None:
+        er_entry["taxonomy"] = dr.taxonomy
+    if dr.structure is not None:
+        er_entry["structure"] = dr.structure
+    if dr.image_quality is not None:
+        er_entry["image_quality"] = dr.image_quality
+    # Sprint 10
+    if dr.line_metrics is not None:
+        er_entry["line_metrics"] = dr.line_metrics
+    if dr.hallucination_metrics is not None:
+        er_entry["hallucination_metrics"] = dr.hallucination_metrics
+    return er_entry
+def annotate_documents_with_difficulty(
+    benchmark: "BenchmarkResult", documents: list[dict],
+) -> None:
+    """Annote chaque document du dict avec son score de difficulté (Sprint 7).
+    Modifie ``documents`` en place. Les valeurs par défaut ``0.5`` /
+    ``"Modéré"`` sont retournées si la difficulté n'a pas pu être
+    calculée (par exemple corpus dégénéré).
+    """
+    doc_ids_ordered = [d["doc_id"] for d in documents]
+    gt_map = {d["doc_id"]: d["ground_truth"] for d in documents}
+    cer_map: dict[str, dict[str, float]] = {d["doc_id"]: {} for d in documents}
+    iq_map: dict[str, float] = {}
+    for report in benchmark.engine_reports:
+        for dr in report.document_results:
+            cer_map.setdefault(dr.doc_id, {})[report.engine_name] = safe_round(dr.metrics.cer)
+            if dr.image_quality and "quality_score" in dr.image_quality:
+                iq_map[dr.doc_id] = dr.image_quality["quality_score"]
+    difficulty_scores = compute_all_difficulties(
+        doc_ids=doc_ids_ordered,
+        ground_truths=gt_map,
+        cer_map=cer_map,
+        image_quality_map=iq_map or None,
+    )
+    for doc in documents:
+        ds = difficulty_scores.get(doc["doc_id"])
+        if ds:
+            doc["difficulty_score"] = safe_round(ds.score)
+            doc["difficulty_label"] = difficulty_label(ds.score)
+        else:
+            doc["difficulty_score"] = 0.5
+            doc["difficulty_label"] = "Modéré"
+__all__ = ["build_documents", "annotate_documents_with_difficulty"]

picarones/report/report_data/engines.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""Construction du résumé par moteur (``engines_summary``).
+Pour chaque ``EngineReport``, accumule métriques agrégées (CER, WER,
+MER, WIL), distribution CER pour l'histogramme, métriques avancées
+patrimoniales (Sprint 5), distribution d'erreurs (Sprint 10), NER
+(Sprint 41), calibration (Sprint 43), profil philologique (Sprint
+62), recherchabilité + séquences numériques (Sprint 86), lisibilité
+(Sprint 87) et indicateurs pipeline OCR+LLM.
+Les coûts (durée moyenne, prix par 1k pages, CO₂) sont ajoutés
+ultérieurement par :mod:`picarones.report.report_data.pareto` qui
+en a besoin pour calculer les fronts.
+"""
+from __future__ import annotations
+from typing import TYPE_CHECKING
+from picarones.report.report_data._helpers import safe_round
+if TYPE_CHECKING:
+    from picarones.core.results import BenchmarkResult
+def build_engines_summary(benchmark: "BenchmarkResult") -> list[dict]:
+    """Retourne la liste des dicts moteur, une entrée par ``EngineReport``."""
+    engines_summary: list[dict] = []
+    for report in benchmark.engine_reports:
+        agg = report.aggregated_metrics
+        diplo_agg = agg.get("cer_diplomatic", {})
+        line_metrics = report.aggregated_line_metrics
+        halluc = report.aggregated_hallucination
+        entry: dict = {
+            "name": report.engine_name,
+            "version": report.engine_version,
+            "cer":  safe_round(agg.get("cer", {}).get("mean")),
+            "wer":  safe_round(agg.get("wer", {}).get("mean")),
+            "mer":  safe_round(agg.get("mer", {}).get("mean")),
+            "wil":  safe_round(agg.get("wil", {}).get("mean")),
+            "cer_median": safe_round(agg.get("cer", {}).get("median")),
+            "cer_min":    safe_round(agg.get("cer", {}).get("min")),
+            "cer_max":    safe_round(agg.get("cer", {}).get("max")),
+            "doc_count":  agg.get("document_count", 0),
+            "failed":     agg.get("failed_count", 0),
+            # CER diplomatique (après normalisation historique : ſ=s, u=v, i=j…)
+            "cer_diplomatic": safe_round(diplo_agg.get("mean")) if diplo_agg else None,
+            "cer_diplomatic_profile": diplo_agg.get("profile"),
+            # Distribution pour l'histogramme : liste des CER individuels
+            "cer_values": [
+                safe_round(dr.metrics.cer)
+                for dr in report.document_results
+                if dr.metrics.error is None
+            ],
+            "cer_diplomatic_values": [
+                safe_round(dr.metrics.cer_diplomatic)
+                for dr in report.document_results
+                if dr.metrics.error is None and dr.metrics.cer_diplomatic is not None
+            ],
+            # Champs pipeline OCR+LLM (vides pour les moteurs OCR seuls)
+            "is_pipeline": report.is_pipeline,
+            "pipeline_info": report.pipeline_info,
+            # Sprint 5 — métriques avancées patrimoniales
+            "ligature_score": safe_round(report.ligature_score) if report.ligature_score is not None else None,
+            "diacritic_score": safe_round(report.diacritic_score) if report.diacritic_score is not None else None,
+            "aggregated_confusion": report.aggregated_confusion,
+            "aggregated_taxonomy": report.aggregated_taxonomy,
+            "aggregated_structure": report.aggregated_structure,
+            "aggregated_image_quality": report.aggregated_image_quality,
+            # Sprint 10 — distribution des erreurs + hallucinations VLM
+            "gini": safe_round(line_metrics.get("gini_mean")) if line_metrics else None,
+            "cer_p90": safe_round(line_metrics.get("percentiles", {}).get("p90")) if line_metrics else None,
+            "cer_p99": safe_round(line_metrics.get("percentiles", {}).get("p99")) if line_metrics else None,
+            "catastrophic_rate_30": safe_round(line_metrics.get("catastrophic_rate", {}).get("0.3")) if line_metrics else None,
+            "aggregated_line_metrics": line_metrics,
+            "anchor_score": safe_round(halluc.get("anchor_score_mean")) if halluc else None,
+            "length_ratio": safe_round(halluc.get("length_ratio_mean")) if halluc else None,
+            "hallucinating_doc_rate": safe_round(halluc.get("hallucinating_doc_rate")) if halluc else None,
+            "aggregated_hallucination": halluc,
+            # Sprint 41 — NER agrégé (None si aucun calcul effectué)
+            "aggregated_ner": report.aggregated_ner,
+            # Sprint 43 — calibration agrégée (None si aucune confidence
+            # n'a été exposée par le moteur sur ce corpus)
+            "aggregated_calibration": report.aggregated_calibration,
+            # Sprint 62 — profil philologique agrégé (None si aucun
+            # signal philologique sur le corpus pour ce moteur)
+            "aggregated_philological": report.aggregated_philological,
+            # Sprint 86 — A.II.5 (recherchabilité fuzzy + séquences
+            # numériques). None si aucun document n'a de signal.
+            "aggregated_searchability": report.aggregated_searchability,
+            "aggregated_numerical_sequences": (
+                report.aggregated_numerical_sequences
+            ),
+            # Sprint 87 — A.II.2 (delta Flesch agrégé)
+            "aggregated_readability": report.aggregated_readability,
+            "is_vlm": report.pipeline_info.get("is_vlm", False) if report.pipeline_info else False,
+        }
+        engines_summary.append(entry)
+    return engines_summary
+__all__ = ["build_engines_summary"]