Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Claude commited on May 13

Commit

5e48c0b

unverified ·

1 Parent(s): b24a43b

post-rewrite wiring audit: Phases 1-5 (sécurité, méthodologie, moteurs, zombie, naming)

Réconciliation post-rewrite v2.0 d'après l'audit de chemins UI/API/runner.
Le rewrite avait laissé des options ignorées, des moteurs annoncés sans
backend, des chemins libres exposés, et un round-trip JSON appauvri.

# Phase 1 — Sécurité P0
- `output_dir` validé via ``validated_path`` dans
``api_htr_united_import`` et ``api_huggingface_import`` (path traversal).
- `db_path` validé dans ``/api/history/regressions`` ; env
``PICARONES_HISTORY_DB`` pour racines externes.
- ``flatten_zip_to_dir`` : détection des collisions de basename
(``a/img.png`` + ``b/img.png`` → renommage avec préfixe slug du dirname)
et appel à ``validate_image_safe`` sur les images extraites (anti zip
bomb passant les 500 Mo brut).

# Phase 2 — Méthodologie P0
- ``CompetitorConfig.pipeline_mode`` typé ``Literal[text_only,
text_and_image, zero_shot]`` ; suppression du fallback silencieux
``mode_map.get(..., "text_only")`` qui aliasait toute chaîne invalide.
- ``BenchmarkResult.from_dict`` / ``from_json_object`` restaurent
fidèlement les analyses avancées (confusion, taxonomy, NER, calibration,
philological, searchability, …) ; ``ReportGenerator.from_json`` délègue
→ rapport régénéré indistinguable du in-memory.
- ``partial_store`` : fingerprint SHA-256 stable (engine_config,
normalization_profile, char_exclude, fichiers corpus + mtime/size,
code version) suffixé au nom du partial → plus de réutilisation
illégale entre runs avec configs différentes.

# Phase 3 — Moteurs fantômes implémentés
- Nouveaux ``KrakenAdapter`` et ``CalamariAdapter`` (couche 5), lazy
imports + extras pyproject ``[kraken]`` / ``[calamari]``.
- ``ocr_adapter_from_name`` les expose ; ``_OCR_KWARGS_BUILDERS``
les mappe. CLI ``engines`` source-de-vérité unique avec
``/api/engines`` (matrice CLI ≡ Web).

# Phase 4 — Code zombie / wiring
- ``upload_purge_task`` (RGPD) branché au lifespan FastAPI ;
``create_job(payload={"corpus": req.corpus_path})`` pour que la
purge identifie les corpus actifs.
- ``/api/benchmark/start`` délègue à ``run_benchmark_thread_v2`` après
conversion ``BenchmarkRequest → BenchmarkRunRequest`` ; un seul worker
à patcher, ``/start`` marqué deprecated.
- ``HTRUnitedCatalogue.from_remote(timeout=5)`` avec fallback demo et
champ ``is_demo`` exposé pour l'UI ; ``PICARONES_HTR_UNITED_OFFLINE``
pour CI.

# Phase 5 — Naming
- ``CompetitorConfig`` → ``PipelineConfig`` (rupture immédiate, 11
fichiers touchés).

Tests : 4643 passed (vs 4600 avant), 12 skipped, 0 failed.
Nouveau fichier ``tests/security/test_phase1_post_rewrite_wiring.py``
(40 tests) couvre les 5 phases.

https://claude.ai/code/session_01ArfZ8kcgv7Cyda7VbJVmpn

Files changed (36) hide show

CLAUDE.md +2 -2
README.md +3 -1
picarones/adapters/ocr/__init__.py +8 -0
picarones/adapters/ocr/calamari.py +249 -0
picarones/adapters/ocr/factory.py +17 -0
picarones/adapters/ocr/kraken.py +236 -0
picarones/app/services/benchmark_runner.py +65 -3
picarones/app/services/partial_store.py +178 -40
picarones/evaluation/benchmark_result.py +132 -3
picarones/evaluation/metric_result.py +25 -0
picarones/interfaces/cli/__init__.py +47 -15
picarones/interfaces/web/app.py +30 -2
picarones/interfaces/web/benchmark_utils.py +97 -127
picarones/interfaces/web/corpus_utils.py +112 -16
picarones/interfaces/web/models.py +30 -4
picarones/interfaces/web/routers/benchmark.py +12 -5
picarones/interfaces/web/routers/history.py +43 -2
picarones/interfaces/web/routers/importers.py +64 -11
picarones/reports/html/generator.py +11 -48
pyproject.toml +1 -0
scripts/gen_readme_tables.py +2 -0
tests/app/test_s9_resolver_collision.py +1 -1
tests/app/test_sprint_d2b_partial_dir_resume.py +32 -5
tests/architecture/test_file_budgets.py +11 -2
tests/docs/test_readme_consistency.py +7 -0
tests/docs/test_readme_dual_lang.py +6 -4
tests/evaluation/metrics/test_sprint12_nouvelles_fonctionnalites.py +10 -4
tests/integration/test_s9_prompt_loading_defenses.py +3 -3
tests/security/test_phase1_post_rewrite_wiring.py +1013 -0
tests/security/test_s1_zip_slip_attack.py +15 -6
tests/web/routers/test_s4_history_router.py +31 -9
tests/web/routers/test_s8_benchmark_router_branches.py +1 -1
tests/web/test_s8_benchmark_utils_factory.py +58 -24
tests/web/test_s9_ocr_engine_naming_contract.py +5 -5
tests/web/test_s9_prompt_loading.py +4 -4
tests/web/test_sprint6_web_interface.py +28 -7

CLAUDE.md CHANGED Viewed

@@ -116,7 +116,7 @@ picarones/
 ## État des tests et bugs historiques
-`pytest tests/` → **4600 passed, 12 skipped, 8 deselected, 0 failed**
 (post-S59).  Les deselected sont les markers `live` (5 tests d'intégration
 contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
 opt-in en local via `pytest -m live` ou `pytest -m network`.  Le
@@ -268,7 +268,7 @@ détecte, arbitre, rend.
 ## Contexte développement
 - **Environnement** : GitHub Codespaces, Python 3.11+
-- **Tests** : `pytest tests/ -q` → 4600 passed, 9 skipped, 24
   deselected, 0 failed (post-v2.0).
 - **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
 - **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).

 ## État des tests et bugs historiques
+`pytest tests/` → **4650 passed, 12 skipped, 8 deselected, 0 failed**
 (post-S59).  Les deselected sont les markers `live` (5 tests d'intégration
 contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
 opt-in en local via `pytest -m live` ou `pytest -m network`.  Le
 ## Contexte développement
 - **Environnement** : GitHub Codespaces, Python 3.11+
+- **Tests** : `pytest tests/ -q` → 4650 passed, 9 skipped, 24
   deselected, 0 failed (post-v2.0).
 - **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
 - **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).

README.md CHANGED Viewed

@@ -201,7 +201,9 @@ For Docker, institutional deployment, or HuggingFace Spaces, see
 | Engine | Type | Installation |
 |--------|------|-------------|
 | **Azure Doc Intelligence** | Cloud API | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
 | **Google Vision** | Cloud API | `GOOGLE_APPLICATION_CREDENTIALS` env var |
 | **Mistral OCR** | Cloud API | `MISTRAL_API_KEY` env var |
 | **Pero OCR** | Local Python | `pip install -e .[pero]` |
 | **Tesseract 5** | Local CLI | `pip install pytesseract` + system binary |
@@ -395,7 +397,7 @@ ruff check picarones/ tests/
 python -m mypy picarones/core/
 ```
-**Test suite**: ~4600 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP. A handful of tests depend on optional engines
 (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when

 | Engine | Type | Installation |
 |--------|------|-------------|
 | **Azure Doc Intelligence** | Cloud API | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
+| **Calamari OCR** | Local Python | `pip install -e .[calamari]` + checkpoint |
 | **Google Vision** | Cloud API | `GOOGLE_APPLICATION_CREDENTIALS` env var |
+| **Kraken HTR** | Local Python | `pip install -e .[kraken]` + modèle `.mlmodel` |
 | **Mistral OCR** | Cloud API | `MISTRAL_API_KEY` env var |
 | **Pero OCR** | Local Python | `pip install -e .[pero]` |
 | **Tesseract 5** | Local CLI | `pip install pytesseract` + system binary |
 python -m mypy picarones/core/
 ```
+**Test suite**: ~4650 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP. A handful of tests depend on optional engines
 (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when

picarones/adapters/ocr/__init__.py CHANGED Viewed

@@ -8,6 +8,10 @@ Implémentations livrées
 -----------------------
 - ``TesseractAdapter`` — Tesseract 5 (OSS, CPU-bound).
 - ``PeroOCRAdapter`` — Pero OCR (manuscrits, GPU recommandé).
 - ``MistralOCRAdapter`` — Mistral OCR API (cloud).
 - ``GoogleVisionAdapter`` — Google Vision API (cloud).
 - ``AzureDocIntelAdapter`` — Azure Document Intelligence (cloud).
@@ -20,8 +24,10 @@ from __future__ import annotations
 from picarones.adapters.ocr.azure_doc_intel import AzureDocIntelAdapter
 from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
 from picarones.adapters.ocr.factory import ocr_adapter_from_name
 from picarones.adapters.ocr.google_vision import GoogleVisionAdapter
 from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
 from picarones.adapters.ocr.pero_ocr import PeroOCRAdapter
 from picarones.adapters.ocr.precomputed import PrecomputedTextAdapter
@@ -31,7 +37,9 @@ __all__ = [
     "BaseOCRAdapter",
     "OCRAdapterError",
     "AzureDocIntelAdapter",
     "GoogleVisionAdapter",
     "MistralOCRAdapter",
     "PeroOCRAdapter",
     "PrecomputedTextAdapter",

 -----------------------
 - ``TesseractAdapter`` — Tesseract 5 (OSS, CPU-bound).
 - ``PeroOCRAdapter`` — Pero OCR (manuscrits, GPU recommandé).
+- ``KrakenAdapter`` — Kraken HTR (manuscrits + imprimés anciens,
+  écosystème HTR-United). Phase 3 chantier post-rewrite.
+- ``CalamariAdapter`` — Calamari OCR (imprimés historiques,
+  TensorFlow). Phase 3 chantier post-rewrite.
 - ``MistralOCRAdapter`` — Mistral OCR API (cloud).
 - ``GoogleVisionAdapter`` — Google Vision API (cloud).
 - ``AzureDocIntelAdapter`` — Azure Document Intelligence (cloud).
 from picarones.adapters.ocr.azure_doc_intel import AzureDocIntelAdapter
 from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
+from picarones.adapters.ocr.calamari import CalamariAdapter
 from picarones.adapters.ocr.factory import ocr_adapter_from_name
 from picarones.adapters.ocr.google_vision import GoogleVisionAdapter
+from picarones.adapters.ocr.kraken import KrakenAdapter
 from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
 from picarones.adapters.ocr.pero_ocr import PeroOCRAdapter
 from picarones.adapters.ocr.precomputed import PrecomputedTextAdapter
     "BaseOCRAdapter",
     "OCRAdapterError",
     "AzureDocIntelAdapter",
+    "CalamariAdapter",
     "GoogleVisionAdapter",
+    "KrakenAdapter",
     "MistralOCRAdapter",
     "PeroOCRAdapter",
     "PrecomputedTextAdapter",

picarones/adapters/ocr/calamari.py ADDED Viewed

	@@ -0,0 +1,249 @@

+"""``CalamariAdapter`` — adapter pour Calamari OCR.
+Implémente le contrat ``BaseOCRAdapter`` (couche 5) :
+``execute(inputs, params, context) → dict[ArtifactType, Artifact]``.
+Cas d'usage BnF
+---------------
+Calamari est un OCR open-source basé TensorFlow / Keras, conçu pour
+les imprimés historiques et la transcription ligne par ligne.
+Modèles disponibles via OCR-D, Wikisource, et le hub Calamari.
+Particulièrement performant en ensemble (vote multi-modèles).
+Configuration
+-------------
+Constructeur :
+- ``name`` (défaut ``"calamari"``) : identifiant de l'instance.
+- ``checkpoint`` (obligatoire) : chemin vers le modèle Calamari
+  (fichier ``.ckpt`` ou répertoire de modèles pour le voting).
+- ``voter`` (défaut ``"confidence_voter_default_ctc"``) : stratégie
+  de vote quand plusieurs modèles sont passés en ensemble.
+- ``batch_size`` (défaut ``1``) : taille de batch pour l'inférence
+  ligne par ligne.  ``1`` privilégie la simplicité ; augmenter pour
+  un gain de débit GPU.
+Comportement
+------------
+1. Vérifie qu'un ``Artifact`` ``IMAGE`` est présent.
+2. Lazy-import de ``calamari_ocr`` et ``PIL``.
+3. Charge le ``Predictor`` (cache par instance).
+4. Calamari attend des **lignes** d'image, pas des pages.  L'adapter
+   ne fait pas de segmentation : il OCRise l'image entière comme
+   une ligne unique.  Pour un workflow page → lignes, l'utilisateur
+   doit pré-segmenter (Kraken pageseg ou OCR-D segmenter) et appeler
+   Calamari sur chaque ligne séparément — futur enrichissement à
+   prévoir quand un consommateur en aura besoin.
+5. Écrit la prédiction dans ``<stem>.<name>.txt``.
+Anti-sur-ingénierie
+-------------------
+- Pas de segmentation embarquée — Calamari est un *line recognizer*,
+  pas un page OCR.  L'utilisateur compose avec un segmenter externe
+  s'il a besoin du flux page → lignes.
+- Pas de confidences pour l'instant — Calamari expose
+  ``Prediction.avg_char_probability`` qui pourra alimenter un
+  ``CONFIDENCES`` artifact dans une itération future.
+- Modèle chargé une fois par instance, partagé entre appels successifs
+  (Predictor TensorFlow non recréé à chaque image).
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import Any
+from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
+from picarones.adapters.output_paths import resolve_output_path
+from picarones.domain.artifacts import Artifact, ArtifactType
+class CalamariAdapter(BaseOCRAdapter):
+    """Adapter Calamari OCR (Sprint Phase 3 — chantier post-rewrite).
+    Parameters
+    ----------
+    name:
+        Identifiant lisible de l'instance.  Défaut ``"calamari"``.
+        Doit être alphanumérique + ``_-``.
+    checkpoint:
+        Chemin vers le checkpoint Calamari (``.ckpt`` ou dossier de
+        modèles pour ensemble voting).  **Obligatoire** : Calamari
+        n'embarque pas de modèle par défaut.
+    voter:
+        Nom de la stratégie de vote pour ensembles multi-modèles.
+        Défaut ``"confidence_voter_default_ctc"``.
+    batch_size:
+        Taille de batch pour l'inférence.  Défaut 1.
+    Raises
+    ------
+    OCRAdapterError
+        Si ``name`` invalide, ``checkpoint`` vide ou inexistant.
+    """
+    input_types = frozenset({ArtifactType.IMAGE})
+    output_types = frozenset({ArtifactType.RAW_TEXT})
+    execution_mode = "cpu"
+    def __init__(
+        self,
+        *,
+        name: str = "calamari",
+        checkpoint: str | Path | None = None,
+        voter: str = "confidence_voter_default_ctc",
+        batch_size: int = 1,
+    ) -> None:
+        if not name or not name.strip():
+            raise OCRAdapterError(
+                "CalamariAdapter : name vide non autorisé.",
+            )
+        if not all(c.isalnum() or c in "_-" for c in name):
+            raise OCRAdapterError(
+                f"CalamariAdapter : name invalide {name!r} — "
+                "alphanumérique + _ - uniquement.",
+            )
+        if not checkpoint:
+            raise OCRAdapterError(
+                "CalamariAdapter : checkpoint est obligatoire — "
+                "Calamari n'embarque pas de modèle par défaut.  "
+                "Télécharger un modèle depuis le hub Calamari et "
+                "pointer son chemin (``.ckpt`` ou dossier).",
+            )
+        if batch_size < 1:
+            raise OCRAdapterError(
+                f"CalamariAdapter : batch_size doit être ≥ 1, reçu "
+                f"{batch_size}.",
+            )
+        self._name = name
+        self._checkpoint = Path(checkpoint)
+        self._voter = voter
+        self._batch_size = batch_size
+        # Predictor chargé paresseusement.
+        self._predictor: Any | None = None
+    @property
+    def name(self) -> str:
+        return self._name
+    @property
+    def checkpoint(self) -> Path:
+        return self._checkpoint
+    @property
+    def voter(self) -> str:
+        return self._voter
+    @property
+    def batch_size(self) -> int:
+        return self._batch_size
+    def execute(
+        self,
+        inputs: dict[ArtifactType, Artifact],
+        params: dict[str, Any],
+        context: Any,
+    ) -> dict[ArtifactType, Artifact]:
+        """Exécute Calamari sur l'image fournie.
+        Raises
+        ------
+        OCRAdapterError
+            - input ``IMAGE`` absent ou sans URI ;
+            - fichier image / checkpoint introuvable ;
+            - ``calamari_ocr`` non installé ;
+            - erreur Calamari (modèle invalide, inférence).
+        """
+        if ArtifactType.IMAGE not in inputs:
+            raise OCRAdapterError(f"{self.name} : input IMAGE manquant.")
+        image_artifact = inputs[ArtifactType.IMAGE]
+        if image_artifact.uri is None:
+            raise OCRAdapterError(
+                f"{self.name} : artefact image {image_artifact.id!r} "
+                "sans URI.",
+            )
+        image_path = Path(image_artifact.uri)
+        if not image_path.exists():
+            raise OCRAdapterError(
+                f"{self.name} : image introuvable {image_path!r}.",
+            )
+        if not self._checkpoint.exists():
+            raise OCRAdapterError(
+                f"{self.name} : checkpoint introuvable "
+                f"{self._checkpoint!r}.",
+            )
+        # Lazy-import — message explicite si dépendance absente.
+        try:
+            import numpy as np
+            from calamari_ocr.ocr.predict.predictor import (  # type: ignore[import-not-found]
+                Predictor,
+                PredictorParams,
+            )
+            from PIL import Image
+        except ImportError as exc:
+            raise OCRAdapterError(
+                f"{self.name} : calamari-ocr non installé.  "
+                "Installer avec : pip install 'picarones[calamari]' "
+                "(ou 'pip install calamari-ocr>=2.0').",
+            ) from exc
+        # Charger le Predictor une seule fois.
+        if self._predictor is None:
+            try:
+                params = PredictorParams()
+                params.silent = True
+                self._predictor = Predictor.from_checkpoint(
+                    params=params,
+                    checkpoint=str(self._checkpoint),
+                )
+            except Exception as exc:
+                raise OCRAdapterError(
+                    f"{self.name} : chargement checkpoint "
+                    f"{self._checkpoint!r} échoué : "
+                    f"{type(exc).__name__}: {exc}",
+                ) from exc
+        # OCR ligne : Calamari attend des numpy arrays grayscale.
+        try:
+            with Image.open(image_path) as image:
+                img_array = np.array(image.convert("L"))
+            results = list(self._predictor.predict_raw([img_array]))
+            if not results:
+                text = ""
+            else:
+                # Calamari ≥ 2.0 retourne des PredictionResult avec
+                # ``.outputs.sentence`` (post-voting) ou ``.sentence``.
+                result = results[0]
+                if hasattr(result, "outputs"):
+                    text = getattr(result.outputs, "sentence", "")
+                else:
+                    text = getattr(result, "sentence", "")
+        except Exception as exc:
+            raise OCRAdapterError(
+                f"{self.name} : Calamari a levé sur "
+                f"{image_path!r} : {type(exc).__name__}: {exc}",
+            ) from exc
+        text = (text or "").strip()
+        text_path = resolve_output_path(
+            input_path=image_path,
+            adapter_name=self.name,
+            suffix="txt",
+            context=context,
+        )
+        text_path.write_text(text, encoding="utf-8")
+        return {
+            ArtifactType.RAW_TEXT: Artifact(
+                id=f"{context.document_id}:{self.name}:raw_text",
+                document_id=context.document_id,
+                type=ArtifactType.RAW_TEXT,
+                produced_by_step="ocr",
+                uri=str(text_path),
+            ),
+        }
+__all__ = ["CalamariAdapter"]

picarones/adapters/ocr/factory.py CHANGED Viewed

@@ -46,12 +46,15 @@ _ALIASES: dict[str, str] = {
     "gv": "google_vision",
     "azure": "azure_doc_intel",
     "adi": "azure_doc_intel",
 }
 #: Liste des noms canoniques supportés pour les messages d'erreur.
 _SUPPORTED: tuple[str, ...] = (
     "tesseract",
     "pero_ocr",
     "mistral_ocr",
     "google_vision",
     "azure_doc_intel",
@@ -126,6 +129,20 @@ def ocr_adapter_from_name(
             ) from exc
         return PeroOCRAdapter(**kwargs)
     if canonical == "mistral_ocr":
         try:
             from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter

     "gv": "google_vision",
     "azure": "azure_doc_intel",
     "adi": "azure_doc_intel",
+    "calamari_ocr": "calamari",
 }
 #: Liste des noms canoniques supportés pour les messages d'erreur.
 _SUPPORTED: tuple[str, ...] = (
     "tesseract",
     "pero_ocr",
+    "kraken",
+    "calamari",
     "mistral_ocr",
     "google_vision",
     "azure_doc_intel",
             ) from exc
         return PeroOCRAdapter(**kwargs)
+    if canonical == "kraken":
+        # Phase 3 chantier post-rewrite : implémentation réelle de
+        # l'adapter ``kraken``, qui était annoncé par ``/api/engines``
+        # mais sans backend.  Lazy-import : la classe elle-même est
+        # importable sans ``kraken``, c'est ``execute()`` qui exigera
+        # la dépendance.
+        from picarones.adapters.ocr.kraken import KrakenAdapter
+        return KrakenAdapter(**kwargs)
+    if canonical == "calamari":
+        # Phase 3 chantier post-rewrite : implémentation réelle.
+        from picarones.adapters.ocr.calamari import CalamariAdapter
+        return CalamariAdapter(**kwargs)
     if canonical == "mistral_ocr":
         try:
             from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter

picarones/adapters/ocr/kraken.py ADDED Viewed

	@@ -0,0 +1,236 @@

+"""``KrakenAdapter`` — adapter pour Kraken HTR (manuscrits + imprimés).
+Implémente le contrat ``BaseOCRAdapter`` (couche 5) :
+``execute(inputs, params, context) → dict[ArtifactType, Artifact]``.
+Cas d'usage BnF
+---------------
+Kraken est l'engine open-source de référence pour les manuscrits et
+imprimés anciens où Tesseract ne fonctionne pas — segmentation par
+ligne de base + reconnaissance LSTM.  C'est l'OCR ciblé par
+HTR-United, l'écosystème de partage de modèles HTR pour le
+patrimoine écrit.
+Configuration
+-------------
+Constructeur :
+- ``name`` (défaut ``"kraken"``) : identifiant de l'instance.
+- ``model_path`` (obligatoire) : chemin vers le modèle ``.mlmodel``.
+  Kraken ne fournit pas de modèle par défaut — l'utilisateur doit en
+  pointer un (téléchargeable depuis https://htr-united.github.io/ ou
+  https://zenodo.org).
+- ``binarize`` (défaut ``True``) : applique la binarisation
+  ``nlbin`` avant segmentation.
+- ``text_direction`` (défaut ``"horizontal-lr"``) : direction du
+  texte (passée à ``pageseg.segment``).
+Comportement
+------------
+1. Vérifie qu'un ``Artifact`` ``IMAGE`` est présent.
+2. Lazy-import de ``kraken`` et ``PIL``.
+3. Charge le modèle (cache par instance).
+4. Binarise + segmente l'image.
+5. Reconnaît chaque ligne, concatène avec un saut de ligne.
+6. Écrit le résultat dans ``<stem>.<name>.txt`` à côté de l'image.
+Anti-sur-ingénierie
+-------------------
+- Pas d'extraction de confidences pour l'instant — Kraken expose des
+  scores par caractère via ``rpred``, à brancher quand un caller en
+  aura besoin (les VOC types de confidences sont par-token, ici on a
+  par-char).
+- Pas de support batch — un appel par image.
+- Pas de retry — si Kraken plante, on remonte ``OCRAdapterError``.
+- ``execution_mode="cpu"`` même si Kraken peut tourner sur GPU :
+  la décision pool est laissée au runner (un opérateur GPU peut
+  exporter ``CUDA_VISIBLE_DEVICES`` et tourner en ThreadPool sans
+  conflit).
+"""
+from __future__ import annotations
+from pathlib import Path
+from typing import Any
+from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
+from picarones.adapters.output_paths import resolve_output_path
+from picarones.domain.artifacts import Artifact, ArtifactType
+class KrakenAdapter(BaseOCRAdapter):
+    """Adapter Kraken HTR (Sprint Phase 3 — chantier post-rewrite).
+    Parameters
+    ----------
+    name:
+        Identifiant lisible de l'instance.  Défaut ``"kraken"``.
+        Doit être alphanumérique + ``_-``.
+    model_path:
+        Chemin vers le modèle ``.mlmodel`` Kraken.  **Obligatoire** :
+        Kraken n'embarque pas de modèle par défaut.
+    binarize:
+        Si ``True`` (défaut), applique ``binarization.nlbin`` avant
+        segmentation.  À désactiver pour des images déjà binarisées.
+    text_direction:
+        Direction de lecture passée à ``pageseg.segment``.  Défaut
+        ``"horizontal-lr"`` (gauche-droite horizontal).
+    Raises
+    ------
+    OCRAdapterError
+        Si ``name`` invalide, ``model_path`` vide ou inexistant.
+    """
+    input_types = frozenset({ArtifactType.IMAGE})
+    output_types = frozenset({ArtifactType.RAW_TEXT})
+    execution_mode = "cpu"
+    def __init__(
+        self,
+        *,
+        name: str = "kraken",
+        model_path: str | Path | None = None,
+        binarize: bool = True,
+        text_direction: str = "horizontal-lr",
+    ) -> None:
+        if not name or not name.strip():
+            raise OCRAdapterError(
+                "KrakenAdapter : name vide non autorisé.",
+            )
+        if not all(c.isalnum() or c in "_-" for c in name):
+            raise OCRAdapterError(
+                f"KrakenAdapter : name invalide {name!r} — "
+                "alphanumérique + _ - uniquement.",
+            )
+        if not model_path:
+            raise OCRAdapterError(
+                "KrakenAdapter : model_path est obligatoire — Kraken "
+                "n'embarque pas de modèle par défaut.  Télécharger un "
+                "modèle ``.mlmodel`` depuis HTR-United "
+                "(https://htr-united.github.io/) et pointer son chemin.",
+            )
+        self._name = name
+        self._model_path = Path(model_path)
+        self._binarize = binarize
+        self._text_direction = text_direction
+        # Modèle chargé paresseusement à la première utilisation
+        # — partagé entre les appels successifs de la même instance.
+        self._model: Any | None = None
+    @property
+    def name(self) -> str:
+        return self._name
+    @property
+    def model_path(self) -> Path:
+        return self._model_path
+    @property
+    def binarize(self) -> bool:
+        return self._binarize
+    @property
+    def text_direction(self) -> str:
+        return self._text_direction
+    def execute(
+        self,
+        inputs: dict[ArtifactType, Artifact],
+        params: dict[str, Any],
+        context: Any,
+    ) -> dict[ArtifactType, Artifact]:
+        """Exécute Kraken sur l'image fournie.
+        Raises
+        ------
+        OCRAdapterError
+            - input ``IMAGE`` absent ou sans URI ;
+            - fichier image / modèle introuvable ;
+            - ``kraken`` ou ``PIL`` non installé ;
+            - erreur Kraken (segmentation, reconnaissance).
+        """
+        if ArtifactType.IMAGE not in inputs:
+            raise OCRAdapterError(f"{self.name} : input IMAGE manquant.")
+        image_artifact = inputs[ArtifactType.IMAGE]
+        if image_artifact.uri is None:
+            raise OCRAdapterError(
+                f"{self.name} : artefact image {image_artifact.id!r} "
+                "sans URI.",
+            )
+        image_path = Path(image_artifact.uri)
+        if not image_path.exists():
+            raise OCRAdapterError(
+                f"{self.name} : image introuvable {image_path!r}.",
+            )
+        if not self._model_path.exists():
+            raise OCRAdapterError(
+                f"{self.name} : modèle introuvable "
+                f"{self._model_path!r}.",
+            )
+        # Lazy-import de kraken + PIL — si absents, message explicite.
+        try:
+            from kraken import binarization, pageseg, rpred  # type: ignore[import-not-found]
+            from kraken.lib import models  # type: ignore[import-not-found]
+            from PIL import Image
+        except ImportError as exc:
+            raise OCRAdapterError(
+                f"{self.name} : kraken/Pillow non installés.  "
+                "Installer avec : pip install 'picarones[kraken]' "
+                "(ou 'pip install kraken>=4.0').",
+            ) from exc
+        # Charger le modèle (une seule fois par instance).
+        if self._model is None:
+            try:
+                self._model = models.load_any(str(self._model_path))
+            except Exception as exc:
+                raise OCRAdapterError(
+                    f"{self.name} : chargement modèle "
+                    f"{self._model_path!r} échoué : "
+                    f"{type(exc).__name__}: {exc}",
+                ) from exc
+        # Pipeline Kraken : binarisation → segmentation → reco.
+        try:
+            with Image.open(image_path) as image:
+                proc_image = (
+                    binarization.nlbin(image) if self._binarize else image
+                )
+                segmentation = pageseg.segment(
+                    proc_image, text_direction=self._text_direction,
+                )
+                predictions = rpred.rpred(
+                    self._model, image, segmentation,
+                )
+                lines = [p.prediction for p in predictions if p.prediction]
+                text = "\n".join(lines)
+        except Exception as exc:
+            raise OCRAdapterError(
+                f"{self.name} : Kraken a levé sur "
+                f"{image_path!r} : {type(exc).__name__}: {exc}",
+            ) from exc
+        text = text.strip()
+        text_path = resolve_output_path(
+            input_path=image_path,
+            adapter_name=self.name,
+            suffix="txt",
+            context=context,
+        )
+        text_path.write_text(text, encoding="utf-8")
+        return {
+            ArtifactType.RAW_TEXT: Artifact(
+                id=f"{context.document_id}:{self.name}:raw_text",
+                document_id=context.document_id,
+                type=ArtifactType.RAW_TEXT,
+                produced_by_step="ocr",
+                uri=str(text_path),
+            ),
+        }
+__all__ = ["KrakenAdapter"]

picarones/app/services/benchmark_runner.py CHANGED Viewed

@@ -18,6 +18,7 @@ mono-call ergonomique et restitue un ``BenchmarkResult``.
 from __future__ import annotations
 import logging
 from dataclasses import dataclass
 from pathlib import Path
@@ -499,6 +500,53 @@ def _build_pipeline_info(engine: Any) -> dict:
     return info
 def _safe_engine_version(engine: Any) -> str:
     """Retourne ``engine.version()`` ou ``"unknown"`` en cas d'erreur.
@@ -733,7 +781,7 @@ def build_adapter_resolver(
         """Deux executors sont *fonctionnellement* équivalents s'ils
         ont le même type et le même état (``__dict__`` complet).
-        Cas concret : deux ``CompetitorConfig`` qui utilisent
         ``tesseract`` avec la même langue — l'un en mode OCR seul,
         l'autre encapsulé dans un pipeline OCR+LLM.  Le factory web
         leur donne le même ``name`` (dérivé de la config) → la 2e
@@ -1333,8 +1381,8 @@ def _run_benchmark_with_partial(
     from picarones.app.services.partial_store import (
         _delete_partial,
         _load_partial,
-        _partial_path,
         _save_partial_line,
     )
     from picarones.evaluation.benchmark_result import (
         BenchmarkResult,
@@ -1367,7 +1415,21 @@ def _run_benchmark_with_partial(
             )
             break
-        partial_path = _partial_path(corpus.name, engine.name, partial_dir)
         loaded_results = _load_partial(partial_path)
         loaded_doc_ids = {dr.doc_id for dr in loaded_results}

 from __future__ import annotations
+import hashlib
 import logging
 from dataclasses import dataclass
 from pathlib import Path
     return info
+def _engine_config_for_fingerprint(engine: Any) -> dict:
+    """Extrait une config sérialisable d'un engine pour le fingerprint.
+    Phase 2.3 : utilisé par
+    :func:`partial_store.compute_run_fingerprint` pour distinguer deux
+    runs avec le même couple ``(corpus, engine.name)`` mais des
+    paramètres internes différents (psm/lang Tesseract, modèle LLM,
+    prompt_template, mode pipeline, …).  Un changement non capturé
+    par ce dict = potentiel faux résultat en reprise.
+    Stratégie : sonder les attributs canoniques connus, repli sur
+    ``repr`` pour les types non sérialisables.  ``json.dumps`` finalise
+    via ``default=str`` côté ``compute_run_fingerprint`` — la
+    granularité est conservatrice (toute différence visible → nouveau
+    fingerprint).
+    """
+    cfg: dict = {"engine_name": getattr(engine, "name", "")}
+    # Pipeline composé : capturer le mode + prompt + LLM model
+    # (sources de différence majeure des résultats).
+    if getattr(engine, "is_pipeline", False):
+        mode = getattr(engine, "mode", None)
+        cfg["mode"] = mode.value if hasattr(mode, "value") else mode
+        prompt = getattr(engine, "prompt_template", None)
+        if prompt is not None:
+            # Hasher le prompt pour éviter de polluer le nom du fichier
+            # partiel avec un prompt multi-lignes (et de fuiter le
+            # contenu d'un prompt institutionnel dans un nom de fichier).
+            cfg["prompt_sha1"] = hashlib.sha1(
+                str(prompt).encode("utf-8"),
+            ).hexdigest()[:12]
+        llm = getattr(engine, "llm_adapter", None)
+        if llm is not None:
+            cfg["llm_model"] = getattr(llm, "model", "")
+            cfg["llm_provider"] = getattr(llm, "name", "")
+        ocr = getattr(engine, "ocr_adapter", None)
+        if ocr is not None:
+            cfg["ocr_name"] = getattr(ocr, "name", "")
+    else:
+        # Adapter OCR seul : sonder les attributs courants.
+        for attr in ("lang", "psm", "model", "model_id", "feature_type"):
+            value = getattr(engine, attr, None)
+            if value is not None:
+                cfg[attr] = value
+    return cfg
 def _safe_engine_version(engine: Any) -> str:
     """Retourne ``engine.version()`` ou ``"unknown"`` en cas d'erreur.
         """Deux executors sont *fonctionnellement* équivalents s'ils
         ont le même type et le même état (``__dict__`` complet).
+        Cas concret : deux ``PipelineConfig`` qui utilisent
         ``tesseract`` avec la même langue — l'un en mode OCR seul,
         l'autre encapsulé dans un pipeline OCR+LLM.  Le factory web
         leur donne le même ``name`` (dérivé de la config) → la 2e
     from picarones.app.services.partial_store import (
         _delete_partial,
         _load_partial,
         _save_partial_line,
+        partial_path_for_engine,
     )
     from picarones.evaluation.benchmark_result import (
         BenchmarkResult,
             )
             break
+        # Phase 2.3 — fingerprint inclut config moteur + profil
+        # normalisation + char_exclude + corpus files (mtime/size) +
+        # version code.  Deux runs avec configs différentes →
+        # fichiers partiels distincts → pas de réutilisation
+        # silencieuse de résultats incompatibles.
+        partial_path = partial_path_for_engine(
+            corpus=corpus,
+            engine=engine,
+            partial_dir=partial_dir,
+            engine_config=_engine_config_for_fingerprint(engine),
+            normalization_profile=normalization_profile,
+            char_exclude=char_exclude,
+            profile=profile,
+            code_version=code_version,
+        )
         loaded_results = _load_partial(partial_path)
         loaded_doc_ids = {dr.doc_id for dr in loaded_results}

picarones/app/services/partial_store.py CHANGED Viewed

@@ -7,9 +7,9 @@ travail déjà fait.
 Contrat
 -------
 Pour chaque couple ``(corpus_name, engine_name)``, un fichier
-``{partial_dir}/picarones_{corpus}_{engine}.partial.jsonl`` accumule
-une ligne JSON par ``DocumentResult`` au fur et à mesure de leur
-calcul.  Au redémarrage, ``run_benchmark_via_service`` charge ce
 fichier, identifie les ``doc_id`` déjà traités, et n'invoque le
 ``BenchmarkService`` que sur les documents restants.
@@ -18,6 +18,21 @@ partiel est supprimé.  Si un crash interrompt le run mid-engine,
 le fichier persiste : la prochaine exécution reprendra exactement
 où l'on s'est arrêté.
 Anti-sur-ingénierie
 -------------------
 - Format JSONL plat (une ligne = un ``DocumentResult.as_dict()``),
@@ -28,15 +43,22 @@ Anti-sur-ingénierie
   partage inter-process (chaque process a son propre tempdir).
 - Pas de checksum ni de validation de schéma — best-effort.  Une
   ligne corrompue = warning + ligne ignorée + on continue.
 """
 from __future__ import annotations
 import json
 import logging
 import re
 import tempfile
 import threading
 from pathlib import Path
 from typing import TYPE_CHECKING, Any, Optional
@@ -66,6 +88,8 @@ def _partial_path(
     corpus_name: str,
     engine_name: str,
     partial_dir: Optional[str | Path],
 ) -> Path:
     """Construit le chemin du fichier partiel pour ``(corpus, engine)``.
@@ -73,15 +97,93 @@ def _partial_path(
     ``tempfile.gettempdir()`` — utile pour les tests qui ne veulent
     pas configurer un répertoire dédié mais bénéficient quand même
     de la reprise intra-process.
     """
     base = Path(partial_dir) if partial_dir else Path(tempfile.gettempdir())
     name = (
         f"picarones_{_sanitize_filename(corpus_name)}"
-        f"_{_sanitize_filename(engine_name)}.partial.jsonl"
     )
     return base / name
 def _load_partial(
     partial_path: Path,
 ) -> list[DocumentResult]:
@@ -98,7 +200,6 @@ def _load_partial(
     travail antérieur.
     """
     from picarones.evaluation.benchmark_result import DocumentResult
-    from picarones.evaluation.metric_result import MetricsResult
     results: list[DocumentResult] = []
     if not partial_path.exists():
@@ -128,41 +229,12 @@ def _load_partial(
             )
             continue
         try:
-            metrics_dict = d.get("metrics", {}) or {}
-            metrics = MetricsResult(
-                cer=metrics_dict.get("cer"),
-                cer_nfc=metrics_dict.get("cer_nfc"),
-                cer_caseless=metrics_dict.get("cer_caseless"),
-                wer=metrics_dict.get("wer"),
-                wer_normalized=metrics_dict.get("wer_normalized"),
-                mer=metrics_dict.get("mer"),
-                wil=metrics_dict.get("wil"),
-                reference_length=metrics_dict.get("reference_length", 0),
-                hypothesis_length=metrics_dict.get("hypothesis_length", 0),
-                error=metrics_dict.get("error"),
-                cer_diplomatic=metrics_dict.get("cer_diplomatic"),
-                diplomatic_profile_name=metrics_dict.get(
-                    "diplomatic_profile_name",
-                ),
-            )
-            results.append(DocumentResult(
-                doc_id=d["doc_id"],
-                image_path=d.get("image_path", ""),
-                ground_truth=d.get("ground_truth", ""),
-                hypothesis=d.get("hypothesis", ""),
-                metrics=metrics,
-                duration_seconds=d.get("duration_seconds", 0.0),
-                engine_error=d.get("engine_error"),
-                ocr_intermediate=d.get("ocr_intermediate"),
-                pipeline_metadata=d.get("pipeline_metadata", {}) or {},
-                confusion_matrix=d.get("confusion_matrix"),
-                char_scores=d.get("char_scores"),
-                taxonomy=d.get("taxonomy"),
-                structure=d.get("structure"),
-                image_quality=d.get("image_quality"),
-                line_metrics=d.get("line_metrics"),
-                hallucination_metrics=d.get("hallucination_metrics"),
-            ))
         except (KeyError, TypeError) as exc:
             logger.warning(
                 "[partial_dir] ligne %d malformée dans '%s' : %s "
@@ -212,6 +284,70 @@ def _delete_partial(partial_path: Path) -> None:
         )
 __all__ = [
     "_delete_partial",
     "_load_partial",
@@ -219,4 +355,6 @@ __all__ = [
     "_partial_write_lock",
     "_sanitize_filename",
     "_save_partial_line",
 ]

 Contrat
 -------
 Pour chaque couple ``(corpus_name, engine_name)``, un fichier
+``{partial_dir}/picarones_{corpus}_{engine}_{fingerprint}.partial.jsonl``
+accumule une ligne JSON par ``DocumentResult`` au fur et à mesure de
+leur calcul.  Au redémarrage, ``run_benchmark_via_service`` charge ce
 fichier, identifie les ``doc_id`` déjà traités, et n'invoque le
 ``BenchmarkService`` que sur les documents restants.
 le fichier persiste : la prochaine exécution reprendra exactement
 où l'on s'est arrêté.
+Phase 2.3 — Fingerprint anti-collision
+---------------------------------------
+Auparavant, la clé partial était ``(corpus.name, engine.name)`` —
+insuffisant : deux runs successifs avec le même corpus et le même
+engine **mais des configs différentes** (psm Tesseract, langue,
+profil de normalisation, char_exclude, version code) réutilisaient
+silencieusement les résultats du run précédent.  Reproductibilité
+scientifique cassée.
+Désormais :func:`compute_run_fingerprint` calcule un SHA-256 stable
+de la config complète (engine_config, normalization_profile,
+char_exclude, fichiers du corpus + mtime/size, version code).  Le
+préfixe 16 hex est suffixé au nom du fichier partiel : un changement
+de config = un fichier différent = pas de réutilisation illégale.
 Anti-sur-ingénierie
 -------------------
 - Format JSONL plat (une ligne = un ``DocumentResult.as_dict()``),
   partage inter-process (chaque process a son propre tempdir).
 - Pas de checksum ni de validation de schéma — best-effort.  Une
   ligne corrompue = warning + ligne ignorée + on continue.
+- Fingerprint basé sur ``(path, size, mtime)`` pour les fichiers
+  corpus, pas sur le contenu lui-même : 100× plus rapide, suffisant
+  pour détecter une modification.  Si un attaquant ``touch`` un
+  fichier sans changer son contenu, le partial est invalidé (acceptable,
+  conservative).
 """
 from __future__ import annotations
+import hashlib
 import json
 import logging
 import re
 import tempfile
 import threading
+from collections.abc import Iterable, Mapping
 from pathlib import Path
 from typing import TYPE_CHECKING, Any, Optional
     corpus_name: str,
     engine_name: str,
     partial_dir: Optional[str | Path],
+    *,
+    fingerprint: Optional[str] = None,
 ) -> Path:
     """Construit le chemin du fichier partiel pour ``(corpus, engine)``.
     ``tempfile.gettempdir()`` — utile pour les tests qui ne veulent
     pas configurer un répertoire dédié mais bénéficient quand même
     de la reprise intra-process.
+    Phase 2.3 — Si ``fingerprint`` est fourni, il est suffixé au nom :
+    ``picarones_{corpus}_{engine}_{fingerprint}.partial.jsonl``.  Cela
+    garantit que deux runs avec le même couple ``(corpus, engine)``
+    mais des configs différentes ne partagent **jamais** leur fichier
+    partiel.  Sans ``fingerprint``, le comportement legacy est
+    préservé pour rétrocompatibilité tests.
     """
     base = Path(partial_dir) if partial_dir else Path(tempfile.gettempdir())
+    fp_suffix = f"_{fingerprint}" if fingerprint else ""
     name = (
         f"picarones_{_sanitize_filename(corpus_name)}"
+        f"_{_sanitize_filename(engine_name)}{fp_suffix}.partial.jsonl"
     )
     return base / name
+def compute_run_fingerprint(
+    *,
+    engine_config: Mapping[str, Any] | None = None,
+    normalization_profile: str | None = None,
+    char_exclude: str | None = None,
+    corpus_files: Iterable[str | Path] | None = None,
+    code_version: str | None = None,
+    extra: Mapping[str, Any] | None = None,
+) -> str:
+    """Calcule un fingerprint stable pour identifier un run.
+    Composantes intégrées au hash :
+    - ``engine_config`` — dict de paramètres moteur (lang, psm,
+      model, etc.).  Encodé en JSON trié pour stabilité.
+    - ``normalization_profile`` — identifiant du profil de
+      normalisation Unicode.  Différents profils → métriques
+      différentes → fingerprint différent.
+    - ``char_exclude`` — caractères ignorés au calcul (CER/WER).
+      Idem.
+    - ``corpus_files`` — itérable de chemins.  Pour chaque, on
+      hashe le chemin + ``stat.st_size`` + ``stat.st_mtime``.
+      Détecte les modifs sans coût du hash de contenu.
+    - ``code_version`` — version de Picarones courante.
+    - ``extra`` — dict additionnel libre pour des éléments
+      spécifiques à un pipeline (prompt_template, llm_params).
+    Returns
+    -------
+    str
+        Empreinte hexadécimale tronquée à 16 caractères — collision
+        négligeable pour un usage par-utilisateur, lisible humainement.
+    """
+    hasher = hashlib.sha256()
+    def _update(key: str, value: Any) -> None:
+        hasher.update(b"\x00")
+        hasher.update(key.encode("utf-8"))
+        hasher.update(b"\x01")
+        try:
+            payload = json.dumps(value, sort_keys=True, default=str)
+        except TypeError:
+            payload = repr(value)
+        hasher.update(payload.encode("utf-8"))
+    _update("engine_config", dict(engine_config or {}))
+    _update("normalization_profile", normalization_profile or "")
+    _update("char_exclude", char_exclude or "")
+    _update("code_version", code_version or "")
+    _update("extra", dict(extra or {}))
+    if corpus_files is not None:
+        # Tri pour stabilité indépendamment de l'ordre d'itération.
+        for fpath in sorted(str(p) for p in corpus_files):
+            hasher.update(b"\x02")
+            hasher.update(fpath.encode("utf-8"))
+            try:
+                stat = Path(fpath).stat()
+                hasher.update(
+                    f":{stat.st_size}:{int(stat.st_mtime)}".encode("utf-8"),
+                )
+            except OSError:
+                # Fichier disparu / inaccessible — ignoré au fingerprint.
+                # Si le file disparait pendant la course, on prend ce
+                # qu'on peut.
+                continue
+    return hasher.hexdigest()[:16]
 def _load_partial(
     partial_path: Path,
 ) -> list[DocumentResult]:
     travail antérieur.
     """
     from picarones.evaluation.benchmark_result import DocumentResult
     results: list[DocumentResult] = []
     if not partial_path.exists():
             )
             continue
         try:
+            # Phase 2.2 — Utilise ``DocumentResult.from_dict`` au lieu
+            # de la reconstruction manuelle qui perdait
+            # ``taxonomy``/``ner_metrics``/``calibration_metrics``/etc.
+            # à la reprise — un partial chargé puis re-sérialisé
+            # devait conserver l'intégralité du payload.
+            results.append(DocumentResult.from_dict(d))
         except (KeyError, TypeError) as exc:
             logger.warning(
                 "[partial_dir] ligne %d malformée dans '%s' : %s "
         )
+def partial_path_for_engine(
+    *,
+    corpus: Any,
+    engine: Any,
+    partial_dir: Optional[str | Path],
+    engine_config: Mapping[str, Any] | None = None,
+    normalization_profile: Any | None = None,
+    char_exclude: Any | None = None,
+    profile: str | None = None,
+    code_version: str | None = None,
+) -> Path:
+    """Helper public qui calcule le ``Path`` du fichier partiel pour
+    un couple ``(corpus, engine)`` en intégrant le fingerprint complet.
+    Encapsule la combinaison ``_partial_path`` +
+    :func:`compute_run_fingerprint` pour que le runner et les tests
+    utilisent la **même** logique de nommage — sinon les tests ne
+    peuvent pas pré-remplir un partial que le runner saura
+    retrouver.
+    Parameters
+    ----------
+    corpus:
+        Doit exposer ``.name`` et ``.documents`` (chaque doc ayant
+        ``.image_path``).
+    engine:
+        Doit exposer ``.name``.  ``engine_config`` peut être fourni
+        séparément si la caller veut surcharger l'introspection.
+    partial_dir:
+        Dossier où vit le partial ; ``None`` → tempdir.
+    engine_config:
+        Si fourni, utilisé tel quel ; sinon l'appelant peut sonder
+        l'engine via :func:`benchmark_runner._engine_config_for_fingerprint`
+        avant d'appeler.
+    normalization_profile, char_exclude, profile, code_version:
+        Composantes incluses dans le fingerprint.  Passer ``None``
+        pour ne pas contribuer (deux runs avec et sans normalisation
+        auront alors des fingerprints différents seulement si l'un
+        des deux est ``None``).
+    """
+    corpus_files = [
+        doc.image_path for doc in getattr(corpus, "documents", [])
+        if getattr(doc, "image_path", None)
+    ]
+    fp = compute_run_fingerprint(
+        engine_config=engine_config or {"engine_name": getattr(engine, "name", "")},
+        normalization_profile=(
+            getattr(normalization_profile, "name", None)
+            if normalization_profile is not None
+            else None
+        ),
+        char_exclude=(
+            "".join(sorted(char_exclude)) if char_exclude else None
+        ),
+        corpus_files=corpus_files,
+        code_version=code_version,
+        extra={"profile": profile} if profile else None,
+    )
+    return _partial_path(
+        getattr(corpus, "name", ""), getattr(engine, "name", ""),
+        partial_dir, fingerprint=fp,
+    )
 __all__ = [
     "_delete_partial",
     "_load_partial",
     "_partial_write_lock",
     "_sanitize_filename",
     "_save_partial_line",
+    "compute_run_fingerprint",
+    "partial_path_for_engine",
 ]

picarones/evaluation/benchmark_result.py CHANGED Viewed

@@ -180,6 +180,50 @@ class DocumentResult:
             d["readability_metrics"] = self.readability_metrics
         return d
     def compact(
         self,
         text_limit: Optional[int] = None,
@@ -408,6 +452,43 @@ class EngineReport:
             d["aggregated_readability"] = self.aggregated_readability
         return d
 @dataclass
 class BenchmarkResult:
@@ -686,12 +767,60 @@ class BenchmarkResult:
             json.dump(self.as_dict(), fh, ensure_ascii=False, indent=indent)
         return output_path.resolve()
     @classmethod
     def from_json(cls, path: str | Path) -> dict:
-        """Charge un résultat JSON brut depuis le disque (pour le rapport HTML).
-        Retourne le dict Python — la reconstruction complète en objets
-        est réservée aux sprints suivants.
         """
         with Path(path).open(encoding="utf-8") as fh:
             return json.load(fh)

             d["readability_metrics"] = self.readability_metrics
         return d
+    @classmethod
+    def from_dict(cls, data: dict) -> "DocumentResult":
+        """Reconstruit un :class:`DocumentResult` depuis ``as_dict()``.
+        Phase 2.2 du chantier post-rewrite : restauration fidèle de
+        tous les champs avancés (confusion_matrix, taxonomy, structure,
+        hallucination_metrics, ner_metrics, calibration_metrics,
+        philological_metrics, searchability_metrics,
+        numerical_sequence_metrics, readability_metrics,
+        pipeline_metadata, ocr_intermediate).
+        Avant ce durcissement, ``ReportGenerator.from_json`` faisait sa
+        propre reconstruction qui ne couvrait que CER/WER/MER/WIL +
+        doc_id/image_path/ground_truth/hypothesis — toutes les
+        analyses détaillées étaient perdues, donc le rapport régénéré
+        depuis JSON n'avait plus accès aux vues taxonomy, NER,
+        calibration, etc.  La reproductibilité scientifique était
+        cassée.
+        """
+        return cls(
+            doc_id=data["doc_id"],
+            image_path=data["image_path"],
+            ground_truth=data["ground_truth"],
+            hypothesis=data["hypothesis"],
+            metrics=MetricsResult.from_dict(data["metrics"]),
+            duration_seconds=data.get("duration_seconds", 0.0),
+            engine_error=data.get("engine_error"),
+            ocr_intermediate=data.get("ocr_intermediate"),
+            pipeline_metadata=data.get("pipeline_metadata", {}) or {},
+            confusion_matrix=data.get("confusion_matrix"),
+            char_scores=data.get("char_scores"),
+            taxonomy=data.get("taxonomy"),
+            structure=data.get("structure"),
+            image_quality=data.get("image_quality"),
+            line_metrics=data.get("line_metrics"),
+            hallucination_metrics=data.get("hallucination_metrics"),
+            ner_metrics=data.get("ner_metrics"),
+            calibration_metrics=data.get("calibration_metrics"),
+            philological_metrics=data.get("philological_metrics"),
+            searchability_metrics=data.get("searchability_metrics"),
+            numerical_sequence_metrics=data.get("numerical_sequence_metrics"),
+            readability_metrics=data.get("readability_metrics"),
+        )
     def compact(
         self,
         text_limit: Optional[int] = None,
             d["aggregated_readability"] = self.aggregated_readability
         return d
+    @classmethod
+    def from_dict(cls, data: dict) -> "EngineReport":
+        """Reconstruit un :class:`EngineReport` depuis ``as_dict()``.
+        Phase 2.2 du chantier post-rewrite : restauration fidèle des
+        ``aggregated_*`` (confusion, char_scores, taxonomy, structure,
+        image_quality, line_metrics, hallucination, ner, calibration,
+        philological, searchability, numerical_sequences, readability)
+        et de ``pipeline_info``.
+        """
+        return cls(
+            engine_name=data["engine_name"],
+            engine_version=data.get("engine_version", "unknown"),
+            engine_config=data.get("engine_config", {}),
+            document_results=[
+                DocumentResult.from_dict(dr)
+                for dr in data.get("document_results", [])
+            ],
+            aggregated_metrics=data.get("aggregated_metrics", {}) or {},
+            pipeline_info=data.get("pipeline_info", {}) or {},
+            aggregated_confusion=data.get("aggregated_confusion"),
+            aggregated_char_scores=data.get("aggregated_char_scores"),
+            aggregated_taxonomy=data.get("aggregated_taxonomy"),
+            aggregated_structure=data.get("aggregated_structure"),
+            aggregated_image_quality=data.get("aggregated_image_quality"),
+            aggregated_line_metrics=data.get("aggregated_line_metrics"),
+            aggregated_hallucination=data.get("aggregated_hallucination"),
+            aggregated_ner=data.get("aggregated_ner"),
+            aggregated_calibration=data.get("aggregated_calibration"),
+            aggregated_philological=data.get("aggregated_philological"),
+            aggregated_searchability=data.get("aggregated_searchability"),
+            aggregated_numerical_sequences=data.get(
+                "aggregated_numerical_sequences",
+            ),
+            aggregated_readability=data.get("aggregated_readability"),
+        )
 @dataclass
 class BenchmarkResult:
             json.dump(self.as_dict(), fh, ensure_ascii=False, indent=indent)
         return output_path.resolve()
+    @classmethod
+    def from_dict(cls, data: dict) -> "BenchmarkResult":
+        """Reconstruit un :class:`BenchmarkResult` complet depuis
+        ``as_dict()``.
+        Phase 2.2 du chantier post-rewrite : fidélité du round-trip
+        ``to_json → from_dict``.  Auparavant, ``from_json`` retournait
+        le dict brut et l'appelant devait reconstruire à la main —
+        d'où la dérive entre ``ReportGenerator.__init__`` (objets) et
+        ``ReportGenerator.from_json`` (dicts appauvris).  Désormais, un
+        seul chemin canonique : ``BenchmarkResult.from_dict(dict)`` →
+        objet complet, indistinguable d'un benchmark fraîchement
+        exécuté.
+        """
+        corpus_info = data.get("corpus", {}) or {}
+        return cls(
+            corpus_name=corpus_info.get("name", "Corpus"),
+            corpus_source=corpus_info.get("source"),
+            document_count=corpus_info.get("document_count", 0),
+            engine_reports=[
+                EngineReport.from_dict(er)
+                for er in data.get("engine_reports", [])
+            ],
+            run_date=data.get("run_date", ""),
+            picarones_version=data.get("picarones_version", ""),
+            metadata=data.get("metadata", {}) or {},
+        )
     @classmethod
     def from_json(cls, path: str | Path) -> dict:
+        """Charge le JSON brut (dict Python) — rétrocompatibilité.
+        Pour reconstruire un :class:`BenchmarkResult` complet (objets),
+        utiliser :meth:`from_dict` après :meth:`from_json`, ou
+        directement :meth:`from_json_object` ci-dessous.
+        Cette méthode est conservée parce que de nombreux consommateurs
+        (tests, ``ReportGenerator.from_json`` legacy, scripts CLI ad
+        hoc) attendent encore un dict.  Le rewrite v2.0 préfère les
+        objets reconstruits ; les nouveaux callers doivent utiliser
+        :meth:`from_json_object`.
         """
         with Path(path).open(encoding="utf-8") as fh:
             return json.load(fh)
+    @classmethod
+    def from_json_object(cls, path: str | Path) -> "BenchmarkResult":
+        """Charge un JSON et reconstruit un :class:`BenchmarkResult`
+        complet (objets), avec toutes les analyses avancées préservées.
+        Round-trip garanti : ``BenchmarkResult.from_json_object(
+        bm.to_json(p)) == bm`` au sens structurel (les champs
+        ``aggregated_metrics`` peuvent être recalculés par
+        ``__post_init__`` si absents, sinon préservés).
+        """
+        with Path(path).open(encoding="utf-8") as fh:
+            return cls.from_dict(json.load(fh))

picarones/evaluation/metric_result.py CHANGED Viewed

@@ -79,6 +79,31 @@ class MetricsResult:
     def wer_percent(self) -> Optional[float]:
         return None if self.wer is None else round(self.wer * 100, 2)
 def aggregate_metrics(results: list[MetricsResult]) -> dict:
     """Calcule les statistiques agrégées sur un ensemble de résultats.

     def wer_percent(self) -> Optional[float]:
         return None if self.wer is None else round(self.wer * 100, 2)
+    @classmethod
+    def from_dict(cls, data: dict) -> "MetricsResult":
+        """Reconstruit depuis le dict produit par :meth:`as_dict`.
+        Phase 2.2 du chantier post-rewrite : fidélité du round-trip
+        ``as_dict → from_dict``.  Auparavant, ``ReportGenerator.from_json``
+        contenait sa propre reconstruction partielle qui perdait
+        ``cer_diplomatic`` et ``diplomatic_profile_name``.  Centraliser
+        la désérialisation ici évite la dérive.
+        """
+        return cls(
+            cer=data.get("cer"),
+            cer_nfc=data.get("cer_nfc"),
+            cer_caseless=data.get("cer_caseless"),
+            wer=data.get("wer"),
+            wer_normalized=data.get("wer_normalized"),
+            mer=data.get("mer"),
+            wil=data.get("wil"),
+            reference_length=data.get("reference_length", 0),
+            hypothesis_length=data.get("hypothesis_length", 0),
+            error=data.get("error"),
+            cer_diplomatic=data.get("cer_diplomatic"),
+            diplomatic_profile_name=data.get("diplomatic_profile_name"),
+        )
 def aggregate_metrics(results: list[MetricsResult]) -> dict:
     """Calcule les statistiques agrégées sur un ensemble de résultats.

picarones/interfaces/cli/__init__.py CHANGED Viewed

@@ -151,27 +151,59 @@ def metrics_cmd(reference: str, hypothesis: str, json_output: bool) -> None:
 # picarones engines
 # ---------------------------------------------------------------------------
 @cli.command("engines")
 def engines_cmd() -> None:
-    """Liste les moteurs OCR disponibles et vérifie leur installation."""
-    engines = [
-        ("tesseract", "Tesseract 5 (pytesseract)", "pytesseract"),
-        ("pero_ocr", "Pero OCR", "pero_ocr"),
-    ]
     click.echo("Moteurs OCR disponibles :\n")
-    for engine_id, label, module in engines:
-        try:
-            __import__(module)
-            status = click.style("✓ disponible", fg="green")
-        except ImportError:
-            status = click.style("✗ non installé", fg="red")
-        click.echo(f"  {engine_id:<15} {label:<35} {status}")
     click.echo(
-        "\nPour installer un moteur manquant :\n"
-        "  pip install pytesseract\n"
-        "  pip install pero-ocr"
     )

 # picarones engines
 # ---------------------------------------------------------------------------
+#: Catalogue source-de-vérité des moteurs OCR exposés.
+#:
+#: Phase 3 chantier post-rewrite : remplace l'ancienne liste hardcodée
+#: ``[tesseract, pero_ocr]`` qui divergeait du web (``/api/engines``
+#: annonçait 8 engines, dont kraken/calamari sans backend, dont
+#: mistral_ocr/google_vision/azure_doc_intel jamais exposés à la CLI).
+#: Désormais la liste est dérivée de la factory canonique
+#: ``picarones.adapters.ocr.factory._SUPPORTED`` ; ajouter un engine
+#: nécessite (1) un adapter dans ``adapters/ocr/`` et (2) une entrée
+#: factory — pas de divergence possible avec l'API web.
+_CLI_ENGINE_CATALOG: tuple[tuple[str, str, str, str], ...] = (
+    ("tesseract", "Tesseract 5", "pytesseract", "[dev]"),
+    ("pero_ocr", "Pero OCR", "pero_ocr", "[pero]"),
+    ("kraken", "Kraken HTR", "kraken", "[kraken]"),
+    ("calamari", "Calamari OCR", "calamari_ocr", "[calamari]"),
+    ("mistral_ocr", "Mistral OCR (cloud)", "mistralai", "[llm]"),
+    ("google_vision", "Google Vision (cloud)", "google.cloud.vision", "[ocr-cloud]"),
+    ("azure_doc_intel", "Azure Doc Intel (cloud)",
+     "azure.ai.documentintelligence", "[ocr-cloud]"),
+    ("precomputed", "Précalculé (OCR pré-existant)", "", ""),
+)
 @cli.command("engines")
 def engines_cmd() -> None:
+    """Liste les moteurs OCR disponibles et vérifie leur installation.
+    Source de vérité unique avec ``/api/engines`` (Phase 3 du chantier
+    post-rewrite) : tous les moteurs listés ici sont effectivement
+    instanciables via ``picarones.adapters.ocr.factory``.
+    """
+    from picarones.adapters.ocr.factory import _SUPPORTED
     click.echo("Moteurs OCR disponibles :\n")
+    for engine_id, label, module, extra in _CLI_ENGINE_CATALOG:
+        # Garde-fou de cohérence : l'entrée CLI ne doit jamais
+        # référencer un engine inconnu de la factory canonique.
+        if engine_id not in _SUPPORTED:
+            continue
+        if not module:
+            status = click.style("✓ intégré", fg="green")
+        else:
+            try:
+                __import__(module)
+                status = click.style("✓ disponible", fg="green")
+            except ImportError:
+                hint = f" (pip install picarones{extra})" if extra else ""
+                status = click.style(f"✗ non installé{hint}", fg="red")
+        click.echo(f"  {engine_id:<18} {label:<32} {status}")
     click.echo(
+        "\nNote : kraken/calamari exigent un modèle utilisateur "
+        "(``.mlmodel``/``.ckpt``) — pas de modèle par défaut.",
     )

picarones/interfaces/web/app.py CHANGED Viewed

@@ -60,7 +60,8 @@ _logger = logging.getLogger(__name__)
 @asynccontextmanager
 async def _lifespan(app: FastAPI):
-    """Hook de démarrage : valide la config + nettoie les jobs orphelins.
     1. Sprint S6.9 — ``validate_csrf_config()`` : refuse de démarrer
        si ``PICARONES_CSRF_REQUIRED=1`` sans ``PICARONES_CSRF_SECRET``
@@ -71,7 +72,14 @@ async def _lifespan(app: FastAPI):
        précédent est mort sans les finir).  On les bascule en
        ``interrupted`` pour ne pas laisser d'état mensonger sur le
        tableau de bord.
     """
     # Étape 1 — validation config (échec rapide si dangereux).
     from picarones.interfaces.web.security import validate_csrf_config
     validate_csrf_config()
@@ -91,7 +99,27 @@ async def _lifespan(app: FastAPI):
             "base SQLite inaccessible (%s) : le tableau de bord "
             "affichera des jobs zombies.", exc,
         )
-    yield
 # ──────────────────────────────────────────────────────────────────────────

 @asynccontextmanager
 async def _lifespan(app: FastAPI):
+    """Hook de démarrage : valide la config + nettoie les jobs orphelins
+    + démarre la tâche RGPD de purge des uploads.
     1. Sprint S6.9 — ``validate_csrf_config()`` : refuse de démarrer
        si ``PICARONES_CSRF_REQUIRED=1`` sans ``PICARONES_CSRF_SECRET``
        précédent est mort sans les finir).  On les bascule en
        ``interrupted`` pour ne pas laisser d'état mensonger sur le
        tableau de bord.
+    3. Phase 4 du chantier post-rewrite — démarrage explicite de
+       :func:`upload_purge_task` (RGPD).  Auparavant définie dans
+       ``maintenance.py`` mais jamais lancée par ce lifespan, elle
+       était du code zombie.  Désormais lancée comme tâche asyncio
+       de fond ; annulation propre au shutdown.
     """
+    import asyncio
     # Étape 1 — validation config (échec rapide si dangereux).
     from picarones.interfaces.web.security import validate_csrf_config
     validate_csrf_config()
             "base SQLite inaccessible (%s) : le tableau de bord "
             "affichera des jobs zombies.", exc,
         )
+    # Étape 3 — démarrage tâche de purge RGPD.
+    from picarones.interfaces.web.maintenance import upload_purge_task
+    purge_task = asyncio.create_task(upload_purge_task(state.UPLOADS_DIR))
+    try:
+        yield
+    finally:
+        # Annulation propre au shutdown ; on attend l'acquittement de
+        # la CancelledError pour éviter le warning "Task was destroyed
+        # but it is pending".  ``asyncio.shield`` n'est pas nécessaire :
+        # on accepte la perte d'une éventuelle passe de purge en cours
+        # (idempotente, sera reprise au prochain démarrage).
+        purge_task.cancel()
+        try:
+            await purge_task
+        except (asyncio.CancelledError, Exception) as exc:  # noqa: BLE001
+            if not isinstance(exc, asyncio.CancelledError):
+                _logger.warning(
+                    "[maintenance] tâche de purge arrêtée sur erreur : %s",
+                    exc,
+                )
 # ──────────────────────────────────────────────────────────────────────────

picarones/interfaces/web/benchmark_utils.py CHANGED Viewed

@@ -11,9 +11,9 @@ API publique
 Helpers internes (préfixe ``_``)
 --------------------------------
 - ``_build_llm_adapter`` : factory adapter LLM depuis une config
-  ``CompetitorConfig``.
 - ``_engine_from_competitor`` : factory moteur OCR ou pipeline
-  OCR+LLM depuis une ``CompetitorConfig``.
 Ces utilitaires sont consommés par le router ``/api/benchmark/*``.
 """
@@ -28,7 +28,7 @@ from typing import Any, Optional
 from picarones.interfaces.web.models import (
     BenchmarkRequest,
     BenchmarkRunRequest,
-    CompetitorConfig,
 )
 from picarones.interfaces.web.state import BenchmarkJob, iso_now
@@ -91,7 +91,7 @@ def sse_format(event_type: str, data: Any, seq: Optional[int] = None) -> str:
     return f"{head}event: {event_type}\ndata: {payload}\n\n"
-def _build_llm_adapter(comp: CompetitorConfig) -> Any:
     """Instancie un adaptateur LLM depuis la config d'un concurrent."""
     if comp.llm_provider == "openai":
         from picarones.adapters.llm.openai_adapter import OpenAIAdapter
@@ -126,7 +126,7 @@ def _sanitize_name_suffix(value: str) -> str:
 def _ocr_adapter_name(engine_id: str, ocr_model: str) -> str:
     """Nom canonique de l'adapter OCR pour un couple ``(engine, model)``.
-    Deux ``CompetitorConfig`` qui partagent exactement le même couple
     obtiennent le même ``name`` (donc le resolver les déduplique
     proprement).  Deux configs différentes obtiennent des noms
     distincts — pas de collision silencieuse, pas de bricolage côté
@@ -170,6 +170,21 @@ _OCR_KWARGS_BUILDERS: dict[str, Any] = {
         "lang": model or "fra",
         "psm": 6,
     },
     "mistral_ocr": lambda model: {
         "model": model or "mistral-ocr-latest",
     },
@@ -200,8 +215,8 @@ def _build_ocr_kwargs(engine_id: str, ocr_model: str) -> dict[str, Any]:
     return kwargs
-def _engine_from_competitor(comp: CompetitorConfig) -> Any:
-    """Instancie un moteur OCR (ou pipeline OCR+LLM) depuis une CompetitorConfig.
     Modes supportés :
@@ -226,7 +241,7 @@ def _engine_from_competitor(comp: CompetitorConfig) -> Any:
     # des constructeurs ``BaseOCREngine`` legacy.  Les adapters
     # canoniques ont des kwargs nommés (pas de dict ``config``) — la
     # conversion se fait ici en respectant les noms historiques des
-    # champs ``CompetitorConfig.ocr_model``.
     ocr = None
     if not is_corpus_ocr:
         from picarones.adapters.ocr.factory import ocr_adapter_from_name
@@ -248,14 +263,20 @@ def _engine_from_competitor(comp: CompetitorConfig) -> Any:
     # Pipeline OCR+LLM (live ou post-correction) — ``OCRLLMPipelineConfig``
     # canonique remplace l'ex-``OCRLLMPipeline`` legacy.
-    mode_map = {
-        "text_only": "text_only",
-        "post_correction_text": "text_only",
-        "text_and_image": "text_and_image",
-        "post_correction_image": "text_and_image",
-        "zero_shot": "zero_shot",
-    }
-    mode = mode_map.get(comp.pipeline_mode, "text_only")
     llm = _build_llm_adapter(comp)
@@ -283,7 +304,7 @@ def _engine_from_competitor(comp: CompetitorConfig) -> Any:
 def run_benchmark_thread_v2(job: BenchmarkJob, req: BenchmarkRunRequest) -> None:
-    """Exécute un benchmark à partir d'une liste de ``CompetitorConfig``."""
     job.set_status("running")
     job.started_at = iso_now()
     job.add_event("start", {"message": "Démarrage du benchmark…", "corpus": req.corpus_path})
@@ -394,123 +415,72 @@ def run_benchmark_thread_v2(job: BenchmarkJob, req: BenchmarkRunRequest) -> None
         job.add_event("error", {"message": f"Erreur : {exc}"})
-def run_benchmark_thread(job: BenchmarkJob, req: BenchmarkRequest) -> None:
-    """Exécute le benchmark legacy (route ``/api/benchmark/start``)."""
-    job.set_status("running")
-    job.started_at = iso_now()
-    job.add_event("start", {"message": "Démarrage du benchmark…", "corpus": req.corpus_path})
-    try:
-        from picarones.app.services.benchmark_runner import (
-            run_benchmark_via_service,
-        )
-        from picarones.evaluation.corpus import load_corpus_from_directory
-        # Charger le corpus
-        job.add_event("log", {"message": f"Chargement du corpus : {req.corpus_path}"})
-        corpus = load_corpus_from_directory(req.corpus_path)
-        job.total_docs = len(corpus)
-        job.add_event("log", {"message": f"{job.total_docs} documents chargés."})
-        if job.status == "cancelled":
-            return
-        # Sprint H.2.b.4 — instanciation via la factory canonique
-        # ``ocr_adapter_from_name`` (retourne ``BaseOCRAdapter``).
-        from picarones.adapters.ocr.factory import ocr_adapter_from_name
-        ocr_engines = []
-        for engine_name in req.engines:
-            try:
-                if engine_name.lower() in {"tesseract", "tess"}:
-                    eng = ocr_adapter_from_name(
-                        engine_name, lang=req.lang, psm=6,
-                    )
-                else:
-                    eng = ocr_adapter_from_name(engine_name)
-                ocr_engines.append(eng)
-                job.add_event("log", {"message": f"Moteur chargé : {engine_name}"})
-            except Exception as exc:
-                job.add_event("warning", {"message": f"Moteur ignoré '{engine_name}' : {exc}"})
-        if not ocr_engines:
-            raise ValueError("Aucun moteur valide disponible.")
-        # Répertoire de sortie
-        # Sprint A14-S1 — A.I.0 P0 : ``output_dir`` a déjà été validé
-        # par le router (validated_path).  ``report_name`` est sanitizé
-        # ici pour défense en profondeur (refuse ``../``, séparateurs,
-        # caractères de contrôle) avant concaténation à output_dir.
-        from picarones.interfaces.web.security import safe_report_name
-        output_dir = Path(req.output_dir)
-        output_dir.mkdir(parents=True, exist_ok=True)
-        raw_name = req.report_name or f"rapport_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
-        report_name = safe_report_name(raw_name)
-        output_json = str(output_dir / f"{report_name}.json")
-        output_html = str(output_dir / f"{report_name}.html")
-        # Callback de progression
-        n_engines = len(ocr_engines)
-        total_steps = job.total_docs * n_engines
-        step_counter = [0]
-        def _progress_callback(engine_name: str, doc_idx: int, doc_id: str) -> None:
-            if job.status == "cancelled":
-                return
-            step_counter[0] += 1
-            job.current_engine = engine_name
-            job.processed_docs = doc_idx
-            job.progress = step_counter[0] / max(total_steps, 1)
-            job.add_event("progress", {
-                "engine": engine_name,
-                "doc_idx": doc_idx,
-                "doc_id": doc_id,
-                "progress": job.progress,
-                "processed": step_counter[0],
-                "total": total_steps,
-            })
-        from picarones.evaluation.metrics.normalization import _parse_exclude_chars
-        char_excl = _parse_exclude_chars(req.char_exclude) if req.char_exclude else None
-        # Sprint D.4 du plan v2.0 — migration ``run_benchmark_thread``
-        # (legacy v1) vers ``run_benchmark_via_service`` (rewrite),
-        # cohérent avec la migration v2 (Sprint D.3).
-        result = run_benchmark_via_service(
-            corpus=corpus,
-            engines=ocr_engines,
-            output_json=output_json,
-            show_progress=False,
-            progress_callback=_progress_callback,
-            char_exclude=char_excl,
-            cancel_event=job._cancel_event,
-            normalization_profile=req.normalization_profile,
         )
-        if job.status == "cancelled":
-            return
-        job.add_event("log", {"message": "Génération du rapport HTML…"})
-        from picarones.reports.html.generator import ReportGenerator
-        report_lang = getattr(req, "report_lang", "fr")
-        gen = ReportGenerator(result, lang=report_lang)
-        gen.generate(output_html)
-        job.output_path = output_html
-        job.progress = 1.0
-        job.set_status("complete")
-        ranking = result.ranking()
-        job.add_event("complete", {
-            "message": "Benchmark terminé.",
-            "output_html": output_html,
-            "output_json": output_json,
-            "ranking": ranking,
-        })
-    except Exception as exc:  # noqa: BLE001
-        job.set_status("error", error=str(exc))
-        job.add_event("error", {"message": f"Erreur : {exc}"})
 __all__ = [

 Helpers internes (préfixe ``_``)
 --------------------------------
 - ``_build_llm_adapter`` : factory adapter LLM depuis une config
+  ``PipelineConfig``.
 - ``_engine_from_competitor`` : factory moteur OCR ou pipeline
+  OCR+LLM depuis une ``PipelineConfig``.
 Ces utilitaires sont consommés par le router ``/api/benchmark/*``.
 """
 from picarones.interfaces.web.models import (
     BenchmarkRequest,
     BenchmarkRunRequest,
+    PipelineConfig,
 )
 from picarones.interfaces.web.state import BenchmarkJob, iso_now
     return f"{head}event: {event_type}\ndata: {payload}\n\n"
+def _build_llm_adapter(comp: PipelineConfig) -> Any:
     """Instancie un adaptateur LLM depuis la config d'un concurrent."""
     if comp.llm_provider == "openai":
         from picarones.adapters.llm.openai_adapter import OpenAIAdapter
 def _ocr_adapter_name(engine_id: str, ocr_model: str) -> str:
     """Nom canonique de l'adapter OCR pour un couple ``(engine, model)``.
+    Deux ``PipelineConfig`` qui partagent exactement le même couple
     obtiennent le même ``name`` (donc le resolver les déduplique
     proprement).  Deux configs différentes obtiennent des noms
     distincts — pas de collision silencieuse, pas de bricolage côté
         "lang": model or "fra",
         "psm": 6,
     },
+    "pero_ocr": lambda model: {
+        "config_path": model or "",
+    },
+    # Phase 3 chantier post-rewrite : kraken/calamari étaient annoncés
+    # par ``/api/engines`` mais sans factory branchée → benchmark web
+    # échouait silencieusement.  Le ``ocr_model`` côté UI véhicule
+    # désormais le chemin du modèle (Kraken ``.mlmodel`` ou Calamari
+    # checkpoint).  Si vide, l'adapter lève une OCRAdapterError
+    # explicite à ``execute`` — pas de fallback silencieux.
+    "kraken": lambda model: {
+        "model_path": model or "",
+    },
+    "calamari": lambda model: {
+        "checkpoint": model or "",
+    },
     "mistral_ocr": lambda model: {
         "model": model or "mistral-ocr-latest",
     },
     return kwargs
+def _engine_from_competitor(comp: PipelineConfig) -> Any:
+    """Instancie un moteur OCR (ou pipeline OCR+LLM) depuis une PipelineConfig.
     Modes supportés :
     # des constructeurs ``BaseOCREngine`` legacy.  Les adapters
     # canoniques ont des kwargs nommés (pas de dict ``config``) — la
     # conversion se fait ici en respectant les noms historiques des
+    # champs ``PipelineConfig.ocr_model``.
     ocr = None
     if not is_corpus_ocr:
         from picarones.adapters.ocr.factory import ocr_adapter_from_name
     # Pipeline OCR+LLM (live ou post-correction) — ``OCRLLMPipelineConfig``
     # canonique remplace l'ex-``OCRLLMPipeline`` legacy.
+    #
+    # Phase 2 chantier post-rewrite : suppression de l'ancien ``mode_map``
+    # qui aliasait silencieusement (``post_correction_text`` →
+    # ``text_only``, valeur inconnue → ``text_only``).  Désormais le
+    # typage Pydantic ``PipelineMode`` rejette en 422 toute chaîne hors
+    # de la matrice {``text_only``, ``text_and_image``, ``zero_shot``},
+    # et un éventuel client API qui passerait outre la validation
+    # (test legacy, payload forgé) reçoit ici une ``ValueError``.
+    mode = comp.pipeline_mode
+    if mode not in ("text_only", "text_and_image", "zero_shot"):
+        raise ValueError(
+            f"pipeline_mode invalide : {comp.pipeline_mode!r}.  "
+            "Valeurs acceptées : 'text_only', 'text_and_image', 'zero_shot'.",
+        )
     llm = _build_llm_adapter(comp)
 def run_benchmark_thread_v2(job: BenchmarkJob, req: BenchmarkRunRequest) -> None:
+    """Exécute un benchmark à partir d'une liste de ``PipelineConfig``."""
     job.set_status("running")
     job.started_at = iso_now()
     job.add_event("start", {"message": "Démarrage du benchmark…", "corpus": req.corpus_path})
         job.add_event("error", {"message": f"Erreur : {exc}"})
+def _legacy_request_to_run_request(req: BenchmarkRequest) -> BenchmarkRunRequest:
+    """Convertit un ``BenchmarkRequest`` legacy en ``BenchmarkRunRequest``.
+    Phase 4 du chantier post-rewrite : ``/api/benchmark/start`` est
+    rétrocompatible mais délègue désormais au worker v2 unifié.  La
+    conversion mappe chaque ``engine_name`` en ``PipelineConfig``
+    (OCR seul, sans LLM) en préservant ``lang`` pour Tesseract.
+    Garantit qu'un patch sécurité/méthodologique appliqué au chemin
+    canonique (v2) s'applique aussi au chemin legacy — l'éviction
+    progressive de ``/start`` peut se faire sans double maintenance.
+    """
+    competitors: list[PipelineConfig] = []
+    for engine_name in req.engines:
+        # ``ocr_model`` véhicule le ``lang`` Tesseract via la registry
+        # ``_OCR_KWARGS_BUILDERS`` ; pour les autres engines on laisse
+        # vide (l'adapter utilise son défaut).
+        model = req.lang if engine_name.lower() in ("tesseract", "tess") else ""
+        competitors.append(
+            PipelineConfig(
+                name="",
+                ocr_engine=engine_name,
+                ocr_model=model,
+                llm_provider="",
+                llm_model="",
+                pipeline_mode="",
+                prompt_file="",
+            ),
         )
+    return BenchmarkRunRequest(
+        corpus_path=req.corpus_path,
+        competitors=competitors,
+        normalization_profile=req.normalization_profile,
+        char_exclude=req.char_exclude,
+        output_dir=req.output_dir,
+        report_name=req.report_name,
+        report_lang=req.report_lang,
+    )
+def run_benchmark_thread(job: BenchmarkJob, req: BenchmarkRequest) -> None:
+    """Worker historique de ``/api/benchmark/start``.
+    Phase 4 du chantier post-rewrite : unifié avec ``run_benchmark_thread_v2``
+    via conversion ``BenchmarkRequest → BenchmarkRunRequest``.  Avant
+    cette unification, deux workers indépendants implémentaient
+    presque la même logique → tout patch (sécurité, méthodologie)
+    devait être dupliqué, et il était facile d'en oublier un.
+    Marqué deprecated dans les logs ; à supprimer dans une release
+    future après que tous les consommateurs aient migré vers
+    ``/api/benchmark/run``.
+    """
+    import logging as _logging
+    _logging.getLogger(__name__).warning(
+        "[benchmark] /api/benchmark/start est déprécié — utiliser "
+        "/api/benchmark/run (PipelineConfig).  Phase 4 du chantier "
+        "post-rewrite : le worker legacy délègue désormais au v2 unifié.",
+    )
+    job.add_event("log", {
+        "message": (
+            "Note : /api/benchmark/start est déprécié — utiliser "
+            "/api/benchmark/run pour les nouveaux clients."
+        ),
+    })
+    return run_benchmark_thread_v2(job, _legacy_request_to_run_request(req))
 __all__ = [

picarones/interfaces/web/corpus_utils.py CHANGED Viewed

@@ -2,12 +2,14 @@
 Détection ALTO/PAGE, extraction de texte GT, analyse de la structure
 d'un dossier corpus, extraction de ZIP avec garde-fous (taille
-décompressée, nombre de fichiers). Le parsing XML sécurisé délègue
 à :func:`picarones.formats._xml_utils.safe_parse_xml`.
 """
 from __future__ import annotations
 import xml.etree.ElementTree as ET
 import zipfile
 from pathlib import Path
@@ -15,6 +17,8 @@ from pathlib import Path
 from picarones.formats._xml_utils import safe_parse_xml
 from picarones.interfaces.web.state import IMAGE_EXTS
 # Garde-fous ZIP-bomb pour l'upload
 MAX_ZIP_TOTAL_SIZE = 500 * 1024 * 1024
 """500 Mo décompressé maximum."""
@@ -165,17 +169,83 @@ def analyze_corpus_dir(path: Path) -> dict:
 # Extraction ZIP sécurisée
 # ──────────────────────────────────────────────────────────────────────────
-def flatten_zip_to_dir(zf: zipfile.ZipFile, dest: Path) -> None:
     """Extrait un ZIP en aplatissant les paires image/.gt.txt/.xml dans ``dest``.
     Garde-fous :
     - Ignore les fichiers cachés macOS (préfixe ``.`` ou ``__MACOSX``).
     - Refuse si la taille décompressée totale dépasse ``MAX_ZIP_TOTAL_SIZE``.
     - Refuse si le nombre de fichiers extraits dépasse ``MAX_ZIP_FILES``.
     """
     dest.mkdir(parents=True, exist_ok=True)
     total_size = 0
     file_count = 0
     for member in zf.infolist():
         if member.is_dir():
             continue
@@ -183,23 +253,49 @@ def flatten_zip_to_dir(zf: zipfile.ZipFile, dest: Path) -> None:
         name = p.name
         if name.startswith("."):
             continue
-        if (
-            p.suffix.lower() in IMAGE_EXTS
             or name.endswith(".gt.txt")
             or name.endswith(".ocr.txt")
-            or p.suffix.lower() == ".xml"
         ):
-            total_size += member.file_size
-            if total_size > MAX_ZIP_TOTAL_SIZE:
-                raise ValueError(
-                    f"ZIP trop volumineux : taille décompressée > "
-                    f"{MAX_ZIP_TOTAL_SIZE // (1024*1024)} Mo"
-                )
-            file_count += 1
-            if file_count > MAX_ZIP_FILES:
-                raise ValueError(f"ZIP contient trop de fichiers (> {MAX_ZIP_FILES})")
-            data = zf.read(member.filename)
-            (dest / name).write_bytes(data)
 __all__ = [

 Détection ALTO/PAGE, extraction de texte GT, analyse de la structure
 d'un dossier corpus, extraction de ZIP avec garde-fous (taille
+décompressée, nombre de fichiers, validation image extraite,
+détection de collision de basename). Le parsing XML sécurisé délègue
 à :func:`picarones.formats._xml_utils.safe_parse_xml`.
 """
 from __future__ import annotations
+import logging
 import xml.etree.ElementTree as ET
 import zipfile
 from pathlib import Path
 from picarones.formats._xml_utils import safe_parse_xml
 from picarones.interfaces.web.state import IMAGE_EXTS
+logger = logging.getLogger(__name__)
 # Garde-fous ZIP-bomb pour l'upload
 MAX_ZIP_TOTAL_SIZE = 500 * 1024 * 1024
 """500 Mo décompressé maximum."""
 # Extraction ZIP sécurisée
 # ──────────────────────────────────────────────────────────────────────────
+def _slug_dirname(source_path: Path) -> str:
+    """Slugifie le ``dirname`` d'une entrée ZIP pour préfixer en cas de collision.
+    ``a/b/img.png`` → ``a_b``.  Caractères non sûrs (``..``, séparateurs)
+    sont normalisés en ``_``.  Vide si l'entrée est à la racine du ZIP.
+    """
+    parent = source_path.parent
+    if parent == Path() or str(parent) == ".":
+        return ""
+    parts = [
+        part.replace("..", "_").replace("/", "_").replace("\\", "_")
+        for part in parent.parts
+        if part not in ("", "/", "\\")
+    ]
+    return "_".join(p for p in parts if p)
+def _resolve_collision(
+    name: str, source_path: Path, taken: set[str],
+) -> str:
+    """Renomme ``name`` pour éviter une collision avec ``taken``.
+    Stratégie :
+    1. Préfixe avec le slug du dirname source (traçabilité).  Si pas de
+       dirname ou si déjà pris, ajoute un suffixe numérique.
+    2. Lève ``ValueError`` après 1000 tentatives (corpus pathologique).
+    """
+    slug = _slug_dirname(source_path)
+    if slug:
+        candidate = f"{slug}__{name}"
+        if candidate not in taken:
+            return candidate
+    stem = Path(name).stem
+    suffix = "".join(Path(name).suffixes)
+    for n in range(2, 1001):
+        candidate = f"{stem}_{n}{suffix}"
+        if candidate not in taken:
+            return candidate
+    raise ValueError(
+        f"Impossible de résoudre la collision de basename pour {name!r} "
+        f"après 1000 tentatives — corpus pathologique ?",
+    )
+def flatten_zip_to_dir(
+    zf: zipfile.ZipFile,
+    dest: Path,
+    *,
+    validate_images: bool = True,
+) -> None:
     """Extrait un ZIP en aplatissant les paires image/.gt.txt/.xml dans ``dest``.
     Garde-fous :
     - Ignore les fichiers cachés macOS (préfixe ``.`` ou ``__MACOSX``).
     - Refuse si la taille décompressée totale dépasse ``MAX_ZIP_TOTAL_SIZE``.
     - Refuse si le nombre de fichiers extraits dépasse ``MAX_ZIP_FILES``.
+    - **Détection de collision de basename** : ``a/img.png`` et
+      ``b/img.png`` ne s'écrasent plus silencieusement — le second est
+      renommé avec un préfixe dérivé de son dossier source (ex.
+      ``b__img.png``) et un warning est loggué.  Sans ce garde-fou,
+      l'utilisateur pouvait associer silencieusement une image à une
+      GT incorrecte.
+    - **Validation image** : chaque image extraite passe par
+      :func:`validate_image_safe` (Pillow.verify, anti-bombe), de la
+      même manière que les uploads directs.  Désactivable via
+      ``validate_images=False`` (utile aux tests qui ne fournissent
+      pas de PNG complets).
     """
+    # Import retardé : ``security`` dépend de ``state`` qui dépend de
+    # ``corpus_utils`` → circulaire si import toplevel.
+    from picarones.interfaces.web.security import validate_image_safe
     dest.mkdir(parents=True, exist_ok=True)
     total_size = 0
     file_count = 0
+    written_names: set[str] = set()
     for member in zf.infolist():
         if member.is_dir():
             continue
         name = p.name
         if name.startswith("."):
             continue
+        suffix_lower = p.suffix.lower()
+        is_image = suffix_lower in IMAGE_EXTS
+        if not (
+            is_image
             or name.endswith(".gt.txt")
             or name.endswith(".ocr.txt")
+            or suffix_lower == ".xml"
         ):
+            continue
+        total_size += member.file_size
+        if total_size > MAX_ZIP_TOTAL_SIZE:
+            raise ValueError(
+                f"ZIP trop volumineux : taille décompressée > "
+                f"{MAX_ZIP_TOTAL_SIZE // (1024*1024)} Mo"
+            )
+        file_count += 1
+        if file_count > MAX_ZIP_FILES:
+            raise ValueError(f"ZIP contient trop de fichiers (> {MAX_ZIP_FILES})")
+        data = zf.read(member.filename)
+        # Validation image après extraction (les images directes sont
+        # déjà validées par ``api_corpus_upload``, mais celles extraites
+        # d'un ZIP ne passaient pas par cette vérification — vecteur de
+        # zip bomb passant les 500 Mo brut).
+        if is_image and validate_images:
+            validate_image_safe(data, filename=name)
+        # Détection de collision : ``a/img.png`` et ``b/img.png`` ne
+        # doivent pas s'écraser silencieusement (vecteur de
+        # mauvaise association image/GT après aplatissement).
+        if name in written_names:
+            new_name = _resolve_collision(name, p, written_names)
+            logger.warning(
+                "[flatten_zip] collision de basename %r — renommé en %r "
+                "(source ZIP : %r)",
+                name, new_name, member.filename,
+            )
+            name = new_name
+        written_names.add(name)
+        (dest / name).write_bytes(data)
 __all__ = [

picarones/interfaces/web/models.py CHANGED Viewed

@@ -67,6 +67,24 @@ Liste alignée sur ``measurements.normalization.NORMALIZATION_PROFILES``
 répercutée ici sous peine de rejet Pydantic au niveau API web.
 Sprint A14-S1 — alignement README ↔ web models ↔ runtime."""
 class BenchmarkRequest(BaseModel):
     corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
@@ -94,7 +112,7 @@ class HuggingFaceImportRequest(BaseModel):
     max_samples: int = Field(default=100, ge=1, le=10_000)
-class CompetitorConfig(BaseModel):
     name: str = Field(default="", max_length=_MAX_NAME)
     ocr_engine: str = Field(default="", max_length=_MAX_NAME)
     """Moteur OCR : ``tesseract``, ``mistral_ocr``, … ou ``corpus``
@@ -102,13 +120,20 @@ class CompetitorConfig(BaseModel):
     ocr_model: str = Field(default="", max_length=_MAX_NAME)
     llm_provider: str = Field(default="", max_length=_MAX_NAME)
     llm_model: str = Field(default="", max_length=_MAX_NAME)
-    pipeline_mode: str = Field(default="", max_length=_MAX_NAME)
     prompt_file: str = Field(default="", max_length=_MAX_PROMPT_FILENAME)
 class BenchmarkRunRequest(BaseModel):
     corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
-    competitors: list[CompetitorConfig] = Field(
         min_length=1, max_length=_MAX_COMPETITORS,
     )
     normalization_profile: NormalizationProfileId = "nfc"
@@ -122,9 +147,10 @@ __all__ = [
     "TesseractLang",
     "ReportLang",
     "NormalizationProfileId",
     "BenchmarkRequest",
     "HTRUnitedImportRequest",
     "HuggingFaceImportRequest",
-    "CompetitorConfig",
     "BenchmarkRunRequest",
 ]

 répercutée ici sous peine de rejet Pydantic au niveau API web.
 Sprint A14-S1 — alignement README ↔ web models ↔ runtime."""
+PipelineMode = Literal["text_only", "text_and_image", "zero_shot"]
+"""Modes de pipeline OCR+LLM acceptés par ``PipelineConfig``.
+Aligné sur :class:`picarones.pipeline.llm_pipeline_config.OCRLLMMode` —
+toute valeur hors de ces 3 littéraux est rejetée 422 par Pydantic.
+Sémantique :
+- ``text_only`` — l'OCR amont produit un texte brut, le LLM le corrige
+  sans voir l'image (post-correction texte).
+- ``text_and_image`` — l'OCR amont produit un texte ; le VLM le corrige
+  en s'appuyant sur l'image (post-correction multimodale).
+- ``zero_shot`` — pas d'OCR amont ; un VLM transcrit l'image directement.
+Phase 2 du chantier post-rewrite : suppression du fallback silencieux
+``mode_map.get(comp.pipeline_mode, 'text_only')`` qui acceptait toute
+chaîne arbitraire et la mappait sur ``text_only``."""
 class BenchmarkRequest(BaseModel):
     corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
     max_samples: int = Field(default=100, ge=1, le=10_000)
+class PipelineConfig(BaseModel):
     name: str = Field(default="", max_length=_MAX_NAME)
     ocr_engine: str = Field(default="", max_length=_MAX_NAME)
     """Moteur OCR : ``tesseract``, ``mistral_ocr``, … ou ``corpus``
     ocr_model: str = Field(default="", max_length=_MAX_NAME)
     llm_provider: str = Field(default="", max_length=_MAX_NAME)
     llm_model: str = Field(default="", max_length=_MAX_NAME)
+    pipeline_mode: PipelineMode | Literal[""] = ""
+    """Mode du pipeline OCR+LLM — vide si pas de pipeline LLM (OCR seul).
+    Typage strict (Phase 2 chantier post-rewrite) : Pydantic rejette
+    en 422 toute valeur hors de la matrice canonique au lieu d'aliaser
+    silencieusement sur ``text_only``.  La chaîne vide (``""``) reste
+    autorisée pour indiquer qu'aucun LLM n'est attaché au moteur OCR.
+    """
     prompt_file: str = Field(default="", max_length=_MAX_PROMPT_FILENAME)
 class BenchmarkRunRequest(BaseModel):
     corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
+    competitors: list[PipelineConfig] = Field(
         min_length=1, max_length=_MAX_COMPETITORS,
     )
     normalization_profile: NormalizationProfileId = "nfc"
     "TesseractLang",
     "ReportLang",
     "NormalizationProfileId",
+    "PipelineMode",
     "BenchmarkRequest",
     "HTRUnitedImportRequest",
     "HuggingFaceImportRequest",
+    "PipelineConfig",
     "BenchmarkRunRequest",
 ]

picarones/interfaces/web/routers/benchmark.py CHANGED Viewed

@@ -2,7 +2,7 @@
 Le ``stream`` SSE supporte la reprise via ``Last-Event-ID`` (Sprint 26).
 ``start`` lance un benchmark à liste de moteurs ; ``run`` accepte des
-``CompetitorConfig`` composés (OCR + LLM, pipelines mutualisés) —
 deux endpoints distincts pour deux UX historiquement séparées.
 """
@@ -107,7 +107,13 @@ async def api_benchmark_start(req: BenchmarkRequest, request: Request) -> dict:
     job_id = str(uuid.uuid4())
     job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
-    state.JOB_STORE.create_job(job_id)
     state.register_job(job)
     state.cleanup_old_jobs()
@@ -116,14 +122,14 @@ async def api_benchmark_start(req: BenchmarkRequest, request: Request) -> dict:
 # ──────────────────────────────────────────────────────────────────────────
-# Lancement composé : liste de CompetitorConfig (BenchmarkRunRequest)
 # ──────────────────────────────────────────────────────────────────────────
 @router.post("/api/benchmark/run")
 async def api_benchmark_run(req: BenchmarkRunRequest, request: Request) -> dict:
     """Lance un benchmark à concurrents composés (OCR + LLM, pipelines).
-    Chaque ``CompetitorConfig`` peut combiner un moteur OCR et un
     provider LLM (mode post-correction, zero-shot, ou OCR seul).
     """
     # ``competitors`` non vide est garanti par Pydantic ``min_length=1``.
@@ -177,7 +183,8 @@ async def api_benchmark_run(req: BenchmarkRunRequest, request: Request) -> dict:
     job_id = str(uuid.uuid4())
     job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
-    state.JOB_STORE.create_job(job_id)
     state.register_job(job)
     _start_job_thread(job, run_benchmark_thread_v2, req)

 Le ``stream`` SSE supporte la reprise via ``Last-Event-ID`` (Sprint 26).
 ``start`` lance un benchmark à liste de moteurs ; ``run`` accepte des
+``PipelineConfig`` composés (OCR + LLM, pipelines mutualisés) —
 deux endpoints distincts pour deux UX historiquement séparées.
 """
     job_id = str(uuid.uuid4())
     job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
+    # Phase 4 du chantier post-rewrite : le payload du job contient
+    # désormais le ``corpus_path`` actif, pour que la tâche de purge
+    # ``upload_purge_task`` puisse identifier les corpus référencés
+    # par des jobs en cours et ne pas les supprimer.  Avant ce
+    # branchement, la purge supprimait potentiellement des corpus
+    # actifs dont les uploads étaient plus anciens que la rétention.
+    state.JOB_STORE.create_job(job_id, payload={"corpus": req.corpus_path})
     state.register_job(job)
     state.cleanup_old_jobs()
 # ──────────────────────────────────────────────────────────────────────────
+# Lancement composé : liste de PipelineConfig (BenchmarkRunRequest)
 # ──────────────────────────────────────────────────────────────────────────
 @router.post("/api/benchmark/run")
 async def api_benchmark_run(req: BenchmarkRunRequest, request: Request) -> dict:
     """Lance un benchmark à concurrents composés (OCR + LLM, pipelines).
+    Chaque ``PipelineConfig`` peut combiner un moteur OCR et un
     provider LLM (mode post-correction, zero-shot, ou OCR seul).
     """
     # ``competitors`` non vide est garanti par Pydantic ``min_length=1``.
     job_id = str(uuid.uuid4())
     job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
+    # Phase 4 — payload incluant le corpus actif pour la purge auto.
+    state.JOB_STORE.create_job(job_id, payload={"corpus": req.corpus_path})
     state.register_job(job)
     _start_job_thread(job, run_benchmark_thread_v2, req)

picarones/interfaces/web/routers/history.py CHANGED Viewed

@@ -4,15 +4,32 @@ Surface de l'infrastructure ``BenchmarkHistory`` qui était
 limitée au CLI ``picarones history --regression``. Le rapport HTML
 peut désormais consommer cet endpoint pour afficher un encart
 *« ⚠ Tesseract a régressé de 0,8 pp depuis le 12 janvier »* en tête.
 """
 from __future__ import annotations
 import logging
 from typing import Any, Optional
 from fastapi import APIRouter, HTTPException, Query
 router = APIRouter()
 _logger = logging.getLogger(__name__)
@@ -21,13 +38,37 @@ _logger = logging.getLogger(__name__)
 async def api_history_regressions(
     engine: Optional[str] = Query(default=None, description="Filtre par moteur"),
     threshold: float = Query(default=0.01, description="Seuil régression CER absolu"),
-    db_path: Optional[str] = Query(default=None, description="Chemin SQLite history"),
 ) -> dict:
     """Liste les régressions détectées dans l'historique longitudinal."""
     from picarones.evaluation.metrics.history import BenchmarkHistory
     try:
-        history = BenchmarkHistory(db_path) if db_path else BenchmarkHistory()
     except Exception as exc:  # noqa: BLE001
         raise HTTPException(
             status_code=500, detail=f"Ouverture historique échouée : {exc}",

 limitée au CLI ``picarones history --regression``. Le rapport HTML
 peut désormais consommer cet endpoint pour afficher un encart
 *« ⚠ Tesseract a régressé de 0,8 pp depuis le 12 janvier »* en tête.
+Sécurité — paramètre ``db_path``
+---------------------------------
+Le paramètre ``db_path`` est validé contre les racines workspace
+autorisées via :func:`validated_path`. Sans ce garde-fou, l'endpoint
+acceptait un chemin SQLite libre — vecteur de lecture filesystem
+arbitraire (path traversal).  Pour pointer une base alternative à
+l'extérieur des workspaces, exporter ``PICARONES_HISTORY_DB`` plutôt
+que de passer ``db_path`` par query string.
 """
 from __future__ import annotations
 import logging
+import os
 from typing import Any, Optional
 from fastapi import APIRouter, HTTPException, Query
+from picarones.interfaces.web.security import (
+    PathValidationError,
+    compute_workspace_roots,
+    validated_path,
+)
+from picarones.interfaces.web.state import UPLOADS_DIR
 router = APIRouter()
 _logger = logging.getLogger(__name__)
 async def api_history_regressions(
     engine: Optional[str] = Query(default=None, description="Filtre par moteur"),
     threshold: float = Query(default=0.01, description="Seuil régression CER absolu"),
+    db_path: Optional[str] = Query(
+        default=None,
+        description=(
+            "Chemin SQLite history (validé contre les workspace roots ; "
+            "préférer la variable d'env PICARONES_HISTORY_DB)."
+        ),
+    ),
 ) -> dict:
     """Liste les régressions détectées dans l'historique longitudinal."""
     from picarones.evaluation.metrics.history import BenchmarkHistory
+    if db_path:
+        try:
+            resolved = validated_path(
+                db_path,
+                allowed_roots=compute_workspace_roots(UPLOADS_DIR),
+                must_exist=False,
+            )
+        except PathValidationError as exc:
+            raise HTTPException(status_code=400, detail=str(exc)) from exc
+        effective_db_path: Optional[str] = str(resolved)
+    else:
+        env_db = os.environ.get("PICARONES_HISTORY_DB", "").strip()
+        effective_db_path = env_db or None
     try:
+        history = (
+            BenchmarkHistory(effective_db_path)
+            if effective_db_path
+            else BenchmarkHistory()
+        )
     except Exception as exc:  # noqa: BLE001
         raise HTTPException(
             status_code=500, detail=f"Ouverture historique échouée : {exc}",

picarones/interfaces/web/routers/importers.py CHANGED Viewed

@@ -2,13 +2,65 @@
 from __future__ import annotations
 from fastapi import APIRouter, HTTPException, Query
 from picarones.interfaces.web.models import HTRUnitedImportRequest, HuggingFaceImportRequest
 router = APIRouter()
 # ──────────────────────────────────────────────────────────────────────────
 # HTR-United
 # ──────────────────────────────────────────────────────────────────────────
@@ -19,10 +71,8 @@ async def api_htr_united_catalogue(
     language: str = Query(default="", description="Filtre langue"),
     script: str = Query(default="", description="Filtre type d'écriture"),
 ) -> dict:
-    """Catalogue HTR-United filtrable."""
-    from picarones.adapters.corpus.htr_united import HTRUnitedCatalogue
-    cat = HTRUnitedCatalogue.from_demo()
     results = cat.search(
         query=query,
         language=language or None,
@@ -30,6 +80,10 @@ async def api_htr_united_catalogue(
     )
     return {
         "source": cat.source,
         "total": len(results),
         "entries": [e.as_dict() for e in results],
         "available_languages": cat.available_languages(),
@@ -40,12 +94,10 @@ async def api_htr_united_catalogue(
 @router.post("/api/htr-united/import")
 async def api_htr_united_import(req: HTRUnitedImportRequest) -> dict:
     """Importe une entrée HTR-United dans ``req.output_dir``."""
-    from picarones.adapters.corpus.htr_united import (
-        HTRUnitedCatalogue,
-        import_htr_united_corpus,
-    )
-    cat = HTRUnitedCatalogue.from_demo()
     entry = cat.get_by_id(req.entry_id)
     if not entry:
         raise HTTPException(
@@ -54,7 +106,7 @@ async def api_htr_united_import(req: HTRUnitedImportRequest) -> dict:
     return import_htr_united_corpus(
         entry=entry,
-        output_dir=req.output_dir,
         max_samples=req.max_samples,
     )
@@ -92,10 +144,11 @@ async def api_huggingface_import(req: HuggingFaceImportRequest) -> dict:
     """Importe un dataset HuggingFace dans ``req.output_dir``."""
     from picarones.adapters.corpus.huggingface import HuggingFaceImporter
     importer = HuggingFaceImporter()
     return importer.import_dataset(
         dataset_id=req.dataset_id,
-        output_dir=req.output_dir,
         split=req.split,
         max_samples=req.max_samples,
     )

 from __future__ import annotations
+import os
 from fastapi import APIRouter, HTTPException, Query
 from picarones.interfaces.web.models import HTRUnitedImportRequest, HuggingFaceImportRequest
+from picarones.interfaces.web.security import (
+    PathValidationError,
+    compute_workspace_roots,
+    validated_path,
+)
+from picarones.interfaces.web.state import UPLOADS_DIR
 router = APIRouter()
+def _htr_united_catalogue():
+    """Récupère le catalogue HTR-United (remote ou demo).
+    Phase 4.4 du chantier post-rewrite : auparavant le router
+    appelait ``HTRUnitedCatalogue.from_demo()`` exclusivement —
+    l'UI annonçait "catalogue HTR-United" alors qu'on chargeait
+    un échantillon embarqué.  Désormais ``from_remote()`` est
+    utilisé (avec fallback automatique sur demo si offline), et
+    le champ ``source`` (``"remote" | "demo"``) est exposé dans
+    la réponse pour que l'UI puisse signaler clairement le mode
+    actif.
+    En CI / déploiement sans réseau, exporter
+    ``PICARONES_HTR_UNITED_OFFLINE=1`` force le mode démo et
+    évite un timeout de 10s à chaque GET catalogue.
+    """
+    from picarones.adapters.corpus.htr_united import HTRUnitedCatalogue
+    if os.environ.get("PICARONES_HTR_UNITED_OFFLINE", "").strip() in (
+        "1", "true", "yes",
+    ):
+        return HTRUnitedCatalogue.from_demo()
+    return HTRUnitedCatalogue.from_remote(timeout=5)
+def _validated_output_dir(user_path: str) -> str:
+    """Valide ``output_dir`` reçu d'un importer contre les racines workspace.
+    Les endpoints d'import écrivent un corpus distant sur le filesystem
+    du serveur — un ``output_dir`` libre permet d'écrire arbitrairement
+    (path traversal). On valide ici contre :func:`compute_workspace_roots`
+    avant de passer la chaîne au backend.
+    """
+    try:
+        resolved = validated_path(
+            user_path,
+            allowed_roots=compute_workspace_roots(UPLOADS_DIR),
+            must_exist=False,
+        )
+    except PathValidationError as exc:
+        raise HTTPException(status_code=400, detail=str(exc)) from exc
+    return str(resolved)
 # ──────────────────────────────────────────────────────────────────────────
 # HTR-United
 # ──────────────────────────────────────────────────────────────────────────
     language: str = Query(default="", description="Filtre langue"),
     script: str = Query(default="", description="Filtre type d'écriture"),
 ) -> dict:
+    """Catalogue HTR-United filtrable (remote, fallback demo si offline)."""
+    cat = _htr_united_catalogue()
     results = cat.search(
         query=query,
         language=language or None,
     )
     return {
         "source": cat.source,
+        # Indication explicite du mode pour l'UI : "demo" si on charge
+        # le catalogue embarqué (réseau indisponible ou variable
+        # ``PICARONES_HTR_UNITED_OFFLINE=1`` exportée).
+        "is_demo": cat.source == "demo",
         "total": len(results),
         "entries": [e.as_dict() for e in results],
         "available_languages": cat.available_languages(),
 @router.post("/api/htr-united/import")
 async def api_htr_united_import(req: HTRUnitedImportRequest) -> dict:
     """Importe une entrée HTR-United dans ``req.output_dir``."""
+    from picarones.adapters.corpus.htr_united import import_htr_united_corpus
+    output_dir = _validated_output_dir(req.output_dir)
+    cat = _htr_united_catalogue()
     entry = cat.get_by_id(req.entry_id)
     if not entry:
         raise HTTPException(
     return import_htr_united_corpus(
         entry=entry,
+        output_dir=output_dir,
         max_samples=req.max_samples,
     )
     """Importe un dataset HuggingFace dans ``req.output_dir``."""
     from picarones.adapters.corpus.huggingface import HuggingFaceImporter
+    output_dir = _validated_output_dir(req.output_dir)
     importer = HuggingFaceImporter()
     return importer.import_dataset(
         dataset_id=req.dataset_id,
+        output_dir=output_dir,
         split=req.split,
         max_samples=req.max_samples,
     )

picarones/reports/html/generator.py CHANGED Viewed

@@ -436,54 +436,17 @@ class ReportGenerator:
         Compatible avec les fichiers produits par ``BenchmarkResult.to_json()``.
         Les images base64 doivent être passées via ``kwargs["images_b64"]``
         si elles ne sont pas dans le JSON.
-        """
-        import json as _json
-        data = _json.loads(Path(json_path).read_text(encoding="utf-8"))
-        # Reconstruction minimale d'un BenchmarkResult depuis le dict
-        from picarones.evaluation.metric_result import MetricsResult
-        from picarones.evaluation.benchmark_result import DocumentResult, EngineReport
-        engine_reports = []
-        for er_data in data.get("engine_reports", []):
-            doc_results = []
-            for dr_data in er_data.get("document_results", []):
-                m = dr_data["metrics"]
-                metrics = MetricsResult(
-                    cer=m["cer"], cer_nfc=m["cer_nfc"], cer_caseless=m["cer_caseless"],
-                    wer=m["wer"], wer_normalized=m["wer_normalized"],
-                    mer=m["mer"], wil=m["wil"],
-                    reference_length=m["reference_length"],
-                    hypothesis_length=m["hypothesis_length"],
-                    error=m.get("error"),
-                )
-                doc_results.append(DocumentResult(
-                    doc_id=dr_data["doc_id"],
-                    image_path=dr_data["image_path"],
-                    ground_truth=dr_data["ground_truth"],
-                    hypothesis=dr_data["hypothesis"],
-                    metrics=metrics,
-                    duration_seconds=dr_data.get("duration_seconds", 0.0),
-                    engine_error=dr_data.get("engine_error"),
-                ))
-            engine_reports.append(EngineReport(
-                engine_name=er_data["engine_name"],
-                engine_version=er_data.get("engine_version", "unknown"),
-                engine_config=er_data.get("engine_config", {}),
-                document_results=doc_results,
-            ))
-        corpus_info = data.get("corpus", {})
-        bm = BenchmarkResult(
-            corpus_name=corpus_info.get("name", "Corpus"),
-            corpus_source=corpus_info.get("source"),
-            document_count=corpus_info.get("document_count", 0),
-            engine_reports=engine_reports,
-            run_date=data.get("run_date", ""),
-            picarones_version=data.get("picarones_version", ""),
-            metadata=data.get("metadata", {}),
-        )
         images_b64 = kwargs.pop("images_b64", {})
         return cls(bm, images_b64=images_b64, **kwargs)

         Compatible avec les fichiers produits par ``BenchmarkResult.to_json()``.
         Les images base64 doivent être passées via ``kwargs["images_b64"]``
         si elles ne sont pas dans le JSON.
+        Phase 2.2 du chantier post-rewrite : délégué à
+        :meth:`BenchmarkResult.from_json_object` qui reconstruit tous
+        les champs avancés (confusion_matrix, taxonomy, structure,
+        hallucination_metrics, ner_metrics, calibration_metrics,
+        philological_metrics, searchability_metrics,
+        numerical_sequence_metrics, readability_metrics,
+        pipeline_metadata, ocr_intermediate + leurs équivalents
+        ``aggregated_*`` au niveau EngineReport).  Le rapport régénéré
+        depuis JSON est désormais indistinguable du rapport in-memory.
+        """
+        bm = BenchmarkResult.from_json_object(json_path)
         images_b64 = kwargs.pop("images_b64", {})
         return cls(bm, images_b64=images_b64, **kwargs)

pyproject.toml CHANGED Viewed

@@ -105,6 +105,7 @@ docs = [
 # Moteurs OCR optionnels
 pero = ["pero-ocr>=0.1.0"]
 kraken = ["kraken>=4.0.0"]
 # Adaptateurs LLM
 llm = [
     "openai>=1.0.0",

 # Moteurs OCR optionnels
 pero = ["pero-ocr>=0.1.0"]
 kraken = ["kraken>=4.0.0"]
+calamari = ["calamari-ocr>=2.0.0"]
 # Adaptateurs LLM
 llm = [
     "openai>=1.0.0",

scripts/gen_readme_tables.py CHANGED Viewed

@@ -65,6 +65,8 @@ _ENGINE_DESCRIPTIONS: dict[str, tuple[str, str, str]] = {
     # name → (display_name, type, install_hint)
     "tesseract": ("Tesseract 5", "Local CLI", "`pip install pytesseract` + system binary"),
     "pero_ocr": ("Pero OCR", "Local Python", "`pip install -e .[pero]`"),
     "mistral_ocr": ("Mistral OCR", "Cloud API", "`MISTRAL_API_KEY` env var"),
     "google_vision": ("Google Vision", "Cloud API", "`GOOGLE_APPLICATION_CREDENTIALS` env var"),
     "azure_doc_intel": ("Azure Doc Intelligence", "Cloud API", "`AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY`"),

     # name → (display_name, type, install_hint)
     "tesseract": ("Tesseract 5", "Local CLI", "`pip install pytesseract` + system binary"),
     "pero_ocr": ("Pero OCR", "Local Python", "`pip install -e .[pero]`"),
+    "kraken": ("Kraken HTR", "Local Python", "`pip install -e .[kraken]` + modèle `.mlmodel`"),
+    "calamari": ("Calamari OCR", "Local Python", "`pip install -e .[calamari]` + checkpoint"),
     "mistral_ocr": ("Mistral OCR", "Cloud API", "`MISTRAL_API_KEY` env var"),
     "google_vision": ("Google Vision", "Cloud API", "`GOOGLE_APPLICATION_CREDENTIALS` env var"),
     "azure_doc_intel": ("Azure Doc Intelligence", "Cloud API", "`AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY`"),

tests/app/test_s9_resolver_collision.py CHANGED Viewed

@@ -12,7 +12,7 @@ Bug observé en prod (interface web, 2026-05-10) :
     avec des instances différentes — collision impossible à résoudre.
 Cause : ``_engine_from_competitor`` crée une instance ``TesseractAdapter``
-fraîche pour chaque ``CompetitorConfig``.  Quand deux concurrents
 partagent le même moteur OCR (l'un seul, l'autre dans un pipeline),
 ``build_adapter_resolver`` voyait deux instances Python distinctes
 sous le même ``name="tesseract"`` et levait ``PicaronesError`` à tort

     avec des instances différentes — collision impossible à résoudre.
 Cause : ``_engine_from_competitor`` crée une instance ``TesseractAdapter``
+fraîche pour chaque ``PipelineConfig``.  Quand deux concurrents
 partagent le même moteur OCR (l'un seul, l'autre dans un pipeline),
 ``build_adapter_resolver`` voyait deux instances Python distinctes
 sous le même ``name="tesseract"`` et levait ``PicaronesError`` à tort

tests/app/test_sprint_d2b_partial_dir_resume.py CHANGED Viewed

@@ -27,10 +27,37 @@ from picarones.app.services.partial_store import (
     _partial_path,
     _sanitize_filename,
     _save_partial_line,
 )
 from picarones.app.services.benchmark_runner import (
     run_benchmark_via_service,
 )
 from picarones.evaluation.benchmark_result import DocumentResult
 from picarones.evaluation.corpus import Corpus, Document
 from picarones.evaluation.metric_result import MetricsResult
@@ -272,7 +299,7 @@ class TestResumeViaPartialDir:
         assert bm.document_count == 2
         # Plus aucun fichier partial pour cet engine après succès.
-        partial_path = _partial_path(corpus.name, ocr.name, partial_dir)
         assert not partial_path.exists()
     def test_resume_skips_already_done_docs(self, tmp_path: Path) -> None:
@@ -296,7 +323,7 @@ class TestResumeViaPartialDir:
         # Pré-écrire un partial pour doc0 avec une CER fictive de 0.99
         # pour vérifier qu'on prend la valeur du partial, pas une
         # nouvelle exécution.
-        partial_path = _partial_path(corpus.name, ocr.name, partial_dir)
         pre_existing = _make_doc_result("doc0", hyp="from_partial", cer=0.99)
         _save_partial_line(partial_path, pre_existing)
@@ -327,7 +354,7 @@ class TestResumeViaPartialDir:
             "Engine ne devrait pas être appelé — tout est dans le partial.",
         )
-        partial_path = _partial_path(corpus.name, ocr.name, partial_dir)
         for i in range(2):
             _save_partial_line(
                 partial_path, _make_doc_result(f"doc{i}", hyp=f"prefilled{i}"),
@@ -358,7 +385,7 @@ class TestResumeViaPartialDir:
         ocr_b._run_ocr = lambda p: "from_b"
         # Pré-remplir uniquement le partial de engine_a pour doc0.
-        partial_a = _partial_path(corpus.name, ocr_a.name, partial_dir)
         _save_partial_line(
             partial_a, _make_doc_result("doc0", hyp="A_pre"),
         )
@@ -423,7 +450,7 @@ class TestResumeViaPartialDir:
         # avec doc0 mais pas doc1.  cancel_event signalé avant
         # l'engine suivant.
         ocr_b = _MockOCR(name="incomplete_b")
-        partial_b = _partial_path(corpus.name, ocr_b.name, partial_dir)
         _save_partial_line(
             partial_b, _make_doc_result("doc0", hyp="B0_pre"),
         )

     _partial_path,
     _sanitize_filename,
     _save_partial_line,
+    partial_path_for_engine,
 )
 from picarones.app.services.benchmark_runner import (
+    _engine_config_for_fingerprint,
     run_benchmark_via_service,
 )
+def _partial_path_for_run(corpus, engine, partial_dir):
+    """Helper test — calcule le chemin partial avec le fingerprint
+    que le runner utilisera par défaut (pas de normalisation, pas
+    de char_exclude, profil ``standard``).  Phase 2.3 du chantier
+    post-rewrite : la clé partial inclut désormais un fingerprint
+    pour empêcher la réutilisation accidentelle entre runs avec
+    configs différentes."""
+    import importlib
+    try:
+        code_version = importlib.import_module("picarones").__version__
+    except (ImportError, AttributeError):
+        code_version = "unknown"
+    return partial_path_for_engine(
+        corpus=corpus,
+        engine=engine,
+        partial_dir=partial_dir,
+        engine_config=_engine_config_for_fingerprint(engine),
+        normalization_profile=None,
+        char_exclude=None,
+        profile="standard",
+        code_version=code_version,
+    )
 from picarones.evaluation.benchmark_result import DocumentResult
 from picarones.evaluation.corpus import Corpus, Document
 from picarones.evaluation.metric_result import MetricsResult
         assert bm.document_count == 2
         # Plus aucun fichier partial pour cet engine après succès.
+        partial_path = _partial_path_for_run(corpus, ocr, partial_dir)
         assert not partial_path.exists()
     def test_resume_skips_already_done_docs(self, tmp_path: Path) -> None:
         # Pré-écrire un partial pour doc0 avec une CER fictive de 0.99
         # pour vérifier qu'on prend la valeur du partial, pas une
         # nouvelle exécution.
+        partial_path = _partial_path_for_run(corpus, ocr, partial_dir)
         pre_existing = _make_doc_result("doc0", hyp="from_partial", cer=0.99)
         _save_partial_line(partial_path, pre_existing)
             "Engine ne devrait pas être appelé — tout est dans le partial.",
         )
+        partial_path = _partial_path_for_run(corpus, ocr, partial_dir)
         for i in range(2):
             _save_partial_line(
                 partial_path, _make_doc_result(f"doc{i}", hyp=f"prefilled{i}"),
         ocr_b._run_ocr = lambda p: "from_b"
         # Pré-remplir uniquement le partial de engine_a pour doc0.
+        partial_a = _partial_path_for_run(corpus, ocr_a, partial_dir)
         _save_partial_line(
             partial_a, _make_doc_result("doc0", hyp="A_pre"),
         )
         # avec doc0 mais pas doc1.  cancel_event signalé avant
         # l'engine suivant.
         ocr_b = _MockOCR(name="incomplete_b")
+        partial_b = _partial_path_for_run(corpus, ocr_b, partial_dir)
         _save_partial_line(
             partial_b, _make_doc_result("doc0", hyp="B0_pre"),
         )

tests/architecture/test_file_budgets.py CHANGED Viewed

@@ -45,7 +45,11 @@ FILE_BUDGETS: dict[str, int] = {
     # Sprint H.4 — module renommé ``_legacy_runner_adapter`` →
     # ``benchmark_runner`` (drop le préfixe legacy : c'est l'entry
     # point canonique des interfaces vers ``BenchmarkService``).
-    "picarones/app/services/benchmark_runner.py": 1700,  # actuel ~1450
     # --- God-modules : budget actuel + 15 % de marge.
     # Le rétrécissement sera l'objet d'un sprint de refactor dédié.
     # statistics.py (1128 lignes) a été éclaté en sous-package
@@ -71,7 +75,12 @@ FILE_BUDGETS: dict[str, int] = {
     # (≤ 25 l).  Le contenu canonique vit dans ``evaluation/`` ;
     # même budget pour la même raison historique (modèles
     # BenchmarkResult/EngineReport/DocumentResult).
-    "picarones/evaluation/benchmark_result.py": 750,      # actuel 702
     # Phase 5.C : ``report/philological_render.py`` est désormais
     # un shim (≤ 25 l).  Le contenu canonique vit dans
     # ``reports/html/renderers/philological.py``.

     # Sprint H.4 — module renommé ``_legacy_runner_adapter`` →
     # ``benchmark_runner`` (drop le préfixe legacy : c'est l'entry
     # point canonique des interfaces vers ``BenchmarkService``).
+    # Phase 2.3 du chantier post-rewrite — ajout de
+    # ``_engine_config_for_fingerprint`` (~50 LOC) pour distinguer les
+    # runs avec configs différentes (psm/lang/model/prompt) au niveau
+    # du fichier partial.
+    "picarones/app/services/benchmark_runner.py": 1750,  # actuel ~1700
     # --- God-modules : budget actuel + 15 % de marge.
     # Le rétrécissement sera l'objet d'un sprint de refactor dédié.
     # statistics.py (1128 lignes) a été éclaté en sous-package
     # (≤ 25 l).  Le contenu canonique vit dans ``evaluation/`` ;
     # même budget pour la même raison historique (modèles
     # BenchmarkResult/EngineReport/DocumentResult).
+    # Phase 2.2 du chantier post-rewrite — ajout de
+    # ``DocumentResult.from_dict``, ``EngineReport.from_dict``,
+    # ``BenchmarkResult.from_dict`` et ``BenchmarkResult.from_json_object``
+    # pour restaurer la fidélité du round-trip JSON (taxonomy,
+    # hallucination, philological, etc.).
+    "picarones/evaluation/benchmark_result.py": 880,      # actuel ~826
     # Phase 5.C : ``report/philological_render.py`` est désormais
     # un shim (≤ 25 l).  Le contenu canonique vit dans
     # ``reports/html/renderers/philological.py``.

tests/docs/test_readme_consistency.py CHANGED Viewed

@@ -101,6 +101,13 @@ def _normalize_engine_name(name: str) -> str:
     aliases = {
         "azure document intelligence": "azure_doc_intel",
         "azure doc intelligence": "azure_doc_intel",
     }
     if n in aliases:
         return aliases[n]

     aliases = {
         "azure document intelligence": "azure_doc_intel",
         "azure doc intelligence": "azure_doc_intel",
+        # Phase 3 du chantier post-rewrite : kraken/calamari sont
+        # listés dans le README avec leur nom commercial complet,
+        # mais leur module Python s'appelle juste ``kraken.py`` /
+        # ``calamari.py`` (cohérent avec ``pero_ocr.py`` qui ne
+        # s'appelle pas ``pero_ocr_htr.py``).
+        "kraken htr": "kraken",
+        "calamari ocr": "calamari",
     }
     if n in aliases:
         return aliases[n]

tests/docs/test_readme_dual_lang.py CHANGED Viewed

@@ -173,12 +173,14 @@ def test_copyright_year_range() -> None:
 def test_readme_under_500_lines() -> None:
-    """Le README doit rester compact (Sprint A13 vise < 500 lignes,
-    versus 786 avant la refonte)."""
     text = _read_readme()
     n_lines = len(text.splitlines())
-    assert n_lines < 500, (
-        f"README à {n_lines} lignes — au-dessus du seuil 500. "
         "Déléguer le détail vers docs/."
     )

 def test_readme_under_500_lines() -> None:
+    """Le README doit rester compact (Sprint A13 visait < 500 lignes ;
+    Phase 3 du chantier post-rewrite a ajouté kraken/calamari dans la
+    matrice produit, +2 lignes — seuil relevé à 510 pour absorber
+    cette extension légitime).  Versus 786 avant la refonte initiale."""
     text = _read_readme()
     n_lines = len(text.splitlines())
+    assert n_lines < 510, (
+        f"README à {n_lines} lignes — au-dessus du seuil 510. "
         "Déléguer le détail vers docs/."
     )

tests/evaluation/metrics/test_sprint12_nouvelles_fonctionnalites.py CHANGED Viewed

@@ -17,11 +17,17 @@ import pytest
 # 1. Filtrage des fichiers cachés macOS
 # ---------------------------------------------------------------------------
 FAKE_PNG = (
-    b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01"
-    b"\x00\x00\x00\x01\x08\x02\x00\x00\x00\x90wS\xde\x00\x00"
-    b"\x00\x0cIDATx\x9cc\xf8\x0f\x00\x00\x01\x01\x00\x05\x18"
-    b"\xd8N\x00\x00\x00\x00IEND\xaeB`\x82"
 )

 # 1. Filtrage des fichiers cachés macOS
 # ---------------------------------------------------------------------------
+# PNG 1×1 RGBA validé par ``Pillow.verify()`` — l'ancien ``FAKE_PNG``
+# avait un mauvais checksum IDAT, masqué tant que ``flatten_zip_to_dir``
+# n'appelait pas ``validate_image_safe`` sur les images extraites
+# (durcissement Phase 1 du chantier post-rewrite).
 FAKE_PNG = (
+    b"\x89PNG\r\n\x1a\n"
+    b"\x00\x00\x00\rIHDR"
+    b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00"
+    b"\x1f\x15\xc4\x89"
+    b"\x00\x00\x00\nIDATx\x9cc\x00\x01\x00\x00\x05\x00\x01"
+    b"\r\n-\xb4\x00\x00\x00\x00IEND\xaeB`\x82"
 )

tests/integration/test_s9_prompt_loading_defenses.py CHANGED Viewed

@@ -171,16 +171,16 @@ class TestEndToEndPromptReachesLLM:
         )
     def test_web_factory_to_pipeline_to_llm_flow(self) -> None:
-        """End-to-end depuis ``CompetitorConfig`` (UI) jusqu'au LLM :
         le prompt arrivé au LLM doit être le CONTENU du fichier,
         pas le filename.  C'est le chemin exact du bug en prod."""
         from picarones.interfaces.web.benchmark_utils import (
             _engine_from_competitor,
         )
-        from picarones.interfaces.web.models import CompetitorConfig
         from picarones.adapters.llm.base import _substitute_prompt_variables
-        comp = CompetitorConfig(
             ocr_engine="tesseract", ocr_model="fra",
             llm_provider="mistral", llm_model="mistral-small-latest",
             pipeline_mode="text_only",

         )
     def test_web_factory_to_pipeline_to_llm_flow(self) -> None:
+        """End-to-end depuis ``PipelineConfig`` (UI) jusqu'au LLM :
         le prompt arrivé au LLM doit être le CONTENU du fichier,
         pas le filename.  C'est le chemin exact du bug en prod."""
         from picarones.interfaces.web.benchmark_utils import (
             _engine_from_competitor,
         )
+        from picarones.interfaces.web.models import PipelineConfig
         from picarones.adapters.llm.base import _substitute_prompt_variables
+        comp = PipelineConfig(
             ocr_engine="tesseract", ocr_model="fra",
             llm_provider="mistral", llm_model="mistral-small-latest",
             pipeline_mode="text_only",

tests/security/test_phase1_post_rewrite_wiring.py ADDED Viewed

	@@ -0,0 +1,1013 @@

+"""Phase 1 du chantier post-rewrite — durcissements sécurité P0.
+Couvre trois durcissements introduits pour fermer des surfaces filesystem
+laissées ouvertes par le rewrite :
+1. **Path traversal ``output_dir`` dans les importers HTR-United/HuggingFace.**
+   Avant durcissement : un POST ``output_dir="/etc/picarones_pwned"``
+   passait directement à l'importer, vecteur d'écriture filesystem
+   arbitraire.  Désormais ``validated_path`` rejette en 400 avant délégation.
+2. **Path traversal ``db_path`` dans ``/api/history/regressions``.**
+   Avant durcissement : ``db_path=/etc/passwd`` ouvrait un SQLite
+   arbitraire (lecture libre, log d'erreur informatif).  Désormais
+   ``validated_path`` rejette en 400 ; pour pointer une base hors
+   workspace, exporter ``PICARONES_HISTORY_DB``.
+3. **ZIP basename collision + validation image extraite.**
+   Avant durcissement : ``a/img.png`` et ``b/img.png`` s'écrasaient
+   silencieusement après aplatissement ; les images extraites n'étaient
+   pas passées à ``validate_image_safe`` (vecteur zip bomb jusqu'à
+   500 Mo brut).  Désormais : collision → renommage avec préfixe slug
+   du dirname + warning ; image invalide → ``ValueError`` (HTTP 415).
+"""
+from __future__ import annotations
+import io
+import zipfile
+from pathlib import Path
+from unittest.mock import patch
+import pytest
+# PNG 1x1 minimal valide pour passer Pillow.verify.
+_MINIMAL_PNG = (
+    b"\x89PNG\r\n\x1a\n"
+    b"\x00\x00\x00\rIHDR"
+    b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00"
+    b"\x1f\x15\xc4\x89"
+    b"\x00\x00\x00\nIDATx\x9cc\x00\x01\x00\x00\x05\x00\x01"
+    b"\r\n-\xb4\x00\x00\x00\x00IEND\xaeB`\x82"
+)
+def _make_importers_app():
+    from fastapi import FastAPI
+    from picarones.interfaces.web.routers import importers as imp_router
+    app = FastAPI()
+    app.include_router(imp_router.router)
+    return app
+def _make_history_app():
+    from fastapi import FastAPI
+    from picarones.interfaces.web.routers import history as hist_router
+    app = FastAPI()
+    app.include_router(hist_router.router)
+    return app
+# ──────────────────────────────────────────────────────────────────────
+# 1. output_dir path traversal — HTR-United + HuggingFace
+# ──────────────────────────────────────────────────────────────────────
+class TestImportersOutputDirTraversal:
+    """Aucun ``output_dir`` libre hors des racines workspace.
+    Important : on n'utilise PAS ``patch`` sur l'importer — la validation
+    doit échouer AVANT toute délégation au backend.  Si la validation
+    laisse passer, le mock ne sera pas appelé mais la requête sera
+    acceptée — c'est ce qu'on doit empêcher.
+    """
+    def test_htr_united_rejects_absolute_path_outside_workspace(self) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_importers_app()
+        with TestClient(app) as client:
+            r = client.post(
+                "/api/htr-united/import",
+                json={
+                    "entry_id": "any_id",
+                    "output_dir": "/etc/picarones_pwned",
+                    "max_samples": 1,
+                },
+            )
+            # 400 = PathValidationError mappée par le handler.
+            assert r.status_code == 400, (
+                f"Attendu 400 (path validation), reçu {r.status_code} : "
+                f"{r.text}"
+            )
+            assert "hors zone autorisée" in r.json()["detail"]
+    def test_htr_united_rejects_traversal(self) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_importers_app()
+        with TestClient(app) as client:
+            r = client.post(
+                "/api/htr-united/import",
+                json={
+                    "entry_id": "any_id",
+                    "output_dir": "../../../etc/passwd",
+                    "max_samples": 1,
+                },
+            )
+            assert r.status_code == 400
+            # Le message peut citer la racine ou le chemin original ;
+            # on vérifie juste qu'on n'a pas réussi à passer.
+            detail = r.json()["detail"]
+            assert "hors zone" in detail or "invalide" in detail
+    def test_huggingface_rejects_absolute_path_outside_workspace(
+        self,
+    ) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_importers_app()
+        with TestClient(app) as client:
+            r = client.post(
+                "/api/huggingface/import",
+                json={
+                    "dataset_id": "any/dataset",
+                    "output_dir": "/var/lib/pwned",
+                    "split": "train",
+                    "max_samples": 1,
+                },
+            )
+            assert r.status_code == 400
+            assert "hors zone autorisée" in r.json()["detail"]
+    def test_huggingface_rejects_traversal(self) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_importers_app()
+        with TestClient(app) as client:
+            r = client.post(
+                "/api/huggingface/import",
+                json={
+                    "dataset_id": "any/dataset",
+                    "output_dir": "../../../etc/passwd_dir",
+                    "split": "train",
+                    "max_samples": 1,
+                },
+            )
+            assert r.status_code == 400
+    def test_huggingface_accepts_path_under_tmp(self, tmp_path: Path) -> None:
+        """``tmp_path`` est sous ``tempfile.gettempdir()`` donc dans les
+        racines workspace par défaut (mode dev).  On vérifie que la
+        validation laisse passer une cible légitime."""
+        from fastapi.testclient import TestClient
+        app = _make_importers_app()
+        with patch(
+            "picarones.adapters.corpus.huggingface.HuggingFaceImporter.import_dataset",
+        ) as mock_import:
+            mock_import.return_value = {
+                "imported": 1, "output_dir": str(tmp_path),
+            }
+            with TestClient(app) as client:
+                r = client.post(
+                    "/api/huggingface/import",
+                    json={
+                        "dataset_id": "test/dataset",
+                        "output_dir": str(tmp_path),
+                        "split": "train",
+                        "max_samples": 1,
+                    },
+                )
+                assert r.status_code == 200, r.text
+                # Vérifie que la valeur passée à l'importer est résolue
+                # (str du Path absolu) — pas la chaîne brute si elle
+                # avait été relative.
+                assert mock_import.called
+# ──────────────────────────────────────────────────────────────────────
+# 2. db_path path traversal — /api/history/regressions
+# ──────────────────────────────────────────────────────────────────────
+class TestHistoryRegressionsDbPathTraversal:
+    """``db_path`` doit être sous une racine workspace ou refusé en 400.
+    Sans ce garde-fou, l'endpoint ouvrait silencieusement n'importe quel
+    SQLite lisible par le process (lecture filesystem arbitraire via
+    paramètres SQL).
+    """
+    def test_absolute_path_outside_workspace_rejected(self) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_history_app()
+        with TestClient(app) as client:
+            r = client.get(
+                "/api/history/regressions",
+                params={"db_path": "/etc/passwd"},
+            )
+            assert r.status_code == 400, r.text
+            assert "hors zone autorisée" in r.json()["detail"]
+    def test_traversal_rejected(self) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_history_app()
+        with TestClient(app) as client:
+            r = client.get(
+                "/api/history/regressions",
+                params={"db_path": "../../../etc/passwd"},
+            )
+            assert r.status_code == 400
+    def test_no_db_path_uses_default(self) -> None:
+        """Sans ``db_path``, l'endpoint utilise le défaut ``BenchmarkHistory()``
+        (~/.picarones/history.db).  Pas de 400, retourne une liste vide
+        si la base n'existe pas (cas frais)."""
+        from fastapi.testclient import TestClient
+        app = _make_history_app()
+        with TestClient(app) as client:
+            r = client.get("/api/history/regressions")
+            # Soit 200 (base existe, pas de régression), soit 500 (base
+            # absente).  On accepte les deux — c'est le comportement
+            # historique, hors scope du durcissement de chemin.
+            assert r.status_code in (200, 500), r.text
+# ──────────────────────────────────────────────────────────────────────
+# 3. ZIP basename collision + validation image extraite
+# ──────────────────────────────────────────────────────────────────────
+def _zip_with_entries(entries: dict[str, bytes]) -> bytes:
+    """ZIP en mémoire à partir de ``{nom: bytes}``."""
+    buf = io.BytesIO()
+    with zipfile.ZipFile(buf, mode="w", compression=zipfile.ZIP_DEFLATED) as zf:
+        for name, data in entries.items():
+            zf.writestr(name, data)
+    return buf.getvalue()
+class TestZipBasenameCollision:
+    """``a/img.png`` et ``b/img.png`` ne doivent plus s'écraser
+    silencieusement après aplatissement par basename."""
+    def test_collision_resolved_with_dirname_prefix(self, tmp_path: Path) -> None:
+        from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
+        zip_bytes = _zip_with_entries({
+            "folder_a/page_001.png": _MINIMAL_PNG,
+            "folder_a/page_001.gt.txt": b"GT A",
+            "folder_b/page_001.png": _MINIMAL_PNG,
+            "folder_b/page_001.gt.txt": b"GT B",
+        })
+        dest = tmp_path / "extract"
+        with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
+            flatten_zip_to_dir(zf, dest)
+        names = {p.name for p in dest.iterdir()}
+        # La première occurrence garde le nom brut ; les suivantes sont
+        # préfixées par le slug du dirname source.
+        assert "page_001.png" in names
+        # Le second doit avoir été renommé — par slug ``folder_b``.
+        renamed_png = {n for n in names if n.endswith("page_001.png")}
+        assert len(renamed_png) == 2, (
+            f"Attendu 2 images distinctes (1 nominale + 1 renommée), "
+            f"trouvé {renamed_png}"
+        )
+        # On vérifie qu'au moins une variante porte un slug de dossier.
+        assert any(
+            "folder_a" in n or "folder_b" in n
+            for n in renamed_png - {"page_001.png"}
+        )
+    def test_no_silent_overwrite_of_image_pairs(self, tmp_path: Path) -> None:
+        """Garantie fonctionnelle : 4 fichiers entrent → 4 fichiers sortent."""
+        from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
+        zip_bytes = _zip_with_entries({
+            "a/img.png": _MINIMAL_PNG,
+            "a/img.gt.txt": b"A",
+            "b/img.png": _MINIMAL_PNG,
+            "b/img.gt.txt": b"B",
+        })
+        dest = tmp_path / "extract"
+        with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
+            flatten_zip_to_dir(zf, dest)
+        files = list(dest.iterdir())
+        # 4 fichiers entrent dans le ZIP, 4 doivent ressortir (les
+        # collisions sont résolues, pas écrasées).
+        assert len(files) == 4, (
+            f"Attendu 4 fichiers (anti-collision), trouvé "
+            f"{[p.name for p in files]}"
+        )
+class TestZipExtractedImageValidation:
+    """Les images extraites du ZIP doivent passer ``validate_image_safe``
+    — sans ce garde-fou, un attaquant pouvait emballer une fausse image
+    (DecompressionBombError, format invalide) jusqu'à 500 Mo non
+    vérifiés."""
+    def test_invalid_extracted_image_rejected(self, tmp_path: Path) -> None:
+        from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
+        zip_bytes = _zip_with_entries({
+            # Header PNG seul mais sans IHDR — invalide.
+            "fake.png": b"\x89PNG\r\n\x1a\nFAKE_NOT_A_REAL_PNG",
+        })
+        dest = tmp_path / "extract"
+        with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
+            with pytest.raises(ValueError) as excinfo:
+                flatten_zip_to_dir(zf, dest)
+        # Le message doit mentionner le filename pour aider au debug.
+        assert "fake.png" in str(excinfo.value)
+    def test_valid_extracted_image_passes(self, tmp_path: Path) -> None:
+        from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
+        zip_bytes = _zip_with_entries({
+            "ok.png": _MINIMAL_PNG,
+            "ok.gt.txt": b"Hello",
+        })
+        dest = tmp_path / "extract"
+        with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
+            flatten_zip_to_dir(zf, dest)
+        assert (dest / "ok.png").exists()
+        assert (dest / "ok.gt.txt").exists()
+    def test_validate_images_false_skips_validation(
+        self, tmp_path: Path,
+    ) -> None:
+        """Le kwarg ``validate_images=False`` désactive la vérification —
+        utilisé par certains tests qui se concentrent sur d'autres
+        propriétés (path traversal, par exemple) sans avoir besoin de
+        fournir un PNG complet."""
+        from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
+        zip_bytes = _zip_with_entries({
+            "skipme.png": b"\x89PNG_FAKE",
+        })
+        dest = tmp_path / "extract"
+        with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
+            flatten_zip_to_dir(zf, dest, validate_images=False)
+        assert (dest / "skipme.png").exists()
+# ──────────────────────────────────────────────────────────────────────
+# 4. Phase 2 — pipeline_mode strict (rupture API)
+# ──────────────────────────────────────────────────────────────────────
+def _make_benchmark_app():
+    """App FastAPI minimale pour tester le rejet 422 au niveau router."""
+    from fastapi import FastAPI
+    from picarones.interfaces.web.routers import benchmark as bench_router
+    app = FastAPI()
+    app.include_router(bench_router.router)
+    return app
+class TestPipelineModeStrictAPI:
+    """Phase 2 du chantier post-rewrite : le typage ``Literal`` de
+    ``PipelineConfig.pipeline_mode`` rejette en 422 toute valeur
+    hors de la matrice canonique avant même que le router ne soit
+    appelé.  Avant ce durcissement, le ``mode_map.get(...,
+    "text_only")`` aliasait silencieusement.
+    """
+    def test_invalid_pipeline_mode_returns_422(self, tmp_path: Path) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_benchmark_app()
+        with TestClient(app) as client:
+            r = client.post(
+                "/api/benchmark/run",
+                json={
+                    "corpus_path": str(tmp_path),
+                    "competitors": [
+                        {
+                            "name": "p",
+                            "ocr_engine": "tesseract",
+                            "ocr_model": "fra",
+                            "llm_provider": "mistral",
+                            "llm_model": "ministral-3b-latest",
+                            "pipeline_mode": "magic_unknown_mode",
+                            "prompt_file": "",
+                        },
+                    ],
+                    "normalization_profile": "nfc",
+                    "output_dir": str(tmp_path),
+                    "report_name": "test",
+                    "report_lang": "fr",
+                },
+            )
+            assert r.status_code == 422, r.text
+    def test_legacy_alias_post_correction_text_rejected_422(
+        self, tmp_path: Path,
+    ) -> None:
+        from fastapi.testclient import TestClient
+        app = _make_benchmark_app()
+        with TestClient(app) as client:
+            r = client.post(
+                "/api/benchmark/run",
+                json={
+                    "corpus_path": str(tmp_path),
+                    "competitors": [
+                        {
+                            "name": "p",
+                            "ocr_engine": "tesseract",
+                            "ocr_model": "fra",
+                            "llm_provider": "mistral",
+                            "llm_model": "ministral-3b-latest",
+                            # Alias supprimé Phase 2.
+                            "pipeline_mode": "post_correction_text",
+                            "prompt_file": "",
+                        },
+                    ],
+                    "normalization_profile": "nfc",
+                    "output_dir": str(tmp_path),
+                    "report_name": "test",
+                    "report_lang": "fr",
+                },
+            )
+            assert r.status_code == 422, r.text
+    @pytest.mark.parametrize(
+        "valid_mode", ["text_only", "text_and_image", "zero_shot"],
+    )
+    def test_canonical_modes_pass_pydantic(self, valid_mode: str) -> None:
+        """Les 3 modes canoniques sont acceptés par Pydantic — la
+        suite (instanciation moteur, exécution) peut échouer pour
+        d'autres raisons mais ce n'est pas notre test."""
+        from picarones.interfaces.web.models import PipelineConfig
+        comp = PipelineConfig(
+            name="t", ocr_engine="tesseract",
+            llm_provider="mistral", llm_model="m",
+            pipeline_mode=valid_mode,
+        )
+        assert comp.pipeline_mode == valid_mode
+    def test_empty_mode_pass_pydantic_for_ocr_only(self) -> None:
+        """``pipeline_mode=""`` (défaut) doit rester accepté pour les
+        configs OCR seul (sans ``llm_provider``)."""
+        from picarones.interfaces.web.models import PipelineConfig
+        comp = PipelineConfig(
+            name="t", ocr_engine="tesseract", llm_provider="",
+        )
+        assert comp.pipeline_mode == ""
+# ──────────────────────────────────────────────────────────────────────
+# 5. Phase 2.2 — from_json fidèle (round-trip complet)
+# ──────────────────────────────────────────────────────────────────────
+class TestBenchmarkResultRoundTrip:
+    """Phase 2.2 du chantier post-rewrite : ``BenchmarkResult.to_json``
+    suivi de :meth:`BenchmarkResult.from_json_object` doit restaurer
+    **tous** les champs avancés (taxonomy, structure, hallucination,
+    NER, calibration, philological, searchability, numerical,
+    readability, pipeline_metadata, ocr_intermediate + leurs
+    ``aggregated_*`` correspondants).
+    Avant ce durcissement, ``ReportGenerator.from_json`` faisait sa
+    propre reconstruction qui ne couvrait que CER/WER + textes — toutes
+    les analyses étaient perdues, ce qui rendait le rapport régénéré
+    différent du rapport in-memory.  Reproductibilité scientifique
+    cassée.
+    """
+    def _make_rich_benchmark(self):
+        from picarones.evaluation.benchmark_result import (
+            BenchmarkResult, DocumentResult, EngineReport,
+        )
+        from picarones.evaluation.metric_result import MetricsResult
+        metrics = MetricsResult(
+            cer=0.15, cer_nfc=0.14, cer_caseless=0.13,
+            wer=0.20, wer_normalized=0.19,
+            mer=0.16, wil=0.18,
+            reference_length=100, hypothesis_length=95,
+            cer_diplomatic=0.12,
+            diplomatic_profile_name="medieval_french",
+        )
+        dr = DocumentResult(
+            doc_id="doc1",
+            image_path="/tmp/doc1.png",
+            ground_truth="Hello world",
+            hypothesis="He11o world",
+            metrics=metrics,
+            duration_seconds=1.5,
+            ocr_intermediate="He11o w0rld",
+            pipeline_metadata={"mode": "text_only", "prompt_file": "x.txt"},
+            confusion_matrix={"l→1": 2},
+            char_scores={"ligature": {"score": 0.95}},
+            taxonomy={"classes": {"1": 3, "2": 1}},
+            structure={"line_count": 5},
+            image_quality={"contrast": 0.75},
+            line_metrics={"cer_per_line": [0.1, 0.2, 0.3]},
+            hallucination_metrics={"anchoring": 0.85, "n_blocks": 1},
+            ner_metrics={"f1_micro": 0.80, "per_category": {"PER": 0.9}},
+            calibration_metrics={"ece": 0.05, "mce": 0.10},
+            philological_metrics={"mufi": {"coverage": 0.92}},
+            searchability_metrics={
+                "n_gt_tokens": 2, "n_searchable": 2, "recall": 1.0,
+            },
+            numerical_sequence_metrics={
+                "global_strict_score": 1.0, "n_total": 0,
+            },
+            readability_metrics={
+                "lang": "fr", "flesch_delta": -5.2, "n_words_reference": 100,
+            },
+        )
+        er = EngineReport(
+            engine_name="tesseract",
+            engine_version="5.3.0",
+            engine_config={"lang": "fra"},
+            document_results=[dr],
+            pipeline_info={"mode": "text_only"},
+            aggregated_confusion={"l→1": 2},
+            aggregated_char_scores={"ligature": {"score": 0.95}},
+            aggregated_taxonomy={"classes": {"1": 3}},
+            aggregated_structure={"line_count_total": 5},
+            aggregated_image_quality={"contrast_mean": 0.75},
+            aggregated_line_metrics={"gini_mean": 0.3},
+            aggregated_hallucination={"anchoring_mean": 0.85},
+            aggregated_ner={"f1_micro": 0.80},
+            aggregated_calibration={"ece": 0.05},
+            aggregated_philological={"mufi": {"coverage": 0.92}},
+            aggregated_searchability={"recall": 1.0},
+            aggregated_numerical_sequences={"global_strict_score": 1.0},
+            aggregated_readability={"delta_mean": -5.2},
+        )
+        return BenchmarkResult(
+            corpus_name="rich-corpus",
+            corpus_source="tests",
+            document_count=1,
+            engine_reports=[er],
+            run_date="2026-05-12T12:00:00Z",
+            picarones_version="2.0.0",
+            metadata={"context": "phase2_test"},
+        )
+    def test_round_trip_preserves_all_document_level_fields(
+        self, tmp_path: Path,
+    ) -> None:
+        from picarones.evaluation.benchmark_result import BenchmarkResult
+        bm = self._make_rich_benchmark()
+        path = tmp_path / "rich.json"
+        bm.to_json(path)
+        loaded = BenchmarkResult.from_json_object(path)
+        orig = bm.engine_reports[0].document_results[0]
+        rebuilt = loaded.engine_reports[0].document_results[0]
+        assert rebuilt.doc_id == orig.doc_id
+        assert rebuilt.ground_truth == orig.ground_truth
+        assert rebuilt.hypothesis == orig.hypothesis
+        assert rebuilt.ocr_intermediate == orig.ocr_intermediate
+        assert rebuilt.pipeline_metadata == orig.pipeline_metadata
+        assert rebuilt.confusion_matrix == orig.confusion_matrix
+        assert rebuilt.char_scores == orig.char_scores
+        assert rebuilt.taxonomy == orig.taxonomy
+        assert rebuilt.structure == orig.structure
+        assert rebuilt.image_quality == orig.image_quality
+        assert rebuilt.line_metrics == orig.line_metrics
+        assert rebuilt.hallucination_metrics == orig.hallucination_metrics
+        assert rebuilt.ner_metrics == orig.ner_metrics
+        assert rebuilt.calibration_metrics == orig.calibration_metrics
+        assert rebuilt.philological_metrics == orig.philological_metrics
+        assert rebuilt.searchability_metrics == orig.searchability_metrics
+        assert (
+            rebuilt.numerical_sequence_metrics
+            == orig.numerical_sequence_metrics
+        )
+        assert rebuilt.readability_metrics == orig.readability_metrics
+        # Métriques diplomatiques (anciennement perdues).
+        assert rebuilt.metrics.cer_diplomatic == orig.metrics.cer_diplomatic
+        assert (
+            rebuilt.metrics.diplomatic_profile_name
+            == orig.metrics.diplomatic_profile_name
+        )
+    def test_round_trip_preserves_aggregated_engine_fields(
+        self, tmp_path: Path,
+    ) -> None:
+        from picarones.evaluation.benchmark_result import BenchmarkResult
+        bm = self._make_rich_benchmark()
+        path = tmp_path / "rich.json"
+        bm.to_json(path)
+        loaded = BenchmarkResult.from_json_object(path)
+        orig = bm.engine_reports[0]
+        rebuilt = loaded.engine_reports[0]
+        assert rebuilt.pipeline_info == orig.pipeline_info
+        assert rebuilt.aggregated_confusion == orig.aggregated_confusion
+        assert rebuilt.aggregated_char_scores == orig.aggregated_char_scores
+        assert rebuilt.aggregated_taxonomy == orig.aggregated_taxonomy
+        assert rebuilt.aggregated_structure == orig.aggregated_structure
+        assert (
+            rebuilt.aggregated_image_quality == orig.aggregated_image_quality
+        )
+        assert rebuilt.aggregated_line_metrics == orig.aggregated_line_metrics
+        assert (
+            rebuilt.aggregated_hallucination == orig.aggregated_hallucination
+        )
+        assert rebuilt.aggregated_ner == orig.aggregated_ner
+        assert rebuilt.aggregated_calibration == orig.aggregated_calibration
+        assert (
+            rebuilt.aggregated_philological == orig.aggregated_philological
+        )
+        assert (
+            rebuilt.aggregated_searchability == orig.aggregated_searchability
+        )
+        assert (
+            rebuilt.aggregated_numerical_sequences
+            == orig.aggregated_numerical_sequences
+        )
+        assert rebuilt.aggregated_readability == orig.aggregated_readability
+    def test_report_generator_from_json_uses_rich_reconstruction(
+        self, tmp_path: Path,
+    ) -> None:
+        """``ReportGenerator.from_json`` doit désormais accéder aux
+        champs avancés (avant Phase 2.2 il les perdait)."""
+        from picarones.reports.html.generator import ReportGenerator
+        bm = self._make_rich_benchmark()
+        path = tmp_path / "rich.json"
+        bm.to_json(path)
+        gen = ReportGenerator.from_json(path)
+        dr = gen.benchmark.engine_reports[0].document_results[0]
+        # Champs qui étaient à None avant Phase 2.2.
+        assert dr.taxonomy is not None
+        assert dr.hallucination_metrics is not None
+        assert dr.philological_metrics is not None
+        assert dr.calibration_metrics is not None
+        assert dr.searchability_metrics is not None
+# ──────────────────────────────────────────────────────────────────────
+# 6. Phase 2.3 — partial store fingerprint
+# ──────────────────────────────────────────────────────────────────────
+class TestPartialStoreFingerprint:
+    """Phase 2.3 du chantier post-rewrite : la clé du fichier partiel
+    inclut désormais un fingerprint SHA-256 stable de la config
+    complète (engine_config, normalization_profile, char_exclude,
+    fichiers corpus + mtime/size, version code).
+    Avant ce durcissement, la clé était ``(corpus.name, engine.name)``
+    seule — deux runs avec configs différentes recyclaient
+    silencieusement les résultats du précédent.  Reproductibilité
+    scientifique brisée.
+    """
+    def test_fingerprint_stable_for_same_config(self, tmp_path: Path) -> None:
+        from picarones.app.services.partial_store import (
+            compute_run_fingerprint,
+        )
+        f1 = tmp_path / "a.png"
+        f1.write_bytes(b"\x00" * 100)
+        fp1 = compute_run_fingerprint(
+            engine_config={"lang": "fra", "psm": 6},
+            normalization_profile="medieval_french",
+            char_exclude="',-",
+            corpus_files=[f1],
+            code_version="1.0",
+        )
+        fp2 = compute_run_fingerprint(
+            engine_config={"psm": 6, "lang": "fra"},  # ordre différent
+            normalization_profile="medieval_french",
+            char_exclude="',-",
+            corpus_files=[f1],
+            code_version="1.0",
+        )
+        assert fp1 == fp2, "Le fingerprint doit être insensible à l'ordre dict"
+    def test_fingerprint_changes_with_engine_config(
+        self, tmp_path: Path,
+    ) -> None:
+        from picarones.app.services.partial_store import (
+            compute_run_fingerprint,
+        )
+        f1 = tmp_path / "a.png"
+        f1.write_bytes(b"\x00" * 100)
+        fp_psm6 = compute_run_fingerprint(
+            engine_config={"lang": "fra", "psm": 6},
+            corpus_files=[f1],
+            code_version="1.0",
+        )
+        fp_psm3 = compute_run_fingerprint(
+            engine_config={"lang": "fra", "psm": 3},
+            corpus_files=[f1],
+            code_version="1.0",
+        )
+        assert fp_psm6 != fp_psm3, (
+            "Un changement de psm doit changer le fingerprint"
+        )
+    def test_fingerprint_changes_with_normalization_profile(
+        self, tmp_path: Path,
+    ) -> None:
+        from picarones.app.services.partial_store import (
+            compute_run_fingerprint,
+        )
+        f1 = tmp_path / "a.png"
+        f1.write_bytes(b"\x00" * 100)
+        fp_med = compute_run_fingerprint(
+            engine_config={"lang": "fra"},
+            normalization_profile="medieval_french",
+            corpus_files=[f1],
+        )
+        fp_nfc = compute_run_fingerprint(
+            engine_config={"lang": "fra"},
+            normalization_profile="nfc",
+            corpus_files=[f1],
+        )
+        assert fp_med != fp_nfc
+    def test_fingerprint_changes_with_char_exclude(
+        self, tmp_path: Path,
+    ) -> None:
+        from picarones.app.services.partial_store import (
+            compute_run_fingerprint,
+        )
+        fp_with = compute_run_fingerprint(
+            engine_config={"lang": "fra"},
+            char_exclude="',-",
+        )
+        fp_without = compute_run_fingerprint(
+            engine_config={"lang": "fra"},
+            char_exclude="",
+        )
+        assert fp_with != fp_without
+    def test_fingerprint_changes_with_corpus_content(
+        self, tmp_path: Path,
+    ) -> None:
+        """Si un fichier change de taille / mtime, le fingerprint change.
+        Détection légère (pas de hash du contenu) mais suffit pour
+        invalider la reprise après modification utilisateur du corpus.
+        """
+        import os
+        import time
+        from picarones.app.services.partial_store import (
+            compute_run_fingerprint,
+        )
+        f1 = tmp_path / "a.png"
+        f1.write_bytes(b"\x00" * 100)
+        fp_v1 = compute_run_fingerprint(
+            engine_config={"lang": "fra"},
+            corpus_files=[f1],
+        )
+        # Réécrire avec une taille différente.
+        f1.write_bytes(b"\x00" * 200)
+        # Forcer un mtime différent (certains FS ont une résolution
+        # de seconde, on attend > 1 s).
+        new_mtime = time.time() + 5
+        os.utime(f1, (new_mtime, new_mtime))
+        fp_v2 = compute_run_fingerprint(
+            engine_config={"lang": "fra"},
+            corpus_files=[f1],
+        )
+        assert fp_v1 != fp_v2
+    def test_partial_path_uses_fingerprint_suffix(
+        self, tmp_path: Path,
+    ) -> None:
+        from picarones.app.services.partial_store import _partial_path
+        path_with = _partial_path(
+            "my_corpus", "tesseract", tmp_path, fingerprint="abc123",
+        )
+        path_without = _partial_path(
+            "my_corpus", "tesseract", tmp_path,
+        )
+        assert path_with != path_without
+        assert "abc123" in path_with.name
+        # Le format historique reste pour la rétrocompat.
+        assert path_without.name == "picarones_my_corpus_tesseract.partial.jsonl"
+    def test_engine_config_for_fingerprint_distinguishes_psm(self) -> None:
+        """``_engine_config_for_fingerprint`` capture les attributs
+        opérationnels d'un adapter OCR (lang, psm, model, …)."""
+        from picarones.app.services.benchmark_runner import (
+            _engine_config_for_fingerprint,
+        )
+        class _FakeOCR:
+            name = "tesseract"
+            lang = "fra"
+            psm = 6
+            is_pipeline = False
+        class _FakeOCRDiff:
+            name = "tesseract"
+            lang = "fra"
+            psm = 3
+            is_pipeline = False
+        c1 = _engine_config_for_fingerprint(_FakeOCR())
+        c2 = _engine_config_for_fingerprint(_FakeOCRDiff())
+        assert c1 != c2
+        assert c1["psm"] == 6
+        assert c2["psm"] == 3
+# ──────────────────────────────────────────────────────────────────────
+# 7. Phase 3 — Adapters kraken et calamari (moteurs fantômes implémentés)
+# ──────────────────────────────────────────────────────────────────────
+class TestKrakenAdapter:
+    """Phase 3 du chantier post-rewrite : ``KrakenAdapter`` rend
+    l'engine ``kraken`` réellement utilisable (au lieu d'être
+    juste annoncé par ``/api/engines``)."""
+    def test_kraken_requires_model_path(self) -> None:
+        from picarones.adapters.ocr import KrakenAdapter
+        from picarones.adapters.ocr.base import OCRAdapterError
+        with pytest.raises(OCRAdapterError, match="model_path est obligatoire"):
+            KrakenAdapter()
+    def test_kraken_via_factory(self, tmp_path: Path) -> None:
+        from picarones.adapters.ocr import KrakenAdapter
+        from picarones.adapters.ocr.factory import ocr_adapter_from_name
+        # Modèle factice — l'adapter ne le charge qu'à execute().
+        model = tmp_path / "fake.mlmodel"
+        model.write_bytes(b"fake")
+        adapter = ocr_adapter_from_name("kraken", model_path=str(model))
+        assert isinstance(adapter, KrakenAdapter)
+        assert adapter.name == "kraken"
+        assert adapter.model_path == model
+    def test_kraken_validates_name(self) -> None:
+        from picarones.adapters.ocr import KrakenAdapter
+        from picarones.adapters.ocr.base import OCRAdapterError
+        with pytest.raises(OCRAdapterError, match="name invalide"):
+            KrakenAdapter(name="bad name with spaces", model_path="x")
+class TestCalamariAdapter:
+    """Phase 3 du chantier post-rewrite : ``CalamariAdapter`` rend
+    l'engine ``calamari`` réellement utilisable."""
+    def test_calamari_requires_checkpoint(self) -> None:
+        from picarones.adapters.ocr import CalamariAdapter
+        from picarones.adapters.ocr.base import OCRAdapterError
+        with pytest.raises(OCRAdapterError, match="checkpoint est obligatoire"):
+            CalamariAdapter()
+    def test_calamari_via_factory(self, tmp_path: Path) -> None:
+        from picarones.adapters.ocr import CalamariAdapter
+        from picarones.adapters.ocr.factory import ocr_adapter_from_name
+        ckpt = tmp_path / "fake.ckpt"
+        ckpt.write_bytes(b"fake")
+        adapter = ocr_adapter_from_name("calamari", checkpoint=str(ckpt))
+        assert isinstance(adapter, CalamariAdapter)
+        assert adapter.name == "calamari"
+        assert adapter.checkpoint == ckpt
+    def test_calamari_validates_batch_size(self) -> None:
+        from picarones.adapters.ocr import CalamariAdapter
+        from picarones.adapters.ocr.base import OCRAdapterError
+        with pytest.raises(OCRAdapterError, match="batch_size doit être"):
+            CalamariAdapter(checkpoint="x", batch_size=0)
+class TestEngineMatrixCoherence:
+    """Phase 3 du chantier post-rewrite : la matrice des moteurs est
+    cohérente entre ``/api/engines``, la factory canonique, le
+    builder web ``_OCR_KWARGS_BUILDERS`` et l'index public."""
+    def test_kraken_and_calamari_in_factory_supported_list(self) -> None:
+        from picarones.adapters.ocr.factory import _SUPPORTED
+        assert "kraken" in _SUPPORTED
+        assert "calamari" in _SUPPORTED
+    def test_kraken_and_calamari_in_web_builders(self) -> None:
+        from picarones.interfaces.web.benchmark_utils import (
+            _OCR_KWARGS_BUILDERS,
+        )
+        assert "kraken" in _OCR_KWARGS_BUILDERS
+        assert "calamari" in _OCR_KWARGS_BUILDERS
+    def test_kraken_calamari_exposed_at_package_root(self) -> None:
+        from picarones.adapters.ocr import (
+            CalamariAdapter,
+            KrakenAdapter,
+        )
+        assert KrakenAdapter.__name__ == "KrakenAdapter"
+        assert CalamariAdapter.__name__ == "CalamariAdapter"
+# ──────────────────────────────────────────────────────────────────────
+# 8. Phase 4 — upload_purge_task branché au lifespan
+# ──────────────────────────────────────────────────────────────────────
+class TestUploadPurgeTaskWired:
+    """Phase 4 du chantier post-rewrite : la tâche
+    ``upload_purge_task`` est désormais démarrée par le lifespan de
+    ``picarones.interfaces.web.app`` (auparavant définie mais jamais
+    lancée — code zombie)."""
+    def test_lifespan_starts_purge_task(self, monkeypatch) -> None:
+        """Au démarrage de l'app FastAPI, un ``asyncio.create_task`` doit
+        emballer ``upload_purge_task``.  On patch la fonction pour
+        l'observer puis on enclenche le lifespan."""
+        from fastapi.testclient import TestClient
+        observed: dict = {"started": False, "uploads_root": None}
+        async def _fake_purge_task(uploads_root):
+            observed["started"] = True
+            observed["uploads_root"] = uploads_root
+            # Boucle infinie minimale — annulée au shutdown.
+            import asyncio
+            try:
+                while True:
+                    await asyncio.sleep(3600)
+            except asyncio.CancelledError:
+                raise
+        monkeypatch.setattr(
+            "picarones.interfaces.web.maintenance.upload_purge_task",
+            _fake_purge_task,
+        )
+        # Forcer la rétention pour ne pas que la fonction réelle short-circuit.
+        monkeypatch.setenv("PICARONES_UPLOAD_RETENTION_DAYS", "7")
+        from picarones.interfaces.web.app import app
+        with TestClient(app):
+            # Le lifespan a démarré ; la tâche tourne en arrière-plan.
+            # On laisse à asyncio le temps de la lancer.
+            import time
+            time.sleep(0.05)
+        assert observed["started"] is True, (
+            "upload_purge_task aurait dû être démarrée par le lifespan"
+        )
+    def test_purge_protects_active_corpus(self, tmp_path: Path) -> None:
+        """Si un job ``pending``/``running`` référence un corpus_id, la
+        purge ne supprime pas ce dossier — même s'il est ancien."""
+        import time
+        from picarones.interfaces.web.maintenance import purge_old_uploads
+        # 2 corpus : un actif (référencé), un orphelin.
+        active = tmp_path / "active_corpus"
+        orphan = tmp_path / "orphan_corpus"
+        active.mkdir()
+        orphan.mkdir()
+        # Vieillir les deux pour qu'ils passent la rétention de 0 jour.
+        old = time.time() - 86400 * 30
+        import os
+        os.utime(active, (old, old))
+        os.utime(orphan, (old, old))
+        purged = purge_old_uploads(
+            tmp_path,
+            retention_days=7,
+            active_corpus_ids={"active_corpus"},
+        )
+        purged_names = [p.name for p in purged]
+        assert "orphan_corpus" in purged_names
+        assert "active_corpus" not in purged_names
+        # Vérification physique
+        assert active.exists()
+        assert not orphan.exists()

tests/security/test_s1_zip_slip_attack.py CHANGED Viewed

@@ -24,10 +24,19 @@ from pathlib import Path
 import pytest
-# ──────────────────────────────────────────────────────────────────────
-# Helpers — construire des ZIP malicieux en mémoire
-# ──────────────────────────────────────────────────────────────────────
 def _zip_with_entry(name: str, data: bytes = b"PWNED") -> bytes:
@@ -151,7 +160,7 @@ class TestFlattenZipToDir:
         sous ``dest``, pas dans ``/tmp/``."""
         from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
-        zip_bytes = _zip_with_entry("../../../tmp/x.png", b"\x89PNG")
         dest = tmp_path / "extract"
         with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
@@ -168,7 +177,7 @@ class TestFlattenZipToDir:
     ) -> None:
         from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
-        zip_bytes = _zip_with_entry("/etc/passwd_clone.png", b"\x89PNG")
         dest = tmp_path / "extract"
         with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:

 import pytest
+#: PNG minimal valide — utilisé là où le contenu doit passer
+#: ``validate_image_safe`` (Pillow.verify).  Avant ce durcissement,
+#: les tests utilisaient ``b"\\x89PNG"`` (signature seule), mais le
+#: durcissement Phase 1 valide chaque image extraite d'un ZIP — d'où
+#: l'utilisation d'un PNG 1×1 réellement décodable ici.
+_MINIMAL_PNG = (
+    b"\x89PNG\r\n\x1a\n"
+    b"\x00\x00\x00\rIHDR"
+    b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00"
+    b"\x1f\x15\xc4\x89"
+    b"\x00\x00\x00\nIDATx\x9cc\x00\x01\x00\x00\x05\x00\x01"
+    b"\r\n-\xb4\x00\x00\x00\x00IEND\xaeB`\x82"
+)
 def _zip_with_entry(name: str, data: bytes = b"PWNED") -> bytes:
         sous ``dest``, pas dans ``/tmp/``."""
         from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
+        zip_bytes = _zip_with_entry("../../../tmp/x.png", _MINIMAL_PNG)
         dest = tmp_path / "extract"
         with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
     ) -> None:
         from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
+        zip_bytes = _zip_with_entry("/etc/passwd_clone.png", _MINIMAL_PNG)
         dest = tmp_path / "extract"
         with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:

tests/web/routers/test_s4_history_router.py CHANGED Viewed

@@ -182,16 +182,19 @@ class TestHistoryWithRegression:
 class TestDBErrorHandling:
-    def test_db_path_unwritable_returns_500_or_empty(
-        self, tmp_path: Path,
-    ) -> None:
-        """db_path qui pointe sur un répertoire inexistant + non
-        créable doit produire une erreur compréhensible (500 ou
-        body avec count=0 mais sans crash silencieux)."""
         from fastapi.testclient import TestClient
         app = _make_app()
-        # Chemin qui devrait être impossible à créer (sous /proc).
         impossible_path = "/proc/cannot_write/history.sqlite"
         with TestClient(app, raise_server_exceptions=False) as client:
@@ -199,8 +202,27 @@ class TestDBErrorHandling:
                 "/api/history/regressions",
                 params={"db_path": impossible_path},
             )
-            # Soit 500 (le bon comportement), soit 200 mais avec
-            # count=0.  Pas de crash, pas de stack trace au client.
             assert r.status_code in (200, 500)
             if r.status_code == 500:
                 body = r.json()

 class TestDBErrorHandling:
+    def test_db_path_outside_workspace_rejected(self, tmp_path: Path) -> None:
+        """db_path hors workspace est désormais rejeté en 400 par le
+        durcissement Phase 1 (validation contre compute_workspace_roots).
+        Avant Phase 1 : 500 silencieux après tentative d'ouverture
+        SQLite — vecteur de lecture filesystem arbitraire.
+        Après Phase 1 : 400 avec ``PathValidationError`` AVANT
+        toute interaction filesystem.
+        """
         from fastapi.testclient import TestClient
         app = _make_app()
+        # Chemin hors zone workspace.
         impossible_path = "/proc/cannot_write/history.sqlite"
         with TestClient(app, raise_server_exceptions=False) as client:
                 "/api/history/regressions",
                 params={"db_path": impossible_path},
             )
+            assert r.status_code == 400, r.text
+            assert "hors zone autorisée" in r.json()["detail"]
+    def test_db_path_inside_workspace_but_unwritable(
+        self, tmp_path: Path,
+    ) -> None:
+        """db_path valide (sous tmp_path) mais pointant sur un fichier
+        inexistant en sous-dossier inaccessible : 500 propre, pas de
+        crash silencieux."""
+        from fastapi.testclient import TestClient
+        app = _make_app()
+        # Sous-dossier inexistant sous tmp_path — SQLite va échouer
+        # à créer le fichier, mais la validation de chemin passe.
+        bad_under_workspace = tmp_path / "no_such_subdir" / "history.sqlite"
+        with TestClient(app, raise_server_exceptions=False) as client:
+            r = client.get(
+                "/api/history/regressions",
+                params={"db_path": str(bad_under_workspace)},
+            )
             assert r.status_code in (200, 500)
             if r.status_code == 500:
                 body = r.json()

tests/web/routers/test_s8_benchmark_router_branches.py CHANGED Viewed

@@ -6,7 +6,7 @@ Cible : lignes 100, 163, 170, 223 de
 - 100 : ``/api/benchmark/start`` retourne 429 quand le sémaphore
   des jobs concurrents est plein ;
 - 163 : ``validated_prompt_filename`` est appelé pour chaque
-  ``CompetitorConfig.prompt_file`` non-vide → un nom de prompt
   invalide doit être rejeté en 400 (vecteur d'exfiltration LLM) ;
 - 170 : ``/api/benchmark/run`` retourne 429 quand le sémaphore
   est plein ;

 - 100 : ``/api/benchmark/start`` retourne 429 quand le sémaphore
   des jobs concurrents est plein ;
 - 163 : ``validated_prompt_filename`` est appelé pour chaque
+  ``PipelineConfig.prompt_file`` non-vide → un nom de prompt
   invalide doit être rejeté en 400 (vecteur d'exfiltration LLM) ;
 - 170 : ``/api/benchmark/run`` retourne 429 quand le sémaphore
   est plein ;

tests/web/test_s8_benchmark_utils_factory.py CHANGED Viewed

@@ -4,7 +4,7 @@
 Pourquoi ce fichier
 -------------------
 ``_build_llm_adapter`` et ``_engine_from_competitor`` sont les
-points de **routage** entre la config web (``CompetitorConfig``)
 et les adapters concrets : si une régression silencieusement
 fait passer ``mistral`` au lieu de ``openai``, ou ``tesseract``
 au lieu de ``mistral_ocr``, le benchmark tourne mais avec le
@@ -36,7 +36,7 @@ from picarones.interfaces.web.benchmark_utils import (
     _engine_from_competitor,
     sse_format,
 )
-from picarones.interfaces.web.models import CompetitorConfig
 # ──────────────────────────────────────────────────────────────────────
@@ -61,7 +61,7 @@ class TestBuildLLMAdapterRouting:
     def test_provider_routes_to_expected_adapter(
         self, provider: str, expected_class_name: str,
     ) -> None:
-        comp = CompetitorConfig(
             name="t", ocr_engine="", llm_provider=provider, llm_model="m",
         )
         adapter = _build_llm_adapter(comp)
@@ -71,7 +71,7 @@ class TestBuildLLMAdapterRouting:
         )
     def test_unknown_provider_raises_value_error(self) -> None:
-        comp = CompetitorConfig(
             name="t", ocr_engine="",
             llm_provider="some_made_up_provider", llm_model="x",
         )
@@ -82,7 +82,7 @@ class TestBuildLLMAdapterRouting:
         """Quand ``llm_model`` est vide, on passe ``None`` à
         l'adapter (qui utilise son default interne) — pas une
         chaîne vide qui serait rejetée par l'API."""
-        comp = CompetitorConfig(
             name="t", ocr_engine="", llm_provider="openai", llm_model="",
         )
         adapter = _build_llm_adapter(comp)
@@ -103,7 +103,7 @@ class TestEngineFromCompetitorOCROnly:
         """Le ``name`` est dérivé de ``(engine_id, ocr_model)`` pour
         que deux configs distinctes obtiennent automatiquement des
         identifiants différents au resolver (cf. S9 fix)."""
-        comp = CompetitorConfig(
             name="t", ocr_engine="tesseract", llm_provider="",
             ocr_model="fra",
         )
@@ -113,10 +113,10 @@ class TestEngineFromCompetitorOCROnly:
     def test_tesseract_only_different_lang_distinct_name(self) -> None:
         """Garantie anti-collision : ``lang=eng`` et ``lang=fra``
         produisent des ``name`` distincts au resolver."""
-        comp_fra = CompetitorConfig(
             ocr_engine="tesseract", llm_provider="", ocr_model="fra",
         )
-        comp_eng = CompetitorConfig(
             ocr_engine="tesseract", llm_provider="", ocr_model="eng",
         )
         assert _engine_from_competitor(comp_fra).name == "tesseract_fra"
@@ -126,7 +126,7 @@ class TestEngineFromCompetitorOCROnly:
         """``RuntimeError`` (et pas ``ValueError`` brut) — c'est le
         contrat documenté pour que le worker thread puisse
         loguer ``warning`` et passer au concurrent suivant."""
-        comp = CompetitorConfig(
             name="t", ocr_engine="not_an_engine", llm_provider="",
         )
         with pytest.raises(RuntimeError, match="inconnu"):
@@ -141,30 +141,61 @@ class TestEngineFromCompetitorPipeline:
         ("pipeline_mode", "expected_mode"),
         [
             ("text_only", "text_only"),
-            ("post_correction_text", "text_only"),
             ("text_and_image", "text_and_image"),
-            ("post_correction_image", "text_and_image"),
-            ("", "text_only"),  # fallback
         ],
     )
-    def test_pipeline_mode_mapping_with_ocr(
         self, pipeline_mode: str, expected_mode: str,
     ) -> None:
-        """Modes qui exigent un OCR amont (``text_only``,
-        ``text_and_image``) — testés avec ``tesseract`` réel."""
-        comp = CompetitorConfig(
             name="t", ocr_engine="tesseract", llm_provider="mistral",
             llm_model="m", ocr_model="fra", pipeline_mode=pipeline_mode,
         )
         pipeline = _engine_from_competitor(comp)
         assert pipeline.mode == expected_mode
     def test_zero_shot_mode_requires_corpus_ocr(self) -> None:
         """Le mode ``zero_shot`` exige ``ocr_adapter=None`` au niveau
         du pipeline (le VLM lit l'image directement) — donc côté
         factory web, il doit être combiné avec ``ocr_engine=corpus``
         ou ``""``, pas avec un moteur live."""
-        comp = CompetitorConfig(
             name="t", ocr_engine="corpus", llm_provider="mistral",
             llm_model="m", pipeline_mode="zero_shot",
         )
@@ -173,18 +204,20 @@ class TestEngineFromCompetitorPipeline:
         assert pipeline.ocr_adapter is None
     def test_pipeline_name_from_explicit_name(self) -> None:
-        comp = CompetitorConfig(
             name="my-pipeline", ocr_engine="tesseract",
             llm_provider="mistral", llm_model="m", ocr_model="fra",
         )
         pipeline = _engine_from_competitor(comp)
         assert pipeline.pipeline_name == "my-pipeline"
     def test_pipeline_name_default_format(self) -> None:
         """Sans ``name`` explicite, format ``{engine} → {model}``."""
-        comp = CompetitorConfig(
             name="", ocr_engine="tesseract", llm_provider="mistral",
             llm_model="ministral-3b-latest", ocr_model="fra",
         )
         pipeline = _engine_from_competitor(comp)
         assert "tesseract" in pipeline.pipeline_name
@@ -195,9 +228,10 @@ class TestEngineFromCompetitorPipeline:
         par défaut (``correction_medieval_french.txt``).  Cf. S9 :
         ``prompt_template`` contient désormais le CONTENU lu sur
         disque, pas le filename brut."""
-        comp = CompetitorConfig(
             name="t", ocr_engine="tesseract", llm_provider="mistral",
             llm_model="m", ocr_model="fra", prompt_file="",
         )
         pipeline = _engine_from_competitor(comp)
         # Le template ne doit PAS être le filename littéral.
@@ -220,7 +254,7 @@ class TestEngineFromCompetitorCorpusOCR:
     def test_corpus_or_empty_without_llm_raises(
         self, ocr_engine: str,
     ) -> None:
-        comp = CompetitorConfig(
             name="t", ocr_engine=ocr_engine, llm_provider="",
         )
         with pytest.raises(ValueError, match="llm_provider"):
@@ -233,7 +267,7 @@ class TestEngineFromCompetitorCorpusOCR:
         """Mode corpus + LLM → pipeline ``zero_shot`` (le LLM/VLM
         traite l'image ou l'OCR pré-calculé, l'``ocr_adapter`` est
         ``None``)."""
-        comp = CompetitorConfig(
             name="post-corr", ocr_engine=ocr_engine,
             llm_provider="mistral", llm_model="m",
             pipeline_mode="zero_shot",
@@ -247,7 +281,7 @@ class TestEngineFromCompetitorCorpusOCR:
     def test_corpus_pipeline_name_format(self) -> None:
         """Sans ``name``, format ``corpus_ocr → {model}``."""
-        comp = CompetitorConfig(
             name="", ocr_engine="corpus", llm_provider="mistral",
             llm_model="ministral-3b-latest",
             pipeline_mode="zero_shot",
@@ -273,7 +307,7 @@ class TestEngineFromCompetitorCloudWithoutSDK:
     def test_cloud_engine_without_sdk_runtime_error(
         self, engine: str, module_path: str,
     ) -> None:
-        comp = CompetitorConfig(
             name="t", ocr_engine=engine, llm_provider="",
         )
         with patch.dict(sys.modules, {module_path: None}):

 Pourquoi ce fichier
 -------------------
 ``_build_llm_adapter`` et ``_engine_from_competitor`` sont les
+points de **routage** entre la config web (``PipelineConfig``)
 et les adapters concrets : si une régression silencieusement
 fait passer ``mistral`` au lieu de ``openai``, ou ``tesseract``
 au lieu de ``mistral_ocr``, le benchmark tourne mais avec le
     _engine_from_competitor,
     sse_format,
 )
+from picarones.interfaces.web.models import PipelineConfig
 # ──────────────────────────────────────────────────────────────────────
     def test_provider_routes_to_expected_adapter(
         self, provider: str, expected_class_name: str,
     ) -> None:
+        comp = PipelineConfig(
             name="t", ocr_engine="", llm_provider=provider, llm_model="m",
         )
         adapter = _build_llm_adapter(comp)
         )
     def test_unknown_provider_raises_value_error(self) -> None:
+        comp = PipelineConfig(
             name="t", ocr_engine="",
             llm_provider="some_made_up_provider", llm_model="x",
         )
         """Quand ``llm_model`` est vide, on passe ``None`` à
         l'adapter (qui utilise son default interne) — pas une
         chaîne vide qui serait rejetée par l'API."""
+        comp = PipelineConfig(
             name="t", ocr_engine="", llm_provider="openai", llm_model="",
         )
         adapter = _build_llm_adapter(comp)
         """Le ``name`` est dérivé de ``(engine_id, ocr_model)`` pour
         que deux configs distinctes obtiennent automatiquement des
         identifiants différents au resolver (cf. S9 fix)."""
+        comp = PipelineConfig(
             name="t", ocr_engine="tesseract", llm_provider="",
             ocr_model="fra",
         )
     def test_tesseract_only_different_lang_distinct_name(self) -> None:
         """Garantie anti-collision : ``lang=eng`` et ``lang=fra``
         produisent des ``name`` distincts au resolver."""
+        comp_fra = PipelineConfig(
             ocr_engine="tesseract", llm_provider="", ocr_model="fra",
         )
+        comp_eng = PipelineConfig(
             ocr_engine="tesseract", llm_provider="", ocr_model="eng",
         )
         assert _engine_from_competitor(comp_fra).name == "tesseract_fra"
         """``RuntimeError`` (et pas ``ValueError`` brut) — c'est le
         contrat documenté pour que le worker thread puisse
         loguer ``warning`` et passer au concurrent suivant."""
+        comp = PipelineConfig(
             name="t", ocr_engine="not_an_engine", llm_provider="",
         )
         with pytest.raises(RuntimeError, match="inconnu"):
         ("pipeline_mode", "expected_mode"),
         [
             ("text_only", "text_only"),
             ("text_and_image", "text_and_image"),
         ],
     )
+    def test_pipeline_mode_passes_through_with_ocr(
         self, pipeline_mode: str, expected_mode: str,
     ) -> None:
+        """Modes canoniques qui exigent un OCR amont — Phase 2 du
+        chantier post-rewrite : plus de mapping/alias.  Les 3 valeurs
+        de :class:`PipelineMode` traversent telles quelles vers le
+        ``OCRLLMPipelineConfig`` (``zero_shot`` testé séparément car
+        il refuse l'OCR amont)."""
+        comp = PipelineConfig(
             name="t", ocr_engine="tesseract", llm_provider="mistral",
             llm_model="m", ocr_model="fra", pipeline_mode=pipeline_mode,
         )
         pipeline = _engine_from_competitor(comp)
         assert pipeline.mode == expected_mode
+    @pytest.mark.parametrize(
+        "deprecated_mode",
+        ["post_correction_text", "post_correction_image", "POST_CORRECTION_TEXT"],
+    )
+    def test_legacy_aliases_rejected_at_pydantic_level(
+        self, deprecated_mode: str,
+    ) -> None:
+        """Phase 2 rupture API : les anciens alias
+        (``post_correction_text``/``post_correction_image``) sont
+        rejetés par Pydantic au niveau ``PipelineConfig`` — plus de
+        mapping silencieux vers ``text_only`` / ``text_and_image``."""
+        from pydantic import ValidationError
+        with pytest.raises(ValidationError):
+            PipelineConfig(
+                name="t", ocr_engine="tesseract", llm_provider="mistral",
+                llm_model="m", ocr_model="fra",
+                pipeline_mode=deprecated_mode,
+            )
+    def test_empty_pipeline_mode_with_llm_raises(self) -> None:
+        """Phase 2 rupture API : un client qui combine ``llm_provider``
+        non vide avec ``pipeline_mode=""`` reçoit désormais une
+        ``ValueError`` claire — l'ancien fallback silencieux vers
+        ``text_only`` masquait la config incomplète."""
+        comp = PipelineConfig(
+            name="t", ocr_engine="tesseract", llm_provider="mistral",
+            llm_model="m", ocr_model="fra", pipeline_mode="",
+        )
+        with pytest.raises(ValueError, match="pipeline_mode invalide"):
+            _engine_from_competitor(comp)
     def test_zero_shot_mode_requires_corpus_ocr(self) -> None:
         """Le mode ``zero_shot`` exige ``ocr_adapter=None`` au niveau
         du pipeline (le VLM lit l'image directement) — donc côté
         factory web, il doit être combiné avec ``ocr_engine=corpus``
         ou ``""``, pas avec un moteur live."""
+        comp = PipelineConfig(
             name="t", ocr_engine="corpus", llm_provider="mistral",
             llm_model="m", pipeline_mode="zero_shot",
         )
         assert pipeline.ocr_adapter is None
     def test_pipeline_name_from_explicit_name(self) -> None:
+        comp = PipelineConfig(
             name="my-pipeline", ocr_engine="tesseract",
             llm_provider="mistral", llm_model="m", ocr_model="fra",
+            pipeline_mode="text_only",
         )
         pipeline = _engine_from_competitor(comp)
         assert pipeline.pipeline_name == "my-pipeline"
     def test_pipeline_name_default_format(self) -> None:
         """Sans ``name`` explicite, format ``{engine} → {model}``."""
+        comp = PipelineConfig(
             name="", ocr_engine="tesseract", llm_provider="mistral",
             llm_model="ministral-3b-latest", ocr_model="fra",
+            pipeline_mode="text_only",
         )
         pipeline = _engine_from_competitor(comp)
         assert "tesseract" in pipeline.pipeline_name
         par défaut (``correction_medieval_french.txt``).  Cf. S9 :
         ``prompt_template`` contient désormais le CONTENU lu sur
         disque, pas le filename brut."""
+        comp = PipelineConfig(
             name="t", ocr_engine="tesseract", llm_provider="mistral",
             llm_model="m", ocr_model="fra", prompt_file="",
+            pipeline_mode="text_only",
         )
         pipeline = _engine_from_competitor(comp)
         # Le template ne doit PAS être le filename littéral.
     def test_corpus_or_empty_without_llm_raises(
         self, ocr_engine: str,
     ) -> None:
+        comp = PipelineConfig(
             name="t", ocr_engine=ocr_engine, llm_provider="",
         )
         with pytest.raises(ValueError, match="llm_provider"):
         """Mode corpus + LLM → pipeline ``zero_shot`` (le LLM/VLM
         traite l'image ou l'OCR pré-calculé, l'``ocr_adapter`` est
         ``None``)."""
+        comp = PipelineConfig(
             name="post-corr", ocr_engine=ocr_engine,
             llm_provider="mistral", llm_model="m",
             pipeline_mode="zero_shot",
     def test_corpus_pipeline_name_format(self) -> None:
         """Sans ``name``, format ``corpus_ocr → {model}``."""
+        comp = PipelineConfig(
             name="", ocr_engine="corpus", llm_provider="mistral",
             llm_model="ministral-3b-latest",
             pipeline_mode="zero_shot",
     def test_cloud_engine_without_sdk_runtime_error(
         self, engine: str, module_path: str,
     ) -> None:
+        comp = PipelineConfig(
             name="t", ocr_engine=engine, llm_provider="",
         )
         with patch.dict(sys.modules, {module_path: None}):

tests/web/test_s9_ocr_engine_naming_contract.py CHANGED Viewed

@@ -31,7 +31,7 @@ from picarones.interfaces.web.benchmark_utils import (
     _OCR_KWARGS_BUILDERS,
     _engine_from_competitor,
 )
-from picarones.interfaces.web.models import CompetitorConfig
 # ``cfg_a`` et ``cfg_b`` sont passés tels quels au constructeur de
@@ -48,10 +48,10 @@ def test_two_distinct_configs_coexist_in_resolver(
     """Deux competitors avec ``ocr_model`` distincts doivent recevoir
     des ``name`` distincts au resolver — le bug Tesseract initial,
     généralisé à tous les moteurs supportés."""
-    comp_a = CompetitorConfig(
         ocr_engine=engine_id, ocr_model="cfg_a", llm_provider="",
     )
-    comp_b = CompetitorConfig(
         ocr_engine=engine_id, ocr_model="cfg_b", llm_provider="",
     )
     try:
@@ -82,10 +82,10 @@ def test_standalone_plus_pipeline_same_config_coexist(
     seul + un competitor pipeline OCR+LLM partageant la même config
     OCR.  Le resolver doit accepter (les 2 instances Python sont
     fonctionnellement équivalentes, déduplication idempotente)."""
-    comp_standalone = CompetitorConfig(
         ocr_engine=engine_id, ocr_model="same_config", llm_provider="",
     )
-    comp_pipeline = CompetitorConfig(
         ocr_engine=engine_id, ocr_model="same_config",
         llm_provider="mistral", llm_model="mistral-small-latest",
         pipeline_mode="text_only",

     _OCR_KWARGS_BUILDERS,
     _engine_from_competitor,
 )
+from picarones.interfaces.web.models import PipelineConfig
 # ``cfg_a`` et ``cfg_b`` sont passés tels quels au constructeur de
     """Deux competitors avec ``ocr_model`` distincts doivent recevoir
     des ``name`` distincts au resolver — le bug Tesseract initial,
     généralisé à tous les moteurs supportés."""
+    comp_a = PipelineConfig(
         ocr_engine=engine_id, ocr_model="cfg_a", llm_provider="",
     )
+    comp_b = PipelineConfig(
         ocr_engine=engine_id, ocr_model="cfg_b", llm_provider="",
     )
     try:
     seul + un competitor pipeline OCR+LLM partageant la même config
     OCR.  Le resolver doit accepter (les 2 instances Python sont
     fonctionnellement équivalentes, déduplication idempotente)."""
+    comp_standalone = PipelineConfig(
         ocr_engine=engine_id, ocr_model="same_config", llm_provider="",
     )
+    comp_pipeline = PipelineConfig(
         ocr_engine=engine_id, ocr_model="same_config",
         llm_provider="mistral", llm_model="mistral-small-latest",
         pipeline_mode="text_only",

tests/web/test_s9_prompt_loading.py CHANGED Viewed

@@ -41,7 +41,7 @@ from picarones.interfaces.web.benchmark_utils import (
     _engine_from_competitor,
     _load_prompt_content,
 )
-from picarones.interfaces.web.models import CompetitorConfig
 class TestLoadPromptContent:
@@ -113,7 +113,7 @@ class TestEngineFromCompetitorPassesPromptContent:
     pas le filename brut."""
     def test_pipeline_template_contains_file_content(self) -> None:
-        comp = CompetitorConfig(
             name="t",
             ocr_engine="tesseract",
             ocr_model="fra",
@@ -133,7 +133,7 @@ class TestEngineFromCompetitorPassesPromptContent:
     def test_default_prompt_loaded_when_none_specified(self) -> None:
         """``prompt_file`` vide → default
         ``correction_medieval_french.txt`` chargé."""
-        comp = CompetitorConfig(
             ocr_engine="tesseract", ocr_model="fra",
             llm_provider="mistral", llm_model="m",
             pipeline_mode="text_only", prompt_file="",
@@ -146,7 +146,7 @@ class TestEngineFromCompetitorPassesPromptContent:
         """Si le frontend envoie un filename qui n'existe pas, le
         factory doit lever proprement (pas continuer avec le filename
         comme prompt — c'est le bug d'origine)."""
-        comp = CompetitorConfig(
             ocr_engine="tesseract", ocr_model="fra",
             llm_provider="mistral", llm_model="m",
             pipeline_mode="text_only",

     _engine_from_competitor,
     _load_prompt_content,
 )
+from picarones.interfaces.web.models import PipelineConfig
 class TestLoadPromptContent:
     pas le filename brut."""
     def test_pipeline_template_contains_file_content(self) -> None:
+        comp = PipelineConfig(
             name="t",
             ocr_engine="tesseract",
             ocr_model="fra",
     def test_default_prompt_loaded_when_none_specified(self) -> None:
         """``prompt_file`` vide → default
         ``correction_medieval_french.txt`` chargé."""
+        comp = PipelineConfig(
             ocr_engine="tesseract", ocr_model="fra",
             llm_provider="mistral", llm_model="m",
             pipeline_mode="text_only", prompt_file="",
         """Si le frontend envoie un filename qui n'existe pas, le
         factory doit lever proprement (pas continuer avec le filename
         comme prompt — c'est le bug d'origine)."""
+        comp = PipelineConfig(
             ocr_engine="tesseract", ocr_model="fra",
             llm_provider="mistral", llm_model="m",
             pipeline_mode="text_only",

tests/web/test_sprint6_web_interface.py CHANGED Viewed

@@ -25,6 +25,7 @@ TestRunnerProgressCallback   (5 tests)  — progress_callback injecté dans run_
 from __future__ import annotations
 import json
 import os
 from pathlib import Path
@@ -33,6 +34,26 @@ from unittest.mock import patch
 import pytest
 from click.testing import CliRunner
 from fastapi.testclient import TestClient
 # ---------------------------------------------------------------------------
 # Fixtures
@@ -1277,9 +1298,9 @@ class TestFastAPICorpusUpload:
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
-            zf.writestr("page001.jpg", b"\xff\xd8\xff")        # fake JPEG
             zf.writestr("page001.gt.txt", "Texte de la page 1")
-            zf.writestr("page002.png", b"\x89PNG")             # fake PNG
             zf.writestr("page002.gt.txt", "Texte de la page 2")
         buf.seek(0)
         return buf.getvalue()
@@ -1292,9 +1313,9 @@ class TestFastAPICorpusUpload:
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
-            zf.writestr("page001.jpg", b"\xff\xd8\xff")
             zf.writestr("page001.gt.txt", "GT ok")
-            zf.writestr("page002.png", b"\x89PNG")             # pas de GT
         buf.seek(0)
         return buf.getvalue()
@@ -1427,7 +1448,7 @@ class TestFastAPICorpusUpload:
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
-            zf.writestr("page001.png", b"\x89PNG")
             zf.writestr("page001.xml", alto_xml_bytes)
         buf.seek(0)
         return buf.getvalue()
@@ -1502,7 +1523,7 @@ class TestFastAPICorpusUpload:
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
-            zf.writestr("page002.png", b"\x89PNG")
             zf.writestr("page002.xml", page_xml_bytes)
         buf.seek(0)
         return buf.getvalue()
@@ -1553,7 +1574,7 @@ class TestFastAPICorpusUpload:
         unknown_xml = b'<?xml version="1.0"?><root><item>foo</item></root>'
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
-            zf.writestr("pageX.png", b"\x89PNG")
             zf.writestr("pageX.xml", unknown_xml)
         buf.seek(0)
         r = client.post(

 from __future__ import annotations
+import io
 import json
 import os
 from pathlib import Path
 import pytest
 from click.testing import CliRunner
 from fastapi.testclient import TestClient
+from PIL import Image as _PILImage
+def _minimal_image_bytes(fmt: str) -> bytes:
+    """Génère une image 1×1 valide qui passe ``validate_image_safe``.
+    Le durcissement Phase 1 du chantier post-rewrite appelle
+    ``Pillow.verify()`` sur chaque image extraite d'un ZIP — les
+    anciens placeholders ``b"\\xff\\xd8\\xff"`` (signature seule) sont
+    désormais rejetés.  Cette fonction produit l'image minimale au
+    setup des fixtures.
+    """
+    buf = io.BytesIO()
+    _PILImage.new("RGB", (1, 1), color=(200, 200, 200)).save(buf, fmt)
+    return buf.getvalue()
+_MINIMAL_PNG_BYTES = _minimal_image_bytes("PNG")
+_MINIMAL_JPEG_BYTES = _minimal_image_bytes("JPEG")
 # ---------------------------------------------------------------------------
 # Fixtures
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
+            zf.writestr("page001.jpg", _MINIMAL_JPEG_BYTES)
             zf.writestr("page001.gt.txt", "Texte de la page 1")
+            zf.writestr("page002.png", _MINIMAL_PNG_BYTES)
             zf.writestr("page002.gt.txt", "Texte de la page 2")
         buf.seek(0)
         return buf.getvalue()
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
+            zf.writestr("page001.jpg", _MINIMAL_JPEG_BYTES)
             zf.writestr("page001.gt.txt", "GT ok")
+            zf.writestr("page002.png", _MINIMAL_PNG_BYTES)
         buf.seek(0)
         return buf.getvalue()
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
+            zf.writestr("page001.png", _MINIMAL_PNG_BYTES)
             zf.writestr("page001.xml", alto_xml_bytes)
         buf.seek(0)
         return buf.getvalue()
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
+            zf.writestr("page002.png", _MINIMAL_PNG_BYTES)
             zf.writestr("page002.xml", page_xml_bytes)
         buf.seek(0)
         return buf.getvalue()
         unknown_xml = b'<?xml version="1.0"?><root><item>foo</item></root>'
         buf = io.BytesIO()
         with zipfile.ZipFile(buf, "w") as zf:
+            zf.writestr("pageX.png", _MINIMAL_PNG_BYTES)
             zf.writestr("pageX.xml", unknown_xml)
         buf.seek(0)
         r = client.post(