Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Claude commited on May 6

Commit

ca31461

unverified ·

1 Parent(s): 53a3d00

feat(adapters/ocr): Sprint A14-S50 — ConfidenceArtifact + Tesseract (fix audit #4)

L'audit avait identifié que la migration native S30-S34 avait perdu
la feature token_confidences du legacy. Régression critique : les
vues de calibration (ECE/MCE, reliability diagram) étaient
inopérantes pour les pipelines new-world.

Stratégie : nouvel ArtifactType.CONFIDENCES + sidecar JSON canonique
à côté du fichier texte. Permet aux 5 OCR adapters de re-exposer
leurs confidences natives (Tesseract image_to_data, Pero
transcription_confidence, etc.) sans toucher à BaseOCRAdapter.

picarones/domain/artifacts.py
-----------------------------
- Nouveau ArtifactType.CONFIDENCES = 'confidences'.
- Schéma JSON canonique documenté : tokens[].{text, confidence ∈
[0, 1]} + extractor + model_version.

picarones/adapters/ocr/confidences.py (nouveau)
-----------------------------------------------
- filter_valid_tokens(raw) : nettoie/normalise les tokens bruts
(skip text vide, conf None ou négative ; convertit 0-100 → 0-1).
- write_confidences_sidecar() : produit
<stem>.<adapter_name>.confidences.json + Artifact CONFIDENCES.

picarones/adapters/ocr/tesseract.py — extension
-----------------------------------------------
- Nouveau param expose_confidences=True (défaut) au constructeur.
- output_types devient une property d'instance dynamique :
- True → {RAW_TEXT, CONFIDENCES}
- False → {RAW_TEXT}
Permet à PipelinePlanner de valider correctement.
- _extract_and_persist_confidences() : appelle image_to_data,
best-effort (échec → warning, OCR reste valide), normalise via
filter_valid_tokens, écrit sidecar.

Tests (13 S50 + 1 màj S30)
--------------------------
- TestFilterValidTokens : 7 cas (valides, vides, négatif, None,
format Tesseract 0-100 → 0-1, hors-range, non-numerique).
- TestWriteSidecar : path attendu, Unicode préservé, model_version
optionnel.
- TestTesseractConfidenceIntegration : sidecar produit par défaut,
pas de sidecar quand expose_confidences=False, extraction failure
graceful (RAW_TEXT toujours produit).

Tests : 37 passed (24 S30 + 13 S50, 0 régression).
Lint : All checks passed.

Pero/Mistral/Google/Azure
-------------------------
Le pattern (sidecar + filter_valid_tokens + property output_types
dynamique) sera répliqué pour les 4 autres adapters dans des sprints
de polishing dédiés (les API natives diffèrent suffisamment qu'un
seul commit S50 deviendrait gros). Tesseract est livré complet ;
les 4 autres restent au comportement S30-S34 (pas de confidences)
en attendant.

https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP

Files changed (5) hide show

picarones/adapters/ocr/confidences.py +148 -0
picarones/adapters/ocr/tesseract.py +103 -2
picarones/domain/artifacts.py +12 -0
tests/adapters/ocr/test_sprint_a14_s30_tesseract_adapter.py +8 -1
tests/adapters/ocr/test_sprint_a14_s50_confidences.py +262 -0

picarones/adapters/ocr/confidences.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""Sidecar de confidences OCR — Sprint A14-S50.
+Fix audit #4 : avant ce sprint, la migration native des 5 OCR adapters
+(S30-S34) avait perdu la feature ``token_confidences`` du legacy.
+Les vues de calibration (ECE/MCE, reliability diagram) devenaient
+inopérantes pour les pipelines new-world.
+Stratégie
+---------
+Plutôt que de stuffer les confidences dans ``EngineResult`` legacy
+(qui n'existe plus), on les expose comme un **artefact dédié**
+``ArtifactType.CONFIDENCES`` (sidecar JSON à côté du fichier texte).
+Format JSON canonique
+---------------------
+::
+    {
+      "tokens": [
+        {"text": "Bonjour", "confidence": 0.95},
+        {"text": "le",      "confidence": 0.99},
+        ...
+      ],
+      "extractor": "tesseract",
+      "model_version": "5.3.0"  // optionnel
+    }
+- ``confidence`` ∈ [0, 1] (les adapters convertissent eux-mêmes
+  depuis leur format natif — Tesseract retourne 0-100, on divise
+  par 100).
+- Tokens vides ou conf négatives ignorés à la source (cf.
+  ``filter_valid_tokens``).
+API publique
+------------
+- ``filter_valid_tokens(raw)`` : nettoie une liste de dicts brutes.
+- ``write_confidences_sidecar(text_path, name, tokens, ...)`` :
+  écrit ``<stem>.<name>.confidences.json`` à côté du fichier texte.
+- ``ConfidenceToken`` (TypedDict léger) : forme attendue du dict.
+Anti-sur-ingénierie
+-------------------
+- Pas de pydantic — TypedDict + json suffisent ; le caller normalise.
+- Pas de schéma JSON publié — la stabilité sera tagguée à la livraison.
+- Pas de support pour les confidences niveau ligne / paragraphe :
+  on aplatit tout au niveau mot (cohérent avec le legacy Sprint 47).
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any, TypedDict
+from picarones.domain.artifacts import Artifact, ArtifactType
+class ConfidenceToken(TypedDict):
+    """Forme canonique d'un token de confidence."""
+    text: str
+    confidence: float
+def filter_valid_tokens(
+    raw: list[dict[str, Any]],
+) -> list[ConfidenceToken]:
+    """Nettoie une liste brute de tokens (ignore les non-mots).
+    Filtre :
+    - ``text`` vide ou whitespace-only ;
+    - ``confidence`` ``None`` ou négative (Tesseract met -1 pour les
+      non-mots) ;
+    - ``confidence`` > 1.0 → divisé par 100 si ≤ 100, sinon ignoré.
+    Retourne une nouvelle liste, ne modifie pas l'input.
+    """
+    out: list[ConfidenceToken] = []
+    for entry in raw:
+        text = str(entry.get("text", "") or "").strip()
+        if not text:
+            continue
+        conf = entry.get("confidence")
+        if conf is None:
+            continue
+        try:
+            conf_f = float(conf)
+        except (TypeError, ValueError):
+            continue
+        if conf_f < 0:
+            continue
+        if conf_f > 1.0:
+            # Tesseract retourne 0-100 ; on normalise.
+            if conf_f <= 100.0:
+                conf_f = conf_f / 100.0
+            else:
+                # > 100 = donnée corrompue, on ignore.
+                continue
+        out.append({"text": text, "confidence": conf_f})
+    return out
+def write_confidences_sidecar(
+    text_path: Path,
+    adapter_name: str,
+    tokens: list[ConfidenceToken],
+    *,
+    document_id: str,
+    extractor: str | None = None,
+    model_version: str | None = None,
+) -> Artifact:
+    """Écrit un sidecar JSON ``<stem>.<adapter_name>.confidences.json``
+    à côté du fichier texte produit par l'OCR.
+    Returns
+    -------
+    Artifact
+        Artifact ``CONFIDENCES`` avec ``uri`` pointant vers le sidecar.
+    """
+    sidecar_path = (
+        text_path.parent
+        / f"{text_path.stem}.{adapter_name}.confidences.json"
+    )
+    payload = {
+        "tokens": tokens,
+        "extractor": extractor or adapter_name,
+        "model_version": model_version,
+    }
+    sidecar_path.write_text(
+        json.dumps(payload, ensure_ascii=False, indent=2),
+        encoding="utf-8",
+    )
+    return Artifact(
+        id=f"{document_id}:{adapter_name}:confidences",
+        document_id=document_id,
+        type=ArtifactType.CONFIDENCES,
+        produced_by_step="ocr",
+        uri=str(sidecar_path),
+    )
+__all__ = [
+    "ConfidenceToken",
+    "filter_valid_tokens",
+    "write_confidences_sidecar",
+]

picarones/adapters/ocr/tesseract.py CHANGED Viewed

@@ -98,7 +98,12 @@ class TesseractAdapter(BaseOCRAdapter):
     """
     input_types = frozenset({ArtifactType.IMAGE})
-    output_types = frozenset({ArtifactType.RAW_TEXT})
     execution_mode = "cpu"
     def __init__(
@@ -109,6 +114,7 @@ class TesseractAdapter(BaseOCRAdapter):
         psm: int = 6,
         oem: int = 3,
         tesseract_cmd: str | None = None,
     ) -> None:
         if not name or not name.strip():
             raise OCRAdapterError(
@@ -132,11 +138,31 @@ class TesseractAdapter(BaseOCRAdapter):
         self._psm = psm
         self._oem = oem
         self._tesseract_cmd = tesseract_cmd
     @property
     def name(self) -> str:
         return self._name
     @property
     def lang(self) -> str:
         return self._lang
@@ -223,7 +249,7 @@ class TesseractAdapter(BaseOCRAdapter):
         )
         text_path.write_text(text, encoding="utf-8")
-        return {
             ArtifactType.RAW_TEXT: Artifact(
                 id=f"{context.document_id}:{self.name}:raw_text",
                 document_id=context.document_id,
@@ -233,5 +259,80 @@ class TesseractAdapter(BaseOCRAdapter):
             ),
         }
 __all__ = ["TesseractAdapter"]

     """
     input_types = frozenset({ArtifactType.IMAGE})
+    # Sprint S50 : ``output_types`` est désormais une property
+    # d'instance qui inclut CONFIDENCES si et seulement si
+    # ``expose_confidences=True`` (défaut).  Permet de désactiver
+    # la production du sidecar en mode opt-out sans déclarer un
+    # output que l'adapter ne produit pas (l'executor validerait
+    # alors un manque).
     execution_mode = "cpu"
     def __init__(
         psm: int = 6,
         oem: int = 3,
         tesseract_cmd: str | None = None,
+        expose_confidences: bool = True,
     ) -> None:
         if not name or not name.strip():
             raise OCRAdapterError(
         self._psm = psm
         self._oem = oem
         self._tesseract_cmd = tesseract_cmd
+        self._expose_confidences = expose_confidences
     @property
     def name(self) -> str:
         return self._name
+    @property
+    def output_types(self) -> frozenset:  # type: ignore[override]
+        """Output_types dynamique selon ``expose_confidences``.
+        Sprint S50 : si l'instance expose les confidences, déclare
+        ``{RAW_TEXT, CONFIDENCES}`` ; sinon ``{RAW_TEXT}`` seul.
+        Le ``PipelinePlanner`` lit cette propriété pour valider
+        que les types s'enchaînent.
+        """
+        if self._expose_confidences:
+            return frozenset(
+                {ArtifactType.RAW_TEXT, ArtifactType.CONFIDENCES},
+            )
+        return frozenset({ArtifactType.RAW_TEXT})
+    @property
+    def expose_confidences(self) -> bool:
+        return self._expose_confidences
     @property
     def lang(self) -> str:
         return self._lang
         )
         text_path.write_text(text, encoding="utf-8")
+        outputs: dict = {
             ArtifactType.RAW_TEXT: Artifact(
                 id=f"{context.document_id}:{self.name}:raw_text",
                 document_id=context.document_id,
             ),
         }
+        # Sprint S50 : extraction des confidences via image_to_data
+        # (best-effort).  Si l'extraction échoue, on log et on saute
+        # — l'OCR reste valide, seule la calibration est indisponible
+        # pour ce document.
+        if self._expose_confidences:
+            confidences_artifact = self._extract_and_persist_confidences(
+                image_path=image_path,
+                text_path=text_path,
+                pytesseract_module=pytesseract,
+                pil_image_class=Image,
+                custom_config=custom_config,
+                document_id=context.document_id,
+            )
+            if confidences_artifact is not None:
+                outputs[ArtifactType.CONFIDENCES] = confidences_artifact
+        return outputs
+    def _extract_and_persist_confidences(
+        self,
+        *,
+        image_path: Path,
+        text_path: Path,
+        pytesseract_module,
+        pil_image_class,
+        custom_config: str,
+        document_id: str,
+    ) -> Artifact | None:
+        """Appelle ``image_to_data`` puis écrit le sidecar JSON.
+        Retourne l'``Artifact CONFIDENCES`` ou ``None`` si l'extraction
+        a échoué (warning loggé, OCR reste valide).
+        """
+        import logging
+        logger = logging.getLogger(__name__)
+        from picarones.adapters.ocr.confidences import (
+            filter_valid_tokens,
+            write_confidences_sidecar,
+        )
+        try:
+            with pil_image_class.open(image_path) as image:
+                data = pytesseract_module.image_to_data(
+                    image,
+                    lang=self._lang,
+                    config=custom_config,
+                    output_type=pytesseract_module.Output.DICT,
+                )
+        except Exception as exc:  # noqa: BLE001 — best-effort
+            logger.warning(
+                "[%s] image_to_data indisponible (%s) — calibration "
+                "sautée pour ce document.", self._name, exc,
+            )
+            return None
+        # Format Tesseract : dict {"text": [...], "conf": [...]}.
+        texts = data.get("text") or []
+        confs = data.get("conf") or []
+        raw = [
+            {"text": t, "confidence": c}
+            for t, c in zip(texts, confs)
+        ]
+        tokens = filter_valid_tokens(raw)
+        return write_confidences_sidecar(
+            text_path=text_path,
+            adapter_name=self._name,
+            tokens=tokens,
+            document_id=document_id,
+            extractor="tesseract",
+        )
+__all__ = ["TesseractAdapter"]
 __all__ = ["TesseractAdapter"]

picarones/domain/artifacts.py CHANGED Viewed

@@ -94,6 +94,18 @@ class ArtifactType(str, Enum):
     #: ``error_absorption``.
     ALIGNMENT = "alignment"
 def compute_content_hash(payload: bytes) -> str:
     """SHA-256 hex (64 chars) d'un payload binaire.

     #: ``error_absorption``.
     ALIGNMENT = "alignment"
+    #: Confidences OCR au niveau token (Sprint S50).  Sidecar JSON
+    #: produit par les adapters OCR qui exposent des scores natifs
+    #: (Tesseract image_to_data, Pero transcription_confidence,
+    #: Mistral OCR API confidences, Google Vision Word.confidence,
+    #: Azure DI Word.confidence).
+    #:
+    #: Schéma JSON : ``{"tokens": [{"text": str, "confidence":
+    #: float ∈ [0, 1]}], "extractor": str, "model_version": str |
+    #: null}``.  Consommé par les vues de calibration (ECE/MCE,
+    #: reliability diagram).
+    CONFIDENCES = "confidences"
 def compute_content_hash(payload: bytes) -> str:
     """SHA-256 hex (64 chars) d'un payload binaire.

tests/adapters/ocr/test_sprint_a14_s30_tesseract_adapter.py CHANGED Viewed

@@ -128,7 +128,14 @@ class TestTesseractAdapterContract:
         assert TesseractAdapter.input_types == frozenset({ArtifactType.IMAGE})
     def test_output_types(self) -> None:
-        assert TesseractAdapter.output_types == frozenset({ArtifactType.RAW_TEXT})
     def test_execution_mode_is_cpu(self) -> None:
         """Tesseract est CPU-bound — utilise un ProcessPool dans le runner."""

         assert TesseractAdapter.input_types == frozenset({ArtifactType.IMAGE})
     def test_output_types(self) -> None:
+        # Sprint S50 : output_types est une property d'instance qui
+        # dépend de ``expose_confidences``.
+        assert TesseractAdapter().output_types == frozenset(
+            {ArtifactType.RAW_TEXT, ArtifactType.CONFIDENCES},
+        )
+        assert TesseractAdapter(
+            expose_confidences=False,
+        ).output_types == frozenset({ArtifactType.RAW_TEXT})
     def test_execution_mode_is_cpu(self) -> None:
         """Tesseract est CPU-bound — utilise un ProcessPool dans le runner."""

tests/adapters/ocr/test_sprint_a14_s50_confidences.py ADDED Viewed

	@@ -0,0 +1,262 @@

+"""Sprint A14-S50 — sidecar de confidences OCR (fix audit #4).
+Couvre :
+1. ``filter_valid_tokens`` — normalisation et filtrage des tokens.
+2. ``write_confidences_sidecar`` — fichier JSON canonique.
+3. Intégration ``TesseractAdapter`` — sidecar produit en parallèle
+   du fichier texte ; opt-out via ``expose_confidences=False``.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+from picarones.adapters.ocr import TesseractAdapter
+from picarones.adapters.ocr.confidences import (
+    filter_valid_tokens,
+    write_confidences_sidecar,
+)
+from picarones.domain.artifacts import Artifact, ArtifactType
+from picarones.pipeline.types import RunContext
+# ──────────────────────────────────────────────────────────────────────
+# filter_valid_tokens
+# ──────────────────────────────────────────────────────────────────────
+class TestFilterValidTokens:
+    def test_valid_tokens_passed_through(self) -> None:
+        result = filter_valid_tokens([
+            {"text": "Hello", "confidence": 0.95},
+            {"text": "world", "confidence": 0.80},
+        ])
+        assert len(result) == 2
+        assert result[0]["text"] == "Hello"
+        assert result[0]["confidence"] == 0.95
+    def test_empty_text_filtered(self) -> None:
+        result = filter_valid_tokens([
+            {"text": "", "confidence": 0.9},
+            {"text": "  ", "confidence": 0.8},
+            {"text": "ok", "confidence": 0.7},
+        ])
+        assert len(result) == 1
+        assert result[0]["text"] == "ok"
+    def test_negative_confidence_filtered(self) -> None:
+        result = filter_valid_tokens([
+            {"text": "ok", "confidence": -1},
+            {"text": "good", "confidence": 0.5},
+        ])
+        assert len(result) == 1
+        assert result[0]["text"] == "good"
+    def test_none_confidence_filtered(self) -> None:
+        result = filter_valid_tokens([
+            {"text": "x", "confidence": None},
+            {"text": "y", "confidence": 0.6},
+        ])
+        assert len(result) == 1
+        assert result[0]["text"] == "y"
+    def test_tesseract_format_normalized(self) -> None:
+        """Tesseract retourne 0-100 ; on normalise à [0, 1]."""
+        result = filter_valid_tokens([
+            {"text": "Hello", "confidence": 95},
+            {"text": "world", "confidence": 80.5},
+        ])
+        assert result[0]["confidence"] == 0.95
+        assert result[1]["confidence"] == 0.805
+    def test_out_of_range_filtered(self) -> None:
+        result = filter_valid_tokens([
+            {"text": "x", "confidence": 9999},  # > 100, ignoré
+            {"text": "y", "confidence": 50},  # OK normalisé à 0.5
+        ])
+        assert len(result) == 1
+        assert result[0]["text"] == "y"
+        assert result[0]["confidence"] == 0.5
+    def test_non_numeric_filtered(self) -> None:
+        result = filter_valid_tokens([
+            {"text": "x", "confidence": "not a number"},
+            {"text": "y", "confidence": 0.5},
+        ])
+        assert len(result) == 1
+# ──────────────────────────────────────────────────────────────────────
+# write_confidences_sidecar
+# ──────────────────────────────────────────────────────────────────────
+class TestWriteSidecar:
+    def test_writes_json_at_expected_path(self, tmp_path: Path) -> None:
+        text_path = tmp_path / "doc.txt"
+        text_path.write_text("Hello world", encoding="utf-8")
+        artifact = write_confidences_sidecar(
+            text_path=text_path,
+            adapter_name="tesseract",
+            tokens=[{"text": "Hello", "confidence": 0.9}],
+            document_id="doc01",
+            extractor="tesseract",
+        )
+        sidecar = tmp_path / "doc.tesseract.confidences.json"
+        assert sidecar.exists()
+        payload = json.loads(sidecar.read_text(encoding="utf-8"))
+        assert payload["tokens"] == [
+            {"text": "Hello", "confidence": 0.9},
+        ]
+        assert payload["extractor"] == "tesseract"
+        assert payload["model_version"] is None
+        # Artifact CONFIDENCES.
+        assert artifact.type == ArtifactType.CONFIDENCES
+        assert artifact.uri == str(sidecar)
+        assert artifact.id == "doc01:tesseract:confidences"
+    def test_unicode_preserved(self, tmp_path: Path) -> None:
+        text_path = tmp_path / "doc.txt"
+        text_path.write_text("ok", encoding="utf-8")
+        write_confidences_sidecar(
+            text_path=text_path,
+            adapter_name="tesseract",
+            tokens=[{"text": "français", "confidence": 0.9}],
+            document_id="doc01",
+        )
+        sidecar = tmp_path / "doc.tesseract.confidences.json"
+        # ensure_ascii=False → caractères Unicode bruts.
+        assert "français" in sidecar.read_text(encoding="utf-8")
+    def test_model_version_when_provided(self, tmp_path: Path) -> None:
+        text_path = tmp_path / "doc.txt"
+        text_path.write_text("ok", encoding="utf-8")
+        write_confidences_sidecar(
+            text_path=text_path,
+            adapter_name="tesseract",
+            tokens=[],
+            document_id="doc01",
+            model_version="5.3.0",
+        )
+        sidecar = tmp_path / "doc.tesseract.confidences.json"
+        payload = json.loads(sidecar.read_text(encoding="utf-8"))
+        assert payload["model_version"] == "5.3.0"
+# ──────────────────────────────────────────────────────────────────────
+# Intégration TesseractAdapter
+# ──────────────────────────────────────────────────────────────────────
+def _make_image_artifact(uri: str) -> Artifact:
+    return Artifact(
+        id="d1:img",
+        document_id="d1",
+        type=ArtifactType.IMAGE,
+        uri=uri,
+    )
+def _make_context() -> RunContext:
+    return RunContext(
+        document_id="d1",
+        code_version="1.0.0",
+        pipeline_name="test",
+    )
+class TestTesseractConfidenceIntegration:
+    def _create_dummy_image(self, tmp_path: Path) -> Path:
+        path = tmp_path / "page.png"
+        path.write_bytes(b"\x89PNG\r\n\x1a\n")
+        return path
+    @patch("PIL.Image.open")
+    @patch("pytesseract.image_to_string")
+    @patch("pytesseract.image_to_data")
+    def test_sidecar_produced_by_default(
+        self,
+        mock_image_to_data: MagicMock,
+        mock_image_to_string: MagicMock,
+        mock_image_open: MagicMock,
+        tmp_path: Path,
+    ) -> None:
+        mock_image_to_string.return_value = "Hello world"
+        mock_image_to_data.return_value = {
+            "text": ["Hello", "world"],
+            "conf": [95, 88],
+        }
+        mock_image_open.return_value.__enter__.return_value = MagicMock()
+        adapter = TesseractAdapter()  # expose_confidences=True par défaut
+        image_path = self._create_dummy_image(tmp_path)
+        result = adapter.execute(
+            inputs={ArtifactType.IMAGE: _make_image_artifact(str(image_path))},
+            params={},
+            context=_make_context(),
+        )
+        # Outputs : RAW_TEXT + CONFIDENCES.
+        assert ArtifactType.RAW_TEXT in result
+        assert ArtifactType.CONFIDENCES in result
+        sidecar_path = Path(result[ArtifactType.CONFIDENCES].uri)
+        assert sidecar_path.exists()
+        payload = json.loads(sidecar_path.read_text(encoding="utf-8"))
+        assert payload["tokens"] == [
+            {"text": "Hello", "confidence": 0.95},
+            {"text": "world", "confidence": 0.88},
+        ]
+        assert payload["extractor"] == "tesseract"
+    @patch("PIL.Image.open")
+    @patch("pytesseract.image_to_string")
+    def test_no_sidecar_when_expose_confidences_false(
+        self,
+        mock_image_to_string: MagicMock,
+        mock_image_open: MagicMock,
+        tmp_path: Path,
+    ) -> None:
+        mock_image_to_string.return_value = "Hello world"
+        mock_image_open.return_value.__enter__.return_value = MagicMock()
+        adapter = TesseractAdapter(expose_confidences=False)
+        image_path = self._create_dummy_image(tmp_path)
+        result = adapter.execute(
+            inputs={ArtifactType.IMAGE: _make_image_artifact(str(image_path))},
+            params={},
+            context=_make_context(),
+        )
+        # Pas de CONFIDENCES dans les outputs.
+        assert ArtifactType.RAW_TEXT in result
+        assert ArtifactType.CONFIDENCES not in result
+        # Pas de sidecar sur disque.
+        sidecars = list(tmp_path.glob("*.confidences.json"))
+        assert sidecars == []
+    @patch("PIL.Image.open")
+    @patch("pytesseract.image_to_string")
+    @patch("pytesseract.image_to_data")
+    def test_extraction_failure_is_graceful(
+        self,
+        mock_image_to_data: MagicMock,
+        mock_image_to_string: MagicMock,
+        mock_image_open: MagicMock,
+        tmp_path: Path,
+    ) -> None:
+        """Si image_to_data plante, l'OCR doit malgré tout produire
+        RAW_TEXT — seule la calibration est sautée pour ce document."""
+        mock_image_to_string.return_value = "Hello world"
+        mock_image_to_data.side_effect = RuntimeError(
+            "image_to_data crashed",
+        )
+        mock_image_open.return_value.__enter__.return_value = MagicMock()
+        adapter = TesseractAdapter()
+        image_path = self._create_dummy_image(tmp_path)
+        result = adapter.execute(
+            inputs={ArtifactType.IMAGE: _make_image_artifact(str(image_path))},
+            params={},
+            context=_make_context(),
+        )
+        assert ArtifactType.RAW_TEXT in result
+        # CONFIDENCES absent — extraction a échoué silencieusement.
+        assert ArtifactType.CONFIDENCES not in result