Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Claude commited on May 1

Commit

eca43d9

unverified ·

1 Parent(s): 00924d0

refactor(engines): unifier l'API token_confidences à un seul nom canonique

Avant ce commit, chaque adapter OCR exposait deux APIs équivalentes :

- Tesseract : ``_extract_token_confidences(image_path)`` (Sprint 47) +
``_extract_raw_confidences(native)`` (chantier 1)
- Pero : ``_extract_token_confidences_from_layout(layout)`` (Sprint 48) +
``_extract_raw_confidences(native)``
- Mistral : ``_extract_token_confidences_from_response(response)`` (Sprint 49) +
``_extract_raw_confidences(native)`` + ``_run_ocr_with_response(image_path)``
(déléguant à ``_run_with_native``)
- Google Vision : ``_extract_token_confidences_from_full_text(full)`` (Sprint 50)
+ ``_extract_raw_confidences(native)`` + ``_run_ocr_with_full_annotation``
- Azure DI : ``_extract_token_confidences_from_result(result)`` (Sprint 51)
+ ``_extract_raw_confidences(native)`` + ``_run_ocr_with_result``

Cette double API doublait la surface, brouillait le contrat et obligeait
tout modificateur à toucher deux noms.

Ce commit supprime les noms historiques et garde **une seule API
canonique** par engine :

- ``_run_with_native(image_path) → (text, native)`` — appel API unique
- ``_extract_raw_confidences(native) → list[dict] | None`` — parsing
- ``_normalize_token_confidences(raw)`` — filtrage final (hérité de
``BaseOCREngine``)

Mise à jour de la docstring de ``_normalize_token_confidences`` :
préciser que l'échelle native ([0, 100] pour Tesseract, [0, 1] pour
les autres) est conservée. La normalisation finale au moment du
calcul de calibration est faite dans
``picarones.measurements.builtin_hooks.calibration_from_engine_result``.

Les tests Sprints 47-51 ont été migrés vers la nouvelle API au
commit précédent.

https://claude.ai/code/session_01Hsd7kL8yeCbXn1mA7GQK9L

Files changed (6) hide show

picarones/engines/azure_doc_intel.py +1 -21
picarones/engines/base.py +6 -10
picarones/engines/google_vision.py +1 -21
picarones/engines/mistral_ocr.py +1 -21
picarones/engines/pero_ocr.py +0 -11
picarones/engines/tesseract.py +0 -22

picarones/engines/azure_doc_intel.py CHANGED Viewed

@@ -80,21 +80,12 @@ class AzureDocIntelEngine(BaseOCREngine):
         self._api_version: str = self.config.get("api_version", "2024-02-29-preview")
     def _run_ocr(self, image_path: Path) -> str:
-        """API rétrocompat : retourne uniquement le texte."""
         text, _result = self._run_with_native(image_path)
         return text
     def _run_with_native(
         self, image_path: Path,
-    ) -> tuple[str, Optional[dict]]:
-        """Hook framework (chantier 1) — délègue à ``_run_ocr_with_result``
-        pour permettre aux tests Sprint 51 de monkeypatcher l'appel réseau
-        sous son nom historique.
-        """
-        return self._run_ocr_with_result(image_path)
-    def _run_ocr_with_result(
-        self, image_path: Path,
     ) -> tuple[str, Optional[dict]]:
         """Exécute l'OCR et retourne ``(text, analyze_result_dict)``.
@@ -252,14 +243,3 @@ class AzureDocIntelEngine(BaseOCREngine):
                     continue
                 out.append({"token": content, "confidence": conf})
         return out or None
-    def _extract_token_confidences_from_result(
-        self, result: Any,
-    ) -> Optional[list[dict[str, Any]]]:
-        """Alias rétrocompat (Sprint 51) — extrait les confidences d'un ``analyzeResult``.
-        Wrapper qui chaîne ``_extract_raw_confidences`` puis
-        ``_normalize_token_confidences`` (filtrage tokens vides / négatifs).
-        """
-        raw = self._extract_raw_confidences(result)
-        return self._normalize_token_confidences(raw)

         self._api_version: str = self.config.get("api_version", "2024-02-29-preview")
     def _run_ocr(self, image_path: Path) -> str:
+        """Retourne uniquement le texte (interface ``BaseOCREngine``)."""
         text, _result = self._run_with_native(image_path)
         return text
     def _run_with_native(
         self, image_path: Path,
     ) -> tuple[str, Optional[dict]]:
         """Exécute l'OCR et retourne ``(text, analyze_result_dict)``.
                     continue
                 out.append({"token": content, "confidence": conf})
         return out or None

picarones/engines/base.py CHANGED Viewed

@@ -172,22 +172,18 @@ class BaseOCREngine(BaseModule):
     def _normalize_token_confidences(
         raw: Optional[list[dict[str, Any]]],
     ) -> Optional[list[dict[str, Any]]]:
-        """Filtre les confidences brutes (sans changer l'échelle).
         - Tokens vides ou ``None`` → écartés.
         - Confidences négatives (Tesseract met -1 pour les non-mots) → écartées.
         - Confidences non convertibles en float → écartées.
-        L'**échelle native** des moteurs (Tesseract en [0, 100],
-        Google/Anthropic/Mistral en [0, 1]) est **conservée**. Le
-        runner Sprint 42 (``_calibration_from_engine_result`` dans
-        ``builtin_hooks``) normalise lui-même au moment du calcul de
-        calibration. Cette discipline préserve la rétrocompat des
-        tests Sprints 47-51 qui inspectent ``EngineResult.token_confidences``.
-        Retourne ``None`` si aucune entrée n'est exploitable (au
-        lieu d'une liste vide), ce qui signale au runner de sauter
-        le calcul de calibration sur ce document.
         """
         if not raw:
             return None

     def _normalize_token_confidences(
         raw: Optional[list[dict[str, Any]]],
     ) -> Optional[list[dict[str, Any]]]:
+        """Filtre les confidences brutes (échelle native conservée).
         - Tokens vides ou ``None`` → écartés.
         - Confidences négatives (Tesseract met -1 pour les non-mots) → écartées.
         - Confidences non convertibles en float → écartées.
+        L'échelle native des moteurs ([0, 100] pour Tesseract,
+        [0, 1] pour les autres) est conservée. La normalisation finale
+        au moment du calcul de calibration est faite dans
+        :func:`picarones.measurements.builtin_hooks.calibration_from_engine_result`.
+        Retourne ``None`` si aucune entrée n'est exploitable.
         """
         if not raw:
             return None

picarones/engines/google_vision.py CHANGED Viewed

@@ -78,21 +78,12 @@ class GoogleVisionEngine(BaseOCREngine):
         self._feature_type: str = self.config.get("feature_type", "DOCUMENT_TEXT_DETECTION")
     def _run_ocr(self, image_path: Path) -> str:
-        """API rétrocompat : retourne uniquement le texte."""
         text, _full = self._run_with_native(image_path)
         return text
     def _run_with_native(
         self, image_path: Path,
-    ) -> tuple[str, Optional[dict]]:
-        """Hook framework (chantier 1) — délègue à ``_run_ocr_with_full_annotation``
-        pour permettre aux tests Sprint 50 de monkeypatcher l'appel réseau
-        sous son nom historique.
-        """
-        return self._run_ocr_with_full_annotation(image_path)
-    def _run_ocr_with_full_annotation(
-        self, image_path: Path,
     ) -> tuple[str, Optional[dict]]:
         """Exécute l'OCR et retourne ``(text, full_text_annotation_dict)``.
@@ -263,14 +254,3 @@ class GoogleVisionEngine(BaseOCREngine):
                             continue
                         out.append({"token": text, "confidence": conf})
         return out or None
-    def _extract_token_confidences_from_full_text(
-        self, full: Any,
-    ) -> Optional[list[dict[str, Any]]]:
-        """Alias rétrocompat (Sprint 50) — extrait les confidences d'un ``fullTextAnnotation``.
-        Wrapper qui chaîne ``_extract_raw_confidences`` puis
-        ``_normalize_token_confidences`` (filtrage tokens vides / négatifs).
-        """
-        raw = self._extract_raw_confidences(full)
-        return self._normalize_token_confidences(raw)

         self._feature_type: str = self.config.get("feature_type", "DOCUMENT_TEXT_DETECTION")
     def _run_ocr(self, image_path: Path) -> str:
+        """Retourne uniquement le texte (interface ``BaseOCREngine``)."""
         text, _full = self._run_with_native(image_path)
         return text
     def _run_with_native(
         self, image_path: Path,
     ) -> tuple[str, Optional[dict]]:
         """Exécute l'OCR et retourne ``(text, full_text_annotation_dict)``.
                             continue
                         out.append({"token": text, "confidence": conf})
         return out or None

picarones/engines/mistral_ocr.py CHANGED Viewed

@@ -76,21 +76,12 @@ class MistralOCREngine(BaseOCREngine):
         self._max_tokens = int(self.config.get("max_tokens", 4096))
     def _run_ocr(self, image_path: Path) -> str:
-        """API rétrocompat : retourne uniquement le texte."""
         text, _raw = self._run_with_native(image_path)
         return text
     def _run_with_native(
         self, image_path: Path,
-    ) -> tuple[str, Optional[dict]]:
-        """Hook framework (chantier 1) — délègue à ``_run_ocr_with_response``
-        pour permettre aux tests Sprint 49 de monkeypatcher l'appel réseau
-        sous son nom historique.
-        """
-        return self._run_ocr_with_response(image_path)
-    def _run_ocr_with_response(
-        self, image_path: Path,
     ) -> tuple[str, Optional[dict]]:
         """Exécute l'OCR et retourne ``(text, raw_response)``.
@@ -238,14 +229,3 @@ class MistralOCREngine(BaseOCREngine):
         for word in text.split():
             if word:
                 out.append({"token": word, "confidence": conf})
-    def _extract_token_confidences_from_response(
-        self, response: Any,
-    ) -> Optional[list[dict[str, Any]]]:
-        """Alias rétrocompat (Sprint 49) — extrait les confidences d'une réponse JSON.
-        Wrapper qui chaîne ``_extract_raw_confidences`` puis
-        ``_normalize_token_confidences`` (filtrage tokens vides / négatifs).
-        """
-        raw = self._extract_raw_confidences(response)
-        return self._normalize_token_confidences(raw)

         self._max_tokens = int(self.config.get("max_tokens", 4096))
     def _run_ocr(self, image_path: Path) -> str:
+        """Retourne uniquement le texte (interface ``BaseOCREngine``)."""
         text, _raw = self._run_with_native(image_path)
         return text
     def _run_with_native(
         self, image_path: Path,
     ) -> tuple[str, Optional[dict]]:
         """Exécute l'OCR et retourne ``(text, raw_response)``.
         for word in text.split():
             if word:
                 out.append({"token": word, "confidence": conf})

picarones/engines/pero_ocr.py CHANGED Viewed

@@ -177,17 +177,6 @@ class PeroOCREngine(BaseOCREngine):
                         out.append({"token": word, "confidence": conf})
         return out or None
-    def _extract_token_confidences_from_layout(
-        self, layout: Any,
-    ) -> Optional[list[dict[str, Any]]]:
-        """Alias rétrocompat (Sprint 48) — extrait les confidences d'un ``page_layout``.
-        Wrapper qui chaîne ``_extract_raw_confidences`` puis
-        ``_normalize_token_confidences`` (filtrage tokens vides / négatifs).
-        """
-        raw = self._extract_raw_confidences(layout)
-        return self._normalize_token_confidences(raw)
     @classmethod
     def from_config(cls, config: Optional[dict] = None) -> "PeroOCREngine":
         return cls(config=config or {})

                         out.append({"token": word, "confidence": conf})
         return out or None
     @classmethod
     def from_config(cls, config: Optional[dict] = None) -> "PeroOCREngine":
         return cls(config=config or {})

picarones/engines/tesseract.py CHANGED Viewed

@@ -172,28 +172,6 @@ class TesseractEngine(BaseOCREngine):
             out.append({"token": tok_text, "confidence": conf})
         return out or None
-    def _extract_token_confidences(
-        self, image_path: Path,
-    ) -> Optional[list[dict[str, Any]]]:
-        """Alias rétrocompat (Sprint 47) — extrait les confidences depuis ``image_path``.
-        Pipeline interne du chantier 1 : ``_run_with_native`` → ``_extract_raw_confidences``
-        → ``_normalize_token_confidences``. Retourne ``None`` si pytesseract est
-        absent ou si l'extraction échoue (signal au runner de sauter la calibration).
-        """
-        if not _PYTESSERACT_AVAILABLE:
-            return None
-        try:
-            _text, native = self._run_with_native(Path(image_path))
-            raw = self._extract_raw_confidences(native)
-            return self._normalize_token_confidences(raw)
-        except Exception as exc:  # noqa: BLE001
-            logger.warning(
-                "[tesseract] extraction des token_confidences indisponible : %s",
-                exc,
-            )
-            return None
     @classmethod
     def from_config(cls, config: Optional[dict] = None) -> "TesseractEngine":
         return cls(config=config or {})

             out.append({"token": tok_text, "confidence": conf})
         return out or None
     @classmethod
     def from_config(cls, config: Optional[dict] = None) -> "TesseractEngine":
         return cls(config=config or {})