Claude commited on
Commit
4c12d6d
·
unverified ·
1 Parent(s): b5c2eaf

feat(adapters/ocr): Sprint A14-S34 — AzureDocIntelAdapter natif (Phase 2 done)

Browse files

Migration native du legacy picarones.engines.azure_doc_intel vers
BaseOCRAdapter (S26). Pas un shim. **Phase 2 complete** : tous les 5
moteurs OCR sont maintenant migrés (Tesseract S30, Pero S31, Mistral S32,
Google Vision S33, Azure DI S34).

picarones/adapters/ocr/azure_doc_intel.py
-----------------------------------------
- AzureDocIntelAdapter(BaseOCRAdapter), execution_mode = "io".
- Constructeur kwargs-only : name, endpoint/api_key (overrides
AZURE_DOC_INTEL_ENDPOINT/AZURE_DOC_INTEL_KEY), model_id (défaut
"prebuilt-read"), locale (défaut "fr-FR"), api_version, timeout_seconds,
max_polling_attempts, polling_interval_base.
- Validation au constructeur : name alphanum + _-, timeouts > 0,
polling_interval >= 0.
- Routing : tente SDK azure-ai-documentintelligence ; fallback REST si
ImportError.
- API Azure asynchrone : POST → Operation-Location → polling tant que
status="running" (interval = base + 0.5 * attempt). Backoff linéaire
identique au legacy.
- Statuts : succeeded → extrait pages.lines.content ; failed/canceled →
OCRAdapterError.
- Sentinel _SDKMissing pour fallback propre — capturé en interne, ne
fuit jamais au caller.
- Erreurs HTTP/SDK wrappées dans OCRAdapterError avec type+message.
- Écrit dans <stem>.<name>.txt à côté de l'image.

Tests S34 dédiés (31 nouveaux)
------------------------------
- Constructor : defaults, custom name/model_id, rejet name vide/
invalide, rejet timeout/max_polling/polling_interval invalides.
- Contract : isinstance BaseOCRAdapter, input/output_types,
execution_mode = "io".
- Auth : explicit api_key/endpoint priority, env fallback, no key/
endpoint → OCRAdapterError, .rstrip("/") sur endpoint.
- InputValidation : IMAGE absent, sans URI, image inexistante → tous
OCRAdapterError.
- REST : succeeded → texte concaténé par lignes, running puis succeeded
(polling), failed/canceled → OCRAdapterError, polling timeout après
max_attempts → OCRAdapterError, no Operation-Location → OCRAdapterError,
écriture <stem>.<name>.txt.
- SDK : SDK call succeeds avec begin_analyze_document, SDK exception →
OCRAdapterError wrappé.
- ArtifactID : utilise adapter name.

Phase 2 récapitulatif (5 OCR engines migrés)
--------------------------------------------
| Sprint | Adapter | Tests | Total |
|--------|-----------------|-------|-------|
| S30 | Tesseract | 24 | +24 |
| S31 | Pero OCR | 21 | +21 |
| S32 | Mistral OCR | 28 | +28 |
| S33 | Google Vision | 29 | +29 |
| S34 | Azure DI | 31 | +31 |

Total Phase 2 : 133 nouveaux tests, 5 adapters migrés natifs.

Tests : 4731 passed, 11 skipped (vs 4700 avant : +31 S34).
Lint : ruff check picarones/ tests/ → All checks passed.

https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP

README.md CHANGED
@@ -396,7 +396,7 @@ ruff check picarones/ tests/
396
  python -m mypy picarones/core/
397
  ```
398
 
399
- **Test suite**: ~4720 tests, ~3 min on a modern laptop. Coverage
400
  floor at 85% (currently ~87%). The `network` marker excludes tests
401
  requiring live HTTP. A handful of tests depend on optional engines
402
  (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
 
396
  python -m mypy picarones/core/
397
  ```
398
 
399
+ **Test suite**: ~4750 tests, ~3 min on a modern laptop. Coverage
400
  floor at 85% (currently ~87%). The `network` marker excludes tests
401
  requiring live HTTP. A handful of tests depend on optional engines
402
  (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
picarones/adapters/ocr/__init__.py CHANGED
@@ -19,6 +19,7 @@ dédiés, **natifs** au nouveau contrat (pas de shim sur le legacy
19
 
20
  from __future__ import annotations
21
 
 
22
  from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
23
  from picarones.adapters.ocr.google_vision import GoogleVisionAdapter
24
  from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
@@ -29,6 +30,7 @@ from picarones.adapters.ocr.tesseract import TesseractAdapter
29
  __all__ = [
30
  "BaseOCRAdapter",
31
  "OCRAdapterError",
 
32
  "GoogleVisionAdapter",
33
  "MistralOCRAdapter",
34
  "PeroOCRAdapter",
 
19
 
20
  from __future__ import annotations
21
 
22
+ from picarones.adapters.ocr.azure_doc_intel import AzureDocIntelAdapter
23
  from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
24
  from picarones.adapters.ocr.google_vision import GoogleVisionAdapter
25
  from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
 
30
  __all__ = [
31
  "BaseOCRAdapter",
32
  "OCRAdapterError",
33
+ "AzureDocIntelAdapter",
34
  "GoogleVisionAdapter",
35
  "MistralOCRAdapter",
36
  "PeroOCRAdapter",
picarones/adapters/ocr/azure_doc_intel.py ADDED
@@ -0,0 +1,368 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """``AzureDocIntelAdapter`` natif — Sprint A14-S34.
2
+
3
+ Migration native du legacy ``picarones.engines.azure_doc_intel`` vers
4
+ ``BaseOCRAdapter`` (S26). **Pas un shim**.
5
+
6
+ Le legacy reste en place jusqu'au S46.
7
+
8
+ Cas d'usage BnF
9
+ ---------------
10
+ Azure Document Intelligence (anciennement Form Recognizer) propose
11
+ plusieurs modèles préentraînés :
12
+
13
+ - ``prebuilt-read`` (défaut) : lecture générique optimisée pour les
14
+ documents textuels denses.
15
+ - ``prebuilt-document`` : extraction layout + champs.
16
+ - ``prebuilt-layout`` : analyse de mise en page.
17
+ - modèles personnalisés entraînés.
18
+
19
+ L'API est asynchrone : on poste l'image et on poll un endpoint
20
+ status jusqu'à obtenir le résultat.
21
+
22
+ L'adapter route automatiquement vers SDK
23
+ (``azure-ai-documentintelligence``) si disponible, sinon REST
24
+ direct via ``urllib`` (avec polling).
25
+
26
+ Configuration
27
+ -------------
28
+ Constructeur :
29
+
30
+ - ``name`` (défaut ``"azure_doc_intel"``).
31
+ - ``endpoint`` : URL de l'endpoint (overrides
32
+ ``AZURE_DOC_INTEL_ENDPOINT``).
33
+ - ``api_key`` : clé API (overrides ``AZURE_DOC_INTEL_KEY``).
34
+ - ``model_id`` (défaut ``"prebuilt-read"``).
35
+ - ``locale`` (défaut ``"fr-FR"``).
36
+ - ``api_version`` (défaut ``"2024-02-29-preview"``).
37
+ - ``timeout_seconds`` (défaut 60) : timeout par requête HTTP.
38
+ - ``max_polling_attempts`` (défaut 30) : nombre max de polls REST.
39
+ - ``polling_interval_base`` (défaut 1.0) : intervalle de base entre
40
+ polls (incrémenté de 0.5s par tentative — backoff linéaire
41
+ identique au legacy).
42
+
43
+ Comportement
44
+ ------------
45
+ 1. Valide IMAGE input.
46
+ 2. Résout endpoint + api_key (explicite > env).
47
+ 3. Tente le SDK ; sur ImportError, fallback REST.
48
+ 4. Pour le REST : POST → Operation-Location → poll jusqu'à
49
+ ``succeeded`` / ``failed`` / ``canceled``.
50
+ 5. Extrait le texte ligne par ligne dans l'ordre pages × lines.
51
+ 6. Écrit dans ``<stem>.<name>.txt`` à côté de l'image.
52
+
53
+ Anti-sur-ingénierie
54
+ -------------------
55
+ - Pas d'extraction de confidences (legacy S51 — reportée).
56
+ - Pas de support multi-langue dans une même requête.
57
+ - Pas de retry au-delà du polling (qui est un retry implicite).
58
+ """
59
+
60
+ from __future__ import annotations
61
+
62
+ import json
63
+ import os
64
+ import time
65
+ import urllib.error
66
+ import urllib.request
67
+ from pathlib import Path
68
+ from typing import Any
69
+
70
+ from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
71
+ from picarones.domain.artifacts import Artifact, ArtifactType
72
+
73
+
74
+ class AzureDocIntelAdapter(BaseOCRAdapter):
75
+ """Adapter Azure Document Intelligence natif au contrat S26.
76
+
77
+ Parameters
78
+ ----------
79
+ name:
80
+ Identifiant lisible. Défaut ``"azure_doc_intel"``.
81
+ endpoint:
82
+ URL Azure (override ``AZURE_DOC_INTEL_ENDPOINT``).
83
+ api_key:
84
+ Clé API Azure (override ``AZURE_DOC_INTEL_KEY``).
85
+ model_id:
86
+ ``"prebuilt-read"`` (défaut), ``"prebuilt-document"``,
87
+ ``"prebuilt-layout"``, ou un modèle entraîné personnalisé.
88
+ locale:
89
+ Locale Azure (défaut ``"fr-FR"``).
90
+ api_version:
91
+ Version d'API Azure (défaut ``"2024-02-29-preview"``).
92
+ timeout_seconds:
93
+ Timeout HTTP (défaut 60).
94
+ max_polling_attempts:
95
+ Nombre max de polls REST (défaut 30).
96
+ polling_interval_base:
97
+ Intervalle de base entre polls (défaut 1.0s, +0.5s/attempt).
98
+
99
+ Raises
100
+ ------
101
+ OCRAdapterError
102
+ Au constructeur si name invalide ou paramètres hors plage.
103
+ """
104
+
105
+ input_types = frozenset({ArtifactType.IMAGE})
106
+ output_types = frozenset({ArtifactType.RAW_TEXT})
107
+ execution_mode = "io"
108
+
109
+ def __init__(
110
+ self,
111
+ *,
112
+ name: str = "azure_doc_intel",
113
+ endpoint: str | None = None,
114
+ api_key: str | None = None,
115
+ model_id: str = "prebuilt-read",
116
+ locale: str = "fr-FR",
117
+ api_version: str = "2024-02-29-preview",
118
+ timeout_seconds: float = 60.0,
119
+ max_polling_attempts: int = 30,
120
+ polling_interval_base: float = 1.0,
121
+ ) -> None:
122
+ if not name or not name.strip():
123
+ raise OCRAdapterError(
124
+ "AzureDocIntelAdapter : name vide non autorisé.",
125
+ )
126
+ if not all(c.isalnum() or c in "_-" for c in name):
127
+ raise OCRAdapterError(
128
+ f"AzureDocIntelAdapter : name invalide {name!r} — "
129
+ "alphanumérique + _ - uniquement.",
130
+ )
131
+ if timeout_seconds <= 0:
132
+ raise OCRAdapterError(
133
+ f"AzureDocIntelAdapter : timeout_seconds doit être > 0, "
134
+ f"reçu {timeout_seconds}.",
135
+ )
136
+ if max_polling_attempts <= 0:
137
+ raise OCRAdapterError(
138
+ f"AzureDocIntelAdapter : max_polling_attempts doit être "
139
+ f"> 0, reçu {max_polling_attempts}.",
140
+ )
141
+ if polling_interval_base < 0:
142
+ raise OCRAdapterError(
143
+ f"AzureDocIntelAdapter : polling_interval_base doit être "
144
+ f">= 0, reçu {polling_interval_base}.",
145
+ )
146
+ self._name = name
147
+ self._explicit_endpoint = endpoint
148
+ self._explicit_api_key = api_key
149
+ self._model_id = model_id
150
+ self._locale = locale
151
+ self._api_version = api_version
152
+ self._timeout = timeout_seconds
153
+ self._max_polling_attempts = max_polling_attempts
154
+ self._polling_base = polling_interval_base
155
+
156
+ @property
157
+ def name(self) -> str:
158
+ return self._name
159
+
160
+ @property
161
+ def model_id(self) -> str:
162
+ return self._model_id
163
+
164
+ def _resolve_api_key(self) -> str:
165
+ key = self._explicit_api_key or os.environ.get("AZURE_DOC_INTEL_KEY")
166
+ if not key:
167
+ raise OCRAdapterError(
168
+ f"{self.name} : clé API Azure manquante. Définir "
169
+ "AZURE_DOC_INTEL_KEY ou passer api_key= au constructeur.",
170
+ )
171
+ return key
172
+
173
+ def _resolve_endpoint(self) -> str:
174
+ endpoint = (
175
+ self._explicit_endpoint
176
+ or os.environ.get("AZURE_DOC_INTEL_ENDPOINT", "")
177
+ ).rstrip("/")
178
+ if not endpoint:
179
+ raise OCRAdapterError(
180
+ f"{self.name} : endpoint Azure manquant. Définir "
181
+ "AZURE_DOC_INTEL_ENDPOINT ou passer endpoint= au "
182
+ "constructeur.",
183
+ )
184
+ return endpoint
185
+
186
+ def execute(
187
+ self,
188
+ inputs: dict[ArtifactType, Artifact],
189
+ params: dict[str, Any],
190
+ context: Any,
191
+ ) -> dict[ArtifactType, Artifact]:
192
+ if ArtifactType.IMAGE not in inputs:
193
+ raise OCRAdapterError(
194
+ f"{self.name} : input IMAGE manquant.",
195
+ )
196
+ image_artifact = inputs[ArtifactType.IMAGE]
197
+ if image_artifact.uri is None:
198
+ raise OCRAdapterError(
199
+ f"{self.name} : artefact image "
200
+ f"{image_artifact.id!r} sans URI.",
201
+ )
202
+ image_path = Path(image_artifact.uri)
203
+ if not image_path.exists():
204
+ raise OCRAdapterError(
205
+ f"{self.name} : image introuvable {image_path!r}.",
206
+ )
207
+
208
+ api_key = self._resolve_api_key()
209
+ endpoint = self._resolve_endpoint()
210
+
211
+ # On tente le SDK d'abord ; sur ImportError, fallback REST.
212
+ try:
213
+ text = self._call_via_sdk(image_path, endpoint, api_key)
214
+ except _SDKMissing:
215
+ text = self._call_via_rest(image_path, endpoint, api_key)
216
+
217
+ text_path = (
218
+ image_path.parent / f"{image_path.stem}.{self.name}.txt"
219
+ )
220
+ text_path.write_text(text, encoding="utf-8")
221
+
222
+ return {
223
+ ArtifactType.RAW_TEXT: Artifact(
224
+ id=f"{context.document_id}:{self.name}:raw_text",
225
+ document_id=context.document_id,
226
+ type=ArtifactType.RAW_TEXT,
227
+ produced_by_step="ocr",
228
+ uri=str(text_path),
229
+ ),
230
+ }
231
+
232
+ # ──────────────────────────────────────────────────────────────
233
+ # SDK
234
+ # ──────────────────────────────────────────────────────────────
235
+
236
+ def _call_via_sdk(
237
+ self, image_path: Path, endpoint: str, api_key: str,
238
+ ) -> str:
239
+ try:
240
+ from azure.ai.documentintelligence import (
241
+ DocumentIntelligenceClient,
242
+ )
243
+ from azure.core.credentials import AzureKeyCredential
244
+ except ImportError as exc:
245
+ raise _SDKMissing() from exc
246
+
247
+ try:
248
+ client = DocumentIntelligenceClient(
249
+ endpoint=endpoint,
250
+ credential=AzureKeyCredential(api_key),
251
+ )
252
+ with open(image_path, "rb") as f:
253
+ poller = client.begin_analyze_document(
254
+ model_id=self._model_id,
255
+ body=f,
256
+ locale=self._locale,
257
+ content_type="application/octet-stream",
258
+ )
259
+ result = poller.result()
260
+ text = "\n".join(
261
+ line.content
262
+ for page in result.pages
263
+ for line in (page.lines or [])
264
+ )
265
+ except _SDKMissing:
266
+ raise
267
+ except Exception as exc:
268
+ raise OCRAdapterError(
269
+ f"{self.name} : SDK Azure a levé : "
270
+ f"{type(exc).__name__}: {exc}",
271
+ ) from exc
272
+ return text
273
+
274
+ # ──────────────────────────────────────────────────────────────
275
+ # REST avec polling
276
+ # ──────────────────────────────────────────────────────────────
277
+
278
+ def _call_via_rest(
279
+ self, image_path: Path, endpoint: str, api_key: str,
280
+ ) -> str:
281
+ image_bytes = image_path.read_bytes()
282
+ analyze_url = (
283
+ f"{endpoint}/documentintelligence/documentModels/"
284
+ f"{self._model_id}:analyze"
285
+ f"?api-version={self._api_version}&locale={self._locale}"
286
+ )
287
+ req = urllib.request.Request(
288
+ analyze_url,
289
+ data=image_bytes,
290
+ headers={
291
+ "Ocp-Apim-Subscription-Key": api_key,
292
+ "Content-Type": "application/octet-stream",
293
+ },
294
+ )
295
+ try:
296
+ with urllib.request.urlopen(req, timeout=self._timeout) as resp:
297
+ operation_url = resp.headers.get("Operation-Location", "")
298
+ except urllib.error.HTTPError as exc:
299
+ body = ""
300
+ try:
301
+ body = exc.read().decode("utf-8")
302
+ except Exception: # noqa: BLE001
303
+ pass
304
+ raise OCRAdapterError(
305
+ f"{self.name} : Azure Document Intelligence erreur "
306
+ f"{exc.code} : {body}",
307
+ ) from exc
308
+ except Exception as exc:
309
+ raise OCRAdapterError(
310
+ f"{self.name} : erreur API Azure : "
311
+ f"{type(exc).__name__}: {exc}",
312
+ ) from exc
313
+
314
+ if not operation_url:
315
+ raise OCRAdapterError(
316
+ f"{self.name} : Azure n'a pas retourné Operation-Location.",
317
+ )
318
+
319
+ # Polling du résultat (Azure asynchrone).
320
+ headers = {"Ocp-Apim-Subscription-Key": api_key}
321
+ for attempt in range(self._max_polling_attempts):
322
+ time.sleep(self._polling_base + attempt * 0.5)
323
+ poll_req = urllib.request.Request(operation_url, headers=headers)
324
+ try:
325
+ with urllib.request.urlopen(
326
+ poll_req, timeout=self._timeout,
327
+ ) as resp:
328
+ result = json.loads(resp.read().decode("utf-8"))
329
+ except Exception as exc:
330
+ raise OCRAdapterError(
331
+ f"{self.name} : erreur de polling Azure : "
332
+ f"{type(exc).__name__}: {exc}",
333
+ ) from exc
334
+ status = result.get("status", "")
335
+ if status == "succeeded":
336
+ return self._extract_text_from_rest_result(result)
337
+ if status in {"failed", "canceled"}:
338
+ raise OCRAdapterError(
339
+ f"{self.name} : analyse Azure {status} : "
340
+ f"{result.get('error', {})}",
341
+ )
342
+ # running → continue
343
+ raise OCRAdapterError(
344
+ f"{self.name} : timeout polling Azure après "
345
+ f"{self._max_polling_attempts} tentatives.",
346
+ )
347
+
348
+ @staticmethod
349
+ def _extract_text_from_rest_result(result: dict) -> str:
350
+ pages = result.get("analyzeResult", {}).get("pages", [])
351
+ lines: list[str] = []
352
+ for page in pages:
353
+ for line in page.get("lines", []):
354
+ content = line.get("content", "")
355
+ if content:
356
+ lines.append(content)
357
+ return "\n".join(lines)
358
+
359
+
360
+ class _SDKMissing(Exception):
361
+ """Sentinel interne pour signaler que le SDK Azure n'est pas
362
+ installé. Capturé par ``execute`` pour fallback REST.
363
+
364
+ Ne fuit jamais au caller — c'est un détail d'implémentation.
365
+ """
366
+
367
+
368
+ __all__ = ["AzureDocIntelAdapter"]
tests/adapters/ocr/test_sprint_a14_s34_azure_doc_intel_adapter.py ADDED
@@ -0,0 +1,536 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Sprint A14-S34 — ``AzureDocIntelAdapter`` natif au contrat S26."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ import sys
7
+ from pathlib import Path
8
+ from unittest.mock import MagicMock, patch
9
+
10
+ import pytest
11
+
12
+ from picarones.adapters.ocr import (
13
+ AzureDocIntelAdapter,
14
+ BaseOCRAdapter,
15
+ OCRAdapterError,
16
+ )
17
+ from picarones.domain.artifacts import Artifact, ArtifactType
18
+ from picarones.pipeline.types import RunContext
19
+
20
+
21
+ def _make_image_artifact(uri: str) -> Artifact:
22
+ return Artifact(
23
+ id="d1:img",
24
+ document_id="d1",
25
+ type=ArtifactType.IMAGE,
26
+ uri=uri,
27
+ )
28
+
29
+
30
+ def _make_context() -> RunContext:
31
+ return RunContext(
32
+ document_id="d1",
33
+ code_version="1.0.0",
34
+ pipeline_name="test",
35
+ )
36
+
37
+
38
+ def _make_dummy_image(tmp_path: Path) -> Path:
39
+ path = tmp_path / "page.png"
40
+ path.write_bytes(b"PNG_FAKE_BYTES")
41
+ return path
42
+
43
+
44
+ # ──────────────────────────────────────────────────────────────────────
45
+ # Constructeur
46
+ # ──────────────────────────────────────────────────────────────────────
47
+
48
+
49
+ class TestAzureDocIntelConstructor:
50
+ def test_defaults(self) -> None:
51
+ adapter = AzureDocIntelAdapter()
52
+ assert adapter.name == "azure_doc_intel"
53
+ assert adapter.model_id == "prebuilt-read"
54
+
55
+ def test_custom_name(self) -> None:
56
+ adapter = AzureDocIntelAdapter(name="my_azure")
57
+ assert adapter.name == "my_azure"
58
+
59
+ def test_custom_model_id(self) -> None:
60
+ adapter = AzureDocIntelAdapter(model_id="prebuilt-document")
61
+ assert adapter.model_id == "prebuilt-document"
62
+
63
+ def test_rejects_empty_name(self) -> None:
64
+ with pytest.raises(OCRAdapterError, match="vide"):
65
+ AzureDocIntelAdapter(name="")
66
+
67
+ def test_rejects_invalid_chars_in_name(self) -> None:
68
+ with pytest.raises(OCRAdapterError, match="invalide"):
69
+ AzureDocIntelAdapter(name="bad name")
70
+
71
+ def test_rejects_non_positive_timeout(self) -> None:
72
+ with pytest.raises(OCRAdapterError, match="timeout_seconds"):
73
+ AzureDocIntelAdapter(timeout_seconds=0)
74
+
75
+ def test_rejects_non_positive_max_polling(self) -> None:
76
+ with pytest.raises(OCRAdapterError, match="max_polling_attempts"):
77
+ AzureDocIntelAdapter(max_polling_attempts=0)
78
+
79
+ def test_rejects_negative_polling_interval(self) -> None:
80
+ with pytest.raises(OCRAdapterError, match="polling_interval_base"):
81
+ AzureDocIntelAdapter(polling_interval_base=-1.0)
82
+
83
+
84
+ # ──────────────────────────────────────────────────────────────────────
85
+ # Contrat BaseOCRAdapter
86
+ # ──────────────────────────────────────────────────────────────────────
87
+
88
+
89
+ class TestAzureDocIntelContract:
90
+ def test_inherits_base_adapter(self) -> None:
91
+ adapter = AzureDocIntelAdapter()
92
+ assert isinstance(adapter, BaseOCRAdapter)
93
+
94
+ def test_input_types(self) -> None:
95
+ assert AzureDocIntelAdapter.input_types == frozenset(
96
+ {ArtifactType.IMAGE},
97
+ )
98
+
99
+ def test_output_types(self) -> None:
100
+ assert AzureDocIntelAdapter.output_types == frozenset(
101
+ {ArtifactType.RAW_TEXT},
102
+ )
103
+
104
+ def test_execution_mode_is_io(self) -> None:
105
+ assert AzureDocIntelAdapter.execution_mode == "io"
106
+
107
+
108
+ # ──────────────────────────────────────────────────────────────────────
109
+ # Auth resolution
110
+ # ──────────────────────────────────────────────────────────────────────
111
+
112
+
113
+ class TestAzureDocIntelAuth:
114
+ def test_explicit_api_key_takes_priority(self) -> None:
115
+ adapter = AzureDocIntelAdapter(api_key="explicit")
116
+ with patch.dict("os.environ", {"AZURE_DOC_INTEL_KEY": "env"}):
117
+ assert adapter._resolve_api_key() == "explicit"
118
+
119
+ def test_env_api_key_fallback(self) -> None:
120
+ adapter = AzureDocIntelAdapter()
121
+ with patch.dict("os.environ", {"AZURE_DOC_INTEL_KEY": "env_key"}):
122
+ assert adapter._resolve_api_key() == "env_key"
123
+
124
+ def test_no_api_key_raises(self) -> None:
125
+ adapter = AzureDocIntelAdapter()
126
+ with patch.dict("os.environ", {}, clear=True):
127
+ with pytest.raises(OCRAdapterError, match="AZURE_DOC_INTEL_KEY"):
128
+ adapter._resolve_api_key()
129
+
130
+ def test_explicit_endpoint_takes_priority(self) -> None:
131
+ adapter = AzureDocIntelAdapter(endpoint="https://explicit.azure.com")
132
+ with patch.dict(
133
+ "os.environ", {"AZURE_DOC_INTEL_ENDPOINT": "https://env.azure.com"},
134
+ ):
135
+ assert adapter._resolve_endpoint() == "https://explicit.azure.com"
136
+
137
+ def test_env_endpoint_fallback(self) -> None:
138
+ adapter = AzureDocIntelAdapter()
139
+ with patch.dict(
140
+ "os.environ", {"AZURE_DOC_INTEL_ENDPOINT": "https://env.azure.com/"},
141
+ ):
142
+ # Note : .rstrip("/") supprime le trailing slash.
143
+ assert adapter._resolve_endpoint() == "https://env.azure.com"
144
+
145
+ def test_no_endpoint_raises(self) -> None:
146
+ adapter = AzureDocIntelAdapter()
147
+ with patch.dict("os.environ", {}, clear=True):
148
+ with pytest.raises(
149
+ OCRAdapterError, match="AZURE_DOC_INTEL_ENDPOINT",
150
+ ):
151
+ adapter._resolve_endpoint()
152
+
153
+
154
+ # ──────────────────────────────────────────────────────────────────────
155
+ # Input validation
156
+ # ──────────────────────────────────────────────────────────────────────
157
+
158
+
159
+ class TestAzureDocIntelInputValidation:
160
+ def test_missing_image_input_raises(self) -> None:
161
+ adapter = AzureDocIntelAdapter(
162
+ api_key="x", endpoint="https://test.azure.com",
163
+ )
164
+ with pytest.raises(OCRAdapterError, match="IMAGE manquant"):
165
+ adapter.execute(inputs={}, params={}, context=_make_context())
166
+
167
+ def test_image_artifact_without_uri_raises(self) -> None:
168
+ adapter = AzureDocIntelAdapter(
169
+ api_key="x", endpoint="https://test.azure.com",
170
+ )
171
+ artifact = Artifact(
172
+ id="d1:img",
173
+ document_id="d1",
174
+ type=ArtifactType.IMAGE,
175
+ uri=None,
176
+ )
177
+ with pytest.raises(OCRAdapterError, match="sans URI"):
178
+ adapter.execute(
179
+ inputs={ArtifactType.IMAGE: artifact},
180
+ params={},
181
+ context=_make_context(),
182
+ )
183
+
184
+ def test_image_path_does_not_exist_raises(self) -> None:
185
+ adapter = AzureDocIntelAdapter(
186
+ api_key="x", endpoint="https://test.azure.com",
187
+ )
188
+ artifact = _make_image_artifact("/nonexistent/img.png")
189
+ with pytest.raises(OCRAdapterError, match="introuvable"):
190
+ adapter.execute(
191
+ inputs={ArtifactType.IMAGE: artifact},
192
+ params={},
193
+ context=_make_context(),
194
+ )
195
+
196
+
197
+ # ──────────────────────────────────────────────────────────────────────
198
+ # REST path
199
+ # ──────────────────────────────────────────────────────────────────────
200
+
201
+
202
+ class TestAzureDocIntelREST:
203
+ def _patch_no_sdk(self):
204
+ """Mock le SDK Azure comme absent → fallback REST."""
205
+ return patch.dict(sys.modules, {
206
+ "azure.ai.documentintelligence": None,
207
+ "azure.core.credentials": None,
208
+ })
209
+
210
+ def _make_initial_response(self):
211
+ """Mock initial POST response retournant Operation-Location."""
212
+ mock_resp = MagicMock()
213
+ mock_resp.headers = {"Operation-Location": "https://op-status-url"}
214
+ mock_resp.__enter__.return_value = mock_resp
215
+ return mock_resp
216
+
217
+ def _make_polling_response(self, status: str, text_lines: list[str] | None = None):
218
+ """Mock polling response avec le status donné."""
219
+ result = {"status": status}
220
+ if status == "succeeded":
221
+ result["analyzeResult"] = {
222
+ "pages": [{
223
+ "lines": [{"content": line} for line in (text_lines or [])],
224
+ }],
225
+ }
226
+ mock_resp = MagicMock()
227
+ mock_resp.read.return_value = json.dumps(result).encode("utf-8")
228
+ mock_resp.__enter__.return_value = mock_resp
229
+ return mock_resp
230
+
231
+ def test_succeeded_returns_text(self, tmp_path: Path) -> None:
232
+ adapter = AzureDocIntelAdapter(
233
+ api_key="k", endpoint="https://e.azure.com",
234
+ polling_interval_base=0, # pas de sleep dans les tests
235
+ )
236
+ image_path = _make_dummy_image(tmp_path)
237
+ artifact = _make_image_artifact(str(image_path))
238
+
239
+ initial = self._make_initial_response()
240
+ succeeded = self._make_polling_response(
241
+ "succeeded", text_lines=["Ligne 1", "Ligne 2"],
242
+ )
243
+
244
+ with self._patch_no_sdk(), patch(
245
+ "urllib.request.urlopen",
246
+ side_effect=[initial, succeeded],
247
+ ):
248
+ result = adapter.execute(
249
+ inputs={ArtifactType.IMAGE: artifact},
250
+ params={},
251
+ context=_make_context(),
252
+ )
253
+ out_text = Path(result[ArtifactType.RAW_TEXT].uri).read_text(
254
+ encoding="utf-8",
255
+ )
256
+ assert out_text == "Ligne 1\nLigne 2"
257
+
258
+ def test_running_then_succeeded(self, tmp_path: Path) -> None:
259
+ adapter = AzureDocIntelAdapter(
260
+ api_key="k", endpoint="https://e.azure.com",
261
+ polling_interval_base=0,
262
+ )
263
+ image_path = _make_dummy_image(tmp_path)
264
+ artifact = _make_image_artifact(str(image_path))
265
+
266
+ with self._patch_no_sdk(), patch(
267
+ "urllib.request.urlopen",
268
+ side_effect=[
269
+ self._make_initial_response(),
270
+ self._make_polling_response("running"),
271
+ self._make_polling_response("running"),
272
+ self._make_polling_response(
273
+ "succeeded", text_lines=["Done"],
274
+ ),
275
+ ],
276
+ ):
277
+ result = adapter.execute(
278
+ inputs={ArtifactType.IMAGE: artifact},
279
+ params={},
280
+ context=_make_context(),
281
+ )
282
+ out_text = Path(result[ArtifactType.RAW_TEXT].uri).read_text(
283
+ encoding="utf-8",
284
+ )
285
+ assert out_text == "Done"
286
+
287
+ def test_failed_status_raises(self, tmp_path: Path) -> None:
288
+ adapter = AzureDocIntelAdapter(
289
+ api_key="k", endpoint="https://e.azure.com",
290
+ polling_interval_base=0,
291
+ )
292
+ image_path = _make_dummy_image(tmp_path)
293
+ artifact = _make_image_artifact(str(image_path))
294
+
295
+ with self._patch_no_sdk(), patch(
296
+ "urllib.request.urlopen",
297
+ side_effect=[
298
+ self._make_initial_response(),
299
+ self._make_polling_response("failed"),
300
+ ],
301
+ ):
302
+ with pytest.raises(OCRAdapterError, match="failed"):
303
+ adapter.execute(
304
+ inputs={ArtifactType.IMAGE: artifact},
305
+ params={},
306
+ context=_make_context(),
307
+ )
308
+
309
+ def test_canceled_status_raises(self, tmp_path: Path) -> None:
310
+ adapter = AzureDocIntelAdapter(
311
+ api_key="k", endpoint="https://e.azure.com",
312
+ polling_interval_base=0,
313
+ )
314
+ image_path = _make_dummy_image(tmp_path)
315
+ artifact = _make_image_artifact(str(image_path))
316
+
317
+ with self._patch_no_sdk(), patch(
318
+ "urllib.request.urlopen",
319
+ side_effect=[
320
+ self._make_initial_response(),
321
+ self._make_polling_response("canceled"),
322
+ ],
323
+ ):
324
+ with pytest.raises(OCRAdapterError, match="canceled"):
325
+ adapter.execute(
326
+ inputs={ArtifactType.IMAGE: artifact},
327
+ params={},
328
+ context=_make_context(),
329
+ )
330
+
331
+ def test_polling_timeout_raises(self, tmp_path: Path) -> None:
332
+ adapter = AzureDocIntelAdapter(
333
+ api_key="k", endpoint="https://e.azure.com",
334
+ polling_interval_base=0,
335
+ max_polling_attempts=2,
336
+ )
337
+ image_path = _make_dummy_image(tmp_path)
338
+ artifact = _make_image_artifact(str(image_path))
339
+
340
+ with self._patch_no_sdk(), patch(
341
+ "urllib.request.urlopen",
342
+ side_effect=[
343
+ self._make_initial_response(),
344
+ self._make_polling_response("running"),
345
+ self._make_polling_response("running"),
346
+ ],
347
+ ):
348
+ with pytest.raises(OCRAdapterError, match="timeout polling"):
349
+ adapter.execute(
350
+ inputs={ArtifactType.IMAGE: artifact},
351
+ params={},
352
+ context=_make_context(),
353
+ )
354
+
355
+ def test_no_operation_location_raises(self, tmp_path: Path) -> None:
356
+ adapter = AzureDocIntelAdapter(
357
+ api_key="k", endpoint="https://e.azure.com",
358
+ polling_interval_base=0,
359
+ )
360
+ image_path = _make_dummy_image(tmp_path)
361
+ artifact = _make_image_artifact(str(image_path))
362
+
363
+ # Initial POST sans Operation-Location.
364
+ bad_initial = MagicMock()
365
+ bad_initial.headers = {}
366
+ bad_initial.__enter__.return_value = bad_initial
367
+
368
+ with self._patch_no_sdk(), patch(
369
+ "urllib.request.urlopen",
370
+ side_effect=[bad_initial],
371
+ ):
372
+ with pytest.raises(OCRAdapterError, match="Operation-Location"):
373
+ adapter.execute(
374
+ inputs={ArtifactType.IMAGE: artifact},
375
+ params={},
376
+ context=_make_context(),
377
+ )
378
+
379
+ def test_writes_to_stem_name_pattern(self, tmp_path: Path) -> None:
380
+ adapter = AzureDocIntelAdapter(
381
+ api_key="k", endpoint="https://e.azure.com",
382
+ polling_interval_base=0,
383
+ name="my_azure",
384
+ )
385
+ image_path = _make_dummy_image(tmp_path)
386
+ artifact = _make_image_artifact(str(image_path))
387
+
388
+ with self._patch_no_sdk(), patch(
389
+ "urllib.request.urlopen",
390
+ side_effect=[
391
+ self._make_initial_response(),
392
+ self._make_polling_response("succeeded", text_lines=["x"]),
393
+ ],
394
+ ):
395
+ result = adapter.execute(
396
+ inputs={ArtifactType.IMAGE: artifact},
397
+ params={},
398
+ context=_make_context(),
399
+ )
400
+ out_path = Path(result[ArtifactType.RAW_TEXT].uri)
401
+ assert out_path.name == "page.my_azure.txt"
402
+
403
+
404
+ # ──────────────────────────────────────────────────────────────────────
405
+ # SDK path
406
+ # ──────────────────────────────────────────────────────────────────────
407
+
408
+
409
+ class TestAzureDocIntelSDK:
410
+ def test_sdk_call_succeeds(self, tmp_path: Path) -> None:
411
+ adapter = AzureDocIntelAdapter(
412
+ api_key="k", endpoint="https://e.azure.com",
413
+ )
414
+ image_path = _make_dummy_image(tmp_path)
415
+ artifact = _make_image_artifact(str(image_path))
416
+
417
+ # Mock du résultat SDK avec pages.lines.content.
418
+ mock_line_a = MagicMock()
419
+ mock_line_a.content = "Ligne A"
420
+ mock_line_b = MagicMock()
421
+ mock_line_b.content = "Ligne B"
422
+ mock_page = MagicMock()
423
+ mock_page.lines = [mock_line_a, mock_line_b]
424
+ mock_result = MagicMock()
425
+ mock_result.pages = [mock_page]
426
+
427
+ mock_poller = MagicMock()
428
+ mock_poller.result.return_value = mock_result
429
+ mock_client = MagicMock()
430
+ mock_client.begin_analyze_document.return_value = mock_poller
431
+
432
+ fake_di_module = MagicMock()
433
+ fake_di_module.DocumentIntelligenceClient = MagicMock(
434
+ return_value=mock_client,
435
+ )
436
+ fake_creds_module = MagicMock()
437
+ fake_creds_module.AzureKeyCredential = MagicMock(return_value="creds")
438
+
439
+ with patch.dict(sys.modules, {
440
+ "azure": MagicMock(),
441
+ "azure.ai": MagicMock(),
442
+ "azure.ai.documentintelligence": fake_di_module,
443
+ "azure.core": MagicMock(),
444
+ "azure.core.credentials": fake_creds_module,
445
+ }):
446
+ result = adapter.execute(
447
+ inputs={ArtifactType.IMAGE: artifact},
448
+ params={},
449
+ context=_make_context(),
450
+ )
451
+ out_text = Path(result[ArtifactType.RAW_TEXT].uri).read_text(
452
+ encoding="utf-8",
453
+ )
454
+ assert out_text == "Ligne A\nLigne B"
455
+
456
+ def test_sdk_internal_error_wrapped(self, tmp_path: Path) -> None:
457
+ adapter = AzureDocIntelAdapter(
458
+ api_key="k", endpoint="https://e.azure.com",
459
+ )
460
+ image_path = _make_dummy_image(tmp_path)
461
+ artifact = _make_image_artifact(str(image_path))
462
+
463
+ mock_client = MagicMock()
464
+ mock_client.begin_analyze_document.side_effect = RuntimeError(
465
+ "Azure boom",
466
+ )
467
+
468
+ fake_di_module = MagicMock()
469
+ fake_di_module.DocumentIntelligenceClient = MagicMock(
470
+ return_value=mock_client,
471
+ )
472
+ fake_creds_module = MagicMock()
473
+ fake_creds_module.AzureKeyCredential = MagicMock(return_value="creds")
474
+
475
+ with patch.dict(sys.modules, {
476
+ "azure": MagicMock(),
477
+ "azure.ai": MagicMock(),
478
+ "azure.ai.documentintelligence": fake_di_module,
479
+ "azure.core": MagicMock(),
480
+ "azure.core.credentials": fake_creds_module,
481
+ }):
482
+ with pytest.raises(OCRAdapterError, match="RuntimeError.*Azure boom"):
483
+ adapter.execute(
484
+ inputs={ArtifactType.IMAGE: artifact},
485
+ params={},
486
+ context=_make_context(),
487
+ )
488
+
489
+
490
+ # ──────────────────────────────────────────────────────────────────────
491
+ # Artifact ID
492
+ # ──────────────────────────────────────────────────────────────────────
493
+
494
+
495
+ class TestAzureDocIntelArtifactID:
496
+ def test_artifact_id_uses_adapter_name(self, tmp_path: Path) -> None:
497
+ adapter = AzureDocIntelAdapter(
498
+ api_key="k", endpoint="https://e.azure.com",
499
+ polling_interval_base=0,
500
+ name="custom_az",
501
+ )
502
+ image_path = _make_dummy_image(tmp_path)
503
+ artifact = _make_image_artifact(str(image_path))
504
+
505
+ mock_resp_initial = MagicMock()
506
+ mock_resp_initial.headers = {"Operation-Location": "https://op"}
507
+ mock_resp_initial.__enter__.return_value = mock_resp_initial
508
+
509
+ result_payload = {
510
+ "status": "succeeded",
511
+ "analyzeResult": {
512
+ "pages": [{"lines": [{"content": "x"}]}],
513
+ },
514
+ }
515
+ mock_resp_polling = MagicMock()
516
+ mock_resp_polling.read.return_value = json.dumps(
517
+ result_payload,
518
+ ).encode("utf-8")
519
+ mock_resp_polling.__enter__.return_value = mock_resp_polling
520
+
521
+ with patch.dict(sys.modules, {
522
+ "azure.ai.documentintelligence": None,
523
+ "azure.core.credentials": None,
524
+ }), patch(
525
+ "urllib.request.urlopen",
526
+ side_effect=[mock_resp_initial, mock_resp_polling],
527
+ ):
528
+ result = adapter.execute(
529
+ inputs={ArtifactType.IMAGE: artifact},
530
+ params={},
531
+ context=_make_context(),
532
+ )
533
+ produced = result[ArtifactType.RAW_TEXT]
534
+ assert produced.id == "d1:custom_az:raw_text"
535
+ assert produced.document_id == "d1"
536
+ assert produced.produced_by_step == "ocr"