Spaces:
Running
fix(security): Phase 1 — SSRF eScriptorium + Tesseract lang + bandit nosec
Browse filesTrois durcissements sécurité identifiés par l'audit code-quality.
**1.1 SSRF résiduel sur eScriptorium**
``adapters/corpus/escriptorium.py:_get``, ``_post`` et le
``urllib.request.urlretrieve(part.image_url)`` ligne 410 ne
passaient pas par ``validate_http_url`` (qui existe pourtant dans
``_http.py`` et est utilisé par IIIF/Gallica/HTR-United). Un
manifeste pointant ``image_url=http://169.254.169.254/...``
exfiltrait les métadonnées cloud, ``http://127.0.0.1:6379/...``
parlait au Redis local, ``http://10.x/...`` aux services RFC 1918.
- ``_get`` et ``_post`` appellent désormais ``validate_http_url``
avant le ``Request``.
- Le download d'image utilise ``download_url`` (helper ``_http``
avec retry + validation) à la place d'``urlretrieve``.
- Cohérence avec les autres importeurs corpus.
**1.2 Injection ligne de commande Tesseract**
``adapters/ocr/tesseract.py:__init__`` n'avait aucune validation
sur ``lang``, concaténé in fine à ``tesseract -l <lang>``. Un
appelant programmatique passant ``lang="fra --user-words
/etc/passwd"`` lisait un fichier arbitraire (flag ``--user-words``
honoré par Tesseract).
Ajout d'une regex ``^[a-zA-Z]{3,}(\+[a-zA-Z]{3,})*$`` qui accepte
les codes ISO 639-3 (``fra``, ``eng``, ``Latin``...) optionnellement
combinés par ``+`` (``fra+eng``), et refuse tout caractère
exploitable (espaces, ``--``, ``/``, ``;``, ``|``, backticks, $,
newlines, etc.).
**1.3 Faux positifs bandit B608**
``interfaces/web/jobs.py:235`` et
``evaluation/metrics/history.py:341`` construisent des requêtes
SQL via f-string, mais les ``fields``/``clauses`` interpolés sont
des littéraux internes (``"status = ?"``, ``"engine_name = ?"``...) ;
les *valeurs* utilisent toutes des ``?``-placeholders. Pas de
SQLi exploitable. Documenté avec ``# nosec B608`` + commentaire
explicatif.
**Tests ajoutés**
- ``tests/security/test_escriptorium_ssrf.py`` (14 tests, 8 IPs
internes bloquées sur _get, 3 sur _post, 1 sur download_url,
garde-fou d'import).
- ``tests/adapters/ocr/test_tesseract_lang_validation.py`` (30
tests, 9 langs valides + 20 injections bloquées + défaut).
Bandit re-scan : 2 issues MEDIUM (B608) → 0 MEDIUM, 2 explicitement
skipped via ``# nosec``.
**Compteurs**
Le test ``test_claude_md_count_close_to_reality`` a échoué après
ajout de 47 tests (tolérance ±50 dépassée). Régénération de
CLAUDE.md + README.md via ``scripts/gen_readme_tables.py`` —
préfigure la Phase 2.1 du plan (script orphelin à câbler en CI).
Suite : 4 731 passed, 16 skipped, 8 deselected, 2 xfailed.
Ruff propre, bandit propre (1 LOW résiduel inoffensif).
- CLAUDE.md +2 -2
- README.md +1 -1
- picarones/adapters/corpus/escriptorium.py +23 -5
- picarones/adapters/ocr/tesseract.py +18 -0
- picarones/evaluation/metrics/history.py +6 -1
- picarones/interfaces/web/jobs.py +6 -1
- tests/adapters/ocr/test_tesseract_lang_validation.py +92 -0
- tests/security/test_escriptorium_ssrf.py +135 -0
|
@@ -116,7 +116,7 @@ picarones/
|
|
| 116 |
|
| 117 |
## État des tests et bugs historiques
|
| 118 |
|
| 119 |
-
`pytest tests/` → **
|
| 120 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intégration
|
| 121 |
contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
|
| 122 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
@@ -302,7 +302,7 @@ détecte, arbitre, rend.
|
|
| 302 |
## Contexte développement
|
| 303 |
|
| 304 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 305 |
-
- **Tests** : `pytest tests/ -q` →
|
| 306 |
deselected, 0 failed (post-v2.0).
|
| 307 |
- **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
|
| 308 |
- **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).
|
|
|
|
| 116 |
|
| 117 |
## État des tests et bugs historiques
|
| 118 |
|
| 119 |
+
`pytest tests/` → **4750 passed, 12 skipped, 8 deselected, 0 failed**
|
| 120 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intégration
|
| 121 |
contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
|
| 122 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
|
|
| 302 |
## Contexte développement
|
| 303 |
|
| 304 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 305 |
+
- **Tests** : `pytest tests/ -q` → 4750 passed, 9 skipped, 24
|
| 306 |
deselected, 0 failed (post-v2.0).
|
| 307 |
- **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
|
| 308 |
- **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).
|
|
@@ -397,7 +397,7 @@ ruff check picarones/ tests/
|
|
| 397 |
python -m mypy picarones/core/
|
| 398 |
```
|
| 399 |
|
| 400 |
-
**Test suite**: ~
|
| 401 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 402 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 403 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
|
|
| 397 |
python -m mypy picarones/core/
|
| 398 |
```
|
| 399 |
|
| 400 |
+
**Test suite**: ~4750 tests, ~3 min on a modern laptop. Coverage
|
| 401 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 402 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 403 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
@@ -54,6 +54,7 @@ warnings.warn(
|
|
| 54 |
)
|
| 55 |
|
| 56 |
|
|
|
|
| 57 |
from picarones.evaluation.corpus import Corpus, Document
|
| 58 |
|
| 59 |
if TYPE_CHECKING:
|
|
@@ -162,9 +163,15 @@ class EScriptoriumClient:
|
|
| 162 |
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
| 163 |
if params:
|
| 164 |
url += "?" + urllib.parse.urlencode(params)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
req = urllib.request.Request(url, headers=self._headers())
|
| 166 |
try:
|
| 167 |
-
with urllib.request.urlopen(req, timeout=self.timeout) as resp:
|
| 168 |
return json.loads(resp.read().decode("utf-8"))
|
| 169 |
except urllib.error.HTTPError as exc:
|
| 170 |
raise RuntimeError(
|
|
@@ -178,12 +185,17 @@ class EScriptoriumClient:
|
|
| 178 |
def _post(self, path: str, payload: dict) -> dict:
|
| 179 |
"""Effectue une requête POST avec payload JSON."""
|
| 180 |
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
data = json.dumps(payload).encode("utf-8")
|
| 182 |
req = urllib.request.Request(
|
| 183 |
url, data=data, headers=self._headers(), method="POST"
|
| 184 |
)
|
| 185 |
try:
|
| 186 |
-
with urllib.request.urlopen(req, timeout=self.timeout) as resp:
|
| 187 |
body = resp.read().decode("utf-8")
|
| 188 |
return json.loads(body) if body else {}
|
| 189 |
except urllib.error.HTTPError as exc:
|
|
@@ -406,11 +418,17 @@ class EScriptoriumClient:
|
|
| 406 |
if out_path and part.image_url and download_images:
|
| 407 |
ext = Path(urllib.parse.urlparse(part.image_url).path).suffix or ".jpg"
|
| 408 |
local_img = out_path / f"part_{part.pk:05d}{ext}"
|
|
|
|
|
|
|
| 409 |
try:
|
| 410 |
-
|
|
|
|
| 411 |
image_path = str(local_img)
|
| 412 |
-
except
|
| 413 |
-
logger.warning(
|
|
|
|
|
|
|
|
|
|
| 414 |
|
| 415 |
# Sauvegarder la GT
|
| 416 |
gt_path = out_path / f"part_{part.pk:05d}.gt.txt"
|
|
|
|
| 54 |
)
|
| 55 |
|
| 56 |
|
| 57 |
+
from picarones.adapters.corpus._http import download_url, validate_http_url
|
| 58 |
from picarones.evaluation.corpus import Corpus, Document
|
| 59 |
|
| 60 |
if TYPE_CHECKING:
|
|
|
|
| 163 |
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
| 164 |
if params:
|
| 165 |
url += "?" + urllib.parse.urlencode(params)
|
| 166 |
+
# Anti-SSRF — refuse loopback, lien-local, RFC 1918, metadata cloud.
|
| 167 |
+
# Cohérence avec IIIF/Gallica/HTR-United qui passent par _http.
|
| 168 |
+
try:
|
| 169 |
+
validate_http_url(url)
|
| 170 |
+
except ValueError as exc:
|
| 171 |
+
raise RuntimeError(str(exc)) from exc
|
| 172 |
req = urllib.request.Request(url, headers=self._headers())
|
| 173 |
try:
|
| 174 |
+
with urllib.request.urlopen(req, timeout=self.timeout) as resp: # noqa: S310
|
| 175 |
return json.loads(resp.read().decode("utf-8"))
|
| 176 |
except urllib.error.HTTPError as exc:
|
| 177 |
raise RuntimeError(
|
|
|
|
| 185 |
def _post(self, path: str, payload: dict) -> dict:
|
| 186 |
"""Effectue une requête POST avec payload JSON."""
|
| 187 |
url = f"{self.base_url}/api/{path.lstrip('/')}"
|
| 188 |
+
# Anti-SSRF — cf. _get.
|
| 189 |
+
try:
|
| 190 |
+
validate_http_url(url)
|
| 191 |
+
except ValueError as exc:
|
| 192 |
+
raise RuntimeError(str(exc)) from exc
|
| 193 |
data = json.dumps(payload).encode("utf-8")
|
| 194 |
req = urllib.request.Request(
|
| 195 |
url, data=data, headers=self._headers(), method="POST"
|
| 196 |
)
|
| 197 |
try:
|
| 198 |
+
with urllib.request.urlopen(req, timeout=self.timeout) as resp: # noqa: S310
|
| 199 |
body = resp.read().decode("utf-8")
|
| 200 |
return json.loads(body) if body else {}
|
| 201 |
except urllib.error.HTTPError as exc:
|
|
|
|
| 418 |
if out_path and part.image_url and download_images:
|
| 419 |
ext = Path(urllib.parse.urlparse(part.image_url).path).suffix or ".jpg"
|
| 420 |
local_img = out_path / f"part_{part.pk:05d}{ext}"
|
| 421 |
+
# Anti-SSRF + retry exponentiel — utilise download_url plutôt
|
| 422 |
+
# que urlretrieve qui ne valide pas l'URL.
|
| 423 |
try:
|
| 424 |
+
image_bytes = download_url(part.image_url)
|
| 425 |
+
local_img.write_bytes(image_bytes)
|
| 426 |
image_path = str(local_img)
|
| 427 |
+
except (ValueError, RuntimeError) as exc:
|
| 428 |
+
logger.warning(
|
| 429 |
+
"[escriptorium] Impossible de télécharger l'image %s : %s",
|
| 430 |
+
part.image_url, exc,
|
| 431 |
+
)
|
| 432 |
|
| 433 |
# Sauvegarder la GT
|
| 434 |
gt_path = out_path / f"part_{part.pk:05d}.gt.txt"
|
|
@@ -56,6 +56,7 @@ Anti-sur-ingénierie
|
|
| 56 |
|
| 57 |
from __future__ import annotations
|
| 58 |
|
|
|
|
| 59 |
from pathlib import Path
|
| 60 |
from typing import Any
|
| 61 |
|
|
@@ -63,6 +64,14 @@ from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
|
| 63 |
from picarones.adapters.output_paths import resolve_output_path
|
| 64 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
class TesseractAdapter(BaseOCRAdapter):
|
| 68 |
"""Adapter Tesseract 5 natif au nouveau contrat (S26).
|
|
@@ -123,6 +132,15 @@ class TesseractAdapter(BaseOCRAdapter):
|
|
| 123 |
f"TesseractAdapter : name invalide {name!r} — "
|
| 124 |
"alphanumérique + _ - uniquement.",
|
| 125 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
if not 0 <= psm <= 13:
|
| 127 |
raise OCRAdapterError(
|
| 128 |
f"TesseractAdapter : psm doit être ∈ [0, 13], reçu {psm}.",
|
|
|
|
| 56 |
|
| 57 |
from __future__ import annotations
|
| 58 |
|
| 59 |
+
import re
|
| 60 |
from pathlib import Path
|
| 61 |
from typing import Any
|
| 62 |
|
|
|
|
| 64 |
from picarones.adapters.output_paths import resolve_output_path
|
| 65 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 66 |
|
| 67 |
+
#: Codes langue Tesseract acceptés : ISO 639-3 (3 lettres ASCII)
|
| 68 |
+
#: éventuellement combinés par ``+`` (ex. ``"fra+eng"``). Le ``lang``
|
| 69 |
+
#: étant in fine passé à la ligne de commande Tesseract via
|
| 70 |
+
#: pytesseract, on refuse tout caractère qui pourrait être interprété
|
| 71 |
+
#: comme un flag ou un séparateur (espaces, ``--``, ``/``, etc.).
|
| 72 |
+
#: Phase 1.2 de l'audit code-quality (2026-05).
|
| 73 |
+
_TESSERACT_LANG_RE = re.compile(r"^[a-zA-Z]{3,}(?:\+[a-zA-Z]{3,})*$")
|
| 74 |
+
|
| 75 |
|
| 76 |
class TesseractAdapter(BaseOCRAdapter):
|
| 77 |
"""Adapter Tesseract 5 natif au nouveau contrat (S26).
|
|
|
|
| 132 |
f"TesseractAdapter : name invalide {name!r} — "
|
| 133 |
"alphanumérique + _ - uniquement.",
|
| 134 |
)
|
| 135 |
+
# Anti-injection ligne de commande Tesseract — refuse les
|
| 136 |
+
# espaces, ``--user-words``, ``/``, etc. ``lang`` est in fine
|
| 137 |
+
# concaténé à ``tesseract -l <lang>``.
|
| 138 |
+
if not _TESSERACT_LANG_RE.fullmatch(lang):
|
| 139 |
+
raise OCRAdapterError(
|
| 140 |
+
f"TesseractAdapter : lang invalide {lang!r} — "
|
| 141 |
+
"format attendu : code ISO 639-3 (3+ lettres ASCII), "
|
| 142 |
+
"optionnellement combiné via ``+`` (ex. ``fra+eng``).",
|
| 143 |
+
)
|
| 144 |
if not 0 <= psm <= 13:
|
| 145 |
raise OCRAdapterError(
|
| 146 |
f"TesseractAdapter : psm doit être ∈ [0, 13], reçu {psm}.",
|
|
@@ -337,8 +337,13 @@ class BenchmarkHistory:
|
|
| 337 |
params.append(limit)
|
| 338 |
|
| 339 |
conn = self._connect()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 340 |
rows = conn.execute(
|
| 341 |
-
f"SELECT * FROM runs {where} ORDER BY timestamp ASC LIMIT ?",
|
| 342 |
params,
|
| 343 |
).fetchall()
|
| 344 |
|
|
|
|
| 337 |
params.append(limit)
|
| 338 |
|
| 339 |
conn = self._connect()
|
| 340 |
+
# Faux positif bandit B608 : ``clauses`` est construit à partir
|
| 341 |
+
# de littéraux internes (``"engine_name = ?"``, ``"corpus_name = ?"``,
|
| 342 |
+
# ``"timestamp >= ?"``) — aucune entrée utilisateur n'est
|
| 343 |
+
# concaténée dans la requête. Les *valeurs* (engine, corpus,
|
| 344 |
+
# since, limit) passent par ``?``-placeholders.
|
| 345 |
rows = conn.execute(
|
| 346 |
+
f"SELECT * FROM runs {where} ORDER BY timestamp ASC LIMIT ?", # nosec B608
|
| 347 |
params,
|
| 348 |
).fetchall()
|
| 349 |
|
|
@@ -231,8 +231,13 @@ class JobStore:
|
|
| 231 |
values.append(time.time())
|
| 232 |
values.append(job_id)
|
| 233 |
with self._conn() as c:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 234 |
c.execute(
|
| 235 |
-
f"UPDATE jobs SET {', '.join(fields)} WHERE job_id = ?",
|
| 236 |
values,
|
| 237 |
)
|
| 238 |
|
|
|
|
| 231 |
values.append(time.time())
|
| 232 |
values.append(job_id)
|
| 233 |
with self._conn() as c:
|
| 234 |
+
# Faux positif bandit B608 : ``fields`` est construit
|
| 235 |
+
# uniquement à partir de littéraux internes (``"status = ?"``,
|
| 236 |
+
# ``"total_docs = ?"`` etc.) — aucune entrée utilisateur
|
| 237 |
+
# n'est concaténée dans la requête. Les *valeurs* passent
|
| 238 |
+
# toutes par ``?``-placeholders (paramètre ``values``).
|
| 239 |
c.execute(
|
| 240 |
+
f"UPDATE jobs SET {', '.join(fields)} WHERE job_id = ?", # nosec B608
|
| 241 |
values,
|
| 242 |
)
|
| 243 |
|
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Phase 1.2 du plan d'audit — TesseractAdapter valide le format
|
| 2 |
+
de ``lang`` à la construction (refuse les injections CLI).
|
| 3 |
+
|
| 4 |
+
Risque parée : ``lang`` est in fine concaténé par pytesseract à la
|
| 5 |
+
ligne de commande ``tesseract -l <lang>``. Sans validation, un
|
| 6 |
+
appelant qui passe ``lang="fra --user-words /etc/passwd"`` lirait
|
| 7 |
+
un fichier arbitraire (Tesseract honore ce flag).
|
| 8 |
+
|
| 9 |
+
La validation côté UI (``get_tesseract_langs()``) protégeait le
|
| 10 |
+
chemin web, mais pas les usages programmatiques ni la CLI. Phase
|
| 11 |
+
1.2 ajoute une défense locale dans ``__init__``.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import pytest
|
| 17 |
+
|
| 18 |
+
from picarones.adapters.ocr.base import OCRAdapterError
|
| 19 |
+
from picarones.adapters.ocr.tesseract import TesseractAdapter
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class TestTesseractLangAccepted:
|
| 23 |
+
"""Codes Tesseract canoniques acceptés."""
|
| 24 |
+
|
| 25 |
+
@pytest.mark.parametrize(
|
| 26 |
+
"lang",
|
| 27 |
+
[
|
| 28 |
+
"fra",
|
| 29 |
+
"eng",
|
| 30 |
+
"lat",
|
| 31 |
+
"frk", # Fraktur
|
| 32 |
+
"deu",
|
| 33 |
+
"fra+eng", # combinaison standard
|
| 34 |
+
"lat+deu+eng",
|
| 35 |
+
"Latin", # script (3+ lettres)
|
| 36 |
+
"Cyrillic",
|
| 37 |
+
],
|
| 38 |
+
)
|
| 39 |
+
def test_valid_lang_accepted(self, lang: str) -> None:
|
| 40 |
+
adapter = TesseractAdapter(lang=lang)
|
| 41 |
+
assert adapter.lang == lang
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
class TestTesseractLangRejected:
|
| 45 |
+
"""Toute valeur exploitable pour injection CLI doit lever."""
|
| 46 |
+
|
| 47 |
+
@pytest.mark.parametrize(
|
| 48 |
+
"lang",
|
| 49 |
+
[
|
| 50 |
+
# Injection classique : un espace permet d'ajouter un flag
|
| 51 |
+
# Tesseract qui lit un fichier arbitraire.
|
| 52 |
+
"fra --user-words /etc/passwd",
|
| 53 |
+
"fra --tessdata-dir /tmp",
|
| 54 |
+
# Doubles tirets sans espace = même attaque.
|
| 55 |
+
"fra--user-words",
|
| 56 |
+
# Slash : chemin / path traversal.
|
| 57 |
+
"fra/eng",
|
| 58 |
+
"../etc",
|
| 59 |
+
# Caractères de séparation shell.
|
| 60 |
+
"fra;ls",
|
| 61 |
+
"fra|cat",
|
| 62 |
+
"fra`whoami`",
|
| 63 |
+
"fra$IFS",
|
| 64 |
+
"fra\nrm",
|
| 65 |
+
# Vide ou trop court.
|
| 66 |
+
"",
|
| 67 |
+
"f",
|
| 68 |
+
"fr",
|
| 69 |
+
# Caractères non-ASCII (peuvent contourner la regex naive).
|
| 70 |
+
"frà",
|
| 71 |
+
"français",
|
| 72 |
+
# Combinaison mal formée.
|
| 73 |
+
"fra+",
|
| 74 |
+
"+fra",
|
| 75 |
+
"fra++eng",
|
| 76 |
+
# Avec chiffres (pas un code ISO 639-3).
|
| 77 |
+
"fra1",
|
| 78 |
+
"1fra",
|
| 79 |
+
],
|
| 80 |
+
)
|
| 81 |
+
def test_invalid_lang_raises(self, lang: str) -> None:
|
| 82 |
+
with pytest.raises(OCRAdapterError, match="lang invalide"):
|
| 83 |
+
TesseractAdapter(lang=lang)
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def test_default_lang_is_valid() -> None:
|
| 87 |
+
"""Régression : le défaut ``"fra"`` doit toujours passer la
|
| 88 |
+
validation (sinon TesseractAdapter() planterait sans
|
| 89 |
+
arguments).
|
| 90 |
+
"""
|
| 91 |
+
adapter = TesseractAdapter()
|
| 92 |
+
assert adapter.lang == "fra"
|
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Phase 1.1 du plan d'audit — l'adapter eScriptorium passe
|
| 2 |
+
désormais par ``validate_http_url`` pour les fetch GET/POST et par
|
| 3 |
+
``download_url`` pour les téléchargements d'images.
|
| 4 |
+
|
| 5 |
+
Audit code-quality (2026-05) : ``escriptorium._get/_post`` et le
|
| 6 |
+
``urllib.request.urlretrieve(part.image_url)`` ligne 410 fetchaient
|
| 7 |
+
sans valider l'URL — un manifeste pointant
|
| 8 |
+
``http://169.254.169.254/...`` exfiltrait les métadonnées cloud,
|
| 9 |
+
``http://127.0.0.1:6379/...`` parlait au Redis local, etc. Le
|
| 10 |
+
helper ``validate_http_url`` existait déjà pour IIIF/Gallica/
|
| 11 |
+
HTR-United mais n'était pas branché pour eScriptorium.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
from unittest.mock import patch
|
| 17 |
+
|
| 18 |
+
import pytest
|
| 19 |
+
|
| 20 |
+
from picarones.adapters.corpus.escriptorium import EScriptoriumClient
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
@pytest.fixture
|
| 24 |
+
def client() -> EScriptoriumClient:
|
| 25 |
+
"""Client eScriptorium configuré sur un hôte fictif valide.
|
| 26 |
+
|
| 27 |
+
Le constructeur n'effectue aucun fetch — on peut donc fabriquer
|
| 28 |
+
un client avec une URL publique fictive et tester les méthodes
|
| 29 |
+
individuellement.
|
| 30 |
+
"""
|
| 31 |
+
return EScriptoriumClient("https://escriptorium.example.org", token="dummy")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
# --------------------------------------------------------------------------
|
| 35 |
+
# _get / _post : hostnames bloqués
|
| 36 |
+
# --------------------------------------------------------------------------
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
class TestGetBlocksDangerousHosts:
|
| 40 |
+
"""``_get`` doit refuser les hostnames internes avant tout fetch."""
|
| 41 |
+
|
| 42 |
+
@pytest.mark.parametrize(
|
| 43 |
+
"base_url",
|
| 44 |
+
[
|
| 45 |
+
"http://localhost:8000",
|
| 46 |
+
"http://127.0.0.1:8000",
|
| 47 |
+
"http://169.254.169.254", # AWS metadata
|
| 48 |
+
"http://metadata.google.internal", # GCP metadata
|
| 49 |
+
"http://10.0.0.42", # RFC 1918
|
| 50 |
+
"http://192.168.1.1", # RFC 1918
|
| 51 |
+
"http://172.16.0.5", # RFC 1918
|
| 52 |
+
"http://0.0.0.0", # unspecified
|
| 53 |
+
],
|
| 54 |
+
)
|
| 55 |
+
def test_get_refuses_internal_host(self, base_url: str) -> None:
|
| 56 |
+
"""Chaque IP/host interne fait lever RuntimeError sans fetch."""
|
| 57 |
+
client = EScriptoriumClient(base_url, token="dummy")
|
| 58 |
+
with patch("urllib.request.urlopen") as mock_urlopen:
|
| 59 |
+
with pytest.raises(RuntimeError, match="(anti-SSRF|refusé|Schéma)"):
|
| 60 |
+
client._get("projects/")
|
| 61 |
+
# Le fetch ne doit jamais avoir lieu.
|
| 62 |
+
mock_urlopen.assert_not_called()
|
| 63 |
+
|
| 64 |
+
def test_get_refuses_file_scheme(self) -> None:
|
| 65 |
+
"""Le schéma ``file://`` est refusé avant fetch."""
|
| 66 |
+
client = EScriptoriumClient("file:///etc/passwd", token="dummy")
|
| 67 |
+
with patch("urllib.request.urlopen") as mock_urlopen:
|
| 68 |
+
with pytest.raises(RuntimeError):
|
| 69 |
+
client._get("anything")
|
| 70 |
+
mock_urlopen.assert_not_called()
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
class TestPostBlocksDangerousHosts:
|
| 74 |
+
"""``_post`` (création de couche OCR) doit aussi valider."""
|
| 75 |
+
|
| 76 |
+
@pytest.mark.parametrize(
|
| 77 |
+
"base_url",
|
| 78 |
+
[
|
| 79 |
+
"http://169.254.169.254",
|
| 80 |
+
"http://localhost",
|
| 81 |
+
"http://10.0.0.1",
|
| 82 |
+
],
|
| 83 |
+
)
|
| 84 |
+
def test_post_refuses_internal_host(self, base_url: str) -> None:
|
| 85 |
+
client = EScriptoriumClient(base_url, token="dummy")
|
| 86 |
+
with patch("urllib.request.urlopen") as mock_urlopen:
|
| 87 |
+
with pytest.raises(RuntimeError, match="(anti-SSRF|refusé|Schéma)"):
|
| 88 |
+
client._post("documents/1/parts/2/transcriptions/", {"key": "value"})
|
| 89 |
+
mock_urlopen.assert_not_called()
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
# --------------------------------------------------------------------------
|
| 93 |
+
# Image download via download_url (Phase 1.1) — anti-SSRF
|
| 94 |
+
# --------------------------------------------------------------------------
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
class TestImageDownloadValidatesURL:
|
| 98 |
+
"""``import_document`` doit refuser de fetch une image dont
|
| 99 |
+
l'``image_url`` pointe vers un hôte interne.
|
| 100 |
+
|
| 101 |
+
On teste ici uniquement la sous-routine qui télécharge l'image
|
| 102 |
+
(le helper ``download_url`` lève ``ValueError`` validate_http_url).
|
| 103 |
+
"""
|
| 104 |
+
|
| 105 |
+
def test_download_url_rejects_metadata_host(self) -> None:
|
| 106 |
+
"""Vérification directe de l'invariant : download_url
|
| 107 |
+
ne fetch pas une URL metadata cloud."""
|
| 108 |
+
from picarones.adapters.corpus._http import download_url
|
| 109 |
+
|
| 110 |
+
with patch("urllib.request.urlopen") as mock_urlopen:
|
| 111 |
+
with pytest.raises(ValueError, match="(anti-SSRF|refusé)"):
|
| 112 |
+
download_url("http://169.254.169.254/latest/meta-data/")
|
| 113 |
+
mock_urlopen.assert_not_called()
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
# --------------------------------------------------------------------------
|
| 117 |
+
# Garde-fou — l'import du module ne plante pas
|
| 118 |
+
# --------------------------------------------------------------------------
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def test_module_imports_validate_http_url() -> None:
|
| 122 |
+
"""Le module ``escriptorium`` doit avoir importé ``validate_http_url``
|
| 123 |
+
au top-level — protection contre une régression d'import lazy
|
| 124 |
+
qui contournerait la vérification.
|
| 125 |
+
"""
|
| 126 |
+
import picarones.adapters.corpus.escriptorium as mod
|
| 127 |
+
|
| 128 |
+
assert hasattr(mod, "validate_http_url"), (
|
| 129 |
+
"escriptorium.py n'importe plus validate_http_url — "
|
| 130 |
+
"régression Phase 1.1 de l'audit code-quality."
|
| 131 |
+
)
|
| 132 |
+
assert hasattr(mod, "download_url"), (
|
| 133 |
+
"escriptorium.py n'importe plus download_url — "
|
| 134 |
+
"régression Phase 1.1 de l'audit code-quality."
|
| 135 |
+
)
|