Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Running

Claude commited on May 4

Commit

4f9f5f6

unverified ·

1 Parent(s): 3a4fc3a

feat(app): Sprint A14-S20 — CorpusService (import ZIP sandboxé + détection patterns image/GT)

Deuxième service applicatif du rewrite, après ``BenchmarkService``
(S17) et ``WorkspaceManager`` (S19) : prend en entrée un blob ZIP
uploadé (web ou CLI) et produit un ``CorpusSpec`` immédiatement
consommable par le ``BenchmarkService``.

Détection des patterns
----------------------
Conventions de nommage alignées sur l'historique (Sprint 32) :

::

mon_doc.png → image source
mon_doc.gt.txt → RAW_TEXT GT
mon_doc.gt.alto.xml → ALTO_XML GT
mon_doc.gt.page.xml → PAGE_XML GT
mon_doc.gt.entities.json → ENTITIES GT
mon_doc.gt.reading_order.json → READING_ORDER GT

Toutes les GT partageant le même stem que l'image sont rattachées
au même ``DocumentRef``. Une image sans GT est incluse (warning,
``n_images_without_gt``) ; une GT orpheline n'est pas rattachée
(warning, ``n_gt_without_image``).

Sécurité (chacun testé)
-----------------------
- **Path traversal** (``../etc/passwd``) → ``CorpusImportError``.
- **Chemin absolu** (Unix ``/etc/passwd`` ou Windows ``C:/evil``)
→ erreur.
- **Symlink** dans le ZIP (mode UNIX ``S_IFLNK`` détecté via
``external_attr``) → erreur.
- **Octet nul** dans un nom d'entrée → erreur.
- **Garde-fou final** : chaque chemin résolu doit rester sous
``extract_dir`` (défense en profondeur post-extraction).
- **Plafond taille blob** (``max_zip_size_bytes``, défaut 100 Mo).
- **Plafond nb entrées** (``max_entry_count``, défaut 5000) — anti
zip bomb par nombre.
- **Plafond taille décompressée** (``max_uncompressed_bytes``,
défaut 500 Mo) — anti zip bomb par expansion.
- **Archive corrompue** (``BadZipFile``) → erreur typée.

Filtrage silencieux des artefacts OS
------------------------------------
Détectés et sautés sans warning (bruit standard d'un ZIP produit
par un poste de travail patrimonial) :

- ``__MACOSX/`` et ``__MACOSX/._*``.
- ``._*`` (Apple resource forks).
- ``.DS_Store``.
- ``Thumbs.db`` (case-insensitive).

Ces sauts sont comptés dans ``n_skipped_noise`` pour que
l'utilisateur puisse vérifier le triage.

Anti-sur-ingénierie
-------------------
- Pas d'OCR à l'import (le service organise, ne lit pas).
- Pas de validation de schéma ALTO/PAGE à l'import (lourde —
reste à la demande des projecteurs/loaders).
- Pas de quotas par utilisateur ni rate-limiting (responsabilité
du caller web/CLI).
- Pas d'autodétection magique de format image (extension only) —
Pillow protégera plus tard côté web.

Sandbox
-------
Toute extraction se fait dans un sous-dossier ``corpus_<safe_name>``
du ``WorkspaceManager`` injecté. Plusieurs imports peuvent
coexister dans un même workspace sans collision (test inclus).
``corpus_name`` est sanitizé via ``safe_report_name``.

Tests
-----
32 tests dans ``tests/security/test_sprint_a14_s20_corpus_service.py`` :

- Import basique (image + GT → 1 doc, extraction sandboxée,
sanitization corpus_name).
- Détection paramétrée des 5 niveaux GT.
- GT multi-niveaux pour le même stem.
- Pairing : image sans GT, GT orpheline, doublon d'image (premier
gardé), hiérarchie ``volA/folio_001`` préservée dans le doc_id.
- Filtrage OS : __MACOSX, ._*, .DS_Store, Thumbs.db (case-insensitive).
- Sécurité : 8 cas (traversal, absolu Unix/Windows, ZIP corrompu,
taille trop grande, trop d'entrées, décompression trop grande,
symlink).
- Cas limites : ZIP vide, extension inconnue sautée, caractères
invalides dans doc_id remplacés, metadata pass-through, imports
multiples sans collision.
- Smoke : un corpus importé est immédiatement consumable par le
BenchmarkService (vérification end-to-end de l'API publique).

439 tests sprint_a14 passent (407 S1-S19 + 32 S20).

https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP

Files changed (3) hide show

picarones/app/services/__init__.py +8 -0
picarones/app/services/corpus_service.py +540 -0
tests/security/test_sprint_a14_s20_corpus_service.py +467 -0

picarones/app/services/__init__.py CHANGED Viewed

@@ -30,6 +30,11 @@ from picarones.app.services.benchmark_service import (
     GroundTruthFactory,
     PipelineInputsFactory,
 )
 from picarones.app.services.path_security import (
     PathValidationError,
     WorkspaceManager,
@@ -41,6 +46,9 @@ from picarones.app.services.path_security import (
 __all__ = [
     "BenchmarkService",
     "ContextFactory",
     "GroundTruthFactory",
     "PathValidationError",
     "PipelineInputsFactory",

     GroundTruthFactory,
     PipelineInputsFactory,
 )
+from picarones.app.services.corpus_service import (
+    CorpusImportError,
+    CorpusImportReport,
+    CorpusService,
+)
 from picarones.app.services.path_security import (
     PathValidationError,
     WorkspaceManager,
 __all__ = [
     "BenchmarkService",
     "ContextFactory",
+    "CorpusImportError",
+    "CorpusImportReport",
+    "CorpusService",
     "GroundTruthFactory",
     "PathValidationError",
     "PipelineInputsFactory",

picarones/app/services/corpus_service.py ADDED Viewed

	@@ -0,0 +1,540 @@

+"""``CorpusService`` — upload ZIP sandboxé + détection des paires image/GT.
+Sprint A14-S20 du rewrite ciblé.
+Le service applicatif qui prend en entrée un blob ZIP (uploadé par
+le web ou la CLI) et produit un ``CorpusSpec`` immédiatement
+consommable par le ``BenchmarkService`` (S17), avec :
+- **Extraction sandboxée** dans un sous-dossier d'un
+  ``WorkspaceManager`` (S19) — refus du path traversal, des symlinks,
+  et des zip bombs.
+- **Détection des paires** image / GT par convention de nommage,
+  alignée sur l'historique (Sprint 32) :
+  ::
+      mon_doc.png
+      mon_doc.gt.txt
+      mon_doc.gt.alto.xml
+      mon_doc.gt.page.xml
+      mon_doc.gt.entities.json
+      mon_doc.gt.reading_order.json
+  Toutes les GT partageant le **même stem** que l'image sont rattachées
+  au même ``DocumentRef``.
+- **Filtrage silencieux** des artefacts macOS / Windows (``__MACOSX/``,
+  ``._*``, ``.DS_Store``, ``Thumbs.db``) — bruit standard d'un ZIP
+  produit par un poste de travail patrimonial.
+- **Rapport** ``CorpusImportReport`` qui agrège warnings (image
+  sans GT, GT orpheline) et compte les entrées sautées — l'utilisateur
+  doit pouvoir vérifier visuellement que son corpus a été interprété
+  correctement.
+Anti-sur-ingénierie
+-------------------
+- Pas d'OCR à l'import.  Le service ne lit pas les contenus, il
+  organise.
+- Pas de validation de schéma ALTO/PAGE à l'import (c'est lourd).
+  Les fichiers sont juste catalogués ; la validation se fait à la
+  demande par les projecteurs/loaders.
+- Pas de quotas par utilisateur ou rate-limiting (responsabilité
+  du caller web/CLI ; les paramètres ``max_*`` du constructeur sont
+  des plafonds défensifs absolus).
+- Pas d'autodétection de format image (PNG vs JPEG vs TIFF) — on
+  reconnaît par extension.  Si un attaquant met un EXE en ``.png``,
+  Pillow protégera plus tard (S21+ pour la web).
+"""
+from __future__ import annotations
+import io
+import logging
+import re
+import zipfile
+from dataclasses import dataclass, field
+from pathlib import Path
+from picarones.app.services.path_security import (
+    WorkspaceManager,
+    safe_report_name,
+)
+from picarones.domain.artifacts import ArtifactType
+from picarones.domain.corpus import CorpusSpec
+from picarones.domain.documents import DocumentRef, GroundTruthRef
+logger = logging.getLogger(__name__)
+class CorpusImportError(Exception):
+    """Levée quand l'import ZIP échoue de manière irrécupérable.
+    Cas typiques :
+    - Archive corrompue / non-ZIP.
+    - Path traversal détecté.
+    - Symlink détecté.
+    - Plafond de taille / nombre d'entrées dépassé (zip bomb).
+    """
+# ──────────────────────────────────────────────────────────────────────
+# Conventions de nommage GT (alignées sur picarones/core/corpus.py
+# Sprint 32, mais exprimées en ``ArtifactType`` pour le rewrite).
+# ──────────────────────────────────────────────────────────────────────
+#: Suffixes de GT reconnus, dans l'ordre du plus spécifique au moins
+#: spécifique (``.gt.alto.xml`` doit être testé AVANT ``.gt.txt`` qui
+#: est une sous-chaîne moins discriminante).
+_GT_SUFFIX_TO_TYPE: tuple[tuple[str, ArtifactType], ...] = (
+    (".gt.alto.xml", ArtifactType.ALTO_XML),
+    (".gt.page.xml", ArtifactType.PAGE_XML),
+    (".gt.entities.json", ArtifactType.ENTITIES),
+    (".gt.reading_order.json", ArtifactType.READING_ORDER),
+    (".gt.txt", ArtifactType.RAW_TEXT),
+)
+#: Extensions image reconnues (case-insensitive).  L'absence de ``.gt.``
+#: dans le chemin est requise pour distinguer ``foo.png`` (image) d'un
+#: éventuel ``foo.gt.alto.xml`` (qui ne match pas ces extensions, mais
+#: par défense).
+_IMAGE_EXTENSIONS: frozenset[str] = frozenset({
+    ".png", ".jpg", ".jpeg", ".tif", ".tiff", ".webp", ".bmp",
+})
+#: Patterns à ignorer silencieusement (artefacts OS).
+_OS_NOISE_PATTERNS: tuple[re.Pattern[str], ...] = (
+    re.compile(r"(^|/)__MACOSX(/|$)"),
+    re.compile(r"(^|/)\._[^/]*$"),
+    re.compile(r"(^|/)\.DS_Store$"),
+    re.compile(r"(^|/)Thumbs\.db$", re.IGNORECASE),
+)
+# ──────────────────────────────────────────────────────────────────────
+# Rapport d'import
+# ──────────────────────────────────────────────────────────────────────
+@dataclass(frozen=True)
+class CorpusImportReport:
+    """Résultat lisible humainement d'un ``import_zip``.
+    Attributs
+    ---------
+    spec:
+        Le ``CorpusSpec`` construit, prêt à être passé au
+        ``BenchmarkService``.
+    extracted_dir:
+        Chemin filesystem absolu du sous-dossier où le ZIP a été
+        extrait.  Vit sous le ``WorkspaceManager.root``.
+    n_documents:
+        Nombre de documents avec au moins une image (= longueur de
+        ``spec.documents``).
+    n_images_without_gt:
+        Nombre d'images trouvées sans GT.  Ces documents sont quand
+        même inclus dans le corpus (l'utilisateur peut juste vouloir
+        OCRiser, pas évaluer).
+    n_gt_without_image:
+        Nombre de GT orphelines (stem qui n'a pas d'image
+        correspondante).  Loggées en warning et non rattachées —
+        ne participent pas au corpus.
+    n_skipped_noise:
+        Nombre d'entrées sautées silencieusement (artefacts OS).
+    warnings:
+        Messages humainement lisibles à présenter au caller (web
+        affiche dans une bannière, CLI affiche en stderr).
+    skipped_paths:
+        Liste des chemins (relatifs au root du ZIP) qui ont été
+        sautés ou non rattachés — utile au debug d'un import qui
+        a perdu des fichiers.
+    """
+    spec: CorpusSpec
+    extracted_dir: Path
+    n_documents: int
+    n_images_without_gt: int
+    n_gt_without_image: int
+    n_skipped_noise: int
+    warnings: tuple[str, ...] = field(default_factory=tuple)
+    skipped_paths: tuple[str, ...] = field(default_factory=tuple)
+# ──────────────────────────────────────────────────────────────────────
+# Service
+# ──────────────────────────────────────────────────────────────────────
+class CorpusService:
+    """Service d'import et d'analyse de structure d'un corpus.
+    Parameters
+    ----------
+    workspace:
+        ``WorkspaceManager`` dans lequel extraire le ZIP.  Le service
+        crée un sous-dossier par import — plusieurs imports peuvent
+        coexister dans un même workspace.
+    max_zip_size_bytes:
+        Plafond sur la **taille du blob ZIP** lui-même (avant
+        extraction).  Défaut 100 Mo.  Le caller (web layer) doit
+        idéalement vérifier ça aussi en amont via
+        ``Content-Length``.
+    max_entry_count:
+        Plafond sur le nombre d'entrées dans le ZIP (anti-bombe par
+        nombre).  Défaut 5000.
+    max_uncompressed_bytes:
+        Plafond sur la taille totale **décompressée** (anti-bombe
+        par expansion).  Défaut 500 Mo.
+    """
+    def __init__(
+        self,
+        workspace: WorkspaceManager,
+        *,
+        max_zip_size_bytes: int = 100 * 1024 * 1024,
+        max_entry_count: int = 5000,
+        max_uncompressed_bytes: int = 500 * 1024 * 1024,
+    ) -> None:
+        self._workspace = workspace
+        self._max_zip_size = max_zip_size_bytes
+        self._max_entries = max_entry_count
+        self._max_uncompressed = max_uncompressed_bytes
+    # ──────────────────────────────────────────────────────────────────
+    # API publique
+    # ──────────────────────────────────────────────────────────────────
+    def import_zip(
+        self,
+        zip_bytes: bytes,
+        *,
+        corpus_name: str,
+        metadata: dict[str, str] | None = None,
+    ) -> CorpusImportReport:
+        """Extrait un ZIP et construit le ``CorpusSpec`` correspondant.
+        Étapes :
+        1. Validation des plafonds (taille blob, nb entrées,
+           taille décompressée prévisible si dispo).
+        2. Validation de chaque entrée (refus traversal, symlinks).
+        3. Extraction sécurisée dans un sous-dossier dédié.
+        4. Catalogage : détection images + GT + appariement par stem.
+        5. Construction du ``CorpusSpec``.
+        Le ``corpus_name`` est nettoyé via :func:`safe_report_name`
+        (le caller peut passer un nom utilisateur sans pré-validation).
+        """
+        if len(zip_bytes) > self._max_zip_size:
+            raise CorpusImportError(
+                f"ZIP trop volumineux : {len(zip_bytes)} octets > "
+                f"plafond {self._max_zip_size}.",
+            )
+        safe_name = safe_report_name(corpus_name, max_length=64)
+        # Sous-dossier d'extraction unique pour cet import — permet
+        # plusieurs imports sans collision.
+        extract_dir = self._workspace.subpath(f"corpus_{safe_name}")
+        extract_dir.mkdir(parents=True, exist_ok=True)
+        try:
+            zf = zipfile.ZipFile(io.BytesIO(zip_bytes))
+        except zipfile.BadZipFile as exc:
+            raise CorpusImportError(f"Archive ZIP invalide : {exc}") from exc
+        with zf:
+            self._validate_archive(zf)
+            extracted_files, n_noise = self._extract_safely(zf, extract_dir)
+        spec, warnings, n_orphan_gt, n_no_gt, skipped_paths = (
+            self._build_corpus_spec(
+                extracted_files=extracted_files,
+                corpus_name=safe_name,
+                extract_dir=extract_dir,
+                metadata=metadata or {},
+            )
+        )
+        return CorpusImportReport(
+            spec=spec,
+            extracted_dir=extract_dir,
+            n_documents=len(spec.documents),
+            n_images_without_gt=n_no_gt,
+            n_gt_without_image=n_orphan_gt,
+            n_skipped_noise=n_noise,
+            warnings=tuple(warnings),
+            skipped_paths=tuple(skipped_paths),
+        )
+    # ──────────────────────────────────────────────────────────────────
+    # Étape 1 : validation globale de l'archive
+    # ──────────────────────────────────────────────────────────────────
+    def _validate_archive(self, zf: zipfile.ZipFile) -> None:
+        """Vérifie les plafonds globaux (entrées, taille décompressée)."""
+        infos = zf.infolist()
+        if len(infos) > self._max_entries:
+            raise CorpusImportError(
+                f"ZIP contient trop d'entrées : {len(infos)} > "
+                f"plafond {self._max_entries} (zip bomb suspectée).",
+            )
+        total_uncompressed = sum(info.file_size for info in infos)
+        if total_uncompressed > self._max_uncompressed:
+            raise CorpusImportError(
+                f"ZIP décompressé trop volumineux : {total_uncompressed} "
+                f"octets > plafond {self._max_uncompressed} (zip bomb "
+                "suspectée).",
+            )
+    # ──────────────────────────────────────────────────────────────────
+    # Étape 2 + 3 : extraction sécurisée
+    # ──────────────────────────────────────────────────────────────────
+    def _extract_safely(
+        self,
+        zf: zipfile.ZipFile,
+        extract_dir: Path,
+    ) -> tuple[list[tuple[str, Path]], int]:
+        """Extrait chaque fichier en validant son chemin cible.
+        Returns
+        -------
+        tuple[list[tuple[str, Path]], int]
+            ``(extracted_files, n_skipped_noise)`` — liste des paires
+            ``(relative_in_zip, absolute_on_disk)`` des fichiers
+            réellement extraits, et compte des entrées sautées car
+            artefact OS.
+        """
+        out: list[tuple[str, Path]] = []
+        n_noise = 0
+        for info in zf.infolist():
+            arc_name = info.filename
+            # Saut des répertoires nus.
+            if arc_name.endswith("/"):
+                continue
+            # Saut des artefacts OS (silencieux par design).
+            if _is_os_noise(arc_name):
+                n_noise += 1
+                continue
+            # Refus des chemins absolus, traversals, octets nuls.
+            self._reject_unsafe_arcname(arc_name)
+            # Refus des symlinks (mode UNIX bit S_IFLNK = 0xA000).
+            unix_mode = (info.external_attr >> 16) & 0xF000
+            if unix_mode == 0xA000:
+                raise CorpusImportError(
+                    f"Symlink dans le ZIP refusé : {arc_name!r}.",
+                )
+            target = (extract_dir / arc_name).resolve()
+            # Garde-fou final : le path résolu doit rester sous extract_dir.
+            try:
+                target.relative_to(extract_dir.resolve())
+            except ValueError as exc:
+                raise CorpusImportError(
+                    f"Entrée ZIP {arc_name!r} sort du dossier "
+                    f"d'extraction après résolution.",
+                ) from exc
+            target.parent.mkdir(parents=True, exist_ok=True)
+            with zf.open(info) as src, target.open("wb") as dst:
+                while True:
+                    chunk = src.read(64 * 1024)
+                    if not chunk:
+                        break
+                    dst.write(chunk)
+            out.append((arc_name, target))
+        return out, n_noise
+    @staticmethod
+    def _reject_unsafe_arcname(arc_name: str) -> None:
+        if not arc_name:
+            raise CorpusImportError("Entrée ZIP au nom vide.")
+        if "\x00" in arc_name:
+            raise CorpusImportError(
+                f"Entrée ZIP avec octet nul dans le nom : {arc_name!r}.",
+            )
+        # Refus chemin absolu (Unix ``/`` ou Windows ``C:\``).
+        if arc_name.startswith("/") or arc_name.startswith("\\"):
+            raise CorpusImportError(
+                f"Chemin absolu interdit dans le ZIP : {arc_name!r}.",
+            )
+        if len(arc_name) >= 3 and arc_name[1] == ":" and arc_name[2] in ("/", "\\"):
+            raise CorpusImportError(
+                f"Chemin absolu Windows interdit dans le ZIP : "
+                f"{arc_name!r}.",
+            )
+        # Refus des traversals (``..`` comme composant).
+        parts = arc_name.replace("\\", "/").split("/")
+        if any(p == ".." for p in parts):
+            raise CorpusImportError(
+                f"Traversal détecté dans le ZIP : {arc_name!r}.",
+            )
+    # ──────────────────────────────────────────────────────────────────
+    # Étape 4 + 5 : catalogage et construction de la spec
+    # ──────────────────────────────────────────────────────────────────
+    def _build_corpus_spec(
+        self,
+        *,
+        extracted_files: list[tuple[str, Path]],
+        corpus_name: str,
+        extract_dir: Path,
+        metadata: dict[str, str],
+    ) -> tuple[CorpusSpec, list[str], int, int, list[str]]:
+        """Catalogue images et GT puis construit le ``CorpusSpec``.
+        Returns
+        -------
+        tuple[CorpusSpec, warnings, n_orphan_gt, n_no_gt, skipped_paths]
+        """
+        images_by_stem: dict[str, Path] = {}
+        gts_by_stem: dict[str, dict[ArtifactType, Path]] = {}
+        skipped_paths: list[str] = []
+        warnings_list: list[str] = []
+        for arc_name, abs_path in extracted_files:
+            # Conserver l'arc_name comme « chemin source » pour le doc
+            # id (relatif, lisible).  L'image_uri / gt.uri sera l'absolu.
+            kind = _classify(arc_name)
+            if kind is None:
+                skipped_paths.append(arc_name)
+                continue
+            if isinstance(kind, ArtifactType):
+                # GT
+                stem = _strip_gt_suffix(arc_name, kind)
+                if stem is None:
+                    skipped_paths.append(arc_name)
+                    continue
+                gts_by_stem.setdefault(stem, {})[kind] = abs_path
+            else:
+                # Image
+                stem = _strip_image_extension(arc_name)
+                if stem in images_by_stem:
+                    warnings_list.append(
+                        f"Plusieurs images partagent le stem "
+                        f"{stem!r} — première gardée, "
+                        f"{arc_name!r} ignorée.",
+                    )
+                    skipped_paths.append(arc_name)
+                    continue
+                images_by_stem[stem] = abs_path
+        # Appariement.
+        documents: list[DocumentRef] = []
+        n_no_gt = 0
+        for stem in sorted(images_by_stem):
+            image_path = images_by_stem[stem]
+            gts = gts_by_stem.pop(stem, {})
+            if not gts:
+                n_no_gt += 1
+                warnings_list.append(
+                    f"Image {stem!r} sans GT — incluse mais non "
+                    "évaluable.",
+                )
+            ground_truths = tuple(
+                GroundTruthRef(type=art_type, uri=str(path))
+                for art_type, path in sorted(
+                    gts.items(), key=lambda kv: kv[0].value,
+                )
+            )
+            doc_id = _doc_id_from_stem(stem)
+            documents.append(
+                DocumentRef(
+                    id=doc_id,
+                    image_uri=str(image_path),
+                    ground_truths=ground_truths,
+                ),
+            )
+        # GT orphelines (stems sans image correspondante).
+        n_orphan_gt = 0
+        for stem, gts in gts_by_stem.items():
+            for art_type in gts:
+                n_orphan_gt += 1
+                warnings_list.append(
+                    f"GT orpheline (pas d'image pour stem "
+                    f"{stem!r}) : niveau {art_type.value!r}.",
+                )
+        spec = CorpusSpec(
+            name=corpus_name,
+            documents=tuple(documents),
+            metadata=metadata,
+        )
+        return spec, warnings_list, n_orphan_gt, n_no_gt, skipped_paths
+# ──────────────────────────────────────────────────────────────────────
+# Helpers de classification
+# ──────────────────────────────────────────────────────────────────────
+def _is_os_noise(arc_name: str) -> bool:
+    return any(p.search(arc_name) for p in _OS_NOISE_PATTERNS)
+def _classify(arc_name: str) -> ArtifactType | str | None:
+    """Classifie une entrée en ``ArtifactType`` (GT) ou ``"image"``.
+    Returns
+    -------
+    ArtifactType si GT reconnue, "image" si image reconnue,
+    None si non classifiable.
+    """
+    lower = arc_name.lower()
+    for suffix, art_type in _GT_SUFFIX_TO_TYPE:
+        if lower.endswith(suffix):
+            return art_type
+    # On distingue les images : extension reconnue ET pas de ``.gt.``.
+    # (``foo.gt.png`` est conceptuellement pas une convention valide,
+    # mais on défend.)
+    if ".gt." in lower:
+        return None
+    for ext in _IMAGE_EXTENSIONS:
+        if lower.endswith(ext):
+            return "image"
+    return None
+def _strip_gt_suffix(arc_name: str, art_type: ArtifactType) -> str | None:
+    """Retire le suffixe GT et retourne le stem.  ``None`` si non match."""
+    lower = arc_name.lower()
+    for suffix, t in _GT_SUFFIX_TO_TYPE:
+        if t is art_type and lower.endswith(suffix):
+            return arc_name[: len(arc_name) - len(suffix)]
+    return None
+def _strip_image_extension(arc_name: str) -> str:
+    """Retire l'extension image (case-insensitive)."""
+    lower = arc_name.lower()
+    for ext in _IMAGE_EXTENSIONS:
+        if lower.endswith(ext):
+            return arc_name[: len(arc_name) - len(ext)]
+    return arc_name
+_DOC_ID_INVALID_RE = re.compile(r"[^A-Za-z0-9_.\-/]")
+def _doc_id_from_stem(stem: str) -> str:
+    """Convertit un stem (chemin relatif) en ``DocumentRef.id`` valide.
+    Le validateur de ``DocumentRef`` exige
+    ``[A-Za-z0-9_.\\-/]+`` — on remplace tout caractère hors de cet
+    alphabet par ``_`` (typique : espaces, accents, parenthèses dans
+    des noms BnF).
+    """
+    cleaned = _DOC_ID_INVALID_RE.sub("_", stem)
+    if not cleaned:
+        return "doc"
+    return cleaned
+__all__ = [
+    "CorpusImportError",
+    "CorpusImportReport",
+    "CorpusService",
+]

tests/security/test_sprint_a14_s20_corpus_service.py ADDED Viewed

	@@ -0,0 +1,467 @@

+"""Sprint A14-S20 — ``CorpusService`` (import ZIP sandboxé +
+détection des paires image/GT).
+Couverture :
+- Import basique : 1 image + 1 GT → 1 doc.
+- Détection de tous les niveaux GT (alto, page, entities,
+  reading_order, txt).
+- GT multi-niveaux pour le même stem → un seul doc avec plusieurs
+  GroundTruthRef.
+- Image sans GT → doc inclus + warning, ``n_images_without_gt`` > 0.
+- GT orpheline (sans image) → warning + non rattachée,
+  ``n_gt_without_image`` > 0.
+- Filtrage silencieux des artefacts macOS (``__MACOSX/``, ``._*``,
+  ``.DS_Store``, ``Thumbs.db``).
+Sécurité :
+- Path traversal (``../etc/passwd``) → ``CorpusImportError``.
+- Chemin absolu Unix (``/etc/passwd``) → ``CorpusImportError``.
+- Chemin absolu Windows (``C:\\evil``) → ``CorpusImportError``.
+- Octet nul dans le nom → ``CorpusImportError``.
+- Symlink dans l'archive → ``CorpusImportError``.
+- ZIP plus volumineux que ``max_zip_size_bytes`` → erreur.
+- Trop d'entrées (zip bomb par nombre) → erreur.
+- Décompression trop volumineuse (zip bomb par expansion) → erreur.
+- Archive corrompue / non-ZIP → erreur.
+Cas limites :
+- ZIP vide → corpus vide, pas d'erreur.
+- corpus_name avec caractères spéciaux → sanitizé via
+  ``safe_report_name``.
+- ZIP avec hiérarchie (``volA/folio.png``) → doc_id préserve la
+  hiérarchie.
+- Doublon d'image (même stem, deux extensions) → premier gardé +
+  warning.
+"""
+from __future__ import annotations
+import io
+import zipfile
+from pathlib import Path
+import pytest
+from picarones.app.services import (
+    CorpusImportError,
+    CorpusImportReport,
+    CorpusService,
+    WorkspaceManager,
+)
+from picarones.domain.artifacts import ArtifactType
+# ──────────────────────────────────────────────────────────────────
+# Fixtures
+# ──────────────────────────────────────────────────────────────────
+@pytest.fixture
+def workspace(tmp_path: Path) -> WorkspaceManager:
+    return WorkspaceManager(tmp_path)
+@pytest.fixture
+def service(workspace: WorkspaceManager) -> CorpusService:
+    return CorpusService(workspace)
+def _make_zip(entries: dict[str, bytes]) -> bytes:
+    """Produit un ZIP en mémoire à partir d'un dict ``{arcname: bytes}``."""
+    buf = io.BytesIO()
+    with zipfile.ZipFile(buf, mode="w", compression=zipfile.ZIP_DEFLATED) as zf:
+        for name, data in entries.items():
+            zf.writestr(name, data)
+    return buf.getvalue()
+def _png_bytes() -> bytes:
+    """Minimal valid PNG header (signature + IHDR), suffisant pour les
+    tests qui ne valident pas l'image."""
+    return (
+        b"\x89PNG\r\n\x1a\n"
+        b"\x00\x00\x00\rIHDR"
+        b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00"
+        b"\x1f\x15\xc4\x89"
+    )
+# ──────────────────────────────────────────────────────────────────
+# Import basique + détection GT
+# ──────────────────────────────────────────────────────────────────
+class TestBasicImport:
+    def test_image_plus_text_gt_creates_one_doc(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc01.png": _png_bytes(),
+            "doc01.gt.txt": "Hello world".encode("utf-8"),
+        })
+        report = service.import_zip(zip_bytes, corpus_name="test_corpus")
+        assert isinstance(report, CorpusImportReport)
+        assert report.n_documents == 1
+        doc = report.spec.documents[0]
+        assert doc.id == "doc01"
+        assert doc.image_uri is not None
+        assert Path(doc.image_uri).name == "doc01.png"
+        assert len(doc.ground_truths) == 1
+        gt = doc.ground_truths[0]
+        assert gt.type == ArtifactType.RAW_TEXT
+        assert Path(gt.uri).name == "doc01.gt.txt"
+    def test_extracted_dir_lives_inside_workspace(
+        self,
+        service: CorpusService,
+        workspace: WorkspaceManager,
+    ) -> None:
+        zip_bytes = _make_zip({"doc.png": _png_bytes()})
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        # Garantie sandbox : le dir extrait est sous le workspace root.
+        report.extracted_dir.relative_to(workspace.root)
+    def test_corpus_name_is_sanitized(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({"doc.png": _png_bytes()})
+        report = service.import_zip(
+            zip_bytes,
+            corpus_name="my/corpus/with/slashes",
+        )
+        # Les / sont retirés par safe_report_name.
+        assert "/" not in report.spec.name
+        assert report.spec.name == "mycorpuswithslashes"
+class TestGTLevelDetection:
+    @pytest.mark.parametrize(
+        "suffix,expected_type",
+        [
+            (".gt.alto.xml", ArtifactType.ALTO_XML),
+            (".gt.page.xml", ArtifactType.PAGE_XML),
+            (".gt.entities.json", ArtifactType.ENTITIES),
+            (".gt.reading_order.json", ArtifactType.READING_ORDER),
+            (".gt.txt", ArtifactType.RAW_TEXT),
+        ],
+    )
+    def test_each_gt_suffix_is_recognized(
+        self,
+        service: CorpusService,
+        suffix: str,
+        expected_type: ArtifactType,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            f"doc{suffix}": b"<gt></gt>",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        doc = report.spec.documents[0]
+        assert len(doc.ground_truths) == 1
+        assert doc.ground_truths[0].type == expected_type
+    def test_multi_level_gt_for_same_stem(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            "doc.gt.txt": b"text",
+            "doc.gt.alto.xml": b"<alto></alto>",
+            "doc.gt.entities.json": b"[]",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        doc = report.spec.documents[0]
+        types = {gt.type for gt in doc.ground_truths}
+        assert types == {
+            ArtifactType.RAW_TEXT,
+            ArtifactType.ALTO_XML,
+            ArtifactType.ENTITIES,
+        }
+    def test_case_insensitive_extension_for_image(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc.PNG": _png_bytes(),
+            "doc.gt.txt": b"x",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+class TestPairing:
+    def test_image_without_gt_is_included_with_warning(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({"only_image.png": _png_bytes()})
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        assert report.n_images_without_gt == 1
+        assert any("sans GT" in w for w in report.warnings)
+    def test_gt_without_image_is_orphan(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({"orphan.gt.txt": b"text"})
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 0
+        assert report.n_gt_without_image == 1
+        assert any("orpheline" in w for w in report.warnings)
+    def test_duplicate_image_stem_keeps_first(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            "doc.jpg": b"jpeg-bytes",
+            "doc.gt.txt": b"text",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        # Une des deux est sautée (warning).
+        assert any("partagent le stem" in w for w in report.warnings)
+    def test_hierarchical_paths_preserved_in_doc_id(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "volA/folio_001.png": _png_bytes(),
+            "volA/folio_001.gt.txt": b"x",
+            "volB/folio_002.png": _png_bytes(),
+            "volB/folio_002.gt.txt": b"y",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 2
+        doc_ids = sorted(d.id for d in report.spec.documents)
+        assert doc_ids == ["volA/folio_001", "volB/folio_002"]
+# ──────────────────────────────────────────────────────────────────
+# Filtrage silencieux des artefacts OS
+# ──────────────────────────────────────────────────────────────────
+class TestOSNoiseFiltering:
+    def test_macosx_dir_is_skipped(self, service: CorpusService) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            "doc.gt.txt": b"x",
+            "__MACOSX/doc.png": b"macos-meta",
+            "__MACOSX/._doc.png": b"macos-meta-fork",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        assert report.n_skipped_noise >= 1
+    def test_dotunderscore_files_skipped(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            "._doc.png": b"resource-fork",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+    def test_dsstore_skipped(self, service: CorpusService) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            ".DS_Store": b"finder-metadata",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        assert report.n_skipped_noise >= 1
+    def test_thumbsdb_skipped_case_insensitive(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            "Thumbs.db": b"win-thumbs",
+            "subdir/THUMBS.DB": b"more",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        assert report.n_skipped_noise >= 2
+# ──────────────────────────────────────────────────────────────────
+# Sécurité — refus brutal
+# ──────────────────────────────────────────────────────────────────
+class TestSecurityRejections:
+    def test_traversal_in_arcname_is_rejected(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({"../escape.txt": b"evil"})
+        with pytest.raises(CorpusImportError, match="Traversal"):
+            service.import_zip(zip_bytes, corpus_name="x")
+    def test_absolute_unix_path_is_rejected(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({"/etc/passwd": b"root:x:0:0::/root:/bin/bash"})
+        with pytest.raises(CorpusImportError, match="absolu"):
+            service.import_zip(zip_bytes, corpus_name="x")
+    def test_absolute_windows_path_is_rejected(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({"C:/evil.txt": b"evil"})
+        with pytest.raises(CorpusImportError, match="absolu"):
+            service.import_zip(zip_bytes, corpus_name="x")
+    def test_corrupt_zip_raises(self, service: CorpusService) -> None:
+        with pytest.raises(CorpusImportError, match="invalide"):
+            service.import_zip(b"not a zip", corpus_name="x")
+    def test_zip_too_large_raises(
+        self, workspace: WorkspaceManager,
+    ) -> None:
+        small_service = CorpusService(workspace, max_zip_size_bytes=10)
+        zip_bytes = _make_zip({"doc.png": _png_bytes()})
+        assert len(zip_bytes) > 10
+        with pytest.raises(CorpusImportError, match="trop volumineux"):
+            small_service.import_zip(zip_bytes, corpus_name="x")
+    def test_too_many_entries_raises(
+        self, workspace: WorkspaceManager,
+    ) -> None:
+        cap_service = CorpusService(workspace, max_entry_count=3)
+        zip_bytes = _make_zip({f"f{i}.png": _png_bytes() for i in range(5)})
+        with pytest.raises(CorpusImportError, match="trop d'entrées"):
+            cap_service.import_zip(zip_bytes, corpus_name="x")
+    def test_uncompressed_too_large_raises(
+        self, workspace: WorkspaceManager,
+    ) -> None:
+        # 3 fichiers de 100 octets, plafond à 200 → refus.
+        cap_service = CorpusService(
+            workspace, max_uncompressed_bytes=200,
+        )
+        zip_bytes = _make_zip({
+            f"f{i}.png": b"x" * 100 for i in range(3)
+        })
+        with pytest.raises(CorpusImportError, match="décompressé trop volumineux"):
+            cap_service.import_zip(zip_bytes, corpus_name="x")
+    def test_symlink_entry_rejected(
+        self, service: CorpusService, tmp_path: Path,
+    ) -> None:
+        # Construire manuellement un ZIP avec une entrée flaggée
+        # symlink (mode UNIX 0xA000).
+        buf = io.BytesIO()
+        with zipfile.ZipFile(buf, mode="w") as zf:
+            info = zipfile.ZipInfo("evil_link")
+            info.external_attr = 0xA000 << 16  # S_IFLNK
+            zf.writestr(info, "/etc/passwd")
+        with pytest.raises(CorpusImportError, match="Symlink"):
+            service.import_zip(buf.getvalue(), corpus_name="x")
+# ──────────────────────────────────────────────────────────────────
+# Cas limites
+# ──────────────────────────────────────────────────────────────────
+class TestEdgeCases:
+    def test_empty_zip_yields_empty_corpus(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({})
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 0
+        assert report.n_images_without_gt == 0
+        assert report.n_gt_without_image == 0
+    def test_unrecognized_extension_is_skipped(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({
+            "doc.png": _png_bytes(),
+            "doc.gt.txt": b"x",
+            "readme.md": b"# readme",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        # readme.md sauté car pas image, pas GT reconnue.
+        assert "readme.md" in report.skipped_paths
+    def test_invalid_chars_in_doc_id_are_replaced(
+        self, service: CorpusService,
+    ) -> None:
+        # Espaces, parenthèses, accents → remplacés par _.
+        zip_bytes = _make_zip({
+            "doc avec espaces (BnF).png": _png_bytes(),
+            "doc avec espaces (BnF).gt.txt": b"x",
+        })
+        report = service.import_zip(zip_bytes, corpus_name="x")
+        assert report.n_documents == 1
+        doc = report.spec.documents[0]
+        # Le doc_id ne contient plus d'espaces ni de parenthèses.
+        assert " " not in doc.id
+        assert "(" not in doc.id
+        assert ")" not in doc.id
+    def test_metadata_passes_through(
+        self, service: CorpusService,
+    ) -> None:
+        zip_bytes = _make_zip({"doc.png": _png_bytes()})
+        report = service.import_zip(
+            zip_bytes,
+            corpus_name="x",
+            metadata={"language": "fr", "period": "early_modern"},
+        )
+        assert report.spec.metadata == {
+            "language": "fr",
+            "period": "early_modern",
+        }
+    def test_multiple_imports_dont_collide(
+        self, service: CorpusService,
+    ) -> None:
+        """Deux imports avec corpus_name distincts coexistent."""
+        zb = _make_zip({"doc.png": _png_bytes()})
+        r1 = service.import_zip(zb, corpus_name="alpha")
+        r2 = service.import_zip(zb, corpus_name="beta")
+        assert r1.extracted_dir != r2.extracted_dir
+        assert r1.extracted_dir.exists()
+        assert r2.extracted_dir.exists()
+# ──────────────────────────────────────────────────────────────────
+# Smoke test : import bout-en-bout puis BenchmarkService consume
+# ──────────────────────────────────────────────────────────────────
+class TestSmokeIntegration:
+    def test_imported_corpus_is_consumable_by_benchmark_service(
+        self, service: CorpusService,
+    ) -> None:
+        """L'import produit un CorpusSpec immédiatement utilisable
+        — vérifie l'API en bout-en-bout sans lancer un vrai bench."""
+        zip_bytes = _make_zip({
+            "doc01.png": _png_bytes(),
+            "doc01.gt.txt": "première page".encode("utf-8"),
+            "doc02.png": _png_bytes(),
+            "doc02.gt.txt": "deuxième page".encode("utf-8"),
+            "doc02.gt.alto.xml": b"<alto/>",
+        })
+        report = service.import_zip(
+            zip_bytes,
+            corpus_name="bnf_test",
+            metadata={"language": "fr"},
+        )
+        assert report.n_documents == 2
+        # Un doc avec 1 GT (text), un avec 2 GT (text + alto).
+        gts_by_doc = {d.id: d.available_gt_types for d in report.spec.documents}
+        assert ArtifactType.RAW_TEXT in gts_by_doc["doc01"]
+        assert set(gts_by_doc["doc02"]) == {
+            ArtifactType.RAW_TEXT, ArtifactType.ALTO_XML,
+        }