Spaces:
Running
feat(adapters): atomic_write — élimine les fichiers partiels sur kill
Browse filesPhase 9 du chantier ADR-0001 — garantit l'atomicité des écritures
disque dans les adapters OCR/LLM/VLM. Aujourd'hui un adapter
killed pendant ``path.write_text(content)`` laisse un fichier
partiel/corrompu (les N premiers octets écrits, le reste manquant).
Pour un benchmark institutionnel qui lit ces fichiers en aval
(validation, métriques, rapport), un état incohérent est
inacceptable.
## Nouveau helper ``adapters/_atomic_io.py``
- ``atomic_write_text(path, content, encoding="utf-8")`` :
pattern "write-to-tmp + fsync + rename" qui garantit que
``path`` existe avec le contenu COMPLET, ou pas du tout.
- ``atomic_write_bytes(path, content)`` : variante bytes.
- Tmp nommé ``<path>.<pid>.<tid>.tmp`` (collision-safe entre
workers).
- ``os.fsync`` force le flush OS vers disque avant rename.
- ``os.replace`` (atomique POSIX + Windows depuis Python 3.3)
garantit le swap final.
- Cleanup best-effort du tmp en cas d'échec ; le ``path``
original est préservé.
## Migration mécanique
Script Python a migré 11 sites ``write_text`` dans les adapters
contexte-pipeline (où le kill cross-thread est réel) :
- 8 OCR (tesseract×2, pero, kraken, calamari, mistral, google,
azure, precomputed)
- 1 LLM (base — site partagé par les 4 adapters concrets)
- 1 VLM (base — site partagé par les 4 adapters concrets)
Le linter a appliqué les modifs ; un import multi-ligne brisé
sur ``llm/base.py`` détecté par l'audit (le script avait inséré
l'import au milieu d'un ``from ... import (\n...\n)``) puis
corrigé manuellement.
## Hors scope
Les **importers corpus** (IIIF, Gallica, HTR-United, eScript,
HuggingFace — 6 sites ``write_text``/``write_bytes``) restent
inchangés. Ils s'exécutent en mode one-shot hors du contexte
pipeline killable (un caller qui fait ``picarones import gallica``
ne risque pas de cancel cross-thread). Phase 9b si un besoin
concret apparaît.
L'``ArtifactStore`` utilise déjà un pattern tmp manuel (non
migré pour ne pas dupliquer).
## Validation
- 14 tests atomic_io (basic write, overwrite, unicode, empty,
error handling avec disque plein simulé, atomicité — path
n'existe pas pendant l'écriture, tmp inclut PID+TID,
concurrent writes don't corrupt)
- 564 tests adapters tous verts (non-régression — la migration
est transparente côté contrat)
- **6151 tests au total**, 0 régression (vs 6137 pré-9)
- ``ruff`` propre, architecture 184 verts, sprint narrative
stable à 477
## Reste pour Phase 9b (futur)
Resource reclamation enrichi :
- Tracking des artefacts écrits par tâche (via ``ArtifactStore``
qui timestamp chaque write).
- Cleanup des artefacts d'un zombie qui complète tardivement
(déjà flagé ``DEADLINE_EXCEEDED_ZOMBIE`` côté outcome — l'info
est là, manque l'action).
- Cleanup des scratch dirs subprocess sur kill.
- ``SDK.cancel`` server-side quand disponible (OpenAI, Anthropic)
pour économiser tokens facturés.
https://claude.ai/code/session_01B93huMjNh4CG2rNcexgDeL
- picarones/adapters/_atomic_io.py +151 -0
- picarones/adapters/llm/base.py +2 -1
- picarones/adapters/ocr/azure_doc_intel.py +2 -1
- picarones/adapters/ocr/calamari.py +2 -1
- picarones/adapters/ocr/google_vision.py +2 -1
- picarones/adapters/ocr/kraken.py +2 -1
- picarones/adapters/ocr/mistral_ocr.py +2 -1
- picarones/adapters/ocr/pero_ocr.py +2 -1
- picarones/adapters/ocr/precomputed.py +2 -1
- picarones/adapters/ocr/tesseract.py +3 -2
- picarones/adapters/vlm/base.py +2 -1
- tests/adapters/test_atomic_io.py +245 -0
|
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Helpers d'écriture atomique pour les adapters.
|
| 2 |
+
|
| 3 |
+
Problème adressé
|
| 4 |
+
----------------
|
| 5 |
+
Le pattern naïf ``path.write_text(content)`` n'est pas atomique :
|
| 6 |
+
si le worker est tué pendant l'écriture (cancel cross-thread,
|
| 7 |
+
SIGKILL d'un subprocess, crash OS, perte de courant), le fichier
|
| 8 |
+
résultant peut être :
|
| 9 |
+
|
| 10 |
+
- **partiel** (les premiers N octets écrits, le reste manquant) ;
|
| 11 |
+
- **vide** (le fichier a été tronqué à 0 par open(..., "w") mais
|
| 12 |
+
l'écriture n'a pas eu lieu) ;
|
| 13 |
+
- **mélangé** avec un contenu antérieur (cas rare où l'OS n'a pas
|
| 14 |
+
flushé l'ancien contenu).
|
| 15 |
+
|
| 16 |
+
Un consommateur qui lit ce fichier (validation, métrique,
|
| 17 |
+
rapport) voit un état incohérent — pour un benchmark
|
| 18 |
+
institutionnel, c'est inacceptable.
|
| 19 |
+
|
| 20 |
+
Solution
|
| 21 |
+
--------
|
| 22 |
+
Pattern "write to tmp + rename" :
|
| 23 |
+
|
| 24 |
+
1. Écrire le contenu complet dans un fichier temporaire à côté
|
| 25 |
+
du chemin cible (``<path>.<pid>.<tid>.tmp``).
|
| 26 |
+
2. ``fsync`` pour forcer le flush sur disque.
|
| 27 |
+
3. ``rename`` atomique vers le chemin cible.
|
| 28 |
+
|
| 29 |
+
Garanties :
|
| 30 |
+
|
| 31 |
+
- **Tout ou rien** : soit ``path`` existe avec le contenu complet,
|
| 32 |
+
soit ``path`` n'existe pas (et un éventuel ``.tmp`` orphelin
|
| 33 |
+
peut être nettoyé au prochain run).
|
| 34 |
+
- Sur POSIX, ``os.replace()`` est atomique (garanti par
|
| 35 |
+
``rename(2)``). Sur Windows, ``os.replace()`` est aussi
|
| 36 |
+
atomique depuis Python 3.3.
|
| 37 |
+
- Si le rename échoue, le ``.tmp`` est best-effort supprimé pour
|
| 38 |
+
ne pas laisser d'orphelin.
|
| 39 |
+
|
| 40 |
+
Cas non couverts
|
| 41 |
+
----------------
|
| 42 |
+
- Le système de fichiers ne survit pas à une perte de courant en
|
| 43 |
+
cours de ``fsync`` (rare, dépend du FS). Hors scope — c'est
|
| 44 |
+
une garantie OS-level.
|
| 45 |
+
- Plusieurs processes/threads écrivent simultanément le même
|
| 46 |
+
``path`` : la dernière écriture gagne (sémantique POSIX
|
| 47 |
+
normale). Le caller doit éviter ce cas via son orchestration.
|
| 48 |
+
"""
|
| 49 |
+
|
| 50 |
+
from __future__ import annotations
|
| 51 |
+
|
| 52 |
+
import logging
|
| 53 |
+
import os
|
| 54 |
+
import threading
|
| 55 |
+
from pathlib import Path
|
| 56 |
+
from typing import Union
|
| 57 |
+
|
| 58 |
+
logger = logging.getLogger(__name__)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def _tmp_path_for(path: Path) -> Path:
|
| 62 |
+
"""Construit un chemin temporaire à côté de ``path``.
|
| 63 |
+
|
| 64 |
+
Inclut PID + thread ID pour éviter les collisions si plusieurs
|
| 65 |
+
workers écrivent dans le même répertoire (cas hypothétique
|
| 66 |
+
mais protègeons-nous).
|
| 67 |
+
"""
|
| 68 |
+
suffix = f".{os.getpid()}.{threading.get_ident()}.tmp"
|
| 69 |
+
return path.parent / (path.name + suffix)
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def atomic_write_text(
|
| 73 |
+
path: Union[str, Path],
|
| 74 |
+
content: str,
|
| 75 |
+
*,
|
| 76 |
+
encoding: str = "utf-8",
|
| 77 |
+
) -> None:
|
| 78 |
+
"""Écrit ``content`` dans ``path`` de façon atomique.
|
| 79 |
+
|
| 80 |
+
Pattern : write-to-tmp + fsync + rename.
|
| 81 |
+
|
| 82 |
+
Si l'écriture du tmp échoue (disque plein, permission, etc.),
|
| 83 |
+
une exception est propagée et le tmp est best-effort supprimé.
|
| 84 |
+
Le ``path`` original (s'il existait) reste inchangé.
|
| 85 |
+
|
| 86 |
+
Si le rename final échoue (improbable en pratique sauf
|
| 87 |
+
permission denied sur le répertoire), même comportement : tmp
|
| 88 |
+
supprimé, ``path`` inchangé.
|
| 89 |
+
"""
|
| 90 |
+
path = Path(path)
|
| 91 |
+
tmp_path = _tmp_path_for(path)
|
| 92 |
+
try:
|
| 93 |
+
# Le ``with`` garantit le close avant rename — important
|
| 94 |
+
# sur Windows où on ne peut pas rename un fichier ouvert.
|
| 95 |
+
with open(tmp_path, "w", encoding=encoding, newline="") as f:
|
| 96 |
+
f.write(content)
|
| 97 |
+
f.flush()
|
| 98 |
+
# ``os.fsync`` force le flush du buffer OS vers le disque.
|
| 99 |
+
# Sans ça, un crash matériel entre flush et rename peut
|
| 100 |
+
# laisser un fichier tmp vide. Coût : ~quelques ms par
|
| 101 |
+
# write — négligeable face aux benchmarks de plusieurs
|
| 102 |
+
# secondes par doc.
|
| 103 |
+
os.fsync(f.fileno())
|
| 104 |
+
# ``os.replace`` (et non ``rename``) parce qu'il écrase
|
| 105 |
+
# l'éventuel ``path`` existant atomiquement, et fonctionne
|
| 106 |
+
# cross-OS (POSIX + Windows depuis Python 3.3).
|
| 107 |
+
os.replace(tmp_path, path)
|
| 108 |
+
except Exception:
|
| 109 |
+
# Cleanup best-effort du tmp en cas d'échec.
|
| 110 |
+
try:
|
| 111 |
+
if tmp_path.exists():
|
| 112 |
+
tmp_path.unlink()
|
| 113 |
+
except OSError as cleanup_exc:
|
| 114 |
+
logger.warning(
|
| 115 |
+
"[atomic_io] échec cleanup du tmp %s : %s",
|
| 116 |
+
tmp_path, cleanup_exc,
|
| 117 |
+
)
|
| 118 |
+
raise
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def atomic_write_bytes(
|
| 122 |
+
path: Union[str, Path],
|
| 123 |
+
content: bytes,
|
| 124 |
+
) -> None:
|
| 125 |
+
"""Variante bytes de :func:`atomic_write_text`.
|
| 126 |
+
|
| 127 |
+
Même garantie : ``path`` existe avec le contenu complet, ou
|
| 128 |
+
pas du tout. Utile pour les artefacts non-textuels (images
|
| 129 |
+
intermédiaires, blobs JSON binaires).
|
| 130 |
+
"""
|
| 131 |
+
path = Path(path)
|
| 132 |
+
tmp_path = _tmp_path_for(path)
|
| 133 |
+
try:
|
| 134 |
+
with open(tmp_path, "wb") as f:
|
| 135 |
+
f.write(content)
|
| 136 |
+
f.flush()
|
| 137 |
+
os.fsync(f.fileno())
|
| 138 |
+
os.replace(tmp_path, path)
|
| 139 |
+
except Exception:
|
| 140 |
+
try:
|
| 141 |
+
if tmp_path.exists():
|
| 142 |
+
tmp_path.unlink()
|
| 143 |
+
except OSError as cleanup_exc:
|
| 144 |
+
logger.warning(
|
| 145 |
+
"[atomic_io] échec cleanup du tmp %s : %s",
|
| 146 |
+
tmp_path, cleanup_exc,
|
| 147 |
+
)
|
| 148 |
+
raise
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
__all__ = ["atomic_write_text", "atomic_write_bytes"]
|
|
@@ -11,6 +11,7 @@ from typing import Any, Optional
|
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
| 13 |
|
|
|
|
| 14 |
from picarones.adapters._retry import (
|
| 15 |
DEFAULT_BACKOFF_BASE as _DEFAULT_BACKOFF_BASE,
|
| 16 |
)
|
|
@@ -561,7 +562,7 @@ class BaseLLMAdapter(ABC):
|
|
| 561 |
suffix="corrected.txt",
|
| 562 |
context=context,
|
| 563 |
)
|
| 564 |
-
|
| 565 |
|
| 566 |
return {
|
| 567 |
ArtifactType.CORRECTED_TEXT: Artifact(
|
|
|
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
| 13 |
|
| 14 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 15 |
from picarones.adapters._retry import (
|
| 16 |
DEFAULT_BACKOFF_BASE as _DEFAULT_BACKOFF_BASE,
|
| 17 |
)
|
|
|
|
| 562 |
suffix="corrected.txt",
|
| 563 |
context=context,
|
| 564 |
)
|
| 565 |
+
atomic_write_text(out_path, result.text, encoding="utf-8")
|
| 566 |
|
| 567 |
return {
|
| 568 |
ArtifactType.CORRECTED_TEXT: Artifact(
|
|
@@ -65,6 +65,7 @@ from pathlib import Path
|
|
| 65 |
from typing import Any
|
| 66 |
|
| 67 |
from picarones.adapters._retry import call_with_retry
|
|
|
|
| 68 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 69 |
from picarones.adapters.output_paths import resolve_output_path
|
| 70 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
|
@@ -239,7 +240,7 @@ class AzureDocIntelAdapter(BaseOCRAdapter):
|
|
| 239 |
suffix="txt",
|
| 240 |
context=context,
|
| 241 |
)
|
| 242 |
-
|
| 243 |
|
| 244 |
return {
|
| 245 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 65 |
from typing import Any
|
| 66 |
|
| 67 |
from picarones.adapters._retry import call_with_retry
|
| 68 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 69 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 70 |
from picarones.adapters.output_paths import resolve_output_path
|
| 71 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
|
|
|
| 240 |
suffix="txt",
|
| 241 |
context=context,
|
| 242 |
)
|
| 243 |
+
atomic_write_text(text_path, text, encoding="utf-8")
|
| 244 |
|
| 245 |
return {
|
| 246 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -54,6 +54,7 @@ from pathlib import Path
|
|
| 54 |
from typing import Any
|
| 55 |
|
| 56 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
|
|
|
| 57 |
from picarones.adapters.output_paths import resolve_output_path
|
| 58 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 59 |
from picarones.pipeline.run_control import RunControl
|
|
@@ -235,7 +236,7 @@ class CalamariAdapter(BaseOCRAdapter):
|
|
| 235 |
suffix="txt",
|
| 236 |
context=context,
|
| 237 |
)
|
| 238 |
-
|
| 239 |
|
| 240 |
return {
|
| 241 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 54 |
from typing import Any
|
| 55 |
|
| 56 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 57 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 58 |
from picarones.adapters.output_paths import resolve_output_path
|
| 59 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 60 |
from picarones.pipeline.run_control import RunControl
|
|
|
|
| 236 |
suffix="txt",
|
| 237 |
context=context,
|
| 238 |
)
|
| 239 |
+
atomic_write_text(text_path, text, encoding="utf-8")
|
| 240 |
|
| 241 |
return {
|
| 242 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -48,6 +48,7 @@ from pathlib import Path
|
|
| 48 |
from typing import Any
|
| 49 |
|
| 50 |
from picarones.adapters._retry import call_with_retry
|
|
|
|
| 51 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 52 |
|
| 53 |
logger = logging.getLogger(__name__)
|
|
@@ -207,7 +208,7 @@ class GoogleVisionAdapter(BaseOCRAdapter):
|
|
| 207 |
suffix="txt",
|
| 208 |
context=context,
|
| 209 |
)
|
| 210 |
-
|
| 211 |
|
| 212 |
return {
|
| 213 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 48 |
from typing import Any
|
| 49 |
|
| 50 |
from picarones.adapters._retry import call_with_retry
|
| 51 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 52 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 53 |
|
| 54 |
logger = logging.getLogger(__name__)
|
|
|
|
| 208 |
suffix="txt",
|
| 209 |
context=context,
|
| 210 |
)
|
| 211 |
+
atomic_write_text(text_path, text, encoding="utf-8")
|
| 212 |
|
| 213 |
return {
|
| 214 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -54,6 +54,7 @@ from pathlib import Path
|
|
| 54 |
from typing import Any
|
| 55 |
|
| 56 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
|
|
|
| 57 |
from picarones.adapters.output_paths import resolve_output_path
|
| 58 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 59 |
from picarones.pipeline.run_control import RunControl
|
|
@@ -222,7 +223,7 @@ class KrakenAdapter(BaseOCRAdapter):
|
|
| 222 |
suffix="txt",
|
| 223 |
context=context,
|
| 224 |
)
|
| 225 |
-
|
| 226 |
|
| 227 |
return {
|
| 228 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 54 |
from typing import Any
|
| 55 |
|
| 56 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 57 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 58 |
from picarones.adapters.output_paths import resolve_output_path
|
| 59 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 60 |
from picarones.pipeline.run_control import RunControl
|
|
|
|
| 223 |
suffix="txt",
|
| 224 |
context=context,
|
| 225 |
)
|
| 226 |
+
atomic_write_text(text_path, text, encoding="utf-8")
|
| 227 |
|
| 228 |
return {
|
| 229 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -58,6 +58,7 @@ from pathlib import Path
|
|
| 58 |
from typing import Any
|
| 59 |
|
| 60 |
from picarones.adapters._retry import call_with_retry
|
|
|
|
| 61 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 62 |
from picarones.adapters.output_paths import resolve_output_path
|
| 63 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
|
@@ -245,7 +246,7 @@ class MistralOCRAdapter(BaseOCRAdapter):
|
|
| 245 |
suffix="txt",
|
| 246 |
context=context,
|
| 247 |
)
|
| 248 |
-
|
| 249 |
|
| 250 |
return {
|
| 251 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 58 |
from typing import Any
|
| 59 |
|
| 60 |
from picarones.adapters._retry import call_with_retry
|
| 61 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 62 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 63 |
from picarones.adapters.output_paths import resolve_output_path
|
| 64 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
|
|
|
| 246 |
suffix="txt",
|
| 247 |
context=context,
|
| 248 |
)
|
| 249 |
+
atomic_write_text(text_path, text, encoding="utf-8")
|
| 250 |
|
| 251 |
return {
|
| 252 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -47,6 +47,7 @@ from pathlib import Path
|
|
| 47 |
from typing import Any
|
| 48 |
|
| 49 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
|
|
|
| 50 |
from picarones.adapters.output_paths import resolve_output_path
|
| 51 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 52 |
from picarones.pipeline.run_control import RunControl
|
|
@@ -210,7 +211,7 @@ class PeroOCRAdapter(BaseOCRAdapter):
|
|
| 210 |
suffix="txt",
|
| 211 |
context=context,
|
| 212 |
)
|
| 213 |
-
|
| 214 |
|
| 215 |
return {
|
| 216 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 47 |
from typing import Any
|
| 48 |
|
| 49 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 50 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 51 |
from picarones.adapters.output_paths import resolve_output_path
|
| 52 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 53 |
from picarones.pipeline.run_control import RunControl
|
|
|
|
| 211 |
suffix="txt",
|
| 212 |
context=context,
|
| 213 |
)
|
| 214 |
+
atomic_write_text(text_path, text, encoding="utf-8")
|
| 215 |
|
| 216 |
return {
|
| 217 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -97,6 +97,7 @@ from pathlib import Path
|
|
| 97 |
from typing import Any, Literal
|
| 98 |
|
| 99 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
|
|
|
| 100 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 101 |
from picarones.pipeline.run_control import RunControl
|
| 102 |
|
|
@@ -188,7 +189,7 @@ class PrecomputedTextAdapter(BaseOCRAdapter):
|
|
| 188 |
# On crée le fichier vide pour rester cohérent : tout
|
| 189 |
# ``Artifact`` produit a une URI vers un fichier
|
| 190 |
# lisible.
|
| 191 |
-
|
| 192 |
else:
|
| 193 |
raise OCRAdapterError(
|
| 194 |
f"{self.name} : fichier pré-calculé introuvable "
|
|
|
|
| 97 |
from typing import Any, Literal
|
| 98 |
|
| 99 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 100 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 101 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 102 |
from picarones.pipeline.run_control import RunControl
|
| 103 |
|
|
|
|
| 189 |
# On crée le fichier vide pour rester cohérent : tout
|
| 190 |
# ``Artifact`` produit a une URI vers un fichier
|
| 191 |
# lisible.
|
| 192 |
+
atomic_write_text(text_path, "", encoding="utf-8")
|
| 193 |
else:
|
| 194 |
raise OCRAdapterError(
|
| 195 |
f"{self.name} : fichier pré-calculé introuvable "
|
|
@@ -61,6 +61,7 @@ from pathlib import Path
|
|
| 61 |
from typing import Any
|
| 62 |
|
| 63 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
|
|
|
| 64 |
from picarones.adapters.output_paths import resolve_output_path
|
| 65 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 66 |
from picarones.pipeline.run_control import RunControl
|
|
@@ -328,7 +329,7 @@ class TesseractAdapter(BaseOCRAdapter):
|
|
| 328 |
suffix="txt",
|
| 329 |
context=context,
|
| 330 |
)
|
| 331 |
-
|
| 332 |
|
| 333 |
outputs: dict = {
|
| 334 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -528,7 +529,7 @@ class TesseractAdapter(BaseOCRAdapter):
|
|
| 528 |
# ``write_confidences_sidecar`` : ``<stem>.<name>.alto.xml``).
|
| 529 |
alto_path = text_path.with_suffix(".alto.xml")
|
| 530 |
try:
|
| 531 |
-
|
| 532 |
except OSError as exc:
|
| 533 |
logger.warning(
|
| 534 |
"[%s] ALTO non persisté (%s) — ALTO sauté.",
|
|
|
|
| 61 |
from typing import Any
|
| 62 |
|
| 63 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 64 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 65 |
from picarones.adapters.output_paths import resolve_output_path
|
| 66 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 67 |
from picarones.pipeline.run_control import RunControl
|
|
|
|
| 329 |
suffix="txt",
|
| 330 |
context=context,
|
| 331 |
)
|
| 332 |
+
atomic_write_text(text_path, text, encoding="utf-8")
|
| 333 |
|
| 334 |
outputs: dict = {
|
| 335 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 529 |
# ``write_confidences_sidecar`` : ``<stem>.<name>.alto.xml``).
|
| 530 |
alto_path = text_path.with_suffix(".alto.xml")
|
| 531 |
try:
|
| 532 |
+
atomic_write_text(alto_path, alto_xml, encoding="utf-8")
|
| 533 |
except OSError as exc:
|
| 534 |
logger.warning(
|
| 535 |
"[%s] ALTO non persisté (%s) — ALTO sauté.",
|
|
@@ -33,6 +33,7 @@ from pathlib import Path
|
|
| 33 |
from typing import Any
|
| 34 |
|
| 35 |
from picarones.adapters.llm.base import BaseLLMAdapter
|
|
|
|
| 36 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 37 |
from picarones.domain.errors import AdapterStepError
|
| 38 |
from picarones.pipeline.run_control import RunControl
|
|
@@ -228,7 +229,7 @@ class BaseVLMAdapter(BaseLLMAdapter):
|
|
| 228 |
suffix="txt",
|
| 229 |
context=context,
|
| 230 |
)
|
| 231 |
-
|
| 232 |
|
| 233 |
return {
|
| 234 |
ArtifactType.RAW_TEXT: Artifact(
|
|
|
|
| 33 |
from typing import Any
|
| 34 |
|
| 35 |
from picarones.adapters.llm.base import BaseLLMAdapter
|
| 36 |
+
from picarones.adapters._atomic_io import atomic_write_text
|
| 37 |
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 38 |
from picarones.domain.errors import AdapterStepError
|
| 39 |
from picarones.pipeline.run_control import RunControl
|
|
|
|
| 229 |
suffix="txt",
|
| 230 |
context=context,
|
| 231 |
)
|
| 232 |
+
atomic_write_text(out_path, result.text, encoding="utf-8")
|
| 233 |
|
| 234 |
return {
|
| 235 |
ArtifactType.RAW_TEXT: Artifact(
|
|
@@ -0,0 +1,245 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests de ``picarones.adapters._atomic_io`` (Phase 9 ADR-0001).
|
| 2 |
+
|
| 3 |
+
Garantit que ``atomic_write_text`` / ``atomic_write_bytes`` :
|
| 4 |
+
|
| 5 |
+
- Écrivent le contenu complet.
|
| 6 |
+
- Survivent à un kill mid-write sans laisser de fichier partiel.
|
| 7 |
+
- Cleanup le tmp en cas d'erreur (disque plein simulé).
|
| 8 |
+
- Sont compatibles avec un ``path`` existant (rename remplace).
|
| 9 |
+
- Fonctionnent sur unicode + bytes.
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
from __future__ import annotations
|
| 13 |
+
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
from unittest.mock import patch
|
| 16 |
+
|
| 17 |
+
import pytest
|
| 18 |
+
|
| 19 |
+
from picarones.adapters._atomic_io import (
|
| 20 |
+
atomic_write_bytes,
|
| 21 |
+
atomic_write_text,
|
| 22 |
+
)
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
class TestBasicWrite:
|
| 26 |
+
def test_write_text_creates_file(self, tmp_path: Path) -> None:
|
| 27 |
+
target = tmp_path / "out.txt"
|
| 28 |
+
atomic_write_text(target, "hello world")
|
| 29 |
+
assert target.read_text(encoding="utf-8") == "hello world"
|
| 30 |
+
|
| 31 |
+
def test_write_bytes_creates_file(self, tmp_path: Path) -> None:
|
| 32 |
+
target = tmp_path / "out.bin"
|
| 33 |
+
atomic_write_bytes(target, b"\x00\x01\x02\xff")
|
| 34 |
+
assert target.read_bytes() == b"\x00\x01\x02\xff"
|
| 35 |
+
|
| 36 |
+
def test_write_text_unicode(self, tmp_path: Path) -> None:
|
| 37 |
+
target = tmp_path / "unicode.txt"
|
| 38 |
+
content = "café — médiéval (œuvre du XIVᵉ siècle) — ⚜"
|
| 39 |
+
atomic_write_text(target, content)
|
| 40 |
+
assert target.read_text(encoding="utf-8") == content
|
| 41 |
+
|
| 42 |
+
def test_write_text_empty(self, tmp_path: Path) -> None:
|
| 43 |
+
"""Un fichier vide est un contenu valide."""
|
| 44 |
+
target = tmp_path / "empty.txt"
|
| 45 |
+
atomic_write_text(target, "")
|
| 46 |
+
assert target.exists()
|
| 47 |
+
assert target.read_text() == ""
|
| 48 |
+
|
| 49 |
+
def test_write_text_accepts_str_path(self, tmp_path: Path) -> None:
|
| 50 |
+
"""``path`` peut être un str ou un Path (cohérence
|
| 51 |
+
Pathlike)."""
|
| 52 |
+
target = tmp_path / "from_str.txt"
|
| 53 |
+
atomic_write_text(str(target), "ok")
|
| 54 |
+
assert target.read_text() == "ok"
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
class TestOverwrite:
|
| 58 |
+
def test_overwrite_existing_file(self, tmp_path: Path) -> None:
|
| 59 |
+
"""``atomic_write_text`` remplace un fichier existant."""
|
| 60 |
+
target = tmp_path / "existing.txt"
|
| 61 |
+
target.write_text("old content")
|
| 62 |
+
atomic_write_text(target, "new content")
|
| 63 |
+
assert target.read_text() == "new content"
|
| 64 |
+
|
| 65 |
+
def test_overwrite_does_not_leak_tmp(self, tmp_path: Path) -> None:
|
| 66 |
+
"""Après un overwrite, il ne reste pas de fichier ``.tmp``
|
| 67 |
+
orphelin dans le répertoire."""
|
| 68 |
+
target = tmp_path / "out.txt"
|
| 69 |
+
atomic_write_text(target, "v1")
|
| 70 |
+
atomic_write_text(target, "v2")
|
| 71 |
+
tmp_files = [
|
| 72 |
+
p for p in tmp_path.iterdir() if ".tmp" in p.name
|
| 73 |
+
]
|
| 74 |
+
assert tmp_files == [], (
|
| 75 |
+
f"fichiers tmp orphelins : {tmp_files}"
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
class TestErrorHandling:
|
| 80 |
+
def test_write_to_nonexistent_dir_raises(self, tmp_path: Path) -> None:
|
| 81 |
+
"""Écrire dans un répertoire inexistant doit lever — le
|
| 82 |
+
helper ne crée pas les répertoires intermédiaires (c'est
|
| 83 |
+
au caller de gérer la création)."""
|
| 84 |
+
target = tmp_path / "nonexistent" / "out.txt"
|
| 85 |
+
with pytest.raises(FileNotFoundError):
|
| 86 |
+
atomic_write_text(target, "x")
|
| 87 |
+
|
| 88 |
+
def test_target_unchanged_after_write_failure(
|
| 89 |
+
self, tmp_path: Path,
|
| 90 |
+
) -> None:
|
| 91 |
+
"""Si l'écriture du tmp échoue, le ``path`` original (s'il
|
| 92 |
+
existait) reste inchangé. Garantit la sémantique
|
| 93 |
+
"tout ou rien"."""
|
| 94 |
+
target = tmp_path / "existing.txt"
|
| 95 |
+
target.write_text("original content")
|
| 96 |
+
|
| 97 |
+
# Simule un échec en intercalant ``open`` qui lève.
|
| 98 |
+
original_open = open
|
| 99 |
+
|
| 100 |
+
def failing_open(file, *args, **kwargs):
|
| 101 |
+
# Ne fail que sur le tmp ; laisse passer les autres open.
|
| 102 |
+
if str(file).endswith(".tmp"):
|
| 103 |
+
raise OSError("disk full simulated")
|
| 104 |
+
return original_open(file, *args, **kwargs)
|
| 105 |
+
|
| 106 |
+
with patch("builtins.open", side_effect=failing_open):
|
| 107 |
+
with pytest.raises(OSError, match="disk full"):
|
| 108 |
+
atomic_write_text(target, "new content")
|
| 109 |
+
|
| 110 |
+
# Le contenu original DOIT être préservé.
|
| 111 |
+
assert target.read_text() == "original content"
|
| 112 |
+
|
| 113 |
+
def test_tmp_cleaned_after_write_failure(
|
| 114 |
+
self, tmp_path: Path,
|
| 115 |
+
) -> None:
|
| 116 |
+
"""Si l'écriture échoue, le tmp doit être supprimé pour ne
|
| 117 |
+
pas laisser d'orphelin sur le filesystem."""
|
| 118 |
+
target = tmp_path / "out.txt"
|
| 119 |
+
|
| 120 |
+
original_open = open
|
| 121 |
+
opened_tmp_paths: list[str] = []
|
| 122 |
+
|
| 123 |
+
def failing_after_open(file, *args, **kwargs):
|
| 124 |
+
if str(file).endswith(".tmp"):
|
| 125 |
+
opened_tmp_paths.append(str(file))
|
| 126 |
+
f = original_open(file, *args, **kwargs)
|
| 127 |
+
# Force fsync à lever (simule erreur disque).
|
| 128 |
+
original_close = f.close
|
| 129 |
+
def _close():
|
| 130 |
+
original_close()
|
| 131 |
+
raise OSError("disk error during close")
|
| 132 |
+
f.close = _close
|
| 133 |
+
return f
|
| 134 |
+
return original_open(file, *args, **kwargs)
|
| 135 |
+
|
| 136 |
+
# Approche plus simple : mock os.fsync pour lever.
|
| 137 |
+
from picarones.adapters import _atomic_io
|
| 138 |
+
|
| 139 |
+
def failing_fsync(fd):
|
| 140 |
+
raise OSError("fsync failed")
|
| 141 |
+
|
| 142 |
+
with patch.object(_atomic_io.os, "fsync", side_effect=failing_fsync):
|
| 143 |
+
with pytest.raises(OSError, match="fsync failed"):
|
| 144 |
+
atomic_write_text(target, "content")
|
| 145 |
+
|
| 146 |
+
# Pas de tmp orphelin.
|
| 147 |
+
tmp_files = [
|
| 148 |
+
p for p in tmp_path.iterdir() if ".tmp" in p.name
|
| 149 |
+
]
|
| 150 |
+
assert tmp_files == [], (
|
| 151 |
+
f"tmp orphelin après échec : {tmp_files}"
|
| 152 |
+
)
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
class TestAtomicity:
|
| 156 |
+
"""Le contrat clé : un kill entre write et rename ne laisse
|
| 157 |
+
jamais ``path`` dans un état partiel.
|
| 158 |
+
|
| 159 |
+
Difficile à tester directement (on ne peut pas SIGKILL un
|
| 160 |
+
sous-process Python depuis pytest proprement), mais on peut
|
| 161 |
+
tester l'invariant : pendant l'écriture, ``path`` n'existe
|
| 162 |
+
pas (le contenu est dans le tmp), et après le rename,
|
| 163 |
+
``path`` existe avec le contenu complet.
|
| 164 |
+
"""
|
| 165 |
+
|
| 166 |
+
def test_path_does_not_exist_during_write(
|
| 167 |
+
self, tmp_path: Path,
|
| 168 |
+
) -> None:
|
| 169 |
+
"""Pendant le ``open(tmp)``, ``path`` ne doit pas exister.
|
| 170 |
+
On vérifie via un side_effect qui assert au moment du write."""
|
| 171 |
+
target = tmp_path / "out.txt"
|
| 172 |
+
assert not target.exists()
|
| 173 |
+
|
| 174 |
+
original_open = open
|
| 175 |
+
|
| 176 |
+
def open_with_check(file, *args, **kwargs):
|
| 177 |
+
f = original_open(file, *args, **kwargs)
|
| 178 |
+
if str(file).endswith(".tmp"):
|
| 179 |
+
# À ce moment, ``target`` ne doit toujours pas exister.
|
| 180 |
+
assert not target.exists(), (
|
| 181 |
+
f"target {target} existe pendant l'écriture du tmp"
|
| 182 |
+
)
|
| 183 |
+
return f
|
| 184 |
+
|
| 185 |
+
with patch("builtins.open", side_effect=open_with_check):
|
| 186 |
+
atomic_write_text(target, "content")
|
| 187 |
+
|
| 188 |
+
assert target.exists()
|
| 189 |
+
assert target.read_text() == "content"
|
| 190 |
+
|
| 191 |
+
def test_tmp_path_is_in_same_dir(self, tmp_path: Path) -> None:
|
| 192 |
+
"""Le tmp doit être dans le MÊME répertoire que le ``path``
|
| 193 |
+
cible — sinon le rename pourrait traverser des filesystems
|
| 194 |
+
(rename(2) atomique uniquement sur le même FS)."""
|
| 195 |
+
from picarones.adapters._atomic_io import _tmp_path_for
|
| 196 |
+
|
| 197 |
+
target = tmp_path / "subdir" / "out.txt"
|
| 198 |
+
target.parent.mkdir()
|
| 199 |
+
tmp = _tmp_path_for(target)
|
| 200 |
+
assert tmp.parent == target.parent
|
| 201 |
+
|
| 202 |
+
def test_tmp_path_includes_pid_and_tid(self) -> None:
|
| 203 |
+
"""Le tmp inclut PID + thread ID pour éviter les collisions
|
| 204 |
+
entre workers du même pool."""
|
| 205 |
+
import os
|
| 206 |
+
import threading
|
| 207 |
+
from picarones.adapters._atomic_io import _tmp_path_for
|
| 208 |
+
|
| 209 |
+
target = Path("/tmp/test_collision.txt")
|
| 210 |
+
tmp = _tmp_path_for(target)
|
| 211 |
+
assert str(os.getpid()) in tmp.name
|
| 212 |
+
assert str(threading.get_ident()) in tmp.name
|
| 213 |
+
|
| 214 |
+
|
| 215 |
+
class TestConcurrentWrites:
|
| 216 |
+
"""Sans bloquer le caller — on s'assure juste que des
|
| 217 |
+
écritures concurrentes vers le MÊME path ne corrompent pas
|
| 218 |
+
le résultat (sémantique : la dernière gagne).
|
| 219 |
+
"""
|
| 220 |
+
|
| 221 |
+
def test_concurrent_writes_yield_one_of_the_contents(
|
| 222 |
+
self, tmp_path: Path,
|
| 223 |
+
) -> None:
|
| 224 |
+
import threading
|
| 225 |
+
target = tmp_path / "racy.txt"
|
| 226 |
+
contents = [f"writer-{i}" for i in range(8)]
|
| 227 |
+
threads = [
|
| 228 |
+
threading.Thread(
|
| 229 |
+
target=atomic_write_text,
|
| 230 |
+
args=(target, c),
|
| 231 |
+
)
|
| 232 |
+
for c in contents
|
| 233 |
+
]
|
| 234 |
+
for t in threads:
|
| 235 |
+
t.start()
|
| 236 |
+
for t in threads:
|
| 237 |
+
t.join(timeout=5.0)
|
| 238 |
+
# Le contenu final doit être l'un des contenus écrits.
|
| 239 |
+
# Pas de tmp orphelin.
|
| 240 |
+
final = target.read_text()
|
| 241 |
+
assert final in contents
|
| 242 |
+
tmp_files = [
|
| 243 |
+
p for p in tmp_path.iterdir() if ".tmp" in p.name
|
| 244 |
+
]
|
| 245 |
+
assert tmp_files == [], f"tmp orphelins : {tmp_files}"
|