Spaces:
Sleeping
post-rewrite wiring audit: Phases 1-5 (sécurité, méthodologie, moteurs, zombie, naming)
Browse filesRéconciliation post-rewrite v2.0 d'après l'audit de chemins UI/API/runner.
Le rewrite avait laissé des options ignorées, des moteurs annoncés sans
backend, des chemins libres exposés, et un round-trip JSON appauvri.
# Phase 1 — Sécurité P0
- `output_dir` validé via ``validated_path`` dans
``api_htr_united_import`` et ``api_huggingface_import`` (path traversal).
- `db_path` validé dans ``/api/history/regressions`` ; env
``PICARONES_HISTORY_DB`` pour racines externes.
- ``flatten_zip_to_dir`` : détection des collisions de basename
(``a/img.png`` + ``b/img.png`` → renommage avec préfixe slug du dirname)
et appel à ``validate_image_safe`` sur les images extraites (anti zip
bomb passant les 500 Mo brut).
# Phase 2 — Méthodologie P0
- ``CompetitorConfig.pipeline_mode`` typé ``Literal[text_only,
text_and_image, zero_shot]`` ; suppression du fallback silencieux
``mode_map.get(..., "text_only")`` qui aliasait toute chaîne invalide.
- ``BenchmarkResult.from_dict`` / ``from_json_object`` restaurent
fidèlement les analyses avancées (confusion, taxonomy, NER, calibration,
philological, searchability, …) ; ``ReportGenerator.from_json`` délègue
→ rapport régénéré indistinguable du in-memory.
- ``partial_store`` : fingerprint SHA-256 stable (engine_config,
normalization_profile, char_exclude, fichiers corpus + mtime/size,
code version) suffixé au nom du partial → plus de réutilisation
illégale entre runs avec configs différentes.
# Phase 3 — Moteurs fantômes implémentés
- Nouveaux ``KrakenAdapter`` et ``CalamariAdapter`` (couche 5), lazy
imports + extras pyproject ``[kraken]`` / ``[calamari]``.
- ``ocr_adapter_from_name`` les expose ; ``_OCR_KWARGS_BUILDERS``
les mappe. CLI ``engines`` source-de-vérité unique avec
``/api/engines`` (matrice CLI ≡ Web).
# Phase 4 — Code zombie / wiring
- ``upload_purge_task`` (RGPD) branché au lifespan FastAPI ;
``create_job(payload={"corpus": req.corpus_path})`` pour que la
purge identifie les corpus actifs.
- ``/api/benchmark/start`` délègue à ``run_benchmark_thread_v2`` après
conversion ``BenchmarkRequest → BenchmarkRunRequest`` ; un seul worker
à patcher, ``/start`` marqué deprecated.
- ``HTRUnitedCatalogue.from_remote(timeout=5)`` avec fallback demo et
champ ``is_demo`` exposé pour l'UI ; ``PICARONES_HTR_UNITED_OFFLINE``
pour CI.
# Phase 5 — Naming
- ``CompetitorConfig`` → ``PipelineConfig`` (rupture immédiate, 11
fichiers touchés).
Tests : 4643 passed (vs 4600 avant), 12 skipped, 0 failed.
Nouveau fichier ``tests/security/test_phase1_post_rewrite_wiring.py``
(40 tests) couvre les 5 phases.
https://claude.ai/code/session_01ArfZ8kcgv7Cyda7VbJVmpn
- CLAUDE.md +2 -2
- README.md +3 -1
- picarones/adapters/ocr/__init__.py +8 -0
- picarones/adapters/ocr/calamari.py +249 -0
- picarones/adapters/ocr/factory.py +17 -0
- picarones/adapters/ocr/kraken.py +236 -0
- picarones/app/services/benchmark_runner.py +65 -3
- picarones/app/services/partial_store.py +178 -40
- picarones/evaluation/benchmark_result.py +132 -3
- picarones/evaluation/metric_result.py +25 -0
- picarones/interfaces/cli/__init__.py +47 -15
- picarones/interfaces/web/app.py +30 -2
- picarones/interfaces/web/benchmark_utils.py +97 -127
- picarones/interfaces/web/corpus_utils.py +112 -16
- picarones/interfaces/web/models.py +30 -4
- picarones/interfaces/web/routers/benchmark.py +12 -5
- picarones/interfaces/web/routers/history.py +43 -2
- picarones/interfaces/web/routers/importers.py +64 -11
- picarones/reports/html/generator.py +11 -48
- pyproject.toml +1 -0
- scripts/gen_readme_tables.py +2 -0
- tests/app/test_s9_resolver_collision.py +1 -1
- tests/app/test_sprint_d2b_partial_dir_resume.py +32 -5
- tests/architecture/test_file_budgets.py +11 -2
- tests/docs/test_readme_consistency.py +7 -0
- tests/docs/test_readme_dual_lang.py +6 -4
- tests/evaluation/metrics/test_sprint12_nouvelles_fonctionnalites.py +10 -4
- tests/integration/test_s9_prompt_loading_defenses.py +3 -3
- tests/security/test_phase1_post_rewrite_wiring.py +1013 -0
- tests/security/test_s1_zip_slip_attack.py +15 -6
- tests/web/routers/test_s4_history_router.py +31 -9
- tests/web/routers/test_s8_benchmark_router_branches.py +1 -1
- tests/web/test_s8_benchmark_utils_factory.py +58 -24
- tests/web/test_s9_ocr_engine_naming_contract.py +5 -5
- tests/web/test_s9_prompt_loading.py +4 -4
- tests/web/test_sprint6_web_interface.py +28 -7
|
@@ -116,7 +116,7 @@ picarones/
|
|
| 116 |
|
| 117 |
## État des tests et bugs historiques
|
| 118 |
|
| 119 |
-
`pytest tests/` → **
|
| 120 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intégration
|
| 121 |
contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
|
| 122 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
@@ -268,7 +268,7 @@ détecte, arbitre, rend.
|
|
| 268 |
## Contexte développement
|
| 269 |
|
| 270 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 271 |
-
- **Tests** : `pytest tests/ -q` →
|
| 272 |
deselected, 0 failed (post-v2.0).
|
| 273 |
- **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
|
| 274 |
- **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).
|
|
|
|
| 116 |
|
| 117 |
## État des tests et bugs historiques
|
| 118 |
|
| 119 |
+
`pytest tests/` → **4650 passed, 12 skipped, 8 deselected, 0 failed**
|
| 120 |
(post-S59). Les deselected sont les markers `live` (5 tests d'intégration
|
| 121 |
contre vraie API/binaire) + `network` (3 tests qui hit le réseau réel),
|
| 122 |
opt-in en local via `pytest -m live` ou `pytest -m network`. Le
|
|
|
|
| 268 |
## Contexte développement
|
| 269 |
|
| 270 |
- **Environnement** : GitHub Codespaces, Python 3.11+
|
| 271 |
+
- **Tests** : `pytest tests/ -q` → 4650 passed, 9 skipped, 24
|
| 272 |
deselected, 0 failed (post-v2.0).
|
| 273 |
- **Manifeste architecture** : [`docs/explanation/architecture.md`](docs/explanation/architecture.md).
|
| 274 |
- **API publique stable** : [`docs/reference/api-stable.md`](docs/reference/api-stable.md).
|
|
@@ -201,7 +201,9 @@ For Docker, institutional deployment, or HuggingFace Spaces, see
|
|
| 201 |
| Engine | Type | Installation |
|
| 202 |
|--------|------|-------------|
|
| 203 |
| **Azure Doc Intelligence** | Cloud API | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
|
|
|
|
| 204 |
| **Google Vision** | Cloud API | `GOOGLE_APPLICATION_CREDENTIALS` env var |
|
|
|
|
| 205 |
| **Mistral OCR** | Cloud API | `MISTRAL_API_KEY` env var |
|
| 206 |
| **Pero OCR** | Local Python | `pip install -e .[pero]` |
|
| 207 |
| **Tesseract 5** | Local CLI | `pip install pytesseract` + system binary |
|
|
@@ -395,7 +397,7 @@ ruff check picarones/ tests/
|
|
| 395 |
python -m mypy picarones/core/
|
| 396 |
```
|
| 397 |
|
| 398 |
-
**Test suite**: ~
|
| 399 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 400 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 401 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
|
|
| 201 |
| Engine | Type | Installation |
|
| 202 |
|--------|------|-------------|
|
| 203 |
| **Azure Doc Intelligence** | Cloud API | `AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY` |
|
| 204 |
+
| **Calamari OCR** | Local Python | `pip install -e .[calamari]` + checkpoint |
|
| 205 |
| **Google Vision** | Cloud API | `GOOGLE_APPLICATION_CREDENTIALS` env var |
|
| 206 |
+
| **Kraken HTR** | Local Python | `pip install -e .[kraken]` + modèle `.mlmodel` |
|
| 207 |
| **Mistral OCR** | Cloud API | `MISTRAL_API_KEY` env var |
|
| 208 |
| **Pero OCR** | Local Python | `pip install -e .[pero]` |
|
| 209 |
| **Tesseract 5** | Local CLI | `pip install pytesseract` + system binary |
|
|
|
|
| 397 |
python -m mypy picarones/core/
|
| 398 |
```
|
| 399 |
|
| 400 |
+
**Test suite**: ~4650 tests, ~3 min on a modern laptop. Coverage
|
| 401 |
floor at 85% (currently ~87%). The `network` marker excludes tests
|
| 402 |
requiring live HTTP. A handful of tests depend on optional engines
|
| 403 |
(`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when
|
|
@@ -8,6 +8,10 @@ Implémentations livrées
|
|
| 8 |
-----------------------
|
| 9 |
- ``TesseractAdapter`` — Tesseract 5 (OSS, CPU-bound).
|
| 10 |
- ``PeroOCRAdapter`` — Pero OCR (manuscrits, GPU recommandé).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- ``MistralOCRAdapter`` — Mistral OCR API (cloud).
|
| 12 |
- ``GoogleVisionAdapter`` — Google Vision API (cloud).
|
| 13 |
- ``AzureDocIntelAdapter`` — Azure Document Intelligence (cloud).
|
|
@@ -20,8 +24,10 @@ from __future__ import annotations
|
|
| 20 |
|
| 21 |
from picarones.adapters.ocr.azure_doc_intel import AzureDocIntelAdapter
|
| 22 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
|
|
|
| 23 |
from picarones.adapters.ocr.factory import ocr_adapter_from_name
|
| 24 |
from picarones.adapters.ocr.google_vision import GoogleVisionAdapter
|
|
|
|
| 25 |
from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
|
| 26 |
from picarones.adapters.ocr.pero_ocr import PeroOCRAdapter
|
| 27 |
from picarones.adapters.ocr.precomputed import PrecomputedTextAdapter
|
|
@@ -31,7 +37,9 @@ __all__ = [
|
|
| 31 |
"BaseOCRAdapter",
|
| 32 |
"OCRAdapterError",
|
| 33 |
"AzureDocIntelAdapter",
|
|
|
|
| 34 |
"GoogleVisionAdapter",
|
|
|
|
| 35 |
"MistralOCRAdapter",
|
| 36 |
"PeroOCRAdapter",
|
| 37 |
"PrecomputedTextAdapter",
|
|
|
|
| 8 |
-----------------------
|
| 9 |
- ``TesseractAdapter`` — Tesseract 5 (OSS, CPU-bound).
|
| 10 |
- ``PeroOCRAdapter`` — Pero OCR (manuscrits, GPU recommandé).
|
| 11 |
+
- ``KrakenAdapter`` — Kraken HTR (manuscrits + imprimés anciens,
|
| 12 |
+
écosystème HTR-United). Phase 3 chantier post-rewrite.
|
| 13 |
+
- ``CalamariAdapter`` — Calamari OCR (imprimés historiques,
|
| 14 |
+
TensorFlow). Phase 3 chantier post-rewrite.
|
| 15 |
- ``MistralOCRAdapter`` — Mistral OCR API (cloud).
|
| 16 |
- ``GoogleVisionAdapter`` — Google Vision API (cloud).
|
| 17 |
- ``AzureDocIntelAdapter`` — Azure Document Intelligence (cloud).
|
|
|
|
| 24 |
|
| 25 |
from picarones.adapters.ocr.azure_doc_intel import AzureDocIntelAdapter
|
| 26 |
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 27 |
+
from picarones.adapters.ocr.calamari import CalamariAdapter
|
| 28 |
from picarones.adapters.ocr.factory import ocr_adapter_from_name
|
| 29 |
from picarones.adapters.ocr.google_vision import GoogleVisionAdapter
|
| 30 |
+
from picarones.adapters.ocr.kraken import KrakenAdapter
|
| 31 |
from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
|
| 32 |
from picarones.adapters.ocr.pero_ocr import PeroOCRAdapter
|
| 33 |
from picarones.adapters.ocr.precomputed import PrecomputedTextAdapter
|
|
|
|
| 37 |
"BaseOCRAdapter",
|
| 38 |
"OCRAdapterError",
|
| 39 |
"AzureDocIntelAdapter",
|
| 40 |
+
"CalamariAdapter",
|
| 41 |
"GoogleVisionAdapter",
|
| 42 |
+
"KrakenAdapter",
|
| 43 |
"MistralOCRAdapter",
|
| 44 |
"PeroOCRAdapter",
|
| 45 |
"PrecomputedTextAdapter",
|
|
@@ -0,0 +1,249 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""``CalamariAdapter`` — adapter pour Calamari OCR.
|
| 2 |
+
|
| 3 |
+
Implémente le contrat ``BaseOCRAdapter`` (couche 5) :
|
| 4 |
+
``execute(inputs, params, context) → dict[ArtifactType, Artifact]``.
|
| 5 |
+
|
| 6 |
+
Cas d'usage BnF
|
| 7 |
+
---------------
|
| 8 |
+
Calamari est un OCR open-source basé TensorFlow / Keras, conçu pour
|
| 9 |
+
les imprimés historiques et la transcription ligne par ligne.
|
| 10 |
+
Modèles disponibles via OCR-D, Wikisource, et le hub Calamari.
|
| 11 |
+
Particulièrement performant en ensemble (vote multi-modèles).
|
| 12 |
+
|
| 13 |
+
Configuration
|
| 14 |
+
-------------
|
| 15 |
+
Constructeur :
|
| 16 |
+
|
| 17 |
+
- ``name`` (défaut ``"calamari"``) : identifiant de l'instance.
|
| 18 |
+
- ``checkpoint`` (obligatoire) : chemin vers le modèle Calamari
|
| 19 |
+
(fichier ``.ckpt`` ou répertoire de modèles pour le voting).
|
| 20 |
+
- ``voter`` (défaut ``"confidence_voter_default_ctc"``) : stratégie
|
| 21 |
+
de vote quand plusieurs modèles sont passés en ensemble.
|
| 22 |
+
- ``batch_size`` (défaut ``1``) : taille de batch pour l'inférence
|
| 23 |
+
ligne par ligne. ``1`` privilégie la simplicité ; augmenter pour
|
| 24 |
+
un gain de débit GPU.
|
| 25 |
+
|
| 26 |
+
Comportement
|
| 27 |
+
------------
|
| 28 |
+
1. Vérifie qu'un ``Artifact`` ``IMAGE`` est présent.
|
| 29 |
+
2. Lazy-import de ``calamari_ocr`` et ``PIL``.
|
| 30 |
+
3. Charge le ``Predictor`` (cache par instance).
|
| 31 |
+
4. Calamari attend des **lignes** d'image, pas des pages. L'adapter
|
| 32 |
+
ne fait pas de segmentation : il OCRise l'image entière comme
|
| 33 |
+
une ligne unique. Pour un workflow page → lignes, l'utilisateur
|
| 34 |
+
doit pré-segmenter (Kraken pageseg ou OCR-D segmenter) et appeler
|
| 35 |
+
Calamari sur chaque ligne séparément — futur enrichissement à
|
| 36 |
+
prévoir quand un consommateur en aura besoin.
|
| 37 |
+
5. Écrit la prédiction dans ``<stem>.<name>.txt``.
|
| 38 |
+
|
| 39 |
+
Anti-sur-ingénierie
|
| 40 |
+
-------------------
|
| 41 |
+
- Pas de segmentation embarquée — Calamari est un *line recognizer*,
|
| 42 |
+
pas un page OCR. L'utilisateur compose avec un segmenter externe
|
| 43 |
+
s'il a besoin du flux page → lignes.
|
| 44 |
+
- Pas de confidences pour l'instant — Calamari expose
|
| 45 |
+
``Prediction.avg_char_probability`` qui pourra alimenter un
|
| 46 |
+
``CONFIDENCES`` artifact dans une itération future.
|
| 47 |
+
- Modèle chargé une fois par instance, partagé entre appels successifs
|
| 48 |
+
(Predictor TensorFlow non recréé à chaque image).
|
| 49 |
+
"""
|
| 50 |
+
|
| 51 |
+
from __future__ import annotations
|
| 52 |
+
|
| 53 |
+
from pathlib import Path
|
| 54 |
+
from typing import Any
|
| 55 |
+
|
| 56 |
+
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 57 |
+
from picarones.adapters.output_paths import resolve_output_path
|
| 58 |
+
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class CalamariAdapter(BaseOCRAdapter):
|
| 62 |
+
"""Adapter Calamari OCR (Sprint Phase 3 — chantier post-rewrite).
|
| 63 |
+
|
| 64 |
+
Parameters
|
| 65 |
+
----------
|
| 66 |
+
name:
|
| 67 |
+
Identifiant lisible de l'instance. Défaut ``"calamari"``.
|
| 68 |
+
Doit être alphanumérique + ``_-``.
|
| 69 |
+
checkpoint:
|
| 70 |
+
Chemin vers le checkpoint Calamari (``.ckpt`` ou dossier de
|
| 71 |
+
modèles pour ensemble voting). **Obligatoire** : Calamari
|
| 72 |
+
n'embarque pas de modèle par défaut.
|
| 73 |
+
voter:
|
| 74 |
+
Nom de la stratégie de vote pour ensembles multi-modèles.
|
| 75 |
+
Défaut ``"confidence_voter_default_ctc"``.
|
| 76 |
+
batch_size:
|
| 77 |
+
Taille de batch pour l'inférence. Défaut 1.
|
| 78 |
+
|
| 79 |
+
Raises
|
| 80 |
+
------
|
| 81 |
+
OCRAdapterError
|
| 82 |
+
Si ``name`` invalide, ``checkpoint`` vide ou inexistant.
|
| 83 |
+
"""
|
| 84 |
+
|
| 85 |
+
input_types = frozenset({ArtifactType.IMAGE})
|
| 86 |
+
output_types = frozenset({ArtifactType.RAW_TEXT})
|
| 87 |
+
execution_mode = "cpu"
|
| 88 |
+
|
| 89 |
+
def __init__(
|
| 90 |
+
self,
|
| 91 |
+
*,
|
| 92 |
+
name: str = "calamari",
|
| 93 |
+
checkpoint: str | Path | None = None,
|
| 94 |
+
voter: str = "confidence_voter_default_ctc",
|
| 95 |
+
batch_size: int = 1,
|
| 96 |
+
) -> None:
|
| 97 |
+
if not name or not name.strip():
|
| 98 |
+
raise OCRAdapterError(
|
| 99 |
+
"CalamariAdapter : name vide non autorisé.",
|
| 100 |
+
)
|
| 101 |
+
if not all(c.isalnum() or c in "_-" for c in name):
|
| 102 |
+
raise OCRAdapterError(
|
| 103 |
+
f"CalamariAdapter : name invalide {name!r} — "
|
| 104 |
+
"alphanumérique + _ - uniquement.",
|
| 105 |
+
)
|
| 106 |
+
if not checkpoint:
|
| 107 |
+
raise OCRAdapterError(
|
| 108 |
+
"CalamariAdapter : checkpoint est obligatoire — "
|
| 109 |
+
"Calamari n'embarque pas de modèle par défaut. "
|
| 110 |
+
"Télécharger un modèle depuis le hub Calamari et "
|
| 111 |
+
"pointer son chemin (``.ckpt`` ou dossier).",
|
| 112 |
+
)
|
| 113 |
+
if batch_size < 1:
|
| 114 |
+
raise OCRAdapterError(
|
| 115 |
+
f"CalamariAdapter : batch_size doit être ≥ 1, reçu "
|
| 116 |
+
f"{batch_size}.",
|
| 117 |
+
)
|
| 118 |
+
self._name = name
|
| 119 |
+
self._checkpoint = Path(checkpoint)
|
| 120 |
+
self._voter = voter
|
| 121 |
+
self._batch_size = batch_size
|
| 122 |
+
# Predictor chargé paresseusement.
|
| 123 |
+
self._predictor: Any | None = None
|
| 124 |
+
|
| 125 |
+
@property
|
| 126 |
+
def name(self) -> str:
|
| 127 |
+
return self._name
|
| 128 |
+
|
| 129 |
+
@property
|
| 130 |
+
def checkpoint(self) -> Path:
|
| 131 |
+
return self._checkpoint
|
| 132 |
+
|
| 133 |
+
@property
|
| 134 |
+
def voter(self) -> str:
|
| 135 |
+
return self._voter
|
| 136 |
+
|
| 137 |
+
@property
|
| 138 |
+
def batch_size(self) -> int:
|
| 139 |
+
return self._batch_size
|
| 140 |
+
|
| 141 |
+
def execute(
|
| 142 |
+
self,
|
| 143 |
+
inputs: dict[ArtifactType, Artifact],
|
| 144 |
+
params: dict[str, Any],
|
| 145 |
+
context: Any,
|
| 146 |
+
) -> dict[ArtifactType, Artifact]:
|
| 147 |
+
"""Exécute Calamari sur l'image fournie.
|
| 148 |
+
|
| 149 |
+
Raises
|
| 150 |
+
------
|
| 151 |
+
OCRAdapterError
|
| 152 |
+
- input ``IMAGE`` absent ou sans URI ;
|
| 153 |
+
- fichier image / checkpoint introuvable ;
|
| 154 |
+
- ``calamari_ocr`` non installé ;
|
| 155 |
+
- erreur Calamari (modèle invalide, inférence).
|
| 156 |
+
"""
|
| 157 |
+
if ArtifactType.IMAGE not in inputs:
|
| 158 |
+
raise OCRAdapterError(f"{self.name} : input IMAGE manquant.")
|
| 159 |
+
image_artifact = inputs[ArtifactType.IMAGE]
|
| 160 |
+
if image_artifact.uri is None:
|
| 161 |
+
raise OCRAdapterError(
|
| 162 |
+
f"{self.name} : artefact image {image_artifact.id!r} "
|
| 163 |
+
"sans URI.",
|
| 164 |
+
)
|
| 165 |
+
image_path = Path(image_artifact.uri)
|
| 166 |
+
if not image_path.exists():
|
| 167 |
+
raise OCRAdapterError(
|
| 168 |
+
f"{self.name} : image introuvable {image_path!r}.",
|
| 169 |
+
)
|
| 170 |
+
if not self._checkpoint.exists():
|
| 171 |
+
raise OCRAdapterError(
|
| 172 |
+
f"{self.name} : checkpoint introuvable "
|
| 173 |
+
f"{self._checkpoint!r}.",
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
# Lazy-import — message explicite si dépendance absente.
|
| 177 |
+
try:
|
| 178 |
+
import numpy as np
|
| 179 |
+
from calamari_ocr.ocr.predict.predictor import ( # type: ignore[import-not-found]
|
| 180 |
+
Predictor,
|
| 181 |
+
PredictorParams,
|
| 182 |
+
)
|
| 183 |
+
from PIL import Image
|
| 184 |
+
except ImportError as exc:
|
| 185 |
+
raise OCRAdapterError(
|
| 186 |
+
f"{self.name} : calamari-ocr non installé. "
|
| 187 |
+
"Installer avec : pip install 'picarones[calamari]' "
|
| 188 |
+
"(ou 'pip install calamari-ocr>=2.0').",
|
| 189 |
+
) from exc
|
| 190 |
+
|
| 191 |
+
# Charger le Predictor une seule fois.
|
| 192 |
+
if self._predictor is None:
|
| 193 |
+
try:
|
| 194 |
+
params = PredictorParams()
|
| 195 |
+
params.silent = True
|
| 196 |
+
self._predictor = Predictor.from_checkpoint(
|
| 197 |
+
params=params,
|
| 198 |
+
checkpoint=str(self._checkpoint),
|
| 199 |
+
)
|
| 200 |
+
except Exception as exc:
|
| 201 |
+
raise OCRAdapterError(
|
| 202 |
+
f"{self.name} : chargement checkpoint "
|
| 203 |
+
f"{self._checkpoint!r} échoué : "
|
| 204 |
+
f"{type(exc).__name__}: {exc}",
|
| 205 |
+
) from exc
|
| 206 |
+
|
| 207 |
+
# OCR ligne : Calamari attend des numpy arrays grayscale.
|
| 208 |
+
try:
|
| 209 |
+
with Image.open(image_path) as image:
|
| 210 |
+
img_array = np.array(image.convert("L"))
|
| 211 |
+
results = list(self._predictor.predict_raw([img_array]))
|
| 212 |
+
if not results:
|
| 213 |
+
text = ""
|
| 214 |
+
else:
|
| 215 |
+
# Calamari ≥ 2.0 retourne des PredictionResult avec
|
| 216 |
+
# ``.outputs.sentence`` (post-voting) ou ``.sentence``.
|
| 217 |
+
result = results[0]
|
| 218 |
+
if hasattr(result, "outputs"):
|
| 219 |
+
text = getattr(result.outputs, "sentence", "")
|
| 220 |
+
else:
|
| 221 |
+
text = getattr(result, "sentence", "")
|
| 222 |
+
except Exception as exc:
|
| 223 |
+
raise OCRAdapterError(
|
| 224 |
+
f"{self.name} : Calamari a levé sur "
|
| 225 |
+
f"{image_path!r} : {type(exc).__name__}: {exc}",
|
| 226 |
+
) from exc
|
| 227 |
+
|
| 228 |
+
text = (text or "").strip()
|
| 229 |
+
|
| 230 |
+
text_path = resolve_output_path(
|
| 231 |
+
input_path=image_path,
|
| 232 |
+
adapter_name=self.name,
|
| 233 |
+
suffix="txt",
|
| 234 |
+
context=context,
|
| 235 |
+
)
|
| 236 |
+
text_path.write_text(text, encoding="utf-8")
|
| 237 |
+
|
| 238 |
+
return {
|
| 239 |
+
ArtifactType.RAW_TEXT: Artifact(
|
| 240 |
+
id=f"{context.document_id}:{self.name}:raw_text",
|
| 241 |
+
document_id=context.document_id,
|
| 242 |
+
type=ArtifactType.RAW_TEXT,
|
| 243 |
+
produced_by_step="ocr",
|
| 244 |
+
uri=str(text_path),
|
| 245 |
+
),
|
| 246 |
+
}
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
__all__ = ["CalamariAdapter"]
|
|
@@ -46,12 +46,15 @@ _ALIASES: dict[str, str] = {
|
|
| 46 |
"gv": "google_vision",
|
| 47 |
"azure": "azure_doc_intel",
|
| 48 |
"adi": "azure_doc_intel",
|
|
|
|
| 49 |
}
|
| 50 |
|
| 51 |
#: Liste des noms canoniques supportés pour les messages d'erreur.
|
| 52 |
_SUPPORTED: tuple[str, ...] = (
|
| 53 |
"tesseract",
|
| 54 |
"pero_ocr",
|
|
|
|
|
|
|
| 55 |
"mistral_ocr",
|
| 56 |
"google_vision",
|
| 57 |
"azure_doc_intel",
|
|
@@ -126,6 +129,20 @@ def ocr_adapter_from_name(
|
|
| 126 |
) from exc
|
| 127 |
return PeroOCRAdapter(**kwargs)
|
| 128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
if canonical == "mistral_ocr":
|
| 130 |
try:
|
| 131 |
from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
|
|
|
|
| 46 |
"gv": "google_vision",
|
| 47 |
"azure": "azure_doc_intel",
|
| 48 |
"adi": "azure_doc_intel",
|
| 49 |
+
"calamari_ocr": "calamari",
|
| 50 |
}
|
| 51 |
|
| 52 |
#: Liste des noms canoniques supportés pour les messages d'erreur.
|
| 53 |
_SUPPORTED: tuple[str, ...] = (
|
| 54 |
"tesseract",
|
| 55 |
"pero_ocr",
|
| 56 |
+
"kraken",
|
| 57 |
+
"calamari",
|
| 58 |
"mistral_ocr",
|
| 59 |
"google_vision",
|
| 60 |
"azure_doc_intel",
|
|
|
|
| 129 |
) from exc
|
| 130 |
return PeroOCRAdapter(**kwargs)
|
| 131 |
|
| 132 |
+
if canonical == "kraken":
|
| 133 |
+
# Phase 3 chantier post-rewrite : implémentation réelle de
|
| 134 |
+
# l'adapter ``kraken``, qui était annoncé par ``/api/engines``
|
| 135 |
+
# mais sans backend. Lazy-import : la classe elle-même est
|
| 136 |
+
# importable sans ``kraken``, c'est ``execute()`` qui exigera
|
| 137 |
+
# la dépendance.
|
| 138 |
+
from picarones.adapters.ocr.kraken import KrakenAdapter
|
| 139 |
+
return KrakenAdapter(**kwargs)
|
| 140 |
+
|
| 141 |
+
if canonical == "calamari":
|
| 142 |
+
# Phase 3 chantier post-rewrite : implémentation réelle.
|
| 143 |
+
from picarones.adapters.ocr.calamari import CalamariAdapter
|
| 144 |
+
return CalamariAdapter(**kwargs)
|
| 145 |
+
|
| 146 |
if canonical == "mistral_ocr":
|
| 147 |
try:
|
| 148 |
from picarones.adapters.ocr.mistral_ocr import MistralOCRAdapter
|
|
@@ -0,0 +1,236 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""``KrakenAdapter`` — adapter pour Kraken HTR (manuscrits + imprimés).
|
| 2 |
+
|
| 3 |
+
Implémente le contrat ``BaseOCRAdapter`` (couche 5) :
|
| 4 |
+
``execute(inputs, params, context) → dict[ArtifactType, Artifact]``.
|
| 5 |
+
|
| 6 |
+
Cas d'usage BnF
|
| 7 |
+
---------------
|
| 8 |
+
Kraken est l'engine open-source de référence pour les manuscrits et
|
| 9 |
+
imprimés anciens où Tesseract ne fonctionne pas — segmentation par
|
| 10 |
+
ligne de base + reconnaissance LSTM. C'est l'OCR ciblé par
|
| 11 |
+
HTR-United, l'écosystème de partage de modèles HTR pour le
|
| 12 |
+
patrimoine écrit.
|
| 13 |
+
|
| 14 |
+
Configuration
|
| 15 |
+
-------------
|
| 16 |
+
Constructeur :
|
| 17 |
+
|
| 18 |
+
- ``name`` (défaut ``"kraken"``) : identifiant de l'instance.
|
| 19 |
+
- ``model_path`` (obligatoire) : chemin vers le modèle ``.mlmodel``.
|
| 20 |
+
Kraken ne fournit pas de modèle par défaut — l'utilisateur doit en
|
| 21 |
+
pointer un (téléchargeable depuis https://htr-united.github.io/ ou
|
| 22 |
+
https://zenodo.org).
|
| 23 |
+
- ``binarize`` (défaut ``True``) : applique la binarisation
|
| 24 |
+
``nlbin`` avant segmentation.
|
| 25 |
+
- ``text_direction`` (défaut ``"horizontal-lr"``) : direction du
|
| 26 |
+
texte (passée à ``pageseg.segment``).
|
| 27 |
+
|
| 28 |
+
Comportement
|
| 29 |
+
------------
|
| 30 |
+
1. Vérifie qu'un ``Artifact`` ``IMAGE`` est présent.
|
| 31 |
+
2. Lazy-import de ``kraken`` et ``PIL``.
|
| 32 |
+
3. Charge le modèle (cache par instance).
|
| 33 |
+
4. Binarise + segmente l'image.
|
| 34 |
+
5. Reconnaît chaque ligne, concatène avec un saut de ligne.
|
| 35 |
+
6. Écrit le résultat dans ``<stem>.<name>.txt`` à côté de l'image.
|
| 36 |
+
|
| 37 |
+
Anti-sur-ingénierie
|
| 38 |
+
-------------------
|
| 39 |
+
- Pas d'extraction de confidences pour l'instant — Kraken expose des
|
| 40 |
+
scores par caractère via ``rpred``, à brancher quand un caller en
|
| 41 |
+
aura besoin (les VOC types de confidences sont par-token, ici on a
|
| 42 |
+
par-char).
|
| 43 |
+
- Pas de support batch — un appel par image.
|
| 44 |
+
- Pas de retry — si Kraken plante, on remonte ``OCRAdapterError``.
|
| 45 |
+
- ``execution_mode="cpu"`` même si Kraken peut tourner sur GPU :
|
| 46 |
+
la décision pool est laissée au runner (un opérateur GPU peut
|
| 47 |
+
exporter ``CUDA_VISIBLE_DEVICES`` et tourner en ThreadPool sans
|
| 48 |
+
conflit).
|
| 49 |
+
"""
|
| 50 |
+
|
| 51 |
+
from __future__ import annotations
|
| 52 |
+
|
| 53 |
+
from pathlib import Path
|
| 54 |
+
from typing import Any
|
| 55 |
+
|
| 56 |
+
from picarones.adapters.ocr.base import BaseOCRAdapter, OCRAdapterError
|
| 57 |
+
from picarones.adapters.output_paths import resolve_output_path
|
| 58 |
+
from picarones.domain.artifacts import Artifact, ArtifactType
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class KrakenAdapter(BaseOCRAdapter):
|
| 62 |
+
"""Adapter Kraken HTR (Sprint Phase 3 — chantier post-rewrite).
|
| 63 |
+
|
| 64 |
+
Parameters
|
| 65 |
+
----------
|
| 66 |
+
name:
|
| 67 |
+
Identifiant lisible de l'instance. Défaut ``"kraken"``.
|
| 68 |
+
Doit être alphanumérique + ``_-``.
|
| 69 |
+
model_path:
|
| 70 |
+
Chemin vers le modèle ``.mlmodel`` Kraken. **Obligatoire** :
|
| 71 |
+
Kraken n'embarque pas de modèle par défaut.
|
| 72 |
+
binarize:
|
| 73 |
+
Si ``True`` (défaut), applique ``binarization.nlbin`` avant
|
| 74 |
+
segmentation. À désactiver pour des images déjà binarisées.
|
| 75 |
+
text_direction:
|
| 76 |
+
Direction de lecture passée à ``pageseg.segment``. Défaut
|
| 77 |
+
``"horizontal-lr"`` (gauche-droite horizontal).
|
| 78 |
+
|
| 79 |
+
Raises
|
| 80 |
+
------
|
| 81 |
+
OCRAdapterError
|
| 82 |
+
Si ``name`` invalide, ``model_path`` vide ou inexistant.
|
| 83 |
+
"""
|
| 84 |
+
|
| 85 |
+
input_types = frozenset({ArtifactType.IMAGE})
|
| 86 |
+
output_types = frozenset({ArtifactType.RAW_TEXT})
|
| 87 |
+
execution_mode = "cpu"
|
| 88 |
+
|
| 89 |
+
def __init__(
|
| 90 |
+
self,
|
| 91 |
+
*,
|
| 92 |
+
name: str = "kraken",
|
| 93 |
+
model_path: str | Path | None = None,
|
| 94 |
+
binarize: bool = True,
|
| 95 |
+
text_direction: str = "horizontal-lr",
|
| 96 |
+
) -> None:
|
| 97 |
+
if not name or not name.strip():
|
| 98 |
+
raise OCRAdapterError(
|
| 99 |
+
"KrakenAdapter : name vide non autorisé.",
|
| 100 |
+
)
|
| 101 |
+
if not all(c.isalnum() or c in "_-" for c in name):
|
| 102 |
+
raise OCRAdapterError(
|
| 103 |
+
f"KrakenAdapter : name invalide {name!r} — "
|
| 104 |
+
"alphanumérique + _ - uniquement.",
|
| 105 |
+
)
|
| 106 |
+
if not model_path:
|
| 107 |
+
raise OCRAdapterError(
|
| 108 |
+
"KrakenAdapter : model_path est obligatoire — Kraken "
|
| 109 |
+
"n'embarque pas de modèle par défaut. Télécharger un "
|
| 110 |
+
"modèle ``.mlmodel`` depuis HTR-United "
|
| 111 |
+
"(https://htr-united.github.io/) et pointer son chemin.",
|
| 112 |
+
)
|
| 113 |
+
self._name = name
|
| 114 |
+
self._model_path = Path(model_path)
|
| 115 |
+
self._binarize = binarize
|
| 116 |
+
self._text_direction = text_direction
|
| 117 |
+
# Modèle chargé paresseusement à la première utilisation
|
| 118 |
+
# — partagé entre les appels successifs de la même instance.
|
| 119 |
+
self._model: Any | None = None
|
| 120 |
+
|
| 121 |
+
@property
|
| 122 |
+
def name(self) -> str:
|
| 123 |
+
return self._name
|
| 124 |
+
|
| 125 |
+
@property
|
| 126 |
+
def model_path(self) -> Path:
|
| 127 |
+
return self._model_path
|
| 128 |
+
|
| 129 |
+
@property
|
| 130 |
+
def binarize(self) -> bool:
|
| 131 |
+
return self._binarize
|
| 132 |
+
|
| 133 |
+
@property
|
| 134 |
+
def text_direction(self) -> str:
|
| 135 |
+
return self._text_direction
|
| 136 |
+
|
| 137 |
+
def execute(
|
| 138 |
+
self,
|
| 139 |
+
inputs: dict[ArtifactType, Artifact],
|
| 140 |
+
params: dict[str, Any],
|
| 141 |
+
context: Any,
|
| 142 |
+
) -> dict[ArtifactType, Artifact]:
|
| 143 |
+
"""Exécute Kraken sur l'image fournie.
|
| 144 |
+
|
| 145 |
+
Raises
|
| 146 |
+
------
|
| 147 |
+
OCRAdapterError
|
| 148 |
+
- input ``IMAGE`` absent ou sans URI ;
|
| 149 |
+
- fichier image / modèle introuvable ;
|
| 150 |
+
- ``kraken`` ou ``PIL`` non installé ;
|
| 151 |
+
- erreur Kraken (segmentation, reconnaissance).
|
| 152 |
+
"""
|
| 153 |
+
if ArtifactType.IMAGE not in inputs:
|
| 154 |
+
raise OCRAdapterError(f"{self.name} : input IMAGE manquant.")
|
| 155 |
+
image_artifact = inputs[ArtifactType.IMAGE]
|
| 156 |
+
if image_artifact.uri is None:
|
| 157 |
+
raise OCRAdapterError(
|
| 158 |
+
f"{self.name} : artefact image {image_artifact.id!r} "
|
| 159 |
+
"sans URI.",
|
| 160 |
+
)
|
| 161 |
+
image_path = Path(image_artifact.uri)
|
| 162 |
+
if not image_path.exists():
|
| 163 |
+
raise OCRAdapterError(
|
| 164 |
+
f"{self.name} : image introuvable {image_path!r}.",
|
| 165 |
+
)
|
| 166 |
+
if not self._model_path.exists():
|
| 167 |
+
raise OCRAdapterError(
|
| 168 |
+
f"{self.name} : modèle introuvable "
|
| 169 |
+
f"{self._model_path!r}.",
|
| 170 |
+
)
|
| 171 |
+
|
| 172 |
+
# Lazy-import de kraken + PIL — si absents, message explicite.
|
| 173 |
+
try:
|
| 174 |
+
from kraken import binarization, pageseg, rpred # type: ignore[import-not-found]
|
| 175 |
+
from kraken.lib import models # type: ignore[import-not-found]
|
| 176 |
+
from PIL import Image
|
| 177 |
+
except ImportError as exc:
|
| 178 |
+
raise OCRAdapterError(
|
| 179 |
+
f"{self.name} : kraken/Pillow non installés. "
|
| 180 |
+
"Installer avec : pip install 'picarones[kraken]' "
|
| 181 |
+
"(ou 'pip install kraken>=4.0').",
|
| 182 |
+
) from exc
|
| 183 |
+
|
| 184 |
+
# Charger le modèle (une seule fois par instance).
|
| 185 |
+
if self._model is None:
|
| 186 |
+
try:
|
| 187 |
+
self._model = models.load_any(str(self._model_path))
|
| 188 |
+
except Exception as exc:
|
| 189 |
+
raise OCRAdapterError(
|
| 190 |
+
f"{self.name} : chargement modèle "
|
| 191 |
+
f"{self._model_path!r} échoué : "
|
| 192 |
+
f"{type(exc).__name__}: {exc}",
|
| 193 |
+
) from exc
|
| 194 |
+
|
| 195 |
+
# Pipeline Kraken : binarisation → segmentation → reco.
|
| 196 |
+
try:
|
| 197 |
+
with Image.open(image_path) as image:
|
| 198 |
+
proc_image = (
|
| 199 |
+
binarization.nlbin(image) if self._binarize else image
|
| 200 |
+
)
|
| 201 |
+
segmentation = pageseg.segment(
|
| 202 |
+
proc_image, text_direction=self._text_direction,
|
| 203 |
+
)
|
| 204 |
+
predictions = rpred.rpred(
|
| 205 |
+
self._model, image, segmentation,
|
| 206 |
+
)
|
| 207 |
+
lines = [p.prediction for p in predictions if p.prediction]
|
| 208 |
+
text = "\n".join(lines)
|
| 209 |
+
except Exception as exc:
|
| 210 |
+
raise OCRAdapterError(
|
| 211 |
+
f"{self.name} : Kraken a levé sur "
|
| 212 |
+
f"{image_path!r} : {type(exc).__name__}: {exc}",
|
| 213 |
+
) from exc
|
| 214 |
+
|
| 215 |
+
text = text.strip()
|
| 216 |
+
|
| 217 |
+
text_path = resolve_output_path(
|
| 218 |
+
input_path=image_path,
|
| 219 |
+
adapter_name=self.name,
|
| 220 |
+
suffix="txt",
|
| 221 |
+
context=context,
|
| 222 |
+
)
|
| 223 |
+
text_path.write_text(text, encoding="utf-8")
|
| 224 |
+
|
| 225 |
+
return {
|
| 226 |
+
ArtifactType.RAW_TEXT: Artifact(
|
| 227 |
+
id=f"{context.document_id}:{self.name}:raw_text",
|
| 228 |
+
document_id=context.document_id,
|
| 229 |
+
type=ArtifactType.RAW_TEXT,
|
| 230 |
+
produced_by_step="ocr",
|
| 231 |
+
uri=str(text_path),
|
| 232 |
+
),
|
| 233 |
+
}
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
__all__ = ["KrakenAdapter"]
|
|
@@ -18,6 +18,7 @@ mono-call ergonomique et restitue un ``BenchmarkResult``.
|
|
| 18 |
|
| 19 |
from __future__ import annotations
|
| 20 |
|
|
|
|
| 21 |
import logging
|
| 22 |
from dataclasses import dataclass
|
| 23 |
from pathlib import Path
|
|
@@ -499,6 +500,53 @@ def _build_pipeline_info(engine: Any) -> dict:
|
|
| 499 |
return info
|
| 500 |
|
| 501 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 502 |
def _safe_engine_version(engine: Any) -> str:
|
| 503 |
"""Retourne ``engine.version()`` ou ``"unknown"`` en cas d'erreur.
|
| 504 |
|
|
@@ -733,7 +781,7 @@ def build_adapter_resolver(
|
|
| 733 |
"""Deux executors sont *fonctionnellement* équivalents s'ils
|
| 734 |
ont le même type et le même état (``__dict__`` complet).
|
| 735 |
|
| 736 |
-
Cas concret : deux ``
|
| 737 |
``tesseract`` avec la même langue — l'un en mode OCR seul,
|
| 738 |
l'autre encapsulé dans un pipeline OCR+LLM. Le factory web
|
| 739 |
leur donne le même ``name`` (dérivé de la config) → la 2e
|
|
@@ -1333,8 +1381,8 @@ def _run_benchmark_with_partial(
|
|
| 1333 |
from picarones.app.services.partial_store import (
|
| 1334 |
_delete_partial,
|
| 1335 |
_load_partial,
|
| 1336 |
-
_partial_path,
|
| 1337 |
_save_partial_line,
|
|
|
|
| 1338 |
)
|
| 1339 |
from picarones.evaluation.benchmark_result import (
|
| 1340 |
BenchmarkResult,
|
|
@@ -1367,7 +1415,21 @@ def _run_benchmark_with_partial(
|
|
| 1367 |
)
|
| 1368 |
break
|
| 1369 |
|
| 1370 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1371 |
loaded_results = _load_partial(partial_path)
|
| 1372 |
loaded_doc_ids = {dr.doc_id for dr in loaded_results}
|
| 1373 |
|
|
|
|
| 18 |
|
| 19 |
from __future__ import annotations
|
| 20 |
|
| 21 |
+
import hashlib
|
| 22 |
import logging
|
| 23 |
from dataclasses import dataclass
|
| 24 |
from pathlib import Path
|
|
|
|
| 500 |
return info
|
| 501 |
|
| 502 |
|
| 503 |
+
def _engine_config_for_fingerprint(engine: Any) -> dict:
|
| 504 |
+
"""Extrait une config sérialisable d'un engine pour le fingerprint.
|
| 505 |
+
|
| 506 |
+
Phase 2.3 : utilisé par
|
| 507 |
+
:func:`partial_store.compute_run_fingerprint` pour distinguer deux
|
| 508 |
+
runs avec le même couple ``(corpus, engine.name)`` mais des
|
| 509 |
+
paramètres internes différents (psm/lang Tesseract, modèle LLM,
|
| 510 |
+
prompt_template, mode pipeline, …). Un changement non capturé
|
| 511 |
+
par ce dict = potentiel faux résultat en reprise.
|
| 512 |
+
|
| 513 |
+
Stratégie : sonder les attributs canoniques connus, repli sur
|
| 514 |
+
``repr`` pour les types non sérialisables. ``json.dumps`` finalise
|
| 515 |
+
via ``default=str`` côté ``compute_run_fingerprint`` — la
|
| 516 |
+
granularité est conservatrice (toute différence visible → nouveau
|
| 517 |
+
fingerprint).
|
| 518 |
+
"""
|
| 519 |
+
cfg: dict = {"engine_name": getattr(engine, "name", "")}
|
| 520 |
+
|
| 521 |
+
# Pipeline composé : capturer le mode + prompt + LLM model
|
| 522 |
+
# (sources de différence majeure des résultats).
|
| 523 |
+
if getattr(engine, "is_pipeline", False):
|
| 524 |
+
mode = getattr(engine, "mode", None)
|
| 525 |
+
cfg["mode"] = mode.value if hasattr(mode, "value") else mode
|
| 526 |
+
prompt = getattr(engine, "prompt_template", None)
|
| 527 |
+
if prompt is not None:
|
| 528 |
+
# Hasher le prompt pour éviter de polluer le nom du fichier
|
| 529 |
+
# partiel avec un prompt multi-lignes (et de fuiter le
|
| 530 |
+
# contenu d'un prompt institutionnel dans un nom de fichier).
|
| 531 |
+
cfg["prompt_sha1"] = hashlib.sha1(
|
| 532 |
+
str(prompt).encode("utf-8"),
|
| 533 |
+
).hexdigest()[:12]
|
| 534 |
+
llm = getattr(engine, "llm_adapter", None)
|
| 535 |
+
if llm is not None:
|
| 536 |
+
cfg["llm_model"] = getattr(llm, "model", "")
|
| 537 |
+
cfg["llm_provider"] = getattr(llm, "name", "")
|
| 538 |
+
ocr = getattr(engine, "ocr_adapter", None)
|
| 539 |
+
if ocr is not None:
|
| 540 |
+
cfg["ocr_name"] = getattr(ocr, "name", "")
|
| 541 |
+
else:
|
| 542 |
+
# Adapter OCR seul : sonder les attributs courants.
|
| 543 |
+
for attr in ("lang", "psm", "model", "model_id", "feature_type"):
|
| 544 |
+
value = getattr(engine, attr, None)
|
| 545 |
+
if value is not None:
|
| 546 |
+
cfg[attr] = value
|
| 547 |
+
return cfg
|
| 548 |
+
|
| 549 |
+
|
| 550 |
def _safe_engine_version(engine: Any) -> str:
|
| 551 |
"""Retourne ``engine.version()`` ou ``"unknown"`` en cas d'erreur.
|
| 552 |
|
|
|
|
| 781 |
"""Deux executors sont *fonctionnellement* équivalents s'ils
|
| 782 |
ont le même type et le même état (``__dict__`` complet).
|
| 783 |
|
| 784 |
+
Cas concret : deux ``PipelineConfig`` qui utilisent
|
| 785 |
``tesseract`` avec la même langue — l'un en mode OCR seul,
|
| 786 |
l'autre encapsulé dans un pipeline OCR+LLM. Le factory web
|
| 787 |
leur donne le même ``name`` (dérivé de la config) → la 2e
|
|
|
|
| 1381 |
from picarones.app.services.partial_store import (
|
| 1382 |
_delete_partial,
|
| 1383 |
_load_partial,
|
|
|
|
| 1384 |
_save_partial_line,
|
| 1385 |
+
partial_path_for_engine,
|
| 1386 |
)
|
| 1387 |
from picarones.evaluation.benchmark_result import (
|
| 1388 |
BenchmarkResult,
|
|
|
|
| 1415 |
)
|
| 1416 |
break
|
| 1417 |
|
| 1418 |
+
# Phase 2.3 — fingerprint inclut config moteur + profil
|
| 1419 |
+
# normalisation + char_exclude + corpus files (mtime/size) +
|
| 1420 |
+
# version code. Deux runs avec configs différentes →
|
| 1421 |
+
# fichiers partiels distincts → pas de réutilisation
|
| 1422 |
+
# silencieuse de résultats incompatibles.
|
| 1423 |
+
partial_path = partial_path_for_engine(
|
| 1424 |
+
corpus=corpus,
|
| 1425 |
+
engine=engine,
|
| 1426 |
+
partial_dir=partial_dir,
|
| 1427 |
+
engine_config=_engine_config_for_fingerprint(engine),
|
| 1428 |
+
normalization_profile=normalization_profile,
|
| 1429 |
+
char_exclude=char_exclude,
|
| 1430 |
+
profile=profile,
|
| 1431 |
+
code_version=code_version,
|
| 1432 |
+
)
|
| 1433 |
loaded_results = _load_partial(partial_path)
|
| 1434 |
loaded_doc_ids = {dr.doc_id for dr in loaded_results}
|
| 1435 |
|
|
@@ -7,9 +7,9 @@ travail déjà fait.
|
|
| 7 |
Contrat
|
| 8 |
-------
|
| 9 |
Pour chaque couple ``(corpus_name, engine_name)``, un fichier
|
| 10 |
-
``{partial_dir}/picarones_{corpus}_{engine}.partial.jsonl``
|
| 11 |
-
une ligne JSON par ``DocumentResult`` au fur et à mesure de
|
| 12 |
-
calcul. Au redémarrage, ``run_benchmark_via_service`` charge ce
|
| 13 |
fichier, identifie les ``doc_id`` déjà traités, et n'invoque le
|
| 14 |
``BenchmarkService`` que sur les documents restants.
|
| 15 |
|
|
@@ -18,6 +18,21 @@ partiel est supprimé. Si un crash interrompt le run mid-engine,
|
|
| 18 |
le fichier persiste : la prochaine exécution reprendra exactement
|
| 19 |
où l'on s'est arrêté.
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
Anti-sur-ingénierie
|
| 22 |
-------------------
|
| 23 |
- Format JSONL plat (une ligne = un ``DocumentResult.as_dict()``),
|
|
@@ -28,15 +43,22 @@ Anti-sur-ingénierie
|
|
| 28 |
partage inter-process (chaque process a son propre tempdir).
|
| 29 |
- Pas de checksum ni de validation de schéma — best-effort. Une
|
| 30 |
ligne corrompue = warning + ligne ignorée + on continue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
"""
|
| 32 |
|
| 33 |
from __future__ import annotations
|
| 34 |
|
|
|
|
| 35 |
import json
|
| 36 |
import logging
|
| 37 |
import re
|
| 38 |
import tempfile
|
| 39 |
import threading
|
|
|
|
| 40 |
from pathlib import Path
|
| 41 |
from typing import TYPE_CHECKING, Any, Optional
|
| 42 |
|
|
@@ -66,6 +88,8 @@ def _partial_path(
|
|
| 66 |
corpus_name: str,
|
| 67 |
engine_name: str,
|
| 68 |
partial_dir: Optional[str | Path],
|
|
|
|
|
|
|
| 69 |
) -> Path:
|
| 70 |
"""Construit le chemin du fichier partiel pour ``(corpus, engine)``.
|
| 71 |
|
|
@@ -73,15 +97,93 @@ def _partial_path(
|
|
| 73 |
``tempfile.gettempdir()`` — utile pour les tests qui ne veulent
|
| 74 |
pas configurer un répertoire dédié mais bénéficient quand même
|
| 75 |
de la reprise intra-process.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
"""
|
| 77 |
base = Path(partial_dir) if partial_dir else Path(tempfile.gettempdir())
|
|
|
|
| 78 |
name = (
|
| 79 |
f"picarones_{_sanitize_filename(corpus_name)}"
|
| 80 |
-
f"_{_sanitize_filename(engine_name)}.partial.jsonl"
|
| 81 |
)
|
| 82 |
return base / name
|
| 83 |
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
def _load_partial(
|
| 86 |
partial_path: Path,
|
| 87 |
) -> list[DocumentResult]:
|
|
@@ -98,7 +200,6 @@ def _load_partial(
|
|
| 98 |
travail antérieur.
|
| 99 |
"""
|
| 100 |
from picarones.evaluation.benchmark_result import DocumentResult
|
| 101 |
-
from picarones.evaluation.metric_result import MetricsResult
|
| 102 |
|
| 103 |
results: list[DocumentResult] = []
|
| 104 |
if not partial_path.exists():
|
|
@@ -128,41 +229,12 @@ def _load_partial(
|
|
| 128 |
)
|
| 129 |
continue
|
| 130 |
try:
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
wer_normalized=metrics_dict.get("wer_normalized"),
|
| 138 |
-
mer=metrics_dict.get("mer"),
|
| 139 |
-
wil=metrics_dict.get("wil"),
|
| 140 |
-
reference_length=metrics_dict.get("reference_length", 0),
|
| 141 |
-
hypothesis_length=metrics_dict.get("hypothesis_length", 0),
|
| 142 |
-
error=metrics_dict.get("error"),
|
| 143 |
-
cer_diplomatic=metrics_dict.get("cer_diplomatic"),
|
| 144 |
-
diplomatic_profile_name=metrics_dict.get(
|
| 145 |
-
"diplomatic_profile_name",
|
| 146 |
-
),
|
| 147 |
-
)
|
| 148 |
-
results.append(DocumentResult(
|
| 149 |
-
doc_id=d["doc_id"],
|
| 150 |
-
image_path=d.get("image_path", ""),
|
| 151 |
-
ground_truth=d.get("ground_truth", ""),
|
| 152 |
-
hypothesis=d.get("hypothesis", ""),
|
| 153 |
-
metrics=metrics,
|
| 154 |
-
duration_seconds=d.get("duration_seconds", 0.0),
|
| 155 |
-
engine_error=d.get("engine_error"),
|
| 156 |
-
ocr_intermediate=d.get("ocr_intermediate"),
|
| 157 |
-
pipeline_metadata=d.get("pipeline_metadata", {}) or {},
|
| 158 |
-
confusion_matrix=d.get("confusion_matrix"),
|
| 159 |
-
char_scores=d.get("char_scores"),
|
| 160 |
-
taxonomy=d.get("taxonomy"),
|
| 161 |
-
structure=d.get("structure"),
|
| 162 |
-
image_quality=d.get("image_quality"),
|
| 163 |
-
line_metrics=d.get("line_metrics"),
|
| 164 |
-
hallucination_metrics=d.get("hallucination_metrics"),
|
| 165 |
-
))
|
| 166 |
except (KeyError, TypeError) as exc:
|
| 167 |
logger.warning(
|
| 168 |
"[partial_dir] ligne %d malformée dans '%s' : %s "
|
|
@@ -212,6 +284,70 @@ def _delete_partial(partial_path: Path) -> None:
|
|
| 212 |
)
|
| 213 |
|
| 214 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
__all__ = [
|
| 216 |
"_delete_partial",
|
| 217 |
"_load_partial",
|
|
@@ -219,4 +355,6 @@ __all__ = [
|
|
| 219 |
"_partial_write_lock",
|
| 220 |
"_sanitize_filename",
|
| 221 |
"_save_partial_line",
|
|
|
|
|
|
|
| 222 |
]
|
|
|
|
| 7 |
Contrat
|
| 8 |
-------
|
| 9 |
Pour chaque couple ``(corpus_name, engine_name)``, un fichier
|
| 10 |
+
``{partial_dir}/picarones_{corpus}_{engine}_{fingerprint}.partial.jsonl``
|
| 11 |
+
accumule une ligne JSON par ``DocumentResult`` au fur et à mesure de
|
| 12 |
+
leur calcul. Au redémarrage, ``run_benchmark_via_service`` charge ce
|
| 13 |
fichier, identifie les ``doc_id`` déjà traités, et n'invoque le
|
| 14 |
``BenchmarkService`` que sur les documents restants.
|
| 15 |
|
|
|
|
| 18 |
le fichier persiste : la prochaine exécution reprendra exactement
|
| 19 |
où l'on s'est arrêté.
|
| 20 |
|
| 21 |
+
Phase 2.3 — Fingerprint anti-collision
|
| 22 |
+
---------------------------------------
|
| 23 |
+
Auparavant, la clé partial était ``(corpus.name, engine.name)`` —
|
| 24 |
+
insuffisant : deux runs successifs avec le même corpus et le même
|
| 25 |
+
engine **mais des configs différentes** (psm Tesseract, langue,
|
| 26 |
+
profil de normalisation, char_exclude, version code) réutilisaient
|
| 27 |
+
silencieusement les résultats du run précédent. Reproductibilité
|
| 28 |
+
scientifique cassée.
|
| 29 |
+
|
| 30 |
+
Désormais :func:`compute_run_fingerprint` calcule un SHA-256 stable
|
| 31 |
+
de la config complète (engine_config, normalization_profile,
|
| 32 |
+
char_exclude, fichiers du corpus + mtime/size, version code). Le
|
| 33 |
+
préfixe 16 hex est suffixé au nom du fichier partiel : un changement
|
| 34 |
+
de config = un fichier différent = pas de réutilisation illégale.
|
| 35 |
+
|
| 36 |
Anti-sur-ingénierie
|
| 37 |
-------------------
|
| 38 |
- Format JSONL plat (une ligne = un ``DocumentResult.as_dict()``),
|
|
|
|
| 43 |
partage inter-process (chaque process a son propre tempdir).
|
| 44 |
- Pas de checksum ni de validation de schéma — best-effort. Une
|
| 45 |
ligne corrompue = warning + ligne ignorée + on continue.
|
| 46 |
+
- Fingerprint basé sur ``(path, size, mtime)`` pour les fichiers
|
| 47 |
+
corpus, pas sur le contenu lui-même : 100× plus rapide, suffisant
|
| 48 |
+
pour détecter une modification. Si un attaquant ``touch`` un
|
| 49 |
+
fichier sans changer son contenu, le partial est invalidé (acceptable,
|
| 50 |
+
conservative).
|
| 51 |
"""
|
| 52 |
|
| 53 |
from __future__ import annotations
|
| 54 |
|
| 55 |
+
import hashlib
|
| 56 |
import json
|
| 57 |
import logging
|
| 58 |
import re
|
| 59 |
import tempfile
|
| 60 |
import threading
|
| 61 |
+
from collections.abc import Iterable, Mapping
|
| 62 |
from pathlib import Path
|
| 63 |
from typing import TYPE_CHECKING, Any, Optional
|
| 64 |
|
|
|
|
| 88 |
corpus_name: str,
|
| 89 |
engine_name: str,
|
| 90 |
partial_dir: Optional[str | Path],
|
| 91 |
+
*,
|
| 92 |
+
fingerprint: Optional[str] = None,
|
| 93 |
) -> Path:
|
| 94 |
"""Construit le chemin du fichier partiel pour ``(corpus, engine)``.
|
| 95 |
|
|
|
|
| 97 |
``tempfile.gettempdir()`` — utile pour les tests qui ne veulent
|
| 98 |
pas configurer un répertoire dédié mais bénéficient quand même
|
| 99 |
de la reprise intra-process.
|
| 100 |
+
|
| 101 |
+
Phase 2.3 — Si ``fingerprint`` est fourni, il est suffixé au nom :
|
| 102 |
+
``picarones_{corpus}_{engine}_{fingerprint}.partial.jsonl``. Cela
|
| 103 |
+
garantit que deux runs avec le même couple ``(corpus, engine)``
|
| 104 |
+
mais des configs différentes ne partagent **jamais** leur fichier
|
| 105 |
+
partiel. Sans ``fingerprint``, le comportement legacy est
|
| 106 |
+
préservé pour rétrocompatibilité tests.
|
| 107 |
"""
|
| 108 |
base = Path(partial_dir) if partial_dir else Path(tempfile.gettempdir())
|
| 109 |
+
fp_suffix = f"_{fingerprint}" if fingerprint else ""
|
| 110 |
name = (
|
| 111 |
f"picarones_{_sanitize_filename(corpus_name)}"
|
| 112 |
+
f"_{_sanitize_filename(engine_name)}{fp_suffix}.partial.jsonl"
|
| 113 |
)
|
| 114 |
return base / name
|
| 115 |
|
| 116 |
|
| 117 |
+
def compute_run_fingerprint(
|
| 118 |
+
*,
|
| 119 |
+
engine_config: Mapping[str, Any] | None = None,
|
| 120 |
+
normalization_profile: str | None = None,
|
| 121 |
+
char_exclude: str | None = None,
|
| 122 |
+
corpus_files: Iterable[str | Path] | None = None,
|
| 123 |
+
code_version: str | None = None,
|
| 124 |
+
extra: Mapping[str, Any] | None = None,
|
| 125 |
+
) -> str:
|
| 126 |
+
"""Calcule un fingerprint stable pour identifier un run.
|
| 127 |
+
|
| 128 |
+
Composantes intégrées au hash :
|
| 129 |
+
|
| 130 |
+
- ``engine_config`` — dict de paramètres moteur (lang, psm,
|
| 131 |
+
model, etc.). Encodé en JSON trié pour stabilité.
|
| 132 |
+
- ``normalization_profile`` — identifiant du profil de
|
| 133 |
+
normalisation Unicode. Différents profils → métriques
|
| 134 |
+
différentes → fingerprint différent.
|
| 135 |
+
- ``char_exclude`` — caractères ignorés au calcul (CER/WER).
|
| 136 |
+
Idem.
|
| 137 |
+
- ``corpus_files`` — itérable de chemins. Pour chaque, on
|
| 138 |
+
hashe le chemin + ``stat.st_size`` + ``stat.st_mtime``.
|
| 139 |
+
Détecte les modifs sans coût du hash de contenu.
|
| 140 |
+
- ``code_version`` — version de Picarones courante.
|
| 141 |
+
- ``extra`` — dict additionnel libre pour des éléments
|
| 142 |
+
spécifiques à un pipeline (prompt_template, llm_params).
|
| 143 |
+
|
| 144 |
+
Returns
|
| 145 |
+
-------
|
| 146 |
+
str
|
| 147 |
+
Empreinte hexadécimale tronquée à 16 caractères — collision
|
| 148 |
+
négligeable pour un usage par-utilisateur, lisible humainement.
|
| 149 |
+
"""
|
| 150 |
+
hasher = hashlib.sha256()
|
| 151 |
+
|
| 152 |
+
def _update(key: str, value: Any) -> None:
|
| 153 |
+
hasher.update(b"\x00")
|
| 154 |
+
hasher.update(key.encode("utf-8"))
|
| 155 |
+
hasher.update(b"\x01")
|
| 156 |
+
try:
|
| 157 |
+
payload = json.dumps(value, sort_keys=True, default=str)
|
| 158 |
+
except TypeError:
|
| 159 |
+
payload = repr(value)
|
| 160 |
+
hasher.update(payload.encode("utf-8"))
|
| 161 |
+
|
| 162 |
+
_update("engine_config", dict(engine_config or {}))
|
| 163 |
+
_update("normalization_profile", normalization_profile or "")
|
| 164 |
+
_update("char_exclude", char_exclude or "")
|
| 165 |
+
_update("code_version", code_version or "")
|
| 166 |
+
_update("extra", dict(extra or {}))
|
| 167 |
+
|
| 168 |
+
if corpus_files is not None:
|
| 169 |
+
# Tri pour stabilité indépendamment de l'ordre d'itération.
|
| 170 |
+
for fpath in sorted(str(p) for p in corpus_files):
|
| 171 |
+
hasher.update(b"\x02")
|
| 172 |
+
hasher.update(fpath.encode("utf-8"))
|
| 173 |
+
try:
|
| 174 |
+
stat = Path(fpath).stat()
|
| 175 |
+
hasher.update(
|
| 176 |
+
f":{stat.st_size}:{int(stat.st_mtime)}".encode("utf-8"),
|
| 177 |
+
)
|
| 178 |
+
except OSError:
|
| 179 |
+
# Fichier disparu / inaccessible — ignoré au fingerprint.
|
| 180 |
+
# Si le file disparait pendant la course, on prend ce
|
| 181 |
+
# qu'on peut.
|
| 182 |
+
continue
|
| 183 |
+
|
| 184 |
+
return hasher.hexdigest()[:16]
|
| 185 |
+
|
| 186 |
+
|
| 187 |
def _load_partial(
|
| 188 |
partial_path: Path,
|
| 189 |
) -> list[DocumentResult]:
|
|
|
|
| 200 |
travail antérieur.
|
| 201 |
"""
|
| 202 |
from picarones.evaluation.benchmark_result import DocumentResult
|
|
|
|
| 203 |
|
| 204 |
results: list[DocumentResult] = []
|
| 205 |
if not partial_path.exists():
|
|
|
|
| 229 |
)
|
| 230 |
continue
|
| 231 |
try:
|
| 232 |
+
# Phase 2.2 — Utilise ``DocumentResult.from_dict`` au lieu
|
| 233 |
+
# de la reconstruction manuelle qui perdait
|
| 234 |
+
# ``taxonomy``/``ner_metrics``/``calibration_metrics``/etc.
|
| 235 |
+
# à la reprise — un partial chargé puis re-sérialisé
|
| 236 |
+
# devait conserver l'intégralité du payload.
|
| 237 |
+
results.append(DocumentResult.from_dict(d))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
except (KeyError, TypeError) as exc:
|
| 239 |
logger.warning(
|
| 240 |
"[partial_dir] ligne %d malformée dans '%s' : %s "
|
|
|
|
| 284 |
)
|
| 285 |
|
| 286 |
|
| 287 |
+
def partial_path_for_engine(
|
| 288 |
+
*,
|
| 289 |
+
corpus: Any,
|
| 290 |
+
engine: Any,
|
| 291 |
+
partial_dir: Optional[str | Path],
|
| 292 |
+
engine_config: Mapping[str, Any] | None = None,
|
| 293 |
+
normalization_profile: Any | None = None,
|
| 294 |
+
char_exclude: Any | None = None,
|
| 295 |
+
profile: str | None = None,
|
| 296 |
+
code_version: str | None = None,
|
| 297 |
+
) -> Path:
|
| 298 |
+
"""Helper public qui calcule le ``Path`` du fichier partiel pour
|
| 299 |
+
un couple ``(corpus, engine)`` en intégrant le fingerprint complet.
|
| 300 |
+
|
| 301 |
+
Encapsule la combinaison ``_partial_path`` +
|
| 302 |
+
:func:`compute_run_fingerprint` pour que le runner et les tests
|
| 303 |
+
utilisent la **même** logique de nommage — sinon les tests ne
|
| 304 |
+
peuvent pas pré-remplir un partial que le runner saura
|
| 305 |
+
retrouver.
|
| 306 |
+
|
| 307 |
+
Parameters
|
| 308 |
+
----------
|
| 309 |
+
corpus:
|
| 310 |
+
Doit exposer ``.name`` et ``.documents`` (chaque doc ayant
|
| 311 |
+
``.image_path``).
|
| 312 |
+
engine:
|
| 313 |
+
Doit exposer ``.name``. ``engine_config`` peut être fourni
|
| 314 |
+
séparément si la caller veut surcharger l'introspection.
|
| 315 |
+
partial_dir:
|
| 316 |
+
Dossier où vit le partial ; ``None`` → tempdir.
|
| 317 |
+
engine_config:
|
| 318 |
+
Si fourni, utilisé tel quel ; sinon l'appelant peut sonder
|
| 319 |
+
l'engine via :func:`benchmark_runner._engine_config_for_fingerprint`
|
| 320 |
+
avant d'appeler.
|
| 321 |
+
normalization_profile, char_exclude, profile, code_version:
|
| 322 |
+
Composantes incluses dans le fingerprint. Passer ``None``
|
| 323 |
+
pour ne pas contribuer (deux runs avec et sans normalisation
|
| 324 |
+
auront alors des fingerprints différents seulement si l'un
|
| 325 |
+
des deux est ``None``).
|
| 326 |
+
"""
|
| 327 |
+
corpus_files = [
|
| 328 |
+
doc.image_path for doc in getattr(corpus, "documents", [])
|
| 329 |
+
if getattr(doc, "image_path", None)
|
| 330 |
+
]
|
| 331 |
+
fp = compute_run_fingerprint(
|
| 332 |
+
engine_config=engine_config or {"engine_name": getattr(engine, "name", "")},
|
| 333 |
+
normalization_profile=(
|
| 334 |
+
getattr(normalization_profile, "name", None)
|
| 335 |
+
if normalization_profile is not None
|
| 336 |
+
else None
|
| 337 |
+
),
|
| 338 |
+
char_exclude=(
|
| 339 |
+
"".join(sorted(char_exclude)) if char_exclude else None
|
| 340 |
+
),
|
| 341 |
+
corpus_files=corpus_files,
|
| 342 |
+
code_version=code_version,
|
| 343 |
+
extra={"profile": profile} if profile else None,
|
| 344 |
+
)
|
| 345 |
+
return _partial_path(
|
| 346 |
+
getattr(corpus, "name", ""), getattr(engine, "name", ""),
|
| 347 |
+
partial_dir, fingerprint=fp,
|
| 348 |
+
)
|
| 349 |
+
|
| 350 |
+
|
| 351 |
__all__ = [
|
| 352 |
"_delete_partial",
|
| 353 |
"_load_partial",
|
|
|
|
| 355 |
"_partial_write_lock",
|
| 356 |
"_sanitize_filename",
|
| 357 |
"_save_partial_line",
|
| 358 |
+
"compute_run_fingerprint",
|
| 359 |
+
"partial_path_for_engine",
|
| 360 |
]
|
|
@@ -180,6 +180,50 @@ class DocumentResult:
|
|
| 180 |
d["readability_metrics"] = self.readability_metrics
|
| 181 |
return d
|
| 182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
def compact(
|
| 184 |
self,
|
| 185 |
text_limit: Optional[int] = None,
|
|
@@ -408,6 +452,43 @@ class EngineReport:
|
|
| 408 |
d["aggregated_readability"] = self.aggregated_readability
|
| 409 |
return d
|
| 410 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 411 |
|
| 412 |
@dataclass
|
| 413 |
class BenchmarkResult:
|
|
@@ -686,12 +767,60 @@ class BenchmarkResult:
|
|
| 686 |
json.dump(self.as_dict(), fh, ensure_ascii=False, indent=indent)
|
| 687 |
return output_path.resolve()
|
| 688 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 689 |
@classmethod
|
| 690 |
def from_json(cls, path: str | Path) -> dict:
|
| 691 |
-
"""Charge
|
| 692 |
|
| 693 |
-
|
| 694 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 695 |
"""
|
| 696 |
with Path(path).open(encoding="utf-8") as fh:
|
| 697 |
return json.load(fh)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
d["readability_metrics"] = self.readability_metrics
|
| 181 |
return d
|
| 182 |
|
| 183 |
+
@classmethod
|
| 184 |
+
def from_dict(cls, data: dict) -> "DocumentResult":
|
| 185 |
+
"""Reconstruit un :class:`DocumentResult` depuis ``as_dict()``.
|
| 186 |
+
|
| 187 |
+
Phase 2.2 du chantier post-rewrite : restauration fidèle de
|
| 188 |
+
tous les champs avancés (confusion_matrix, taxonomy, structure,
|
| 189 |
+
hallucination_metrics, ner_metrics, calibration_metrics,
|
| 190 |
+
philological_metrics, searchability_metrics,
|
| 191 |
+
numerical_sequence_metrics, readability_metrics,
|
| 192 |
+
pipeline_metadata, ocr_intermediate).
|
| 193 |
+
|
| 194 |
+
Avant ce durcissement, ``ReportGenerator.from_json`` faisait sa
|
| 195 |
+
propre reconstruction qui ne couvrait que CER/WER/MER/WIL +
|
| 196 |
+
doc_id/image_path/ground_truth/hypothesis — toutes les
|
| 197 |
+
analyses détaillées étaient perdues, donc le rapport régénéré
|
| 198 |
+
depuis JSON n'avait plus accès aux vues taxonomy, NER,
|
| 199 |
+
calibration, etc. La reproductibilité scientifique était
|
| 200 |
+
cassée.
|
| 201 |
+
"""
|
| 202 |
+
return cls(
|
| 203 |
+
doc_id=data["doc_id"],
|
| 204 |
+
image_path=data["image_path"],
|
| 205 |
+
ground_truth=data["ground_truth"],
|
| 206 |
+
hypothesis=data["hypothesis"],
|
| 207 |
+
metrics=MetricsResult.from_dict(data["metrics"]),
|
| 208 |
+
duration_seconds=data.get("duration_seconds", 0.0),
|
| 209 |
+
engine_error=data.get("engine_error"),
|
| 210 |
+
ocr_intermediate=data.get("ocr_intermediate"),
|
| 211 |
+
pipeline_metadata=data.get("pipeline_metadata", {}) or {},
|
| 212 |
+
confusion_matrix=data.get("confusion_matrix"),
|
| 213 |
+
char_scores=data.get("char_scores"),
|
| 214 |
+
taxonomy=data.get("taxonomy"),
|
| 215 |
+
structure=data.get("structure"),
|
| 216 |
+
image_quality=data.get("image_quality"),
|
| 217 |
+
line_metrics=data.get("line_metrics"),
|
| 218 |
+
hallucination_metrics=data.get("hallucination_metrics"),
|
| 219 |
+
ner_metrics=data.get("ner_metrics"),
|
| 220 |
+
calibration_metrics=data.get("calibration_metrics"),
|
| 221 |
+
philological_metrics=data.get("philological_metrics"),
|
| 222 |
+
searchability_metrics=data.get("searchability_metrics"),
|
| 223 |
+
numerical_sequence_metrics=data.get("numerical_sequence_metrics"),
|
| 224 |
+
readability_metrics=data.get("readability_metrics"),
|
| 225 |
+
)
|
| 226 |
+
|
| 227 |
def compact(
|
| 228 |
self,
|
| 229 |
text_limit: Optional[int] = None,
|
|
|
|
| 452 |
d["aggregated_readability"] = self.aggregated_readability
|
| 453 |
return d
|
| 454 |
|
| 455 |
+
@classmethod
|
| 456 |
+
def from_dict(cls, data: dict) -> "EngineReport":
|
| 457 |
+
"""Reconstruit un :class:`EngineReport` depuis ``as_dict()``.
|
| 458 |
+
|
| 459 |
+
Phase 2.2 du chantier post-rewrite : restauration fidèle des
|
| 460 |
+
``aggregated_*`` (confusion, char_scores, taxonomy, structure,
|
| 461 |
+
image_quality, line_metrics, hallucination, ner, calibration,
|
| 462 |
+
philological, searchability, numerical_sequences, readability)
|
| 463 |
+
et de ``pipeline_info``.
|
| 464 |
+
"""
|
| 465 |
+
return cls(
|
| 466 |
+
engine_name=data["engine_name"],
|
| 467 |
+
engine_version=data.get("engine_version", "unknown"),
|
| 468 |
+
engine_config=data.get("engine_config", {}),
|
| 469 |
+
document_results=[
|
| 470 |
+
DocumentResult.from_dict(dr)
|
| 471 |
+
for dr in data.get("document_results", [])
|
| 472 |
+
],
|
| 473 |
+
aggregated_metrics=data.get("aggregated_metrics", {}) or {},
|
| 474 |
+
pipeline_info=data.get("pipeline_info", {}) or {},
|
| 475 |
+
aggregated_confusion=data.get("aggregated_confusion"),
|
| 476 |
+
aggregated_char_scores=data.get("aggregated_char_scores"),
|
| 477 |
+
aggregated_taxonomy=data.get("aggregated_taxonomy"),
|
| 478 |
+
aggregated_structure=data.get("aggregated_structure"),
|
| 479 |
+
aggregated_image_quality=data.get("aggregated_image_quality"),
|
| 480 |
+
aggregated_line_metrics=data.get("aggregated_line_metrics"),
|
| 481 |
+
aggregated_hallucination=data.get("aggregated_hallucination"),
|
| 482 |
+
aggregated_ner=data.get("aggregated_ner"),
|
| 483 |
+
aggregated_calibration=data.get("aggregated_calibration"),
|
| 484 |
+
aggregated_philological=data.get("aggregated_philological"),
|
| 485 |
+
aggregated_searchability=data.get("aggregated_searchability"),
|
| 486 |
+
aggregated_numerical_sequences=data.get(
|
| 487 |
+
"aggregated_numerical_sequences",
|
| 488 |
+
),
|
| 489 |
+
aggregated_readability=data.get("aggregated_readability"),
|
| 490 |
+
)
|
| 491 |
+
|
| 492 |
|
| 493 |
@dataclass
|
| 494 |
class BenchmarkResult:
|
|
|
|
| 767 |
json.dump(self.as_dict(), fh, ensure_ascii=False, indent=indent)
|
| 768 |
return output_path.resolve()
|
| 769 |
|
| 770 |
+
@classmethod
|
| 771 |
+
def from_dict(cls, data: dict) -> "BenchmarkResult":
|
| 772 |
+
"""Reconstruit un :class:`BenchmarkResult` complet depuis
|
| 773 |
+
``as_dict()``.
|
| 774 |
+
|
| 775 |
+
Phase 2.2 du chantier post-rewrite : fidélité du round-trip
|
| 776 |
+
``to_json → from_dict``. Auparavant, ``from_json`` retournait
|
| 777 |
+
le dict brut et l'appelant devait reconstruire à la main —
|
| 778 |
+
d'où la dérive entre ``ReportGenerator.__init__`` (objets) et
|
| 779 |
+
``ReportGenerator.from_json`` (dicts appauvris). Désormais, un
|
| 780 |
+
seul chemin canonique : ``BenchmarkResult.from_dict(dict)`` →
|
| 781 |
+
objet complet, indistinguable d'un benchmark fraîchement
|
| 782 |
+
exécuté.
|
| 783 |
+
"""
|
| 784 |
+
corpus_info = data.get("corpus", {}) or {}
|
| 785 |
+
return cls(
|
| 786 |
+
corpus_name=corpus_info.get("name", "Corpus"),
|
| 787 |
+
corpus_source=corpus_info.get("source"),
|
| 788 |
+
document_count=corpus_info.get("document_count", 0),
|
| 789 |
+
engine_reports=[
|
| 790 |
+
EngineReport.from_dict(er)
|
| 791 |
+
for er in data.get("engine_reports", [])
|
| 792 |
+
],
|
| 793 |
+
run_date=data.get("run_date", ""),
|
| 794 |
+
picarones_version=data.get("picarones_version", ""),
|
| 795 |
+
metadata=data.get("metadata", {}) or {},
|
| 796 |
+
)
|
| 797 |
+
|
| 798 |
@classmethod
|
| 799 |
def from_json(cls, path: str | Path) -> dict:
|
| 800 |
+
"""Charge le JSON brut (dict Python) — rétrocompatibilité.
|
| 801 |
|
| 802 |
+
Pour reconstruire un :class:`BenchmarkResult` complet (objets),
|
| 803 |
+
utiliser :meth:`from_dict` après :meth:`from_json`, ou
|
| 804 |
+
directement :meth:`from_json_object` ci-dessous.
|
| 805 |
+
|
| 806 |
+
Cette méthode est conservée parce que de nombreux consommateurs
|
| 807 |
+
(tests, ``ReportGenerator.from_json`` legacy, scripts CLI ad
|
| 808 |
+
hoc) attendent encore un dict. Le rewrite v2.0 préfère les
|
| 809 |
+
objets reconstruits ; les nouveaux callers doivent utiliser
|
| 810 |
+
:meth:`from_json_object`.
|
| 811 |
"""
|
| 812 |
with Path(path).open(encoding="utf-8") as fh:
|
| 813 |
return json.load(fh)
|
| 814 |
+
|
| 815 |
+
@classmethod
|
| 816 |
+
def from_json_object(cls, path: str | Path) -> "BenchmarkResult":
|
| 817 |
+
"""Charge un JSON et reconstruit un :class:`BenchmarkResult`
|
| 818 |
+
complet (objets), avec toutes les analyses avancées préservées.
|
| 819 |
+
|
| 820 |
+
Round-trip garanti : ``BenchmarkResult.from_json_object(
|
| 821 |
+
bm.to_json(p)) == bm`` au sens structurel (les champs
|
| 822 |
+
``aggregated_metrics`` peuvent être recalculés par
|
| 823 |
+
``__post_init__`` si absents, sinon préservés).
|
| 824 |
+
"""
|
| 825 |
+
with Path(path).open(encoding="utf-8") as fh:
|
| 826 |
+
return cls.from_dict(json.load(fh))
|
|
@@ -79,6 +79,31 @@ class MetricsResult:
|
|
| 79 |
def wer_percent(self) -> Optional[float]:
|
| 80 |
return None if self.wer is None else round(self.wer * 100, 2)
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
def aggregate_metrics(results: list[MetricsResult]) -> dict:
|
| 84 |
"""Calcule les statistiques agrégées sur un ensemble de résultats.
|
|
|
|
| 79 |
def wer_percent(self) -> Optional[float]:
|
| 80 |
return None if self.wer is None else round(self.wer * 100, 2)
|
| 81 |
|
| 82 |
+
@classmethod
|
| 83 |
+
def from_dict(cls, data: dict) -> "MetricsResult":
|
| 84 |
+
"""Reconstruit depuis le dict produit par :meth:`as_dict`.
|
| 85 |
+
|
| 86 |
+
Phase 2.2 du chantier post-rewrite : fidélité du round-trip
|
| 87 |
+
``as_dict → from_dict``. Auparavant, ``ReportGenerator.from_json``
|
| 88 |
+
contenait sa propre reconstruction partielle qui perdait
|
| 89 |
+
``cer_diplomatic`` et ``diplomatic_profile_name``. Centraliser
|
| 90 |
+
la désérialisation ici évite la dérive.
|
| 91 |
+
"""
|
| 92 |
+
return cls(
|
| 93 |
+
cer=data.get("cer"),
|
| 94 |
+
cer_nfc=data.get("cer_nfc"),
|
| 95 |
+
cer_caseless=data.get("cer_caseless"),
|
| 96 |
+
wer=data.get("wer"),
|
| 97 |
+
wer_normalized=data.get("wer_normalized"),
|
| 98 |
+
mer=data.get("mer"),
|
| 99 |
+
wil=data.get("wil"),
|
| 100 |
+
reference_length=data.get("reference_length", 0),
|
| 101 |
+
hypothesis_length=data.get("hypothesis_length", 0),
|
| 102 |
+
error=data.get("error"),
|
| 103 |
+
cer_diplomatic=data.get("cer_diplomatic"),
|
| 104 |
+
diplomatic_profile_name=data.get("diplomatic_profile_name"),
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
|
| 108 |
def aggregate_metrics(results: list[MetricsResult]) -> dict:
|
| 109 |
"""Calcule les statistiques agrégées sur un ensemble de résultats.
|
|
@@ -151,27 +151,59 @@ def metrics_cmd(reference: str, hypothesis: str, json_output: bool) -> None:
|
|
| 151 |
# picarones engines
|
| 152 |
# ---------------------------------------------------------------------------
|
| 153 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
@cli.command("engines")
|
| 155 |
def engines_cmd() -> None:
|
| 156 |
-
"""Liste les moteurs OCR disponibles et vérifie leur installation.
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
|
|
|
|
|
|
| 161 |
|
| 162 |
click.echo("Moteurs OCR disponibles :\n")
|
| 163 |
-
for engine_id, label, module in
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
click.echo(
|
| 172 |
-
"\
|
| 173 |
-
"
|
| 174 |
-
" pip install pero-ocr"
|
| 175 |
)
|
| 176 |
|
| 177 |
|
|
|
|
| 151 |
# picarones engines
|
| 152 |
# ---------------------------------------------------------------------------
|
| 153 |
|
| 154 |
+
#: Catalogue source-de-vérité des moteurs OCR exposés.
|
| 155 |
+
#:
|
| 156 |
+
#: Phase 3 chantier post-rewrite : remplace l'ancienne liste hardcodée
|
| 157 |
+
#: ``[tesseract, pero_ocr]`` qui divergeait du web (``/api/engines``
|
| 158 |
+
#: annonçait 8 engines, dont kraken/calamari sans backend, dont
|
| 159 |
+
#: mistral_ocr/google_vision/azure_doc_intel jamais exposés à la CLI).
|
| 160 |
+
#: Désormais la liste est dérivée de la factory canonique
|
| 161 |
+
#: ``picarones.adapters.ocr.factory._SUPPORTED`` ; ajouter un engine
|
| 162 |
+
#: nécessite (1) un adapter dans ``adapters/ocr/`` et (2) une entrée
|
| 163 |
+
#: factory — pas de divergence possible avec l'API web.
|
| 164 |
+
_CLI_ENGINE_CATALOG: tuple[tuple[str, str, str, str], ...] = (
|
| 165 |
+
("tesseract", "Tesseract 5", "pytesseract", "[dev]"),
|
| 166 |
+
("pero_ocr", "Pero OCR", "pero_ocr", "[pero]"),
|
| 167 |
+
("kraken", "Kraken HTR", "kraken", "[kraken]"),
|
| 168 |
+
("calamari", "Calamari OCR", "calamari_ocr", "[calamari]"),
|
| 169 |
+
("mistral_ocr", "Mistral OCR (cloud)", "mistralai", "[llm]"),
|
| 170 |
+
("google_vision", "Google Vision (cloud)", "google.cloud.vision", "[ocr-cloud]"),
|
| 171 |
+
("azure_doc_intel", "Azure Doc Intel (cloud)",
|
| 172 |
+
"azure.ai.documentintelligence", "[ocr-cloud]"),
|
| 173 |
+
("precomputed", "Précalculé (OCR pré-existant)", "", ""),
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
|
| 177 |
@cli.command("engines")
|
| 178 |
def engines_cmd() -> None:
|
| 179 |
+
"""Liste les moteurs OCR disponibles et vérifie leur installation.
|
| 180 |
+
|
| 181 |
+
Source de vérité unique avec ``/api/engines`` (Phase 3 du chantier
|
| 182 |
+
post-rewrite) : tous les moteurs listés ici sont effectivement
|
| 183 |
+
instanciables via ``picarones.adapters.ocr.factory``.
|
| 184 |
+
"""
|
| 185 |
+
from picarones.adapters.ocr.factory import _SUPPORTED
|
| 186 |
|
| 187 |
click.echo("Moteurs OCR disponibles :\n")
|
| 188 |
+
for engine_id, label, module, extra in _CLI_ENGINE_CATALOG:
|
| 189 |
+
# Garde-fou de cohérence : l'entrée CLI ne doit jamais
|
| 190 |
+
# référencer un engine inconnu de la factory canonique.
|
| 191 |
+
if engine_id not in _SUPPORTED:
|
| 192 |
+
continue
|
| 193 |
+
if not module:
|
| 194 |
+
status = click.style("✓ intégré", fg="green")
|
| 195 |
+
else:
|
| 196 |
+
try:
|
| 197 |
+
__import__(module)
|
| 198 |
+
status = click.style("✓ disponible", fg="green")
|
| 199 |
+
except ImportError:
|
| 200 |
+
hint = f" (pip install picarones{extra})" if extra else ""
|
| 201 |
+
status = click.style(f"✗ non installé{hint}", fg="red")
|
| 202 |
+
click.echo(f" {engine_id:<18} {label:<32} {status}")
|
| 203 |
|
| 204 |
click.echo(
|
| 205 |
+
"\nNote : kraken/calamari exigent un modèle utilisateur "
|
| 206 |
+
"(``.mlmodel``/``.ckpt``) — pas de modèle par défaut.",
|
|
|
|
| 207 |
)
|
| 208 |
|
| 209 |
|
|
@@ -60,7 +60,8 @@ _logger = logging.getLogger(__name__)
|
|
| 60 |
|
| 61 |
@asynccontextmanager
|
| 62 |
async def _lifespan(app: FastAPI):
|
| 63 |
-
"""Hook de démarrage : valide la config + nettoie les jobs orphelins
|
|
|
|
| 64 |
|
| 65 |
1. Sprint S6.9 — ``validate_csrf_config()`` : refuse de démarrer
|
| 66 |
si ``PICARONES_CSRF_REQUIRED=1`` sans ``PICARONES_CSRF_SECRET``
|
|
@@ -71,7 +72,14 @@ async def _lifespan(app: FastAPI):
|
|
| 71 |
précédent est mort sans les finir). On les bascule en
|
| 72 |
``interrupted`` pour ne pas laisser d'état mensonger sur le
|
| 73 |
tableau de bord.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
"""
|
|
|
|
|
|
|
| 75 |
# Étape 1 — validation config (échec rapide si dangereux).
|
| 76 |
from picarones.interfaces.web.security import validate_csrf_config
|
| 77 |
validate_csrf_config()
|
|
@@ -91,7 +99,27 @@ async def _lifespan(app: FastAPI):
|
|
| 91 |
"base SQLite inaccessible (%s) : le tableau de bord "
|
| 92 |
"affichera des jobs zombies.", exc,
|
| 93 |
)
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
|
| 97 |
# ──────────────────────────────────────────────────────────────────────────
|
|
|
|
| 60 |
|
| 61 |
@asynccontextmanager
|
| 62 |
async def _lifespan(app: FastAPI):
|
| 63 |
+
"""Hook de démarrage : valide la config + nettoie les jobs orphelins
|
| 64 |
+
+ démarre la tâche RGPD de purge des uploads.
|
| 65 |
|
| 66 |
1. Sprint S6.9 — ``validate_csrf_config()`` : refuse de démarrer
|
| 67 |
si ``PICARONES_CSRF_REQUIRED=1`` sans ``PICARONES_CSRF_SECRET``
|
|
|
|
| 72 |
précédent est mort sans les finir). On les bascule en
|
| 73 |
``interrupted`` pour ne pas laisser d'état mensonger sur le
|
| 74 |
tableau de bord.
|
| 75 |
+
3. Phase 4 du chantier post-rewrite — démarrage explicite de
|
| 76 |
+
:func:`upload_purge_task` (RGPD). Auparavant définie dans
|
| 77 |
+
``maintenance.py`` mais jamais lancée par ce lifespan, elle
|
| 78 |
+
était du code zombie. Désormais lancée comme tâche asyncio
|
| 79 |
+
de fond ; annulation propre au shutdown.
|
| 80 |
"""
|
| 81 |
+
import asyncio
|
| 82 |
+
|
| 83 |
# Étape 1 — validation config (échec rapide si dangereux).
|
| 84 |
from picarones.interfaces.web.security import validate_csrf_config
|
| 85 |
validate_csrf_config()
|
|
|
|
| 99 |
"base SQLite inaccessible (%s) : le tableau de bord "
|
| 100 |
"affichera des jobs zombies.", exc,
|
| 101 |
)
|
| 102 |
+
|
| 103 |
+
# Étape 3 — démarrage tâche de purge RGPD.
|
| 104 |
+
from picarones.interfaces.web.maintenance import upload_purge_task
|
| 105 |
+
purge_task = asyncio.create_task(upload_purge_task(state.UPLOADS_DIR))
|
| 106 |
+
try:
|
| 107 |
+
yield
|
| 108 |
+
finally:
|
| 109 |
+
# Annulation propre au shutdown ; on attend l'acquittement de
|
| 110 |
+
# la CancelledError pour éviter le warning "Task was destroyed
|
| 111 |
+
# but it is pending". ``asyncio.shield`` n'est pas nécessaire :
|
| 112 |
+
# on accepte la perte d'une éventuelle passe de purge en cours
|
| 113 |
+
# (idempotente, sera reprise au prochain démarrage).
|
| 114 |
+
purge_task.cancel()
|
| 115 |
+
try:
|
| 116 |
+
await purge_task
|
| 117 |
+
except (asyncio.CancelledError, Exception) as exc: # noqa: BLE001
|
| 118 |
+
if not isinstance(exc, asyncio.CancelledError):
|
| 119 |
+
_logger.warning(
|
| 120 |
+
"[maintenance] tâche de purge arrêtée sur erreur : %s",
|
| 121 |
+
exc,
|
| 122 |
+
)
|
| 123 |
|
| 124 |
|
| 125 |
# ──────────────────────────────────────────────────────────────────────────
|
|
@@ -11,9 +11,9 @@ API publique
|
|
| 11 |
Helpers internes (préfixe ``_``)
|
| 12 |
--------------------------------
|
| 13 |
- ``_build_llm_adapter`` : factory adapter LLM depuis une config
|
| 14 |
-
``
|
| 15 |
- ``_engine_from_competitor`` : factory moteur OCR ou pipeline
|
| 16 |
-
OCR+LLM depuis une ``
|
| 17 |
|
| 18 |
Ces utilitaires sont consommés par le router ``/api/benchmark/*``.
|
| 19 |
"""
|
|
@@ -28,7 +28,7 @@ from typing import Any, Optional
|
|
| 28 |
from picarones.interfaces.web.models import (
|
| 29 |
BenchmarkRequest,
|
| 30 |
BenchmarkRunRequest,
|
| 31 |
-
|
| 32 |
)
|
| 33 |
from picarones.interfaces.web.state import BenchmarkJob, iso_now
|
| 34 |
|
|
@@ -91,7 +91,7 @@ def sse_format(event_type: str, data: Any, seq: Optional[int] = None) -> str:
|
|
| 91 |
return f"{head}event: {event_type}\ndata: {payload}\n\n"
|
| 92 |
|
| 93 |
|
| 94 |
-
def _build_llm_adapter(comp:
|
| 95 |
"""Instancie un adaptateur LLM depuis la config d'un concurrent."""
|
| 96 |
if comp.llm_provider == "openai":
|
| 97 |
from picarones.adapters.llm.openai_adapter import OpenAIAdapter
|
|
@@ -126,7 +126,7 @@ def _sanitize_name_suffix(value: str) -> str:
|
|
| 126 |
def _ocr_adapter_name(engine_id: str, ocr_model: str) -> str:
|
| 127 |
"""Nom canonique de l'adapter OCR pour un couple ``(engine, model)``.
|
| 128 |
|
| 129 |
-
Deux ``
|
| 130 |
obtiennent le même ``name`` (donc le resolver les déduplique
|
| 131 |
proprement). Deux configs différentes obtiennent des noms
|
| 132 |
distincts — pas de collision silencieuse, pas de bricolage côté
|
|
@@ -170,6 +170,21 @@ _OCR_KWARGS_BUILDERS: dict[str, Any] = {
|
|
| 170 |
"lang": model or "fra",
|
| 171 |
"psm": 6,
|
| 172 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
"mistral_ocr": lambda model: {
|
| 174 |
"model": model or "mistral-ocr-latest",
|
| 175 |
},
|
|
@@ -200,8 +215,8 @@ def _build_ocr_kwargs(engine_id: str, ocr_model: str) -> dict[str, Any]:
|
|
| 200 |
return kwargs
|
| 201 |
|
| 202 |
|
| 203 |
-
def _engine_from_competitor(comp:
|
| 204 |
-
"""Instancie un moteur OCR (ou pipeline OCR+LLM) depuis une
|
| 205 |
|
| 206 |
Modes supportés :
|
| 207 |
|
|
@@ -226,7 +241,7 @@ def _engine_from_competitor(comp: CompetitorConfig) -> Any:
|
|
| 226 |
# des constructeurs ``BaseOCREngine`` legacy. Les adapters
|
| 227 |
# canoniques ont des kwargs nommés (pas de dict ``config``) — la
|
| 228 |
# conversion se fait ici en respectant les noms historiques des
|
| 229 |
-
# champs ``
|
| 230 |
ocr = None
|
| 231 |
if not is_corpus_ocr:
|
| 232 |
from picarones.adapters.ocr.factory import ocr_adapter_from_name
|
|
@@ -248,14 +263,20 @@ def _engine_from_competitor(comp: CompetitorConfig) -> Any:
|
|
| 248 |
|
| 249 |
# Pipeline OCR+LLM (live ou post-correction) — ``OCRLLMPipelineConfig``
|
| 250 |
# canonique remplace l'ex-``OCRLLMPipeline`` legacy.
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 259 |
|
| 260 |
llm = _build_llm_adapter(comp)
|
| 261 |
|
|
@@ -283,7 +304,7 @@ def _engine_from_competitor(comp: CompetitorConfig) -> Any:
|
|
| 283 |
|
| 284 |
|
| 285 |
def run_benchmark_thread_v2(job: BenchmarkJob, req: BenchmarkRunRequest) -> None:
|
| 286 |
-
"""Exécute un benchmark à partir d'une liste de ``
|
| 287 |
job.set_status("running")
|
| 288 |
job.started_at = iso_now()
|
| 289 |
job.add_event("start", {"message": "Démarrage du benchmark…", "corpus": req.corpus_path})
|
|
@@ -394,123 +415,72 @@ def run_benchmark_thread_v2(job: BenchmarkJob, req: BenchmarkRunRequest) -> None
|
|
| 394 |
job.add_event("error", {"message": f"Erreur : {exc}"})
|
| 395 |
|
| 396 |
|
| 397 |
-
def
|
| 398 |
-
"""
|
| 399 |
-
job.set_status("running")
|
| 400 |
-
job.started_at = iso_now()
|
| 401 |
-
job.add_event("start", {"message": "Démarrage du benchmark…", "corpus": req.corpus_path})
|
| 402 |
-
|
| 403 |
-
try:
|
| 404 |
-
from picarones.app.services.benchmark_runner import (
|
| 405 |
-
run_benchmark_via_service,
|
| 406 |
-
)
|
| 407 |
-
from picarones.evaluation.corpus import load_corpus_from_directory
|
| 408 |
-
|
| 409 |
-
# Charger le corpus
|
| 410 |
-
job.add_event("log", {"message": f"Chargement du corpus : {req.corpus_path}"})
|
| 411 |
-
corpus = load_corpus_from_directory(req.corpus_path)
|
| 412 |
-
job.total_docs = len(corpus)
|
| 413 |
-
job.add_event("log", {"message": f"{job.total_docs} documents chargés."})
|
| 414 |
-
|
| 415 |
-
if job.status == "cancelled":
|
| 416 |
-
return
|
| 417 |
-
|
| 418 |
-
# Sprint H.2.b.4 — instanciation via la factory canonique
|
| 419 |
-
# ``ocr_adapter_from_name`` (retourne ``BaseOCRAdapter``).
|
| 420 |
-
from picarones.adapters.ocr.factory import ocr_adapter_from_name
|
| 421 |
-
|
| 422 |
-
ocr_engines = []
|
| 423 |
-
for engine_name in req.engines:
|
| 424 |
-
try:
|
| 425 |
-
if engine_name.lower() in {"tesseract", "tess"}:
|
| 426 |
-
eng = ocr_adapter_from_name(
|
| 427 |
-
engine_name, lang=req.lang, psm=6,
|
| 428 |
-
)
|
| 429 |
-
else:
|
| 430 |
-
eng = ocr_adapter_from_name(engine_name)
|
| 431 |
-
ocr_engines.append(eng)
|
| 432 |
-
job.add_event("log", {"message": f"Moteur chargé : {engine_name}"})
|
| 433 |
-
except Exception as exc:
|
| 434 |
-
job.add_event("warning", {"message": f"Moteur ignoré '{engine_name}' : {exc}"})
|
| 435 |
-
|
| 436 |
-
if not ocr_engines:
|
| 437 |
-
raise ValueError("Aucun moteur valide disponible.")
|
| 438 |
-
|
| 439 |
-
# Répertoire de sortie
|
| 440 |
-
# Sprint A14-S1 — A.I.0 P0 : ``output_dir`` a déjà été validé
|
| 441 |
-
# par le router (validated_path). ``report_name`` est sanitizé
|
| 442 |
-
# ici pour défense en profondeur (refuse ``../``, séparateurs,
|
| 443 |
-
# caractères de contrôle) avant concaténation à output_dir.
|
| 444 |
-
from picarones.interfaces.web.security import safe_report_name
|
| 445 |
-
output_dir = Path(req.output_dir)
|
| 446 |
-
output_dir.mkdir(parents=True, exist_ok=True)
|
| 447 |
-
raw_name = req.report_name or f"rapport_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
|
| 448 |
-
report_name = safe_report_name(raw_name)
|
| 449 |
-
output_json = str(output_dir / f"{report_name}.json")
|
| 450 |
-
output_html = str(output_dir / f"{report_name}.html")
|
| 451 |
-
|
| 452 |
-
# Callback de progression
|
| 453 |
-
n_engines = len(ocr_engines)
|
| 454 |
-
total_steps = job.total_docs * n_engines
|
| 455 |
-
step_counter = [0]
|
| 456 |
-
|
| 457 |
-
def _progress_callback(engine_name: str, doc_idx: int, doc_id: str) -> None:
|
| 458 |
-
if job.status == "cancelled":
|
| 459 |
-
return
|
| 460 |
-
step_counter[0] += 1
|
| 461 |
-
job.current_engine = engine_name
|
| 462 |
-
job.processed_docs = doc_idx
|
| 463 |
-
job.progress = step_counter[0] / max(total_steps, 1)
|
| 464 |
-
job.add_event("progress", {
|
| 465 |
-
"engine": engine_name,
|
| 466 |
-
"doc_idx": doc_idx,
|
| 467 |
-
"doc_id": doc_id,
|
| 468 |
-
"progress": job.progress,
|
| 469 |
-
"processed": step_counter[0],
|
| 470 |
-
"total": total_steps,
|
| 471 |
-
})
|
| 472 |
|
| 473 |
-
|
| 474 |
-
|
|
|
|
|
|
|
| 475 |
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 488 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 489 |
|
| 490 |
-
if job.status == "cancelled":
|
| 491 |
-
return
|
| 492 |
-
|
| 493 |
-
job.add_event("log", {"message": "Génération du rapport HTML…"})
|
| 494 |
-
from picarones.reports.html.generator import ReportGenerator
|
| 495 |
-
report_lang = getattr(req, "report_lang", "fr")
|
| 496 |
-
gen = ReportGenerator(result, lang=report_lang)
|
| 497 |
-
gen.generate(output_html)
|
| 498 |
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
job.set_status("complete")
|
| 502 |
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
"ranking": ranking,
|
| 509 |
-
})
|
| 510 |
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 514 |
|
| 515 |
|
| 516 |
__all__ = [
|
|
|
|
| 11 |
Helpers internes (préfixe ``_``)
|
| 12 |
--------------------------------
|
| 13 |
- ``_build_llm_adapter`` : factory adapter LLM depuis une config
|
| 14 |
+
``PipelineConfig``.
|
| 15 |
- ``_engine_from_competitor`` : factory moteur OCR ou pipeline
|
| 16 |
+
OCR+LLM depuis une ``PipelineConfig``.
|
| 17 |
|
| 18 |
Ces utilitaires sont consommés par le router ``/api/benchmark/*``.
|
| 19 |
"""
|
|
|
|
| 28 |
from picarones.interfaces.web.models import (
|
| 29 |
BenchmarkRequest,
|
| 30 |
BenchmarkRunRequest,
|
| 31 |
+
PipelineConfig,
|
| 32 |
)
|
| 33 |
from picarones.interfaces.web.state import BenchmarkJob, iso_now
|
| 34 |
|
|
|
|
| 91 |
return f"{head}event: {event_type}\ndata: {payload}\n\n"
|
| 92 |
|
| 93 |
|
| 94 |
+
def _build_llm_adapter(comp: PipelineConfig) -> Any:
|
| 95 |
"""Instancie un adaptateur LLM depuis la config d'un concurrent."""
|
| 96 |
if comp.llm_provider == "openai":
|
| 97 |
from picarones.adapters.llm.openai_adapter import OpenAIAdapter
|
|
|
|
| 126 |
def _ocr_adapter_name(engine_id: str, ocr_model: str) -> str:
|
| 127 |
"""Nom canonique de l'adapter OCR pour un couple ``(engine, model)``.
|
| 128 |
|
| 129 |
+
Deux ``PipelineConfig`` qui partagent exactement le même couple
|
| 130 |
obtiennent le même ``name`` (donc le resolver les déduplique
|
| 131 |
proprement). Deux configs différentes obtiennent des noms
|
| 132 |
distincts — pas de collision silencieuse, pas de bricolage côté
|
|
|
|
| 170 |
"lang": model or "fra",
|
| 171 |
"psm": 6,
|
| 172 |
},
|
| 173 |
+
"pero_ocr": lambda model: {
|
| 174 |
+
"config_path": model or "",
|
| 175 |
+
},
|
| 176 |
+
# Phase 3 chantier post-rewrite : kraken/calamari étaient annoncés
|
| 177 |
+
# par ``/api/engines`` mais sans factory branchée → benchmark web
|
| 178 |
+
# échouait silencieusement. Le ``ocr_model`` côté UI véhicule
|
| 179 |
+
# désormais le chemin du modèle (Kraken ``.mlmodel`` ou Calamari
|
| 180 |
+
# checkpoint). Si vide, l'adapter lève une OCRAdapterError
|
| 181 |
+
# explicite à ``execute`` — pas de fallback silencieux.
|
| 182 |
+
"kraken": lambda model: {
|
| 183 |
+
"model_path": model or "",
|
| 184 |
+
},
|
| 185 |
+
"calamari": lambda model: {
|
| 186 |
+
"checkpoint": model or "",
|
| 187 |
+
},
|
| 188 |
"mistral_ocr": lambda model: {
|
| 189 |
"model": model or "mistral-ocr-latest",
|
| 190 |
},
|
|
|
|
| 215 |
return kwargs
|
| 216 |
|
| 217 |
|
| 218 |
+
def _engine_from_competitor(comp: PipelineConfig) -> Any:
|
| 219 |
+
"""Instancie un moteur OCR (ou pipeline OCR+LLM) depuis une PipelineConfig.
|
| 220 |
|
| 221 |
Modes supportés :
|
| 222 |
|
|
|
|
| 241 |
# des constructeurs ``BaseOCREngine`` legacy. Les adapters
|
| 242 |
# canoniques ont des kwargs nommés (pas de dict ``config``) — la
|
| 243 |
# conversion se fait ici en respectant les noms historiques des
|
| 244 |
+
# champs ``PipelineConfig.ocr_model``.
|
| 245 |
ocr = None
|
| 246 |
if not is_corpus_ocr:
|
| 247 |
from picarones.adapters.ocr.factory import ocr_adapter_from_name
|
|
|
|
| 263 |
|
| 264 |
# Pipeline OCR+LLM (live ou post-correction) — ``OCRLLMPipelineConfig``
|
| 265 |
# canonique remplace l'ex-``OCRLLMPipeline`` legacy.
|
| 266 |
+
#
|
| 267 |
+
# Phase 2 chantier post-rewrite : suppression de l'ancien ``mode_map``
|
| 268 |
+
# qui aliasait silencieusement (``post_correction_text`` →
|
| 269 |
+
# ``text_only``, valeur inconnue → ``text_only``). Désormais le
|
| 270 |
+
# typage Pydantic ``PipelineMode`` rejette en 422 toute chaîne hors
|
| 271 |
+
# de la matrice {``text_only``, ``text_and_image``, ``zero_shot``},
|
| 272 |
+
# et un éventuel client API qui passerait outre la validation
|
| 273 |
+
# (test legacy, payload forgé) reçoit ici une ``ValueError``.
|
| 274 |
+
mode = comp.pipeline_mode
|
| 275 |
+
if mode not in ("text_only", "text_and_image", "zero_shot"):
|
| 276 |
+
raise ValueError(
|
| 277 |
+
f"pipeline_mode invalide : {comp.pipeline_mode!r}. "
|
| 278 |
+
"Valeurs acceptées : 'text_only', 'text_and_image', 'zero_shot'.",
|
| 279 |
+
)
|
| 280 |
|
| 281 |
llm = _build_llm_adapter(comp)
|
| 282 |
|
|
|
|
| 304 |
|
| 305 |
|
| 306 |
def run_benchmark_thread_v2(job: BenchmarkJob, req: BenchmarkRunRequest) -> None:
|
| 307 |
+
"""Exécute un benchmark à partir d'une liste de ``PipelineConfig``."""
|
| 308 |
job.set_status("running")
|
| 309 |
job.started_at = iso_now()
|
| 310 |
job.add_event("start", {"message": "Démarrage du benchmark…", "corpus": req.corpus_path})
|
|
|
|
| 415 |
job.add_event("error", {"message": f"Erreur : {exc}"})
|
| 416 |
|
| 417 |
|
| 418 |
+
def _legacy_request_to_run_request(req: BenchmarkRequest) -> BenchmarkRunRequest:
|
| 419 |
+
"""Convertit un ``BenchmarkRequest`` legacy en ``BenchmarkRunRequest``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 420 |
|
| 421 |
+
Phase 4 du chantier post-rewrite : ``/api/benchmark/start`` est
|
| 422 |
+
rétrocompatible mais délègue désormais au worker v2 unifié. La
|
| 423 |
+
conversion mappe chaque ``engine_name`` en ``PipelineConfig``
|
| 424 |
+
(OCR seul, sans LLM) en préservant ``lang`` pour Tesseract.
|
| 425 |
|
| 426 |
+
Garantit qu'un patch sécurité/méthodologique appliqué au chemin
|
| 427 |
+
canonique (v2) s'applique aussi au chemin legacy — l'éviction
|
| 428 |
+
progressive de ``/start`` peut se faire sans double maintenance.
|
| 429 |
+
"""
|
| 430 |
+
competitors: list[PipelineConfig] = []
|
| 431 |
+
for engine_name in req.engines:
|
| 432 |
+
# ``ocr_model`` véhicule le ``lang`` Tesseract via la registry
|
| 433 |
+
# ``_OCR_KWARGS_BUILDERS`` ; pour les autres engines on laisse
|
| 434 |
+
# vide (l'adapter utilise son défaut).
|
| 435 |
+
model = req.lang if engine_name.lower() in ("tesseract", "tess") else ""
|
| 436 |
+
competitors.append(
|
| 437 |
+
PipelineConfig(
|
| 438 |
+
name="",
|
| 439 |
+
ocr_engine=engine_name,
|
| 440 |
+
ocr_model=model,
|
| 441 |
+
llm_provider="",
|
| 442 |
+
llm_model="",
|
| 443 |
+
pipeline_mode="",
|
| 444 |
+
prompt_file="",
|
| 445 |
+
),
|
| 446 |
)
|
| 447 |
+
return BenchmarkRunRequest(
|
| 448 |
+
corpus_path=req.corpus_path,
|
| 449 |
+
competitors=competitors,
|
| 450 |
+
normalization_profile=req.normalization_profile,
|
| 451 |
+
char_exclude=req.char_exclude,
|
| 452 |
+
output_dir=req.output_dir,
|
| 453 |
+
report_name=req.report_name,
|
| 454 |
+
report_lang=req.report_lang,
|
| 455 |
+
)
|
| 456 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 457 |
|
| 458 |
+
def run_benchmark_thread(job: BenchmarkJob, req: BenchmarkRequest) -> None:
|
| 459 |
+
"""Worker historique de ``/api/benchmark/start``.
|
|
|
|
| 460 |
|
| 461 |
+
Phase 4 du chantier post-rewrite : unifié avec ``run_benchmark_thread_v2``
|
| 462 |
+
via conversion ``BenchmarkRequest → BenchmarkRunRequest``. Avant
|
| 463 |
+
cette unification, deux workers indépendants implémentaient
|
| 464 |
+
presque la même logique → tout patch (sécurité, méthodologie)
|
| 465 |
+
devait être dupliqué, et il était facile d'en oublier un.
|
|
|
|
|
|
|
| 466 |
|
| 467 |
+
Marqué deprecated dans les logs ; à supprimer dans une release
|
| 468 |
+
future après que tous les consommateurs aient migré vers
|
| 469 |
+
``/api/benchmark/run``.
|
| 470 |
+
"""
|
| 471 |
+
import logging as _logging
|
| 472 |
+
_logging.getLogger(__name__).warning(
|
| 473 |
+
"[benchmark] /api/benchmark/start est déprécié — utiliser "
|
| 474 |
+
"/api/benchmark/run (PipelineConfig). Phase 4 du chantier "
|
| 475 |
+
"post-rewrite : le worker legacy délègue désormais au v2 unifié.",
|
| 476 |
+
)
|
| 477 |
+
job.add_event("log", {
|
| 478 |
+
"message": (
|
| 479 |
+
"Note : /api/benchmark/start est déprécié — utiliser "
|
| 480 |
+
"/api/benchmark/run pour les nouveaux clients."
|
| 481 |
+
),
|
| 482 |
+
})
|
| 483 |
+
return run_benchmark_thread_v2(job, _legacy_request_to_run_request(req))
|
| 484 |
|
| 485 |
|
| 486 |
__all__ = [
|
|
@@ -2,12 +2,14 @@
|
|
| 2 |
|
| 3 |
Détection ALTO/PAGE, extraction de texte GT, analyse de la structure
|
| 4 |
d'un dossier corpus, extraction de ZIP avec garde-fous (taille
|
| 5 |
-
décompressée, nombre de fichiers
|
|
|
|
| 6 |
à :func:`picarones.formats._xml_utils.safe_parse_xml`.
|
| 7 |
"""
|
| 8 |
|
| 9 |
from __future__ import annotations
|
| 10 |
|
|
|
|
| 11 |
import xml.etree.ElementTree as ET
|
| 12 |
import zipfile
|
| 13 |
from pathlib import Path
|
|
@@ -15,6 +17,8 @@ from pathlib import Path
|
|
| 15 |
from picarones.formats._xml_utils import safe_parse_xml
|
| 16 |
from picarones.interfaces.web.state import IMAGE_EXTS
|
| 17 |
|
|
|
|
|
|
|
| 18 |
# Garde-fous ZIP-bomb pour l'upload
|
| 19 |
MAX_ZIP_TOTAL_SIZE = 500 * 1024 * 1024
|
| 20 |
"""500 Mo décompressé maximum."""
|
|
@@ -165,17 +169,83 @@ def analyze_corpus_dir(path: Path) -> dict:
|
|
| 165 |
# Extraction ZIP sécurisée
|
| 166 |
# ──────────────────────────────────────────────────────────────────────────
|
| 167 |
|
| 168 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
"""Extrait un ZIP en aplatissant les paires image/.gt.txt/.xml dans ``dest``.
|
| 170 |
|
| 171 |
Garde-fous :
|
|
|
|
| 172 |
- Ignore les fichiers cachés macOS (préfixe ``.`` ou ``__MACOSX``).
|
| 173 |
- Refuse si la taille décompressée totale dépasse ``MAX_ZIP_TOTAL_SIZE``.
|
| 174 |
- Refuse si le nombre de fichiers extraits dépasse ``MAX_ZIP_FILES``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
dest.mkdir(parents=True, exist_ok=True)
|
| 177 |
total_size = 0
|
| 178 |
file_count = 0
|
|
|
|
| 179 |
for member in zf.infolist():
|
| 180 |
if member.is_dir():
|
| 181 |
continue
|
|
@@ -183,23 +253,49 @@ def flatten_zip_to_dir(zf: zipfile.ZipFile, dest: Path) -> None:
|
|
| 183 |
name = p.name
|
| 184 |
if name.startswith("."):
|
| 185 |
continue
|
| 186 |
-
|
| 187 |
-
|
|
|
|
|
|
|
| 188 |
or name.endswith(".gt.txt")
|
| 189 |
or name.endswith(".ocr.txt")
|
| 190 |
-
or
|
| 191 |
):
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
|
| 205 |
__all__ = [
|
|
|
|
| 2 |
|
| 3 |
Détection ALTO/PAGE, extraction de texte GT, analyse de la structure
|
| 4 |
d'un dossier corpus, extraction de ZIP avec garde-fous (taille
|
| 5 |
+
décompressée, nombre de fichiers, validation image extraite,
|
| 6 |
+
détection de collision de basename). Le parsing XML sécurisé délègue
|
| 7 |
à :func:`picarones.formats._xml_utils.safe_parse_xml`.
|
| 8 |
"""
|
| 9 |
|
| 10 |
from __future__ import annotations
|
| 11 |
|
| 12 |
+
import logging
|
| 13 |
import xml.etree.ElementTree as ET
|
| 14 |
import zipfile
|
| 15 |
from pathlib import Path
|
|
|
|
| 17 |
from picarones.formats._xml_utils import safe_parse_xml
|
| 18 |
from picarones.interfaces.web.state import IMAGE_EXTS
|
| 19 |
|
| 20 |
+
logger = logging.getLogger(__name__)
|
| 21 |
+
|
| 22 |
# Garde-fous ZIP-bomb pour l'upload
|
| 23 |
MAX_ZIP_TOTAL_SIZE = 500 * 1024 * 1024
|
| 24 |
"""500 Mo décompressé maximum."""
|
|
|
|
| 169 |
# Extraction ZIP sécurisée
|
| 170 |
# ──────────────────────────────────────────────────────────────────────────
|
| 171 |
|
| 172 |
+
def _slug_dirname(source_path: Path) -> str:
|
| 173 |
+
"""Slugifie le ``dirname`` d'une entrée ZIP pour préfixer en cas de collision.
|
| 174 |
+
|
| 175 |
+
``a/b/img.png`` → ``a_b``. Caractères non sûrs (``..``, séparateurs)
|
| 176 |
+
sont normalisés en ``_``. Vide si l'entrée est à la racine du ZIP.
|
| 177 |
+
"""
|
| 178 |
+
parent = source_path.parent
|
| 179 |
+
if parent == Path() or str(parent) == ".":
|
| 180 |
+
return ""
|
| 181 |
+
parts = [
|
| 182 |
+
part.replace("..", "_").replace("/", "_").replace("\\", "_")
|
| 183 |
+
for part in parent.parts
|
| 184 |
+
if part not in ("", "/", "\\")
|
| 185 |
+
]
|
| 186 |
+
return "_".join(p for p in parts if p)
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
def _resolve_collision(
|
| 190 |
+
name: str, source_path: Path, taken: set[str],
|
| 191 |
+
) -> str:
|
| 192 |
+
"""Renomme ``name`` pour éviter une collision avec ``taken``.
|
| 193 |
+
|
| 194 |
+
Stratégie :
|
| 195 |
+
1. Préfixe avec le slug du dirname source (traçabilité). Si pas de
|
| 196 |
+
dirname ou si déjà pris, ajoute un suffixe numérique.
|
| 197 |
+
2. Lève ``ValueError`` après 1000 tentatives (corpus pathologique).
|
| 198 |
+
"""
|
| 199 |
+
slug = _slug_dirname(source_path)
|
| 200 |
+
if slug:
|
| 201 |
+
candidate = f"{slug}__{name}"
|
| 202 |
+
if candidate not in taken:
|
| 203 |
+
return candidate
|
| 204 |
+
stem = Path(name).stem
|
| 205 |
+
suffix = "".join(Path(name).suffixes)
|
| 206 |
+
for n in range(2, 1001):
|
| 207 |
+
candidate = f"{stem}_{n}{suffix}"
|
| 208 |
+
if candidate not in taken:
|
| 209 |
+
return candidate
|
| 210 |
+
raise ValueError(
|
| 211 |
+
f"Impossible de résoudre la collision de basename pour {name!r} "
|
| 212 |
+
f"après 1000 tentatives — corpus pathologique ?",
|
| 213 |
+
)
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
def flatten_zip_to_dir(
|
| 217 |
+
zf: zipfile.ZipFile,
|
| 218 |
+
dest: Path,
|
| 219 |
+
*,
|
| 220 |
+
validate_images: bool = True,
|
| 221 |
+
) -> None:
|
| 222 |
"""Extrait un ZIP en aplatissant les paires image/.gt.txt/.xml dans ``dest``.
|
| 223 |
|
| 224 |
Garde-fous :
|
| 225 |
+
|
| 226 |
- Ignore les fichiers cachés macOS (préfixe ``.`` ou ``__MACOSX``).
|
| 227 |
- Refuse si la taille décompressée totale dépasse ``MAX_ZIP_TOTAL_SIZE``.
|
| 228 |
- Refuse si le nombre de fichiers extraits dépasse ``MAX_ZIP_FILES``.
|
| 229 |
+
- **Détection de collision de basename** : ``a/img.png`` et
|
| 230 |
+
``b/img.png`` ne s'écrasent plus silencieusement — le second est
|
| 231 |
+
renommé avec un préfixe dérivé de son dossier source (ex.
|
| 232 |
+
``b__img.png``) et un warning est loggué. Sans ce garde-fou,
|
| 233 |
+
l'utilisateur pouvait associer silencieusement une image à une
|
| 234 |
+
GT incorrecte.
|
| 235 |
+
- **Validation image** : chaque image extraite passe par
|
| 236 |
+
:func:`validate_image_safe` (Pillow.verify, anti-bombe), de la
|
| 237 |
+
même manière que les uploads directs. Désactivable via
|
| 238 |
+
``validate_images=False`` (utile aux tests qui ne fournissent
|
| 239 |
+
pas de PNG complets).
|
| 240 |
"""
|
| 241 |
+
# Import retardé : ``security`` dépend de ``state`` qui dépend de
|
| 242 |
+
# ``corpus_utils`` → circulaire si import toplevel.
|
| 243 |
+
from picarones.interfaces.web.security import validate_image_safe
|
| 244 |
+
|
| 245 |
dest.mkdir(parents=True, exist_ok=True)
|
| 246 |
total_size = 0
|
| 247 |
file_count = 0
|
| 248 |
+
written_names: set[str] = set()
|
| 249 |
for member in zf.infolist():
|
| 250 |
if member.is_dir():
|
| 251 |
continue
|
|
|
|
| 253 |
name = p.name
|
| 254 |
if name.startswith("."):
|
| 255 |
continue
|
| 256 |
+
suffix_lower = p.suffix.lower()
|
| 257 |
+
is_image = suffix_lower in IMAGE_EXTS
|
| 258 |
+
if not (
|
| 259 |
+
is_image
|
| 260 |
or name.endswith(".gt.txt")
|
| 261 |
or name.endswith(".ocr.txt")
|
| 262 |
+
or suffix_lower == ".xml"
|
| 263 |
):
|
| 264 |
+
continue
|
| 265 |
+
|
| 266 |
+
total_size += member.file_size
|
| 267 |
+
if total_size > MAX_ZIP_TOTAL_SIZE:
|
| 268 |
+
raise ValueError(
|
| 269 |
+
f"ZIP trop volumineux : taille décompressée > "
|
| 270 |
+
f"{MAX_ZIP_TOTAL_SIZE // (1024*1024)} Mo"
|
| 271 |
+
)
|
| 272 |
+
file_count += 1
|
| 273 |
+
if file_count > MAX_ZIP_FILES:
|
| 274 |
+
raise ValueError(f"ZIP contient trop de fichiers (> {MAX_ZIP_FILES})")
|
| 275 |
+
|
| 276 |
+
data = zf.read(member.filename)
|
| 277 |
+
|
| 278 |
+
# Validation image après extraction (les images directes sont
|
| 279 |
+
# déjà validées par ``api_corpus_upload``, mais celles extraites
|
| 280 |
+
# d'un ZIP ne passaient pas par cette vérification — vecteur de
|
| 281 |
+
# zip bomb passant les 500 Mo brut).
|
| 282 |
+
if is_image and validate_images:
|
| 283 |
+
validate_image_safe(data, filename=name)
|
| 284 |
+
|
| 285 |
+
# Détection de collision : ``a/img.png`` et ``b/img.png`` ne
|
| 286 |
+
# doivent pas s'écraser silencieusement (vecteur de
|
| 287 |
+
# mauvaise association image/GT après aplatissement).
|
| 288 |
+
if name in written_names:
|
| 289 |
+
new_name = _resolve_collision(name, p, written_names)
|
| 290 |
+
logger.warning(
|
| 291 |
+
"[flatten_zip] collision de basename %r — renommé en %r "
|
| 292 |
+
"(source ZIP : %r)",
|
| 293 |
+
name, new_name, member.filename,
|
| 294 |
+
)
|
| 295 |
+
name = new_name
|
| 296 |
+
written_names.add(name)
|
| 297 |
+
|
| 298 |
+
(dest / name).write_bytes(data)
|
| 299 |
|
| 300 |
|
| 301 |
__all__ = [
|
|
@@ -67,6 +67,24 @@ Liste alignée sur ``measurements.normalization.NORMALIZATION_PROFILES``
|
|
| 67 |
répercutée ici sous peine de rejet Pydantic au niveau API web.
|
| 68 |
Sprint A14-S1 — alignement README ↔ web models ↔ runtime."""
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
class BenchmarkRequest(BaseModel):
|
| 72 |
corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
|
|
@@ -94,7 +112,7 @@ class HuggingFaceImportRequest(BaseModel):
|
|
| 94 |
max_samples: int = Field(default=100, ge=1, le=10_000)
|
| 95 |
|
| 96 |
|
| 97 |
-
class
|
| 98 |
name: str = Field(default="", max_length=_MAX_NAME)
|
| 99 |
ocr_engine: str = Field(default="", max_length=_MAX_NAME)
|
| 100 |
"""Moteur OCR : ``tesseract``, ``mistral_ocr``, … ou ``corpus``
|
|
@@ -102,13 +120,20 @@ class CompetitorConfig(BaseModel):
|
|
| 102 |
ocr_model: str = Field(default="", max_length=_MAX_NAME)
|
| 103 |
llm_provider: str = Field(default="", max_length=_MAX_NAME)
|
| 104 |
llm_model: str = Field(default="", max_length=_MAX_NAME)
|
| 105 |
-
pipeline_mode:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
prompt_file: str = Field(default="", max_length=_MAX_PROMPT_FILENAME)
|
| 107 |
|
| 108 |
|
| 109 |
class BenchmarkRunRequest(BaseModel):
|
| 110 |
corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
|
| 111 |
-
competitors: list[
|
| 112 |
min_length=1, max_length=_MAX_COMPETITORS,
|
| 113 |
)
|
| 114 |
normalization_profile: NormalizationProfileId = "nfc"
|
|
@@ -122,9 +147,10 @@ __all__ = [
|
|
| 122 |
"TesseractLang",
|
| 123 |
"ReportLang",
|
| 124 |
"NormalizationProfileId",
|
|
|
|
| 125 |
"BenchmarkRequest",
|
| 126 |
"HTRUnitedImportRequest",
|
| 127 |
"HuggingFaceImportRequest",
|
| 128 |
-
"
|
| 129 |
"BenchmarkRunRequest",
|
| 130 |
]
|
|
|
|
| 67 |
répercutée ici sous peine de rejet Pydantic au niveau API web.
|
| 68 |
Sprint A14-S1 — alignement README ↔ web models ↔ runtime."""
|
| 69 |
|
| 70 |
+
PipelineMode = Literal["text_only", "text_and_image", "zero_shot"]
|
| 71 |
+
"""Modes de pipeline OCR+LLM acceptés par ``PipelineConfig``.
|
| 72 |
+
|
| 73 |
+
Aligné sur :class:`picarones.pipeline.llm_pipeline_config.OCRLLMMode` —
|
| 74 |
+
toute valeur hors de ces 3 littéraux est rejetée 422 par Pydantic.
|
| 75 |
+
|
| 76 |
+
Sémantique :
|
| 77 |
+
|
| 78 |
+
- ``text_only`` — l'OCR amont produit un texte brut, le LLM le corrige
|
| 79 |
+
sans voir l'image (post-correction texte).
|
| 80 |
+
- ``text_and_image`` — l'OCR amont produit un texte ; le VLM le corrige
|
| 81 |
+
en s'appuyant sur l'image (post-correction multimodale).
|
| 82 |
+
- ``zero_shot`` — pas d'OCR amont ; un VLM transcrit l'image directement.
|
| 83 |
+
|
| 84 |
+
Phase 2 du chantier post-rewrite : suppression du fallback silencieux
|
| 85 |
+
``mode_map.get(comp.pipeline_mode, 'text_only')`` qui acceptait toute
|
| 86 |
+
chaîne arbitraire et la mappait sur ``text_only``."""
|
| 87 |
+
|
| 88 |
|
| 89 |
class BenchmarkRequest(BaseModel):
|
| 90 |
corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
|
|
|
|
| 112 |
max_samples: int = Field(default=100, ge=1, le=10_000)
|
| 113 |
|
| 114 |
|
| 115 |
+
class PipelineConfig(BaseModel):
|
| 116 |
name: str = Field(default="", max_length=_MAX_NAME)
|
| 117 |
ocr_engine: str = Field(default="", max_length=_MAX_NAME)
|
| 118 |
"""Moteur OCR : ``tesseract``, ``mistral_ocr``, … ou ``corpus``
|
|
|
|
| 120 |
ocr_model: str = Field(default="", max_length=_MAX_NAME)
|
| 121 |
llm_provider: str = Field(default="", max_length=_MAX_NAME)
|
| 122 |
llm_model: str = Field(default="", max_length=_MAX_NAME)
|
| 123 |
+
pipeline_mode: PipelineMode | Literal[""] = ""
|
| 124 |
+
"""Mode du pipeline OCR+LLM — vide si pas de pipeline LLM (OCR seul).
|
| 125 |
+
|
| 126 |
+
Typage strict (Phase 2 chantier post-rewrite) : Pydantic rejette
|
| 127 |
+
en 422 toute valeur hors de la matrice canonique au lieu d'aliaser
|
| 128 |
+
silencieusement sur ``text_only``. La chaîne vide (``""``) reste
|
| 129 |
+
autorisée pour indiquer qu'aucun LLM n'est attaché au moteur OCR.
|
| 130 |
+
"""
|
| 131 |
prompt_file: str = Field(default="", max_length=_MAX_PROMPT_FILENAME)
|
| 132 |
|
| 133 |
|
| 134 |
class BenchmarkRunRequest(BaseModel):
|
| 135 |
corpus_path: str = Field(min_length=1, max_length=_MAX_PATH)
|
| 136 |
+
competitors: list[PipelineConfig] = Field(
|
| 137 |
min_length=1, max_length=_MAX_COMPETITORS,
|
| 138 |
)
|
| 139 |
normalization_profile: NormalizationProfileId = "nfc"
|
|
|
|
| 147 |
"TesseractLang",
|
| 148 |
"ReportLang",
|
| 149 |
"NormalizationProfileId",
|
| 150 |
+
"PipelineMode",
|
| 151 |
"BenchmarkRequest",
|
| 152 |
"HTRUnitedImportRequest",
|
| 153 |
"HuggingFaceImportRequest",
|
| 154 |
+
"PipelineConfig",
|
| 155 |
"BenchmarkRunRequest",
|
| 156 |
]
|
|
@@ -2,7 +2,7 @@
|
|
| 2 |
|
| 3 |
Le ``stream`` SSE supporte la reprise via ``Last-Event-ID`` (Sprint 26).
|
| 4 |
``start`` lance un benchmark à liste de moteurs ; ``run`` accepte des
|
| 5 |
-
``
|
| 6 |
deux endpoints distincts pour deux UX historiquement séparées.
|
| 7 |
"""
|
| 8 |
|
|
@@ -107,7 +107,13 @@ async def api_benchmark_start(req: BenchmarkRequest, request: Request) -> dict:
|
|
| 107 |
|
| 108 |
job_id = str(uuid.uuid4())
|
| 109 |
job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
|
| 110 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
state.register_job(job)
|
| 112 |
state.cleanup_old_jobs()
|
| 113 |
|
|
@@ -116,14 +122,14 @@ async def api_benchmark_start(req: BenchmarkRequest, request: Request) -> dict:
|
|
| 116 |
|
| 117 |
|
| 118 |
# ──────────────────────────────────────────────────────────────────────────
|
| 119 |
-
# Lancement composé : liste de
|
| 120 |
# ──────────────────────────────────────────────────────────────────────────
|
| 121 |
|
| 122 |
@router.post("/api/benchmark/run")
|
| 123 |
async def api_benchmark_run(req: BenchmarkRunRequest, request: Request) -> dict:
|
| 124 |
"""Lance un benchmark à concurrents composés (OCR + LLM, pipelines).
|
| 125 |
|
| 126 |
-
Chaque ``
|
| 127 |
provider LLM (mode post-correction, zero-shot, ou OCR seul).
|
| 128 |
"""
|
| 129 |
# ``competitors`` non vide est garanti par Pydantic ``min_length=1``.
|
|
@@ -177,7 +183,8 @@ async def api_benchmark_run(req: BenchmarkRunRequest, request: Request) -> dict:
|
|
| 177 |
|
| 178 |
job_id = str(uuid.uuid4())
|
| 179 |
job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
|
| 180 |
-
|
|
|
|
| 181 |
state.register_job(job)
|
| 182 |
|
| 183 |
_start_job_thread(job, run_benchmark_thread_v2, req)
|
|
|
|
| 2 |
|
| 3 |
Le ``stream`` SSE supporte la reprise via ``Last-Event-ID`` (Sprint 26).
|
| 4 |
``start`` lance un benchmark à liste de moteurs ; ``run`` accepte des
|
| 5 |
+
``PipelineConfig`` composés (OCR + LLM, pipelines mutualisés) —
|
| 6 |
deux endpoints distincts pour deux UX historiquement séparées.
|
| 7 |
"""
|
| 8 |
|
|
|
|
| 107 |
|
| 108 |
job_id = str(uuid.uuid4())
|
| 109 |
job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
|
| 110 |
+
# Phase 4 du chantier post-rewrite : le payload du job contient
|
| 111 |
+
# désormais le ``corpus_path`` actif, pour que la tâche de purge
|
| 112 |
+
# ``upload_purge_task`` puisse identifier les corpus référencés
|
| 113 |
+
# par des jobs en cours et ne pas les supprimer. Avant ce
|
| 114 |
+
# branchement, la purge supprimait potentiellement des corpus
|
| 115 |
+
# actifs dont les uploads étaient plus anciens que la rétention.
|
| 116 |
+
state.JOB_STORE.create_job(job_id, payload={"corpus": req.corpus_path})
|
| 117 |
state.register_job(job)
|
| 118 |
state.cleanup_old_jobs()
|
| 119 |
|
|
|
|
| 122 |
|
| 123 |
|
| 124 |
# ──────────────────────────────────────────────────────────────────────────
|
| 125 |
+
# Lancement composé : liste de PipelineConfig (BenchmarkRunRequest)
|
| 126 |
# ──────────────────────────────────────────────────────────────────────────
|
| 127 |
|
| 128 |
@router.post("/api/benchmark/run")
|
| 129 |
async def api_benchmark_run(req: BenchmarkRunRequest, request: Request) -> dict:
|
| 130 |
"""Lance un benchmark à concurrents composés (OCR + LLM, pipelines).
|
| 131 |
|
| 132 |
+
Chaque ``PipelineConfig`` peut combiner un moteur OCR et un
|
| 133 |
provider LLM (mode post-correction, zero-shot, ou OCR seul).
|
| 134 |
"""
|
| 135 |
# ``competitors`` non vide est garanti par Pydantic ``min_length=1``.
|
|
|
|
| 183 |
|
| 184 |
job_id = str(uuid.uuid4())
|
| 185 |
job = state.BenchmarkJob(job_id=job_id, _store=state.JOB_STORE)
|
| 186 |
+
# Phase 4 — payload incluant le corpus actif pour la purge auto.
|
| 187 |
+
state.JOB_STORE.create_job(job_id, payload={"corpus": req.corpus_path})
|
| 188 |
state.register_job(job)
|
| 189 |
|
| 190 |
_start_job_thread(job, run_benchmark_thread_v2, req)
|
|
@@ -4,15 +4,32 @@ Surface de l'infrastructure ``BenchmarkHistory`` qui était
|
|
| 4 |
limitée au CLI ``picarones history --regression``. Le rapport HTML
|
| 5 |
peut désormais consommer cet endpoint pour afficher un encart
|
| 6 |
*« ⚠ Tesseract a régressé de 0,8 pp depuis le 12 janvier »* en tête.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
"""
|
| 8 |
|
| 9 |
from __future__ import annotations
|
| 10 |
|
| 11 |
import logging
|
|
|
|
| 12 |
from typing import Any, Optional
|
| 13 |
|
| 14 |
from fastapi import APIRouter, HTTPException, Query
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
router = APIRouter()
|
| 17 |
_logger = logging.getLogger(__name__)
|
| 18 |
|
|
@@ -21,13 +38,37 @@ _logger = logging.getLogger(__name__)
|
|
| 21 |
async def api_history_regressions(
|
| 22 |
engine: Optional[str] = Query(default=None, description="Filtre par moteur"),
|
| 23 |
threshold: float = Query(default=0.01, description="Seuil régression CER absolu"),
|
| 24 |
-
db_path: Optional[str] = Query(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
) -> dict:
|
| 26 |
"""Liste les régressions détectées dans l'historique longitudinal."""
|
| 27 |
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
try:
|
| 30 |
-
history =
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
except Exception as exc: # noqa: BLE001
|
| 32 |
raise HTTPException(
|
| 33 |
status_code=500, detail=f"Ouverture historique échouée : {exc}",
|
|
|
|
| 4 |
limitée au CLI ``picarones history --regression``. Le rapport HTML
|
| 5 |
peut désormais consommer cet endpoint pour afficher un encart
|
| 6 |
*« ⚠ Tesseract a régressé de 0,8 pp depuis le 12 janvier »* en tête.
|
| 7 |
+
|
| 8 |
+
Sécurité — paramètre ``db_path``
|
| 9 |
+
---------------------------------
|
| 10 |
+
Le paramètre ``db_path`` est validé contre les racines workspace
|
| 11 |
+
autorisées via :func:`validated_path`. Sans ce garde-fou, l'endpoint
|
| 12 |
+
acceptait un chemin SQLite libre — vecteur de lecture filesystem
|
| 13 |
+
arbitraire (path traversal). Pour pointer une base alternative à
|
| 14 |
+
l'extérieur des workspaces, exporter ``PICARONES_HISTORY_DB`` plutôt
|
| 15 |
+
que de passer ``db_path`` par query string.
|
| 16 |
"""
|
| 17 |
|
| 18 |
from __future__ import annotations
|
| 19 |
|
| 20 |
import logging
|
| 21 |
+
import os
|
| 22 |
from typing import Any, Optional
|
| 23 |
|
| 24 |
from fastapi import APIRouter, HTTPException, Query
|
| 25 |
|
| 26 |
+
from picarones.interfaces.web.security import (
|
| 27 |
+
PathValidationError,
|
| 28 |
+
compute_workspace_roots,
|
| 29 |
+
validated_path,
|
| 30 |
+
)
|
| 31 |
+
from picarones.interfaces.web.state import UPLOADS_DIR
|
| 32 |
+
|
| 33 |
router = APIRouter()
|
| 34 |
_logger = logging.getLogger(__name__)
|
| 35 |
|
|
|
|
| 38 |
async def api_history_regressions(
|
| 39 |
engine: Optional[str] = Query(default=None, description="Filtre par moteur"),
|
| 40 |
threshold: float = Query(default=0.01, description="Seuil régression CER absolu"),
|
| 41 |
+
db_path: Optional[str] = Query(
|
| 42 |
+
default=None,
|
| 43 |
+
description=(
|
| 44 |
+
"Chemin SQLite history (validé contre les workspace roots ; "
|
| 45 |
+
"préférer la variable d'env PICARONES_HISTORY_DB)."
|
| 46 |
+
),
|
| 47 |
+
),
|
| 48 |
) -> dict:
|
| 49 |
"""Liste les régressions détectées dans l'historique longitudinal."""
|
| 50 |
from picarones.evaluation.metrics.history import BenchmarkHistory
|
| 51 |
|
| 52 |
+
if db_path:
|
| 53 |
+
try:
|
| 54 |
+
resolved = validated_path(
|
| 55 |
+
db_path,
|
| 56 |
+
allowed_roots=compute_workspace_roots(UPLOADS_DIR),
|
| 57 |
+
must_exist=False,
|
| 58 |
+
)
|
| 59 |
+
except PathValidationError as exc:
|
| 60 |
+
raise HTTPException(status_code=400, detail=str(exc)) from exc
|
| 61 |
+
effective_db_path: Optional[str] = str(resolved)
|
| 62 |
+
else:
|
| 63 |
+
env_db = os.environ.get("PICARONES_HISTORY_DB", "").strip()
|
| 64 |
+
effective_db_path = env_db or None
|
| 65 |
+
|
| 66 |
try:
|
| 67 |
+
history = (
|
| 68 |
+
BenchmarkHistory(effective_db_path)
|
| 69 |
+
if effective_db_path
|
| 70 |
+
else BenchmarkHistory()
|
| 71 |
+
)
|
| 72 |
except Exception as exc: # noqa: BLE001
|
| 73 |
raise HTTPException(
|
| 74 |
status_code=500, detail=f"Ouverture historique échouée : {exc}",
|
|
@@ -2,13 +2,65 @@
|
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
|
|
|
|
|
|
| 5 |
from fastapi import APIRouter, HTTPException, Query
|
| 6 |
|
| 7 |
from picarones.interfaces.web.models import HTRUnitedImportRequest, HuggingFaceImportRequest
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
router = APIRouter()
|
| 10 |
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
# ──────────────────────────────────────────────────────────────────────────
|
| 13 |
# HTR-United
|
| 14 |
# ──────────────────────────────────────────────────────────────────────────
|
|
@@ -19,10 +71,8 @@ async def api_htr_united_catalogue(
|
|
| 19 |
language: str = Query(default="", description="Filtre langue"),
|
| 20 |
script: str = Query(default="", description="Filtre type d'écriture"),
|
| 21 |
) -> dict:
|
| 22 |
-
"""Catalogue HTR-United filtrable."""
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
cat = HTRUnitedCatalogue.from_demo()
|
| 26 |
results = cat.search(
|
| 27 |
query=query,
|
| 28 |
language=language or None,
|
|
@@ -30,6 +80,10 @@ async def api_htr_united_catalogue(
|
|
| 30 |
)
|
| 31 |
return {
|
| 32 |
"source": cat.source,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
"total": len(results),
|
| 34 |
"entries": [e.as_dict() for e in results],
|
| 35 |
"available_languages": cat.available_languages(),
|
|
@@ -40,12 +94,10 @@ async def api_htr_united_catalogue(
|
|
| 40 |
@router.post("/api/htr-united/import")
|
| 41 |
async def api_htr_united_import(req: HTRUnitedImportRequest) -> dict:
|
| 42 |
"""Importe une entrée HTR-United dans ``req.output_dir``."""
|
| 43 |
-
from picarones.adapters.corpus.htr_united import
|
| 44 |
-
HTRUnitedCatalogue,
|
| 45 |
-
import_htr_united_corpus,
|
| 46 |
-
)
|
| 47 |
|
| 48 |
-
|
|
|
|
| 49 |
entry = cat.get_by_id(req.entry_id)
|
| 50 |
if not entry:
|
| 51 |
raise HTTPException(
|
|
@@ -54,7 +106,7 @@ async def api_htr_united_import(req: HTRUnitedImportRequest) -> dict:
|
|
| 54 |
|
| 55 |
return import_htr_united_corpus(
|
| 56 |
entry=entry,
|
| 57 |
-
output_dir=
|
| 58 |
max_samples=req.max_samples,
|
| 59 |
)
|
| 60 |
|
|
@@ -92,10 +144,11 @@ async def api_huggingface_import(req: HuggingFaceImportRequest) -> dict:
|
|
| 92 |
"""Importe un dataset HuggingFace dans ``req.output_dir``."""
|
| 93 |
from picarones.adapters.corpus.huggingface import HuggingFaceImporter
|
| 94 |
|
|
|
|
| 95 |
importer = HuggingFaceImporter()
|
| 96 |
return importer.import_dataset(
|
| 97 |
dataset_id=req.dataset_id,
|
| 98 |
-
output_dir=
|
| 99 |
split=req.split,
|
| 100 |
max_samples=req.max_samples,
|
| 101 |
)
|
|
|
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
| 5 |
+
import os
|
| 6 |
+
|
| 7 |
from fastapi import APIRouter, HTTPException, Query
|
| 8 |
|
| 9 |
from picarones.interfaces.web.models import HTRUnitedImportRequest, HuggingFaceImportRequest
|
| 10 |
+
from picarones.interfaces.web.security import (
|
| 11 |
+
PathValidationError,
|
| 12 |
+
compute_workspace_roots,
|
| 13 |
+
validated_path,
|
| 14 |
+
)
|
| 15 |
+
from picarones.interfaces.web.state import UPLOADS_DIR
|
| 16 |
|
| 17 |
router = APIRouter()
|
| 18 |
|
| 19 |
|
| 20 |
+
def _htr_united_catalogue():
|
| 21 |
+
"""Récupère le catalogue HTR-United (remote ou demo).
|
| 22 |
+
|
| 23 |
+
Phase 4.4 du chantier post-rewrite : auparavant le router
|
| 24 |
+
appelait ``HTRUnitedCatalogue.from_demo()`` exclusivement —
|
| 25 |
+
l'UI annonçait "catalogue HTR-United" alors qu'on chargeait
|
| 26 |
+
un échantillon embarqué. Désormais ``from_remote()`` est
|
| 27 |
+
utilisé (avec fallback automatique sur demo si offline), et
|
| 28 |
+
le champ ``source`` (``"remote" | "demo"``) est exposé dans
|
| 29 |
+
la réponse pour que l'UI puisse signaler clairement le mode
|
| 30 |
+
actif.
|
| 31 |
+
|
| 32 |
+
En CI / déploiement sans réseau, exporter
|
| 33 |
+
``PICARONES_HTR_UNITED_OFFLINE=1`` force le mode démo et
|
| 34 |
+
évite un timeout de 10s à chaque GET catalogue.
|
| 35 |
+
"""
|
| 36 |
+
from picarones.adapters.corpus.htr_united import HTRUnitedCatalogue
|
| 37 |
+
|
| 38 |
+
if os.environ.get("PICARONES_HTR_UNITED_OFFLINE", "").strip() in (
|
| 39 |
+
"1", "true", "yes",
|
| 40 |
+
):
|
| 41 |
+
return HTRUnitedCatalogue.from_demo()
|
| 42 |
+
return HTRUnitedCatalogue.from_remote(timeout=5)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _validated_output_dir(user_path: str) -> str:
|
| 46 |
+
"""Valide ``output_dir`` reçu d'un importer contre les racines workspace.
|
| 47 |
+
|
| 48 |
+
Les endpoints d'import écrivent un corpus distant sur le filesystem
|
| 49 |
+
du serveur — un ``output_dir`` libre permet d'écrire arbitrairement
|
| 50 |
+
(path traversal). On valide ici contre :func:`compute_workspace_roots`
|
| 51 |
+
avant de passer la chaîne au backend.
|
| 52 |
+
"""
|
| 53 |
+
try:
|
| 54 |
+
resolved = validated_path(
|
| 55 |
+
user_path,
|
| 56 |
+
allowed_roots=compute_workspace_roots(UPLOADS_DIR),
|
| 57 |
+
must_exist=False,
|
| 58 |
+
)
|
| 59 |
+
except PathValidationError as exc:
|
| 60 |
+
raise HTTPException(status_code=400, detail=str(exc)) from exc
|
| 61 |
+
return str(resolved)
|
| 62 |
+
|
| 63 |
+
|
| 64 |
# ──────────────────────────────────────────────────────────────────────────
|
| 65 |
# HTR-United
|
| 66 |
# ──────────────────────────────────────────────────────────────────────────
|
|
|
|
| 71 |
language: str = Query(default="", description="Filtre langue"),
|
| 72 |
script: str = Query(default="", description="Filtre type d'écriture"),
|
| 73 |
) -> dict:
|
| 74 |
+
"""Catalogue HTR-United filtrable (remote, fallback demo si offline)."""
|
| 75 |
+
cat = _htr_united_catalogue()
|
|
|
|
|
|
|
| 76 |
results = cat.search(
|
| 77 |
query=query,
|
| 78 |
language=language or None,
|
|
|
|
| 80 |
)
|
| 81 |
return {
|
| 82 |
"source": cat.source,
|
| 83 |
+
# Indication explicite du mode pour l'UI : "demo" si on charge
|
| 84 |
+
# le catalogue embarqué (réseau indisponible ou variable
|
| 85 |
+
# ``PICARONES_HTR_UNITED_OFFLINE=1`` exportée).
|
| 86 |
+
"is_demo": cat.source == "demo",
|
| 87 |
"total": len(results),
|
| 88 |
"entries": [e.as_dict() for e in results],
|
| 89 |
"available_languages": cat.available_languages(),
|
|
|
|
| 94 |
@router.post("/api/htr-united/import")
|
| 95 |
async def api_htr_united_import(req: HTRUnitedImportRequest) -> dict:
|
| 96 |
"""Importe une entrée HTR-United dans ``req.output_dir``."""
|
| 97 |
+
from picarones.adapters.corpus.htr_united import import_htr_united_corpus
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
+
output_dir = _validated_output_dir(req.output_dir)
|
| 100 |
+
cat = _htr_united_catalogue()
|
| 101 |
entry = cat.get_by_id(req.entry_id)
|
| 102 |
if not entry:
|
| 103 |
raise HTTPException(
|
|
|
|
| 106 |
|
| 107 |
return import_htr_united_corpus(
|
| 108 |
entry=entry,
|
| 109 |
+
output_dir=output_dir,
|
| 110 |
max_samples=req.max_samples,
|
| 111 |
)
|
| 112 |
|
|
|
|
| 144 |
"""Importe un dataset HuggingFace dans ``req.output_dir``."""
|
| 145 |
from picarones.adapters.corpus.huggingface import HuggingFaceImporter
|
| 146 |
|
| 147 |
+
output_dir = _validated_output_dir(req.output_dir)
|
| 148 |
importer = HuggingFaceImporter()
|
| 149 |
return importer.import_dataset(
|
| 150 |
dataset_id=req.dataset_id,
|
| 151 |
+
output_dir=output_dir,
|
| 152 |
split=req.split,
|
| 153 |
max_samples=req.max_samples,
|
| 154 |
)
|
|
@@ -436,54 +436,17 @@ class ReportGenerator:
|
|
| 436 |
Compatible avec les fichiers produits par ``BenchmarkResult.to_json()``.
|
| 437 |
Les images base64 doivent être passées via ``kwargs["images_b64"]``
|
| 438 |
si elles ne sont pas dans le JSON.
|
| 439 |
-
"""
|
| 440 |
-
import json as _json
|
| 441 |
-
|
| 442 |
-
data = _json.loads(Path(json_path).read_text(encoding="utf-8"))
|
| 443 |
-
|
| 444 |
-
# Reconstruction minimale d'un BenchmarkResult depuis le dict
|
| 445 |
-
from picarones.evaluation.metric_result import MetricsResult
|
| 446 |
-
from picarones.evaluation.benchmark_result import DocumentResult, EngineReport
|
| 447 |
-
|
| 448 |
-
engine_reports = []
|
| 449 |
-
for er_data in data.get("engine_reports", []):
|
| 450 |
-
doc_results = []
|
| 451 |
-
for dr_data in er_data.get("document_results", []):
|
| 452 |
-
m = dr_data["metrics"]
|
| 453 |
-
metrics = MetricsResult(
|
| 454 |
-
cer=m["cer"], cer_nfc=m["cer_nfc"], cer_caseless=m["cer_caseless"],
|
| 455 |
-
wer=m["wer"], wer_normalized=m["wer_normalized"],
|
| 456 |
-
mer=m["mer"], wil=m["wil"],
|
| 457 |
-
reference_length=m["reference_length"],
|
| 458 |
-
hypothesis_length=m["hypothesis_length"],
|
| 459 |
-
error=m.get("error"),
|
| 460 |
-
)
|
| 461 |
-
doc_results.append(DocumentResult(
|
| 462 |
-
doc_id=dr_data["doc_id"],
|
| 463 |
-
image_path=dr_data["image_path"],
|
| 464 |
-
ground_truth=dr_data["ground_truth"],
|
| 465 |
-
hypothesis=dr_data["hypothesis"],
|
| 466 |
-
metrics=metrics,
|
| 467 |
-
duration_seconds=dr_data.get("duration_seconds", 0.0),
|
| 468 |
-
engine_error=dr_data.get("engine_error"),
|
| 469 |
-
))
|
| 470 |
-
engine_reports.append(EngineReport(
|
| 471 |
-
engine_name=er_data["engine_name"],
|
| 472 |
-
engine_version=er_data.get("engine_version", "unknown"),
|
| 473 |
-
engine_config=er_data.get("engine_config", {}),
|
| 474 |
-
document_results=doc_results,
|
| 475 |
-
))
|
| 476 |
-
|
| 477 |
-
corpus_info = data.get("corpus", {})
|
| 478 |
-
bm = BenchmarkResult(
|
| 479 |
-
corpus_name=corpus_info.get("name", "Corpus"),
|
| 480 |
-
corpus_source=corpus_info.get("source"),
|
| 481 |
-
document_count=corpus_info.get("document_count", 0),
|
| 482 |
-
engine_reports=engine_reports,
|
| 483 |
-
run_date=data.get("run_date", ""),
|
| 484 |
-
picarones_version=data.get("picarones_version", ""),
|
| 485 |
-
metadata=data.get("metadata", {}),
|
| 486 |
-
)
|
| 487 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 488 |
images_b64 = kwargs.pop("images_b64", {})
|
| 489 |
return cls(bm, images_b64=images_b64, **kwargs)
|
|
|
|
| 436 |
Compatible avec les fichiers produits par ``BenchmarkResult.to_json()``.
|
| 437 |
Les images base64 doivent être passées via ``kwargs["images_b64"]``
|
| 438 |
si elles ne sont pas dans le JSON.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 439 |
|
| 440 |
+
Phase 2.2 du chantier post-rewrite : délégué à
|
| 441 |
+
:meth:`BenchmarkResult.from_json_object` qui reconstruit tous
|
| 442 |
+
les champs avancés (confusion_matrix, taxonomy, structure,
|
| 443 |
+
hallucination_metrics, ner_metrics, calibration_metrics,
|
| 444 |
+
philological_metrics, searchability_metrics,
|
| 445 |
+
numerical_sequence_metrics, readability_metrics,
|
| 446 |
+
pipeline_metadata, ocr_intermediate + leurs équivalents
|
| 447 |
+
``aggregated_*`` au niveau EngineReport). Le rapport régénéré
|
| 448 |
+
depuis JSON est désormais indistinguable du rapport in-memory.
|
| 449 |
+
"""
|
| 450 |
+
bm = BenchmarkResult.from_json_object(json_path)
|
| 451 |
images_b64 = kwargs.pop("images_b64", {})
|
| 452 |
return cls(bm, images_b64=images_b64, **kwargs)
|
|
@@ -105,6 +105,7 @@ docs = [
|
|
| 105 |
# Moteurs OCR optionnels
|
| 106 |
pero = ["pero-ocr>=0.1.0"]
|
| 107 |
kraken = ["kraken>=4.0.0"]
|
|
|
|
| 108 |
# Adaptateurs LLM
|
| 109 |
llm = [
|
| 110 |
"openai>=1.0.0",
|
|
|
|
| 105 |
# Moteurs OCR optionnels
|
| 106 |
pero = ["pero-ocr>=0.1.0"]
|
| 107 |
kraken = ["kraken>=4.0.0"]
|
| 108 |
+
calamari = ["calamari-ocr>=2.0.0"]
|
| 109 |
# Adaptateurs LLM
|
| 110 |
llm = [
|
| 111 |
"openai>=1.0.0",
|
|
@@ -65,6 +65,8 @@ _ENGINE_DESCRIPTIONS: dict[str, tuple[str, str, str]] = {
|
|
| 65 |
# name → (display_name, type, install_hint)
|
| 66 |
"tesseract": ("Tesseract 5", "Local CLI", "`pip install pytesseract` + system binary"),
|
| 67 |
"pero_ocr": ("Pero OCR", "Local Python", "`pip install -e .[pero]`"),
|
|
|
|
|
|
|
| 68 |
"mistral_ocr": ("Mistral OCR", "Cloud API", "`MISTRAL_API_KEY` env var"),
|
| 69 |
"google_vision": ("Google Vision", "Cloud API", "`GOOGLE_APPLICATION_CREDENTIALS` env var"),
|
| 70 |
"azure_doc_intel": ("Azure Doc Intelligence", "Cloud API", "`AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY`"),
|
|
|
|
| 65 |
# name → (display_name, type, install_hint)
|
| 66 |
"tesseract": ("Tesseract 5", "Local CLI", "`pip install pytesseract` + system binary"),
|
| 67 |
"pero_ocr": ("Pero OCR", "Local Python", "`pip install -e .[pero]`"),
|
| 68 |
+
"kraken": ("Kraken HTR", "Local Python", "`pip install -e .[kraken]` + modèle `.mlmodel`"),
|
| 69 |
+
"calamari": ("Calamari OCR", "Local Python", "`pip install -e .[calamari]` + checkpoint"),
|
| 70 |
"mistral_ocr": ("Mistral OCR", "Cloud API", "`MISTRAL_API_KEY` env var"),
|
| 71 |
"google_vision": ("Google Vision", "Cloud API", "`GOOGLE_APPLICATION_CREDENTIALS` env var"),
|
| 72 |
"azure_doc_intel": ("Azure Doc Intelligence", "Cloud API", "`AZURE_DOC_INTEL_ENDPOINT` + `AZURE_DOC_INTEL_KEY`"),
|
|
@@ -12,7 +12,7 @@ Bug observé en prod (interface web, 2026-05-10) :
|
|
| 12 |
avec des instances différentes — collision impossible à résoudre.
|
| 13 |
|
| 14 |
Cause : ``_engine_from_competitor`` crée une instance ``TesseractAdapter``
|
| 15 |
-
fraîche pour chaque ``
|
| 16 |
partagent le même moteur OCR (l'un seul, l'autre dans un pipeline),
|
| 17 |
``build_adapter_resolver`` voyait deux instances Python distinctes
|
| 18 |
sous le même ``name="tesseract"`` et levait ``PicaronesError`` à tort
|
|
|
|
| 12 |
avec des instances différentes — collision impossible à résoudre.
|
| 13 |
|
| 14 |
Cause : ``_engine_from_competitor`` crée une instance ``TesseractAdapter``
|
| 15 |
+
fraîche pour chaque ``PipelineConfig``. Quand deux concurrents
|
| 16 |
partagent le même moteur OCR (l'un seul, l'autre dans un pipeline),
|
| 17 |
``build_adapter_resolver`` voyait deux instances Python distinctes
|
| 18 |
sous le même ``name="tesseract"`` et levait ``PicaronesError`` à tort
|
|
@@ -27,10 +27,37 @@ from picarones.app.services.partial_store import (
|
|
| 27 |
_partial_path,
|
| 28 |
_sanitize_filename,
|
| 29 |
_save_partial_line,
|
|
|
|
| 30 |
)
|
| 31 |
from picarones.app.services.benchmark_runner import (
|
|
|
|
| 32 |
run_benchmark_via_service,
|
| 33 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
from picarones.evaluation.benchmark_result import DocumentResult
|
| 35 |
from picarones.evaluation.corpus import Corpus, Document
|
| 36 |
from picarones.evaluation.metric_result import MetricsResult
|
|
@@ -272,7 +299,7 @@ class TestResumeViaPartialDir:
|
|
| 272 |
|
| 273 |
assert bm.document_count == 2
|
| 274 |
# Plus aucun fichier partial pour cet engine après succès.
|
| 275 |
-
partial_path =
|
| 276 |
assert not partial_path.exists()
|
| 277 |
|
| 278 |
def test_resume_skips_already_done_docs(self, tmp_path: Path) -> None:
|
|
@@ -296,7 +323,7 @@ class TestResumeViaPartialDir:
|
|
| 296 |
# Pré-écrire un partial pour doc0 avec une CER fictive de 0.99
|
| 297 |
# pour vérifier qu'on prend la valeur du partial, pas une
|
| 298 |
# nouvelle exécution.
|
| 299 |
-
partial_path =
|
| 300 |
pre_existing = _make_doc_result("doc0", hyp="from_partial", cer=0.99)
|
| 301 |
_save_partial_line(partial_path, pre_existing)
|
| 302 |
|
|
@@ -327,7 +354,7 @@ class TestResumeViaPartialDir:
|
|
| 327 |
"Engine ne devrait pas être appelé — tout est dans le partial.",
|
| 328 |
)
|
| 329 |
|
| 330 |
-
partial_path =
|
| 331 |
for i in range(2):
|
| 332 |
_save_partial_line(
|
| 333 |
partial_path, _make_doc_result(f"doc{i}", hyp=f"prefilled{i}"),
|
|
@@ -358,7 +385,7 @@ class TestResumeViaPartialDir:
|
|
| 358 |
ocr_b._run_ocr = lambda p: "from_b"
|
| 359 |
|
| 360 |
# Pré-remplir uniquement le partial de engine_a pour doc0.
|
| 361 |
-
partial_a =
|
| 362 |
_save_partial_line(
|
| 363 |
partial_a, _make_doc_result("doc0", hyp="A_pre"),
|
| 364 |
)
|
|
@@ -423,7 +450,7 @@ class TestResumeViaPartialDir:
|
|
| 423 |
# avec doc0 mais pas doc1. cancel_event signalé avant
|
| 424 |
# l'engine suivant.
|
| 425 |
ocr_b = _MockOCR(name="incomplete_b")
|
| 426 |
-
partial_b =
|
| 427 |
_save_partial_line(
|
| 428 |
partial_b, _make_doc_result("doc0", hyp="B0_pre"),
|
| 429 |
)
|
|
|
|
| 27 |
_partial_path,
|
| 28 |
_sanitize_filename,
|
| 29 |
_save_partial_line,
|
| 30 |
+
partial_path_for_engine,
|
| 31 |
)
|
| 32 |
from picarones.app.services.benchmark_runner import (
|
| 33 |
+
_engine_config_for_fingerprint,
|
| 34 |
run_benchmark_via_service,
|
| 35 |
)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _partial_path_for_run(corpus, engine, partial_dir):
|
| 39 |
+
"""Helper test — calcule le chemin partial avec le fingerprint
|
| 40 |
+
que le runner utilisera par défaut (pas de normalisation, pas
|
| 41 |
+
de char_exclude, profil ``standard``). Phase 2.3 du chantier
|
| 42 |
+
post-rewrite : la clé partial inclut désormais un fingerprint
|
| 43 |
+
pour empêcher la réutilisation accidentelle entre runs avec
|
| 44 |
+
configs différentes."""
|
| 45 |
+
import importlib
|
| 46 |
+
|
| 47 |
+
try:
|
| 48 |
+
code_version = importlib.import_module("picarones").__version__
|
| 49 |
+
except (ImportError, AttributeError):
|
| 50 |
+
code_version = "unknown"
|
| 51 |
+
return partial_path_for_engine(
|
| 52 |
+
corpus=corpus,
|
| 53 |
+
engine=engine,
|
| 54 |
+
partial_dir=partial_dir,
|
| 55 |
+
engine_config=_engine_config_for_fingerprint(engine),
|
| 56 |
+
normalization_profile=None,
|
| 57 |
+
char_exclude=None,
|
| 58 |
+
profile="standard",
|
| 59 |
+
code_version=code_version,
|
| 60 |
+
)
|
| 61 |
from picarones.evaluation.benchmark_result import DocumentResult
|
| 62 |
from picarones.evaluation.corpus import Corpus, Document
|
| 63 |
from picarones.evaluation.metric_result import MetricsResult
|
|
|
|
| 299 |
|
| 300 |
assert bm.document_count == 2
|
| 301 |
# Plus aucun fichier partial pour cet engine après succès.
|
| 302 |
+
partial_path = _partial_path_for_run(corpus, ocr, partial_dir)
|
| 303 |
assert not partial_path.exists()
|
| 304 |
|
| 305 |
def test_resume_skips_already_done_docs(self, tmp_path: Path) -> None:
|
|
|
|
| 323 |
# Pré-écrire un partial pour doc0 avec une CER fictive de 0.99
|
| 324 |
# pour vérifier qu'on prend la valeur du partial, pas une
|
| 325 |
# nouvelle exécution.
|
| 326 |
+
partial_path = _partial_path_for_run(corpus, ocr, partial_dir)
|
| 327 |
pre_existing = _make_doc_result("doc0", hyp="from_partial", cer=0.99)
|
| 328 |
_save_partial_line(partial_path, pre_existing)
|
| 329 |
|
|
|
|
| 354 |
"Engine ne devrait pas être appelé — tout est dans le partial.",
|
| 355 |
)
|
| 356 |
|
| 357 |
+
partial_path = _partial_path_for_run(corpus, ocr, partial_dir)
|
| 358 |
for i in range(2):
|
| 359 |
_save_partial_line(
|
| 360 |
partial_path, _make_doc_result(f"doc{i}", hyp=f"prefilled{i}"),
|
|
|
|
| 385 |
ocr_b._run_ocr = lambda p: "from_b"
|
| 386 |
|
| 387 |
# Pré-remplir uniquement le partial de engine_a pour doc0.
|
| 388 |
+
partial_a = _partial_path_for_run(corpus, ocr_a, partial_dir)
|
| 389 |
_save_partial_line(
|
| 390 |
partial_a, _make_doc_result("doc0", hyp="A_pre"),
|
| 391 |
)
|
|
|
|
| 450 |
# avec doc0 mais pas doc1. cancel_event signalé avant
|
| 451 |
# l'engine suivant.
|
| 452 |
ocr_b = _MockOCR(name="incomplete_b")
|
| 453 |
+
partial_b = _partial_path_for_run(corpus, ocr_b, partial_dir)
|
| 454 |
_save_partial_line(
|
| 455 |
partial_b, _make_doc_result("doc0", hyp="B0_pre"),
|
| 456 |
)
|
|
@@ -45,7 +45,11 @@ FILE_BUDGETS: dict[str, int] = {
|
|
| 45 |
# Sprint H.4 — module renommé ``_legacy_runner_adapter`` →
|
| 46 |
# ``benchmark_runner`` (drop le préfixe legacy : c'est l'entry
|
| 47 |
# point canonique des interfaces vers ``BenchmarkService``).
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
# --- God-modules : budget actuel + 15 % de marge.
|
| 50 |
# Le rétrécissement sera l'objet d'un sprint de refactor dédié.
|
| 51 |
# statistics.py (1128 lignes) a été éclaté en sous-package
|
|
@@ -71,7 +75,12 @@ FILE_BUDGETS: dict[str, int] = {
|
|
| 71 |
# (≤ 25 l). Le contenu canonique vit dans ``evaluation/`` ;
|
| 72 |
# même budget pour la même raison historique (modèles
|
| 73 |
# BenchmarkResult/EngineReport/DocumentResult).
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
# Phase 5.C : ``report/philological_render.py`` est désormais
|
| 76 |
# un shim (≤ 25 l). Le contenu canonique vit dans
|
| 77 |
# ``reports/html/renderers/philological.py``.
|
|
|
|
| 45 |
# Sprint H.4 — module renommé ``_legacy_runner_adapter`` →
|
| 46 |
# ``benchmark_runner`` (drop le préfixe legacy : c'est l'entry
|
| 47 |
# point canonique des interfaces vers ``BenchmarkService``).
|
| 48 |
+
# Phase 2.3 du chantier post-rewrite — ajout de
|
| 49 |
+
# ``_engine_config_for_fingerprint`` (~50 LOC) pour distinguer les
|
| 50 |
+
# runs avec configs différentes (psm/lang/model/prompt) au niveau
|
| 51 |
+
# du fichier partial.
|
| 52 |
+
"picarones/app/services/benchmark_runner.py": 1750, # actuel ~1700
|
| 53 |
# --- God-modules : budget actuel + 15 % de marge.
|
| 54 |
# Le rétrécissement sera l'objet d'un sprint de refactor dédié.
|
| 55 |
# statistics.py (1128 lignes) a été éclaté en sous-package
|
|
|
|
| 75 |
# (≤ 25 l). Le contenu canonique vit dans ``evaluation/`` ;
|
| 76 |
# même budget pour la même raison historique (modèles
|
| 77 |
# BenchmarkResult/EngineReport/DocumentResult).
|
| 78 |
+
# Phase 2.2 du chantier post-rewrite — ajout de
|
| 79 |
+
# ``DocumentResult.from_dict``, ``EngineReport.from_dict``,
|
| 80 |
+
# ``BenchmarkResult.from_dict`` et ``BenchmarkResult.from_json_object``
|
| 81 |
+
# pour restaurer la fidélité du round-trip JSON (taxonomy,
|
| 82 |
+
# hallucination, philological, etc.).
|
| 83 |
+
"picarones/evaluation/benchmark_result.py": 880, # actuel ~826
|
| 84 |
# Phase 5.C : ``report/philological_render.py`` est désormais
|
| 85 |
# un shim (≤ 25 l). Le contenu canonique vit dans
|
| 86 |
# ``reports/html/renderers/philological.py``.
|
|
@@ -101,6 +101,13 @@ def _normalize_engine_name(name: str) -> str:
|
|
| 101 |
aliases = {
|
| 102 |
"azure document intelligence": "azure_doc_intel",
|
| 103 |
"azure doc intelligence": "azure_doc_intel",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
}
|
| 105 |
if n in aliases:
|
| 106 |
return aliases[n]
|
|
|
|
| 101 |
aliases = {
|
| 102 |
"azure document intelligence": "azure_doc_intel",
|
| 103 |
"azure doc intelligence": "azure_doc_intel",
|
| 104 |
+
# Phase 3 du chantier post-rewrite : kraken/calamari sont
|
| 105 |
+
# listés dans le README avec leur nom commercial complet,
|
| 106 |
+
# mais leur module Python s'appelle juste ``kraken.py`` /
|
| 107 |
+
# ``calamari.py`` (cohérent avec ``pero_ocr.py`` qui ne
|
| 108 |
+
# s'appelle pas ``pero_ocr_htr.py``).
|
| 109 |
+
"kraken htr": "kraken",
|
| 110 |
+
"calamari ocr": "calamari",
|
| 111 |
}
|
| 112 |
if n in aliases:
|
| 113 |
return aliases[n]
|
|
@@ -173,12 +173,14 @@ def test_copyright_year_range() -> None:
|
|
| 173 |
|
| 174 |
|
| 175 |
def test_readme_under_500_lines() -> None:
|
| 176 |
-
"""Le README doit rester compact (Sprint A13
|
| 177 |
-
|
|
|
|
|
|
|
| 178 |
text = _read_readme()
|
| 179 |
n_lines = len(text.splitlines())
|
| 180 |
-
assert n_lines <
|
| 181 |
-
f"README à {n_lines} lignes — au-dessus du seuil
|
| 182 |
"Déléguer le détail vers docs/."
|
| 183 |
)
|
| 184 |
|
|
|
|
| 173 |
|
| 174 |
|
| 175 |
def test_readme_under_500_lines() -> None:
|
| 176 |
+
"""Le README doit rester compact (Sprint A13 visait < 500 lignes ;
|
| 177 |
+
Phase 3 du chantier post-rewrite a ajouté kraken/calamari dans la
|
| 178 |
+
matrice produit, +2 lignes — seuil relevé à 510 pour absorber
|
| 179 |
+
cette extension légitime). Versus 786 avant la refonte initiale."""
|
| 180 |
text = _read_readme()
|
| 181 |
n_lines = len(text.splitlines())
|
| 182 |
+
assert n_lines < 510, (
|
| 183 |
+
f"README à {n_lines} lignes — au-dessus du seuil 510. "
|
| 184 |
"Déléguer le détail vers docs/."
|
| 185 |
)
|
| 186 |
|
|
@@ -17,11 +17,17 @@ import pytest
|
|
| 17 |
# 1. Filtrage des fichiers cachés macOS
|
| 18 |
# ---------------------------------------------------------------------------
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
FAKE_PNG = (
|
| 21 |
-
b"\x89PNG\r\n\x1a\n
|
| 22 |
-
b"\x00\x00\x00\
|
| 23 |
-
b"\x00\
|
| 24 |
-
b"\
|
|
|
|
|
|
|
| 25 |
)
|
| 26 |
|
| 27 |
|
|
|
|
| 17 |
# 1. Filtrage des fichiers cachés macOS
|
| 18 |
# ---------------------------------------------------------------------------
|
| 19 |
|
| 20 |
+
# PNG 1×1 RGBA validé par ``Pillow.verify()`` — l'ancien ``FAKE_PNG``
|
| 21 |
+
# avait un mauvais checksum IDAT, masqué tant que ``flatten_zip_to_dir``
|
| 22 |
+
# n'appelait pas ``validate_image_safe`` sur les images extraites
|
| 23 |
+
# (durcissement Phase 1 du chantier post-rewrite).
|
| 24 |
FAKE_PNG = (
|
| 25 |
+
b"\x89PNG\r\n\x1a\n"
|
| 26 |
+
b"\x00\x00\x00\rIHDR"
|
| 27 |
+
b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00"
|
| 28 |
+
b"\x1f\x15\xc4\x89"
|
| 29 |
+
b"\x00\x00\x00\nIDATx\x9cc\x00\x01\x00\x00\x05\x00\x01"
|
| 30 |
+
b"\r\n-\xb4\x00\x00\x00\x00IEND\xaeB`\x82"
|
| 31 |
)
|
| 32 |
|
| 33 |
|
|
@@ -171,16 +171,16 @@ class TestEndToEndPromptReachesLLM:
|
|
| 171 |
)
|
| 172 |
|
| 173 |
def test_web_factory_to_pipeline_to_llm_flow(self) -> None:
|
| 174 |
-
"""End-to-end depuis ``
|
| 175 |
le prompt arrivé au LLM doit être le CONTENU du fichier,
|
| 176 |
pas le filename. C'est le chemin exact du bug en prod."""
|
| 177 |
from picarones.interfaces.web.benchmark_utils import (
|
| 178 |
_engine_from_competitor,
|
| 179 |
)
|
| 180 |
-
from picarones.interfaces.web.models import
|
| 181 |
from picarones.adapters.llm.base import _substitute_prompt_variables
|
| 182 |
|
| 183 |
-
comp =
|
| 184 |
ocr_engine="tesseract", ocr_model="fra",
|
| 185 |
llm_provider="mistral", llm_model="mistral-small-latest",
|
| 186 |
pipeline_mode="text_only",
|
|
|
|
| 171 |
)
|
| 172 |
|
| 173 |
def test_web_factory_to_pipeline_to_llm_flow(self) -> None:
|
| 174 |
+
"""End-to-end depuis ``PipelineConfig`` (UI) jusqu'au LLM :
|
| 175 |
le prompt arrivé au LLM doit être le CONTENU du fichier,
|
| 176 |
pas le filename. C'est le chemin exact du bug en prod."""
|
| 177 |
from picarones.interfaces.web.benchmark_utils import (
|
| 178 |
_engine_from_competitor,
|
| 179 |
)
|
| 180 |
+
from picarones.interfaces.web.models import PipelineConfig
|
| 181 |
from picarones.adapters.llm.base import _substitute_prompt_variables
|
| 182 |
|
| 183 |
+
comp = PipelineConfig(
|
| 184 |
ocr_engine="tesseract", ocr_model="fra",
|
| 185 |
llm_provider="mistral", llm_model="mistral-small-latest",
|
| 186 |
pipeline_mode="text_only",
|
|
@@ -0,0 +1,1013 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Phase 1 du chantier post-rewrite — durcissements sécurité P0.
|
| 2 |
+
|
| 3 |
+
Couvre trois durcissements introduits pour fermer des surfaces filesystem
|
| 4 |
+
laissées ouvertes par le rewrite :
|
| 5 |
+
|
| 6 |
+
1. **Path traversal ``output_dir`` dans les importers HTR-United/HuggingFace.**
|
| 7 |
+
Avant durcissement : un POST ``output_dir="/etc/picarones_pwned"``
|
| 8 |
+
passait directement à l'importer, vecteur d'écriture filesystem
|
| 9 |
+
arbitraire. Désormais ``validated_path`` rejette en 400 avant délégation.
|
| 10 |
+
|
| 11 |
+
2. **Path traversal ``db_path`` dans ``/api/history/regressions``.**
|
| 12 |
+
Avant durcissement : ``db_path=/etc/passwd`` ouvrait un SQLite
|
| 13 |
+
arbitraire (lecture libre, log d'erreur informatif). Désormais
|
| 14 |
+
``validated_path`` rejette en 400 ; pour pointer une base hors
|
| 15 |
+
workspace, exporter ``PICARONES_HISTORY_DB``.
|
| 16 |
+
|
| 17 |
+
3. **ZIP basename collision + validation image extraite.**
|
| 18 |
+
Avant durcissement : ``a/img.png`` et ``b/img.png`` s'écrasaient
|
| 19 |
+
silencieusement après aplatissement ; les images extraites n'étaient
|
| 20 |
+
pas passées à ``validate_image_safe`` (vecteur zip bomb jusqu'à
|
| 21 |
+
500 Mo brut). Désormais : collision → renommage avec préfixe slug
|
| 22 |
+
du dirname + warning ; image invalide → ``ValueError`` (HTTP 415).
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import io
|
| 28 |
+
import zipfile
|
| 29 |
+
from pathlib import Path
|
| 30 |
+
from unittest.mock import patch
|
| 31 |
+
|
| 32 |
+
import pytest
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
# PNG 1x1 minimal valide pour passer Pillow.verify.
|
| 36 |
+
_MINIMAL_PNG = (
|
| 37 |
+
b"\x89PNG\r\n\x1a\n"
|
| 38 |
+
b"\x00\x00\x00\rIHDR"
|
| 39 |
+
b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00"
|
| 40 |
+
b"\x1f\x15\xc4\x89"
|
| 41 |
+
b"\x00\x00\x00\nIDATx\x9cc\x00\x01\x00\x00\x05\x00\x01"
|
| 42 |
+
b"\r\n-\xb4\x00\x00\x00\x00IEND\xaeB`\x82"
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def _make_importers_app():
|
| 47 |
+
from fastapi import FastAPI
|
| 48 |
+
|
| 49 |
+
from picarones.interfaces.web.routers import importers as imp_router
|
| 50 |
+
|
| 51 |
+
app = FastAPI()
|
| 52 |
+
app.include_router(imp_router.router)
|
| 53 |
+
return app
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def _make_history_app():
|
| 57 |
+
from fastapi import FastAPI
|
| 58 |
+
|
| 59 |
+
from picarones.interfaces.web.routers import history as hist_router
|
| 60 |
+
|
| 61 |
+
app = FastAPI()
|
| 62 |
+
app.include_router(hist_router.router)
|
| 63 |
+
return app
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 67 |
+
# 1. output_dir path traversal — HTR-United + HuggingFace
|
| 68 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
class TestImportersOutputDirTraversal:
|
| 72 |
+
"""Aucun ``output_dir`` libre hors des racines workspace.
|
| 73 |
+
|
| 74 |
+
Important : on n'utilise PAS ``patch`` sur l'importer — la validation
|
| 75 |
+
doit échouer AVANT toute délégation au backend. Si la validation
|
| 76 |
+
laisse passer, le mock ne sera pas appelé mais la requête sera
|
| 77 |
+
acceptée — c'est ce qu'on doit empêcher.
|
| 78 |
+
"""
|
| 79 |
+
|
| 80 |
+
def test_htr_united_rejects_absolute_path_outside_workspace(self) -> None:
|
| 81 |
+
from fastapi.testclient import TestClient
|
| 82 |
+
|
| 83 |
+
app = _make_importers_app()
|
| 84 |
+
with TestClient(app) as client:
|
| 85 |
+
r = client.post(
|
| 86 |
+
"/api/htr-united/import",
|
| 87 |
+
json={
|
| 88 |
+
"entry_id": "any_id",
|
| 89 |
+
"output_dir": "/etc/picarones_pwned",
|
| 90 |
+
"max_samples": 1,
|
| 91 |
+
},
|
| 92 |
+
)
|
| 93 |
+
# 400 = PathValidationError mappée par le handler.
|
| 94 |
+
assert r.status_code == 400, (
|
| 95 |
+
f"Attendu 400 (path validation), reçu {r.status_code} : "
|
| 96 |
+
f"{r.text}"
|
| 97 |
+
)
|
| 98 |
+
assert "hors zone autorisée" in r.json()["detail"]
|
| 99 |
+
|
| 100 |
+
def test_htr_united_rejects_traversal(self) -> None:
|
| 101 |
+
from fastapi.testclient import TestClient
|
| 102 |
+
|
| 103 |
+
app = _make_importers_app()
|
| 104 |
+
with TestClient(app) as client:
|
| 105 |
+
r = client.post(
|
| 106 |
+
"/api/htr-united/import",
|
| 107 |
+
json={
|
| 108 |
+
"entry_id": "any_id",
|
| 109 |
+
"output_dir": "../../../etc/passwd",
|
| 110 |
+
"max_samples": 1,
|
| 111 |
+
},
|
| 112 |
+
)
|
| 113 |
+
assert r.status_code == 400
|
| 114 |
+
# Le message peut citer la racine ou le chemin original ;
|
| 115 |
+
# on vérifie juste qu'on n'a pas réussi à passer.
|
| 116 |
+
detail = r.json()["detail"]
|
| 117 |
+
assert "hors zone" in detail or "invalide" in detail
|
| 118 |
+
|
| 119 |
+
def test_huggingface_rejects_absolute_path_outside_workspace(
|
| 120 |
+
self,
|
| 121 |
+
) -> None:
|
| 122 |
+
from fastapi.testclient import TestClient
|
| 123 |
+
|
| 124 |
+
app = _make_importers_app()
|
| 125 |
+
with TestClient(app) as client:
|
| 126 |
+
r = client.post(
|
| 127 |
+
"/api/huggingface/import",
|
| 128 |
+
json={
|
| 129 |
+
"dataset_id": "any/dataset",
|
| 130 |
+
"output_dir": "/var/lib/pwned",
|
| 131 |
+
"split": "train",
|
| 132 |
+
"max_samples": 1,
|
| 133 |
+
},
|
| 134 |
+
)
|
| 135 |
+
assert r.status_code == 400
|
| 136 |
+
assert "hors zone autorisée" in r.json()["detail"]
|
| 137 |
+
|
| 138 |
+
def test_huggingface_rejects_traversal(self) -> None:
|
| 139 |
+
from fastapi.testclient import TestClient
|
| 140 |
+
|
| 141 |
+
app = _make_importers_app()
|
| 142 |
+
with TestClient(app) as client:
|
| 143 |
+
r = client.post(
|
| 144 |
+
"/api/huggingface/import",
|
| 145 |
+
json={
|
| 146 |
+
"dataset_id": "any/dataset",
|
| 147 |
+
"output_dir": "../../../etc/passwd_dir",
|
| 148 |
+
"split": "train",
|
| 149 |
+
"max_samples": 1,
|
| 150 |
+
},
|
| 151 |
+
)
|
| 152 |
+
assert r.status_code == 400
|
| 153 |
+
|
| 154 |
+
def test_huggingface_accepts_path_under_tmp(self, tmp_path: Path) -> None:
|
| 155 |
+
"""``tmp_path`` est sous ``tempfile.gettempdir()`` donc dans les
|
| 156 |
+
racines workspace par défaut (mode dev). On vérifie que la
|
| 157 |
+
validation laisse passer une cible légitime."""
|
| 158 |
+
from fastapi.testclient import TestClient
|
| 159 |
+
|
| 160 |
+
app = _make_importers_app()
|
| 161 |
+
with patch(
|
| 162 |
+
"picarones.adapters.corpus.huggingface.HuggingFaceImporter.import_dataset",
|
| 163 |
+
) as mock_import:
|
| 164 |
+
mock_import.return_value = {
|
| 165 |
+
"imported": 1, "output_dir": str(tmp_path),
|
| 166 |
+
}
|
| 167 |
+
with TestClient(app) as client:
|
| 168 |
+
r = client.post(
|
| 169 |
+
"/api/huggingface/import",
|
| 170 |
+
json={
|
| 171 |
+
"dataset_id": "test/dataset",
|
| 172 |
+
"output_dir": str(tmp_path),
|
| 173 |
+
"split": "train",
|
| 174 |
+
"max_samples": 1,
|
| 175 |
+
},
|
| 176 |
+
)
|
| 177 |
+
assert r.status_code == 200, r.text
|
| 178 |
+
# Vérifie que la valeur passée à l'importer est résolue
|
| 179 |
+
# (str du Path absolu) — pas la chaîne brute si elle
|
| 180 |
+
# avait été relative.
|
| 181 |
+
assert mock_import.called
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 185 |
+
# 2. db_path path traversal — /api/history/regressions
|
| 186 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
class TestHistoryRegressionsDbPathTraversal:
|
| 190 |
+
"""``db_path`` doit être sous une racine workspace ou refusé en 400.
|
| 191 |
+
|
| 192 |
+
Sans ce garde-fou, l'endpoint ouvrait silencieusement n'importe quel
|
| 193 |
+
SQLite lisible par le process (lecture filesystem arbitraire via
|
| 194 |
+
paramètres SQL).
|
| 195 |
+
"""
|
| 196 |
+
|
| 197 |
+
def test_absolute_path_outside_workspace_rejected(self) -> None:
|
| 198 |
+
from fastapi.testclient import TestClient
|
| 199 |
+
|
| 200 |
+
app = _make_history_app()
|
| 201 |
+
with TestClient(app) as client:
|
| 202 |
+
r = client.get(
|
| 203 |
+
"/api/history/regressions",
|
| 204 |
+
params={"db_path": "/etc/passwd"},
|
| 205 |
+
)
|
| 206 |
+
assert r.status_code == 400, r.text
|
| 207 |
+
assert "hors zone autorisée" in r.json()["detail"]
|
| 208 |
+
|
| 209 |
+
def test_traversal_rejected(self) -> None:
|
| 210 |
+
from fastapi.testclient import TestClient
|
| 211 |
+
|
| 212 |
+
app = _make_history_app()
|
| 213 |
+
with TestClient(app) as client:
|
| 214 |
+
r = client.get(
|
| 215 |
+
"/api/history/regressions",
|
| 216 |
+
params={"db_path": "../../../etc/passwd"},
|
| 217 |
+
)
|
| 218 |
+
assert r.status_code == 400
|
| 219 |
+
|
| 220 |
+
def test_no_db_path_uses_default(self) -> None:
|
| 221 |
+
"""Sans ``db_path``, l'endpoint utilise le défaut ``BenchmarkHistory()``
|
| 222 |
+
(~/.picarones/history.db). Pas de 400, retourne une liste vide
|
| 223 |
+
si la base n'existe pas (cas frais)."""
|
| 224 |
+
from fastapi.testclient import TestClient
|
| 225 |
+
|
| 226 |
+
app = _make_history_app()
|
| 227 |
+
with TestClient(app) as client:
|
| 228 |
+
r = client.get("/api/history/regressions")
|
| 229 |
+
# Soit 200 (base existe, pas de régression), soit 500 (base
|
| 230 |
+
# absente). On accepte les deux — c'est le comportement
|
| 231 |
+
# historique, hors scope du durcissement de chemin.
|
| 232 |
+
assert r.status_code in (200, 500), r.text
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 236 |
+
# 3. ZIP basename collision + validation image extraite
|
| 237 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 238 |
+
|
| 239 |
+
|
| 240 |
+
def _zip_with_entries(entries: dict[str, bytes]) -> bytes:
|
| 241 |
+
"""ZIP en mémoire à partir de ``{nom: bytes}``."""
|
| 242 |
+
buf = io.BytesIO()
|
| 243 |
+
with zipfile.ZipFile(buf, mode="w", compression=zipfile.ZIP_DEFLATED) as zf:
|
| 244 |
+
for name, data in entries.items():
|
| 245 |
+
zf.writestr(name, data)
|
| 246 |
+
return buf.getvalue()
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
class TestZipBasenameCollision:
|
| 250 |
+
"""``a/img.png`` et ``b/img.png`` ne doivent plus s'écraser
|
| 251 |
+
silencieusement après aplatissement par basename."""
|
| 252 |
+
|
| 253 |
+
def test_collision_resolved_with_dirname_prefix(self, tmp_path: Path) -> None:
|
| 254 |
+
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 255 |
+
|
| 256 |
+
zip_bytes = _zip_with_entries({
|
| 257 |
+
"folder_a/page_001.png": _MINIMAL_PNG,
|
| 258 |
+
"folder_a/page_001.gt.txt": b"GT A",
|
| 259 |
+
"folder_b/page_001.png": _MINIMAL_PNG,
|
| 260 |
+
"folder_b/page_001.gt.txt": b"GT B",
|
| 261 |
+
})
|
| 262 |
+
dest = tmp_path / "extract"
|
| 263 |
+
|
| 264 |
+
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
| 265 |
+
flatten_zip_to_dir(zf, dest)
|
| 266 |
+
|
| 267 |
+
names = {p.name for p in dest.iterdir()}
|
| 268 |
+
# La première occurrence garde le nom brut ; les suivantes sont
|
| 269 |
+
# préfixées par le slug du dirname source.
|
| 270 |
+
assert "page_001.png" in names
|
| 271 |
+
# Le second doit avoir été renommé — par slug ``folder_b``.
|
| 272 |
+
renamed_png = {n for n in names if n.endswith("page_001.png")}
|
| 273 |
+
assert len(renamed_png) == 2, (
|
| 274 |
+
f"Attendu 2 images distinctes (1 nominale + 1 renommée), "
|
| 275 |
+
f"trouvé {renamed_png}"
|
| 276 |
+
)
|
| 277 |
+
# On vérifie qu'au moins une variante porte un slug de dossier.
|
| 278 |
+
assert any(
|
| 279 |
+
"folder_a" in n or "folder_b" in n
|
| 280 |
+
for n in renamed_png - {"page_001.png"}
|
| 281 |
+
)
|
| 282 |
+
|
| 283 |
+
def test_no_silent_overwrite_of_image_pairs(self, tmp_path: Path) -> None:
|
| 284 |
+
"""Garantie fonctionnelle : 4 fichiers entrent → 4 fichiers sortent."""
|
| 285 |
+
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 286 |
+
|
| 287 |
+
zip_bytes = _zip_with_entries({
|
| 288 |
+
"a/img.png": _MINIMAL_PNG,
|
| 289 |
+
"a/img.gt.txt": b"A",
|
| 290 |
+
"b/img.png": _MINIMAL_PNG,
|
| 291 |
+
"b/img.gt.txt": b"B",
|
| 292 |
+
})
|
| 293 |
+
dest = tmp_path / "extract"
|
| 294 |
+
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
| 295 |
+
flatten_zip_to_dir(zf, dest)
|
| 296 |
+
|
| 297 |
+
files = list(dest.iterdir())
|
| 298 |
+
# 4 fichiers entrent dans le ZIP, 4 doivent ressortir (les
|
| 299 |
+
# collisions sont résolues, pas écrasées).
|
| 300 |
+
assert len(files) == 4, (
|
| 301 |
+
f"Attendu 4 fichiers (anti-collision), trouvé "
|
| 302 |
+
f"{[p.name for p in files]}"
|
| 303 |
+
)
|
| 304 |
+
|
| 305 |
+
|
| 306 |
+
class TestZipExtractedImageValidation:
|
| 307 |
+
"""Les images extraites du ZIP doivent passer ``validate_image_safe``
|
| 308 |
+
— sans ce garde-fou, un attaquant pouvait emballer une fausse image
|
| 309 |
+
(DecompressionBombError, format invalide) jusqu'à 500 Mo non
|
| 310 |
+
vérifiés."""
|
| 311 |
+
|
| 312 |
+
def test_invalid_extracted_image_rejected(self, tmp_path: Path) -> None:
|
| 313 |
+
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 314 |
+
|
| 315 |
+
zip_bytes = _zip_with_entries({
|
| 316 |
+
# Header PNG seul mais sans IHDR — invalide.
|
| 317 |
+
"fake.png": b"\x89PNG\r\n\x1a\nFAKE_NOT_A_REAL_PNG",
|
| 318 |
+
})
|
| 319 |
+
dest = tmp_path / "extract"
|
| 320 |
+
|
| 321 |
+
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
| 322 |
+
with pytest.raises(ValueError) as excinfo:
|
| 323 |
+
flatten_zip_to_dir(zf, dest)
|
| 324 |
+
# Le message doit mentionner le filename pour aider au debug.
|
| 325 |
+
assert "fake.png" in str(excinfo.value)
|
| 326 |
+
|
| 327 |
+
def test_valid_extracted_image_passes(self, tmp_path: Path) -> None:
|
| 328 |
+
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 329 |
+
|
| 330 |
+
zip_bytes = _zip_with_entries({
|
| 331 |
+
"ok.png": _MINIMAL_PNG,
|
| 332 |
+
"ok.gt.txt": b"Hello",
|
| 333 |
+
})
|
| 334 |
+
dest = tmp_path / "extract"
|
| 335 |
+
|
| 336 |
+
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
| 337 |
+
flatten_zip_to_dir(zf, dest)
|
| 338 |
+
|
| 339 |
+
assert (dest / "ok.png").exists()
|
| 340 |
+
assert (dest / "ok.gt.txt").exists()
|
| 341 |
+
|
| 342 |
+
def test_validate_images_false_skips_validation(
|
| 343 |
+
self, tmp_path: Path,
|
| 344 |
+
) -> None:
|
| 345 |
+
"""Le kwarg ``validate_images=False`` désactive la vérification —
|
| 346 |
+
utilisé par certains tests qui se concentrent sur d'autres
|
| 347 |
+
propriétés (path traversal, par exemple) sans avoir besoin de
|
| 348 |
+
fournir un PNG complet."""
|
| 349 |
+
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 350 |
+
|
| 351 |
+
zip_bytes = _zip_with_entries({
|
| 352 |
+
"skipme.png": b"\x89PNG_FAKE",
|
| 353 |
+
})
|
| 354 |
+
dest = tmp_path / "extract"
|
| 355 |
+
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
| 356 |
+
flatten_zip_to_dir(zf, dest, validate_images=False)
|
| 357 |
+
assert (dest / "skipme.png").exists()
|
| 358 |
+
|
| 359 |
+
|
| 360 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 361 |
+
# 4. Phase 2 — pipeline_mode strict (rupture API)
|
| 362 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 363 |
+
|
| 364 |
+
|
| 365 |
+
def _make_benchmark_app():
|
| 366 |
+
"""App FastAPI minimale pour tester le rejet 422 au niveau router."""
|
| 367 |
+
from fastapi import FastAPI
|
| 368 |
+
|
| 369 |
+
from picarones.interfaces.web.routers import benchmark as bench_router
|
| 370 |
+
|
| 371 |
+
app = FastAPI()
|
| 372 |
+
app.include_router(bench_router.router)
|
| 373 |
+
return app
|
| 374 |
+
|
| 375 |
+
|
| 376 |
+
class TestPipelineModeStrictAPI:
|
| 377 |
+
"""Phase 2 du chantier post-rewrite : le typage ``Literal`` de
|
| 378 |
+
``PipelineConfig.pipeline_mode`` rejette en 422 toute valeur
|
| 379 |
+
hors de la matrice canonique avant même que le router ne soit
|
| 380 |
+
appelé. Avant ce durcissement, le ``mode_map.get(...,
|
| 381 |
+
"text_only")`` aliasait silencieusement.
|
| 382 |
+
"""
|
| 383 |
+
|
| 384 |
+
def test_invalid_pipeline_mode_returns_422(self, tmp_path: Path) -> None:
|
| 385 |
+
from fastapi.testclient import TestClient
|
| 386 |
+
|
| 387 |
+
app = _make_benchmark_app()
|
| 388 |
+
with TestClient(app) as client:
|
| 389 |
+
r = client.post(
|
| 390 |
+
"/api/benchmark/run",
|
| 391 |
+
json={
|
| 392 |
+
"corpus_path": str(tmp_path),
|
| 393 |
+
"competitors": [
|
| 394 |
+
{
|
| 395 |
+
"name": "p",
|
| 396 |
+
"ocr_engine": "tesseract",
|
| 397 |
+
"ocr_model": "fra",
|
| 398 |
+
"llm_provider": "mistral",
|
| 399 |
+
"llm_model": "ministral-3b-latest",
|
| 400 |
+
"pipeline_mode": "magic_unknown_mode",
|
| 401 |
+
"prompt_file": "",
|
| 402 |
+
},
|
| 403 |
+
],
|
| 404 |
+
"normalization_profile": "nfc",
|
| 405 |
+
"output_dir": str(tmp_path),
|
| 406 |
+
"report_name": "test",
|
| 407 |
+
"report_lang": "fr",
|
| 408 |
+
},
|
| 409 |
+
)
|
| 410 |
+
assert r.status_code == 422, r.text
|
| 411 |
+
|
| 412 |
+
def test_legacy_alias_post_correction_text_rejected_422(
|
| 413 |
+
self, tmp_path: Path,
|
| 414 |
+
) -> None:
|
| 415 |
+
from fastapi.testclient import TestClient
|
| 416 |
+
|
| 417 |
+
app = _make_benchmark_app()
|
| 418 |
+
with TestClient(app) as client:
|
| 419 |
+
r = client.post(
|
| 420 |
+
"/api/benchmark/run",
|
| 421 |
+
json={
|
| 422 |
+
"corpus_path": str(tmp_path),
|
| 423 |
+
"competitors": [
|
| 424 |
+
{
|
| 425 |
+
"name": "p",
|
| 426 |
+
"ocr_engine": "tesseract",
|
| 427 |
+
"ocr_model": "fra",
|
| 428 |
+
"llm_provider": "mistral",
|
| 429 |
+
"llm_model": "ministral-3b-latest",
|
| 430 |
+
# Alias supprimé Phase 2.
|
| 431 |
+
"pipeline_mode": "post_correction_text",
|
| 432 |
+
"prompt_file": "",
|
| 433 |
+
},
|
| 434 |
+
],
|
| 435 |
+
"normalization_profile": "nfc",
|
| 436 |
+
"output_dir": str(tmp_path),
|
| 437 |
+
"report_name": "test",
|
| 438 |
+
"report_lang": "fr",
|
| 439 |
+
},
|
| 440 |
+
)
|
| 441 |
+
assert r.status_code == 422, r.text
|
| 442 |
+
|
| 443 |
+
@pytest.mark.parametrize(
|
| 444 |
+
"valid_mode", ["text_only", "text_and_image", "zero_shot"],
|
| 445 |
+
)
|
| 446 |
+
def test_canonical_modes_pass_pydantic(self, valid_mode: str) -> None:
|
| 447 |
+
"""Les 3 modes canoniques sont acceptés par Pydantic — la
|
| 448 |
+
suite (instanciation moteur, exécution) peut échouer pour
|
| 449 |
+
d'autres raisons mais ce n'est pas notre test."""
|
| 450 |
+
from picarones.interfaces.web.models import PipelineConfig
|
| 451 |
+
|
| 452 |
+
comp = PipelineConfig(
|
| 453 |
+
name="t", ocr_engine="tesseract",
|
| 454 |
+
llm_provider="mistral", llm_model="m",
|
| 455 |
+
pipeline_mode=valid_mode,
|
| 456 |
+
)
|
| 457 |
+
assert comp.pipeline_mode == valid_mode
|
| 458 |
+
|
| 459 |
+
def test_empty_mode_pass_pydantic_for_ocr_only(self) -> None:
|
| 460 |
+
"""``pipeline_mode=""`` (défaut) doit rester accepté pour les
|
| 461 |
+
configs OCR seul (sans ``llm_provider``)."""
|
| 462 |
+
from picarones.interfaces.web.models import PipelineConfig
|
| 463 |
+
|
| 464 |
+
comp = PipelineConfig(
|
| 465 |
+
name="t", ocr_engine="tesseract", llm_provider="",
|
| 466 |
+
)
|
| 467 |
+
assert comp.pipeline_mode == ""
|
| 468 |
+
|
| 469 |
+
|
| 470 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 471 |
+
# 5. Phase 2.2 — from_json fidèle (round-trip complet)
|
| 472 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 473 |
+
|
| 474 |
+
|
| 475 |
+
class TestBenchmarkResultRoundTrip:
|
| 476 |
+
"""Phase 2.2 du chantier post-rewrite : ``BenchmarkResult.to_json``
|
| 477 |
+
suivi de :meth:`BenchmarkResult.from_json_object` doit restaurer
|
| 478 |
+
**tous** les champs avancés (taxonomy, structure, hallucination,
|
| 479 |
+
NER, calibration, philological, searchability, numerical,
|
| 480 |
+
readability, pipeline_metadata, ocr_intermediate + leurs
|
| 481 |
+
``aggregated_*`` correspondants).
|
| 482 |
+
|
| 483 |
+
Avant ce durcissement, ``ReportGenerator.from_json`` faisait sa
|
| 484 |
+
propre reconstruction qui ne couvrait que CER/WER + textes — toutes
|
| 485 |
+
les analyses étaient perdues, ce qui rendait le rapport régénéré
|
| 486 |
+
différent du rapport in-memory. Reproductibilité scientifique
|
| 487 |
+
cassée.
|
| 488 |
+
"""
|
| 489 |
+
|
| 490 |
+
def _make_rich_benchmark(self):
|
| 491 |
+
from picarones.evaluation.benchmark_result import (
|
| 492 |
+
BenchmarkResult, DocumentResult, EngineReport,
|
| 493 |
+
)
|
| 494 |
+
from picarones.evaluation.metric_result import MetricsResult
|
| 495 |
+
|
| 496 |
+
metrics = MetricsResult(
|
| 497 |
+
cer=0.15, cer_nfc=0.14, cer_caseless=0.13,
|
| 498 |
+
wer=0.20, wer_normalized=0.19,
|
| 499 |
+
mer=0.16, wil=0.18,
|
| 500 |
+
reference_length=100, hypothesis_length=95,
|
| 501 |
+
cer_diplomatic=0.12,
|
| 502 |
+
diplomatic_profile_name="medieval_french",
|
| 503 |
+
)
|
| 504 |
+
dr = DocumentResult(
|
| 505 |
+
doc_id="doc1",
|
| 506 |
+
image_path="/tmp/doc1.png",
|
| 507 |
+
ground_truth="Hello world",
|
| 508 |
+
hypothesis="He11o world",
|
| 509 |
+
metrics=metrics,
|
| 510 |
+
duration_seconds=1.5,
|
| 511 |
+
ocr_intermediate="He11o w0rld",
|
| 512 |
+
pipeline_metadata={"mode": "text_only", "prompt_file": "x.txt"},
|
| 513 |
+
confusion_matrix={"l→1": 2},
|
| 514 |
+
char_scores={"ligature": {"score": 0.95}},
|
| 515 |
+
taxonomy={"classes": {"1": 3, "2": 1}},
|
| 516 |
+
structure={"line_count": 5},
|
| 517 |
+
image_quality={"contrast": 0.75},
|
| 518 |
+
line_metrics={"cer_per_line": [0.1, 0.2, 0.3]},
|
| 519 |
+
hallucination_metrics={"anchoring": 0.85, "n_blocks": 1},
|
| 520 |
+
ner_metrics={"f1_micro": 0.80, "per_category": {"PER": 0.9}},
|
| 521 |
+
calibration_metrics={"ece": 0.05, "mce": 0.10},
|
| 522 |
+
philological_metrics={"mufi": {"coverage": 0.92}},
|
| 523 |
+
searchability_metrics={
|
| 524 |
+
"n_gt_tokens": 2, "n_searchable": 2, "recall": 1.0,
|
| 525 |
+
},
|
| 526 |
+
numerical_sequence_metrics={
|
| 527 |
+
"global_strict_score": 1.0, "n_total": 0,
|
| 528 |
+
},
|
| 529 |
+
readability_metrics={
|
| 530 |
+
"lang": "fr", "flesch_delta": -5.2, "n_words_reference": 100,
|
| 531 |
+
},
|
| 532 |
+
)
|
| 533 |
+
er = EngineReport(
|
| 534 |
+
engine_name="tesseract",
|
| 535 |
+
engine_version="5.3.0",
|
| 536 |
+
engine_config={"lang": "fra"},
|
| 537 |
+
document_results=[dr],
|
| 538 |
+
pipeline_info={"mode": "text_only"},
|
| 539 |
+
aggregated_confusion={"l→1": 2},
|
| 540 |
+
aggregated_char_scores={"ligature": {"score": 0.95}},
|
| 541 |
+
aggregated_taxonomy={"classes": {"1": 3}},
|
| 542 |
+
aggregated_structure={"line_count_total": 5},
|
| 543 |
+
aggregated_image_quality={"contrast_mean": 0.75},
|
| 544 |
+
aggregated_line_metrics={"gini_mean": 0.3},
|
| 545 |
+
aggregated_hallucination={"anchoring_mean": 0.85},
|
| 546 |
+
aggregated_ner={"f1_micro": 0.80},
|
| 547 |
+
aggregated_calibration={"ece": 0.05},
|
| 548 |
+
aggregated_philological={"mufi": {"coverage": 0.92}},
|
| 549 |
+
aggregated_searchability={"recall": 1.0},
|
| 550 |
+
aggregated_numerical_sequences={"global_strict_score": 1.0},
|
| 551 |
+
aggregated_readability={"delta_mean": -5.2},
|
| 552 |
+
)
|
| 553 |
+
return BenchmarkResult(
|
| 554 |
+
corpus_name="rich-corpus",
|
| 555 |
+
corpus_source="tests",
|
| 556 |
+
document_count=1,
|
| 557 |
+
engine_reports=[er],
|
| 558 |
+
run_date="2026-05-12T12:00:00Z",
|
| 559 |
+
picarones_version="2.0.0",
|
| 560 |
+
metadata={"context": "phase2_test"},
|
| 561 |
+
)
|
| 562 |
+
|
| 563 |
+
def test_round_trip_preserves_all_document_level_fields(
|
| 564 |
+
self, tmp_path: Path,
|
| 565 |
+
) -> None:
|
| 566 |
+
from picarones.evaluation.benchmark_result import BenchmarkResult
|
| 567 |
+
|
| 568 |
+
bm = self._make_rich_benchmark()
|
| 569 |
+
path = tmp_path / "rich.json"
|
| 570 |
+
bm.to_json(path)
|
| 571 |
+
loaded = BenchmarkResult.from_json_object(path)
|
| 572 |
+
|
| 573 |
+
orig = bm.engine_reports[0].document_results[0]
|
| 574 |
+
rebuilt = loaded.engine_reports[0].document_results[0]
|
| 575 |
+
|
| 576 |
+
assert rebuilt.doc_id == orig.doc_id
|
| 577 |
+
assert rebuilt.ground_truth == orig.ground_truth
|
| 578 |
+
assert rebuilt.hypothesis == orig.hypothesis
|
| 579 |
+
assert rebuilt.ocr_intermediate == orig.ocr_intermediate
|
| 580 |
+
assert rebuilt.pipeline_metadata == orig.pipeline_metadata
|
| 581 |
+
assert rebuilt.confusion_matrix == orig.confusion_matrix
|
| 582 |
+
assert rebuilt.char_scores == orig.char_scores
|
| 583 |
+
assert rebuilt.taxonomy == orig.taxonomy
|
| 584 |
+
assert rebuilt.structure == orig.structure
|
| 585 |
+
assert rebuilt.image_quality == orig.image_quality
|
| 586 |
+
assert rebuilt.line_metrics == orig.line_metrics
|
| 587 |
+
assert rebuilt.hallucination_metrics == orig.hallucination_metrics
|
| 588 |
+
assert rebuilt.ner_metrics == orig.ner_metrics
|
| 589 |
+
assert rebuilt.calibration_metrics == orig.calibration_metrics
|
| 590 |
+
assert rebuilt.philological_metrics == orig.philological_metrics
|
| 591 |
+
assert rebuilt.searchability_metrics == orig.searchability_metrics
|
| 592 |
+
assert (
|
| 593 |
+
rebuilt.numerical_sequence_metrics
|
| 594 |
+
== orig.numerical_sequence_metrics
|
| 595 |
+
)
|
| 596 |
+
assert rebuilt.readability_metrics == orig.readability_metrics
|
| 597 |
+
# Métriques diplomatiques (anciennement perdues).
|
| 598 |
+
assert rebuilt.metrics.cer_diplomatic == orig.metrics.cer_diplomatic
|
| 599 |
+
assert (
|
| 600 |
+
rebuilt.metrics.diplomatic_profile_name
|
| 601 |
+
== orig.metrics.diplomatic_profile_name
|
| 602 |
+
)
|
| 603 |
+
|
| 604 |
+
def test_round_trip_preserves_aggregated_engine_fields(
|
| 605 |
+
self, tmp_path: Path,
|
| 606 |
+
) -> None:
|
| 607 |
+
from picarones.evaluation.benchmark_result import BenchmarkResult
|
| 608 |
+
|
| 609 |
+
bm = self._make_rich_benchmark()
|
| 610 |
+
path = tmp_path / "rich.json"
|
| 611 |
+
bm.to_json(path)
|
| 612 |
+
loaded = BenchmarkResult.from_json_object(path)
|
| 613 |
+
|
| 614 |
+
orig = bm.engine_reports[0]
|
| 615 |
+
rebuilt = loaded.engine_reports[0]
|
| 616 |
+
assert rebuilt.pipeline_info == orig.pipeline_info
|
| 617 |
+
assert rebuilt.aggregated_confusion == orig.aggregated_confusion
|
| 618 |
+
assert rebuilt.aggregated_char_scores == orig.aggregated_char_scores
|
| 619 |
+
assert rebuilt.aggregated_taxonomy == orig.aggregated_taxonomy
|
| 620 |
+
assert rebuilt.aggregated_structure == orig.aggregated_structure
|
| 621 |
+
assert (
|
| 622 |
+
rebuilt.aggregated_image_quality == orig.aggregated_image_quality
|
| 623 |
+
)
|
| 624 |
+
assert rebuilt.aggregated_line_metrics == orig.aggregated_line_metrics
|
| 625 |
+
assert (
|
| 626 |
+
rebuilt.aggregated_hallucination == orig.aggregated_hallucination
|
| 627 |
+
)
|
| 628 |
+
assert rebuilt.aggregated_ner == orig.aggregated_ner
|
| 629 |
+
assert rebuilt.aggregated_calibration == orig.aggregated_calibration
|
| 630 |
+
assert (
|
| 631 |
+
rebuilt.aggregated_philological == orig.aggregated_philological
|
| 632 |
+
)
|
| 633 |
+
assert (
|
| 634 |
+
rebuilt.aggregated_searchability == orig.aggregated_searchability
|
| 635 |
+
)
|
| 636 |
+
assert (
|
| 637 |
+
rebuilt.aggregated_numerical_sequences
|
| 638 |
+
== orig.aggregated_numerical_sequences
|
| 639 |
+
)
|
| 640 |
+
assert rebuilt.aggregated_readability == orig.aggregated_readability
|
| 641 |
+
|
| 642 |
+
def test_report_generator_from_json_uses_rich_reconstruction(
|
| 643 |
+
self, tmp_path: Path,
|
| 644 |
+
) -> None:
|
| 645 |
+
"""``ReportGenerator.from_json`` doit désormais accéder aux
|
| 646 |
+
champs avancés (avant Phase 2.2 il les perdait)."""
|
| 647 |
+
from picarones.reports.html.generator import ReportGenerator
|
| 648 |
+
|
| 649 |
+
bm = self._make_rich_benchmark()
|
| 650 |
+
path = tmp_path / "rich.json"
|
| 651 |
+
bm.to_json(path)
|
| 652 |
+
|
| 653 |
+
gen = ReportGenerator.from_json(path)
|
| 654 |
+
dr = gen.benchmark.engine_reports[0].document_results[0]
|
| 655 |
+
# Champs qui étaient à None avant Phase 2.2.
|
| 656 |
+
assert dr.taxonomy is not None
|
| 657 |
+
assert dr.hallucination_metrics is not None
|
| 658 |
+
assert dr.philological_metrics is not None
|
| 659 |
+
assert dr.calibration_metrics is not None
|
| 660 |
+
assert dr.searchability_metrics is not None
|
| 661 |
+
|
| 662 |
+
|
| 663 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 664 |
+
# 6. Phase 2.3 — partial store fingerprint
|
| 665 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 666 |
+
|
| 667 |
+
|
| 668 |
+
class TestPartialStoreFingerprint:
|
| 669 |
+
"""Phase 2.3 du chantier post-rewrite : la clé du fichier partiel
|
| 670 |
+
inclut désormais un fingerprint SHA-256 stable de la config
|
| 671 |
+
complète (engine_config, normalization_profile, char_exclude,
|
| 672 |
+
fichiers corpus + mtime/size, version code).
|
| 673 |
+
|
| 674 |
+
Avant ce durcissement, la clé était ``(corpus.name, engine.name)``
|
| 675 |
+
seule — deux runs avec configs différentes recyclaient
|
| 676 |
+
silencieusement les résultats du précédent. Reproductibilité
|
| 677 |
+
scientifique brisée.
|
| 678 |
+
"""
|
| 679 |
+
|
| 680 |
+
def test_fingerprint_stable_for_same_config(self, tmp_path: Path) -> None:
|
| 681 |
+
from picarones.app.services.partial_store import (
|
| 682 |
+
compute_run_fingerprint,
|
| 683 |
+
)
|
| 684 |
+
|
| 685 |
+
f1 = tmp_path / "a.png"
|
| 686 |
+
f1.write_bytes(b"\x00" * 100)
|
| 687 |
+
fp1 = compute_run_fingerprint(
|
| 688 |
+
engine_config={"lang": "fra", "psm": 6},
|
| 689 |
+
normalization_profile="medieval_french",
|
| 690 |
+
char_exclude="',-",
|
| 691 |
+
corpus_files=[f1],
|
| 692 |
+
code_version="1.0",
|
| 693 |
+
)
|
| 694 |
+
fp2 = compute_run_fingerprint(
|
| 695 |
+
engine_config={"psm": 6, "lang": "fra"}, # ordre différent
|
| 696 |
+
normalization_profile="medieval_french",
|
| 697 |
+
char_exclude="',-",
|
| 698 |
+
corpus_files=[f1],
|
| 699 |
+
code_version="1.0",
|
| 700 |
+
)
|
| 701 |
+
assert fp1 == fp2, "Le fingerprint doit être insensible à l'ordre dict"
|
| 702 |
+
|
| 703 |
+
def test_fingerprint_changes_with_engine_config(
|
| 704 |
+
self, tmp_path: Path,
|
| 705 |
+
) -> None:
|
| 706 |
+
from picarones.app.services.partial_store import (
|
| 707 |
+
compute_run_fingerprint,
|
| 708 |
+
)
|
| 709 |
+
|
| 710 |
+
f1 = tmp_path / "a.png"
|
| 711 |
+
f1.write_bytes(b"\x00" * 100)
|
| 712 |
+
fp_psm6 = compute_run_fingerprint(
|
| 713 |
+
engine_config={"lang": "fra", "psm": 6},
|
| 714 |
+
corpus_files=[f1],
|
| 715 |
+
code_version="1.0",
|
| 716 |
+
)
|
| 717 |
+
fp_psm3 = compute_run_fingerprint(
|
| 718 |
+
engine_config={"lang": "fra", "psm": 3},
|
| 719 |
+
corpus_files=[f1],
|
| 720 |
+
code_version="1.0",
|
| 721 |
+
)
|
| 722 |
+
assert fp_psm6 != fp_psm3, (
|
| 723 |
+
"Un changement de psm doit changer le fingerprint"
|
| 724 |
+
)
|
| 725 |
+
|
| 726 |
+
def test_fingerprint_changes_with_normalization_profile(
|
| 727 |
+
self, tmp_path: Path,
|
| 728 |
+
) -> None:
|
| 729 |
+
from picarones.app.services.partial_store import (
|
| 730 |
+
compute_run_fingerprint,
|
| 731 |
+
)
|
| 732 |
+
|
| 733 |
+
f1 = tmp_path / "a.png"
|
| 734 |
+
f1.write_bytes(b"\x00" * 100)
|
| 735 |
+
fp_med = compute_run_fingerprint(
|
| 736 |
+
engine_config={"lang": "fra"},
|
| 737 |
+
normalization_profile="medieval_french",
|
| 738 |
+
corpus_files=[f1],
|
| 739 |
+
)
|
| 740 |
+
fp_nfc = compute_run_fingerprint(
|
| 741 |
+
engine_config={"lang": "fra"},
|
| 742 |
+
normalization_profile="nfc",
|
| 743 |
+
corpus_files=[f1],
|
| 744 |
+
)
|
| 745 |
+
assert fp_med != fp_nfc
|
| 746 |
+
|
| 747 |
+
def test_fingerprint_changes_with_char_exclude(
|
| 748 |
+
self, tmp_path: Path,
|
| 749 |
+
) -> None:
|
| 750 |
+
from picarones.app.services.partial_store import (
|
| 751 |
+
compute_run_fingerprint,
|
| 752 |
+
)
|
| 753 |
+
|
| 754 |
+
fp_with = compute_run_fingerprint(
|
| 755 |
+
engine_config={"lang": "fra"},
|
| 756 |
+
char_exclude="',-",
|
| 757 |
+
)
|
| 758 |
+
fp_without = compute_run_fingerprint(
|
| 759 |
+
engine_config={"lang": "fra"},
|
| 760 |
+
char_exclude="",
|
| 761 |
+
)
|
| 762 |
+
assert fp_with != fp_without
|
| 763 |
+
|
| 764 |
+
def test_fingerprint_changes_with_corpus_content(
|
| 765 |
+
self, tmp_path: Path,
|
| 766 |
+
) -> None:
|
| 767 |
+
"""Si un fichier change de taille / mtime, le fingerprint change.
|
| 768 |
+
|
| 769 |
+
Détection légère (pas de hash du contenu) mais suffit pour
|
| 770 |
+
invalider la reprise après modification utilisateur du corpus.
|
| 771 |
+
"""
|
| 772 |
+
import os
|
| 773 |
+
import time
|
| 774 |
+
|
| 775 |
+
from picarones.app.services.partial_store import (
|
| 776 |
+
compute_run_fingerprint,
|
| 777 |
+
)
|
| 778 |
+
|
| 779 |
+
f1 = tmp_path / "a.png"
|
| 780 |
+
f1.write_bytes(b"\x00" * 100)
|
| 781 |
+
fp_v1 = compute_run_fingerprint(
|
| 782 |
+
engine_config={"lang": "fra"},
|
| 783 |
+
corpus_files=[f1],
|
| 784 |
+
)
|
| 785 |
+
# Réécrire avec une taille différente.
|
| 786 |
+
f1.write_bytes(b"\x00" * 200)
|
| 787 |
+
# Forcer un mtime différent (certains FS ont une résolution
|
| 788 |
+
# de seconde, on attend > 1 s).
|
| 789 |
+
new_mtime = time.time() + 5
|
| 790 |
+
os.utime(f1, (new_mtime, new_mtime))
|
| 791 |
+
fp_v2 = compute_run_fingerprint(
|
| 792 |
+
engine_config={"lang": "fra"},
|
| 793 |
+
corpus_files=[f1],
|
| 794 |
+
)
|
| 795 |
+
assert fp_v1 != fp_v2
|
| 796 |
+
|
| 797 |
+
def test_partial_path_uses_fingerprint_suffix(
|
| 798 |
+
self, tmp_path: Path,
|
| 799 |
+
) -> None:
|
| 800 |
+
from picarones.app.services.partial_store import _partial_path
|
| 801 |
+
|
| 802 |
+
path_with = _partial_path(
|
| 803 |
+
"my_corpus", "tesseract", tmp_path, fingerprint="abc123",
|
| 804 |
+
)
|
| 805 |
+
path_without = _partial_path(
|
| 806 |
+
"my_corpus", "tesseract", tmp_path,
|
| 807 |
+
)
|
| 808 |
+
assert path_with != path_without
|
| 809 |
+
assert "abc123" in path_with.name
|
| 810 |
+
# Le format historique reste pour la rétrocompat.
|
| 811 |
+
assert path_without.name == "picarones_my_corpus_tesseract.partial.jsonl"
|
| 812 |
+
|
| 813 |
+
def test_engine_config_for_fingerprint_distinguishes_psm(self) -> None:
|
| 814 |
+
"""``_engine_config_for_fingerprint`` capture les attributs
|
| 815 |
+
opérationnels d'un adapter OCR (lang, psm, model, …)."""
|
| 816 |
+
from picarones.app.services.benchmark_runner import (
|
| 817 |
+
_engine_config_for_fingerprint,
|
| 818 |
+
)
|
| 819 |
+
|
| 820 |
+
class _FakeOCR:
|
| 821 |
+
name = "tesseract"
|
| 822 |
+
lang = "fra"
|
| 823 |
+
psm = 6
|
| 824 |
+
is_pipeline = False
|
| 825 |
+
|
| 826 |
+
class _FakeOCRDiff:
|
| 827 |
+
name = "tesseract"
|
| 828 |
+
lang = "fra"
|
| 829 |
+
psm = 3
|
| 830 |
+
is_pipeline = False
|
| 831 |
+
|
| 832 |
+
c1 = _engine_config_for_fingerprint(_FakeOCR())
|
| 833 |
+
c2 = _engine_config_for_fingerprint(_FakeOCRDiff())
|
| 834 |
+
assert c1 != c2
|
| 835 |
+
assert c1["psm"] == 6
|
| 836 |
+
assert c2["psm"] == 3
|
| 837 |
+
|
| 838 |
+
|
| 839 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 840 |
+
# 7. Phase 3 — Adapters kraken et calamari (moteurs fantômes implémentés)
|
| 841 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 842 |
+
|
| 843 |
+
|
| 844 |
+
class TestKrakenAdapter:
|
| 845 |
+
"""Phase 3 du chantier post-rewrite : ``KrakenAdapter`` rend
|
| 846 |
+
l'engine ``kraken`` réellement utilisable (au lieu d'être
|
| 847 |
+
juste annoncé par ``/api/engines``)."""
|
| 848 |
+
|
| 849 |
+
def test_kraken_requires_model_path(self) -> None:
|
| 850 |
+
from picarones.adapters.ocr import KrakenAdapter
|
| 851 |
+
from picarones.adapters.ocr.base import OCRAdapterError
|
| 852 |
+
|
| 853 |
+
with pytest.raises(OCRAdapterError, match="model_path est obligatoire"):
|
| 854 |
+
KrakenAdapter()
|
| 855 |
+
|
| 856 |
+
def test_kraken_via_factory(self, tmp_path: Path) -> None:
|
| 857 |
+
from picarones.adapters.ocr import KrakenAdapter
|
| 858 |
+
from picarones.adapters.ocr.factory import ocr_adapter_from_name
|
| 859 |
+
|
| 860 |
+
# Modèle factice — l'adapter ne le charge qu'à execute().
|
| 861 |
+
model = tmp_path / "fake.mlmodel"
|
| 862 |
+
model.write_bytes(b"fake")
|
| 863 |
+
adapter = ocr_adapter_from_name("kraken", model_path=str(model))
|
| 864 |
+
assert isinstance(adapter, KrakenAdapter)
|
| 865 |
+
assert adapter.name == "kraken"
|
| 866 |
+
assert adapter.model_path == model
|
| 867 |
+
|
| 868 |
+
def test_kraken_validates_name(self) -> None:
|
| 869 |
+
from picarones.adapters.ocr import KrakenAdapter
|
| 870 |
+
from picarones.adapters.ocr.base import OCRAdapterError
|
| 871 |
+
|
| 872 |
+
with pytest.raises(OCRAdapterError, match="name invalide"):
|
| 873 |
+
KrakenAdapter(name="bad name with spaces", model_path="x")
|
| 874 |
+
|
| 875 |
+
|
| 876 |
+
class TestCalamariAdapter:
|
| 877 |
+
"""Phase 3 du chantier post-rewrite : ``CalamariAdapter`` rend
|
| 878 |
+
l'engine ``calamari`` réellement utilisable."""
|
| 879 |
+
|
| 880 |
+
def test_calamari_requires_checkpoint(self) -> None:
|
| 881 |
+
from picarones.adapters.ocr import CalamariAdapter
|
| 882 |
+
from picarones.adapters.ocr.base import OCRAdapterError
|
| 883 |
+
|
| 884 |
+
with pytest.raises(OCRAdapterError, match="checkpoint est obligatoire"):
|
| 885 |
+
CalamariAdapter()
|
| 886 |
+
|
| 887 |
+
def test_calamari_via_factory(self, tmp_path: Path) -> None:
|
| 888 |
+
from picarones.adapters.ocr import CalamariAdapter
|
| 889 |
+
from picarones.adapters.ocr.factory import ocr_adapter_from_name
|
| 890 |
+
|
| 891 |
+
ckpt = tmp_path / "fake.ckpt"
|
| 892 |
+
ckpt.write_bytes(b"fake")
|
| 893 |
+
adapter = ocr_adapter_from_name("calamari", checkpoint=str(ckpt))
|
| 894 |
+
assert isinstance(adapter, CalamariAdapter)
|
| 895 |
+
assert adapter.name == "calamari"
|
| 896 |
+
assert adapter.checkpoint == ckpt
|
| 897 |
+
|
| 898 |
+
def test_calamari_validates_batch_size(self) -> None:
|
| 899 |
+
from picarones.adapters.ocr import CalamariAdapter
|
| 900 |
+
from picarones.adapters.ocr.base import OCRAdapterError
|
| 901 |
+
|
| 902 |
+
with pytest.raises(OCRAdapterError, match="batch_size doit être"):
|
| 903 |
+
CalamariAdapter(checkpoint="x", batch_size=0)
|
| 904 |
+
|
| 905 |
+
|
| 906 |
+
class TestEngineMatrixCoherence:
|
| 907 |
+
"""Phase 3 du chantier post-rewrite : la matrice des moteurs est
|
| 908 |
+
cohérente entre ``/api/engines``, la factory canonique, le
|
| 909 |
+
builder web ``_OCR_KWARGS_BUILDERS`` et l'index public."""
|
| 910 |
+
|
| 911 |
+
def test_kraken_and_calamari_in_factory_supported_list(self) -> None:
|
| 912 |
+
from picarones.adapters.ocr.factory import _SUPPORTED
|
| 913 |
+
|
| 914 |
+
assert "kraken" in _SUPPORTED
|
| 915 |
+
assert "calamari" in _SUPPORTED
|
| 916 |
+
|
| 917 |
+
def test_kraken_and_calamari_in_web_builders(self) -> None:
|
| 918 |
+
from picarones.interfaces.web.benchmark_utils import (
|
| 919 |
+
_OCR_KWARGS_BUILDERS,
|
| 920 |
+
)
|
| 921 |
+
|
| 922 |
+
assert "kraken" in _OCR_KWARGS_BUILDERS
|
| 923 |
+
assert "calamari" in _OCR_KWARGS_BUILDERS
|
| 924 |
+
|
| 925 |
+
def test_kraken_calamari_exposed_at_package_root(self) -> None:
|
| 926 |
+
from picarones.adapters.ocr import (
|
| 927 |
+
CalamariAdapter,
|
| 928 |
+
KrakenAdapter,
|
| 929 |
+
)
|
| 930 |
+
|
| 931 |
+
assert KrakenAdapter.__name__ == "KrakenAdapter"
|
| 932 |
+
assert CalamariAdapter.__name__ == "CalamariAdapter"
|
| 933 |
+
|
| 934 |
+
|
| 935 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 936 |
+
# 8. Phase 4 — upload_purge_task branché au lifespan
|
| 937 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 938 |
+
|
| 939 |
+
|
| 940 |
+
class TestUploadPurgeTaskWired:
|
| 941 |
+
"""Phase 4 du chantier post-rewrite : la tâche
|
| 942 |
+
``upload_purge_task`` est désormais démarrée par le lifespan de
|
| 943 |
+
``picarones.interfaces.web.app`` (auparavant définie mais jamais
|
| 944 |
+
lancée — code zombie)."""
|
| 945 |
+
|
| 946 |
+
def test_lifespan_starts_purge_task(self, monkeypatch) -> None:
|
| 947 |
+
"""Au démarrage de l'app FastAPI, un ``asyncio.create_task`` doit
|
| 948 |
+
emballer ``upload_purge_task``. On patch la fonction pour
|
| 949 |
+
l'observer puis on enclenche le lifespan."""
|
| 950 |
+
from fastapi.testclient import TestClient
|
| 951 |
+
|
| 952 |
+
observed: dict = {"started": False, "uploads_root": None}
|
| 953 |
+
|
| 954 |
+
async def _fake_purge_task(uploads_root):
|
| 955 |
+
observed["started"] = True
|
| 956 |
+
observed["uploads_root"] = uploads_root
|
| 957 |
+
# Boucle infinie minimale — annulée au shutdown.
|
| 958 |
+
import asyncio
|
| 959 |
+
try:
|
| 960 |
+
while True:
|
| 961 |
+
await asyncio.sleep(3600)
|
| 962 |
+
except asyncio.CancelledError:
|
| 963 |
+
raise
|
| 964 |
+
|
| 965 |
+
monkeypatch.setattr(
|
| 966 |
+
"picarones.interfaces.web.maintenance.upload_purge_task",
|
| 967 |
+
_fake_purge_task,
|
| 968 |
+
)
|
| 969 |
+
# Forcer la rétention pour ne pas que la fonction réelle short-circuit.
|
| 970 |
+
monkeypatch.setenv("PICARONES_UPLOAD_RETENTION_DAYS", "7")
|
| 971 |
+
|
| 972 |
+
from picarones.interfaces.web.app import app
|
| 973 |
+
|
| 974 |
+
with TestClient(app):
|
| 975 |
+
# Le lifespan a démarré ; la tâche tourne en arrière-plan.
|
| 976 |
+
# On laisse à asyncio le temps de la lancer.
|
| 977 |
+
import time
|
| 978 |
+
time.sleep(0.05)
|
| 979 |
+
|
| 980 |
+
assert observed["started"] is True, (
|
| 981 |
+
"upload_purge_task aurait dû être démarrée par le lifespan"
|
| 982 |
+
)
|
| 983 |
+
|
| 984 |
+
def test_purge_protects_active_corpus(self, tmp_path: Path) -> None:
|
| 985 |
+
"""Si un job ``pending``/``running`` référence un corpus_id, la
|
| 986 |
+
purge ne supprime pas ce dossier — même s'il est ancien."""
|
| 987 |
+
import time
|
| 988 |
+
|
| 989 |
+
from picarones.interfaces.web.maintenance import purge_old_uploads
|
| 990 |
+
|
| 991 |
+
# 2 corpus : un actif (référencé), un orphelin.
|
| 992 |
+
active = tmp_path / "active_corpus"
|
| 993 |
+
orphan = tmp_path / "orphan_corpus"
|
| 994 |
+
active.mkdir()
|
| 995 |
+
orphan.mkdir()
|
| 996 |
+
# Vieillir les deux pour qu'ils passent la rétention de 0 jour.
|
| 997 |
+
old = time.time() - 86400 * 30
|
| 998 |
+
import os
|
| 999 |
+
os.utime(active, (old, old))
|
| 1000 |
+
os.utime(orphan, (old, old))
|
| 1001 |
+
|
| 1002 |
+
purged = purge_old_uploads(
|
| 1003 |
+
tmp_path,
|
| 1004 |
+
retention_days=7,
|
| 1005 |
+
active_corpus_ids={"active_corpus"},
|
| 1006 |
+
)
|
| 1007 |
+
|
| 1008 |
+
purged_names = [p.name for p in purged]
|
| 1009 |
+
assert "orphan_corpus" in purged_names
|
| 1010 |
+
assert "active_corpus" not in purged_names
|
| 1011 |
+
# Vérification physique
|
| 1012 |
+
assert active.exists()
|
| 1013 |
+
assert not orphan.exists()
|
|
@@ -24,10 +24,19 @@ from pathlib import Path
|
|
| 24 |
|
| 25 |
import pytest
|
| 26 |
|
| 27 |
-
|
| 28 |
-
#
|
| 29 |
-
#
|
| 30 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
|
| 33 |
def _zip_with_entry(name: str, data: bytes = b"PWNED") -> bytes:
|
|
@@ -151,7 +160,7 @@ class TestFlattenZipToDir:
|
|
| 151 |
sous ``dest``, pas dans ``/tmp/``."""
|
| 152 |
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 153 |
|
| 154 |
-
zip_bytes = _zip_with_entry("../../../tmp/x.png",
|
| 155 |
dest = tmp_path / "extract"
|
| 156 |
|
| 157 |
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
|
@@ -168,7 +177,7 @@ class TestFlattenZipToDir:
|
|
| 168 |
) -> None:
|
| 169 |
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 170 |
|
| 171 |
-
zip_bytes = _zip_with_entry("/etc/passwd_clone.png",
|
| 172 |
dest = tmp_path / "extract"
|
| 173 |
|
| 174 |
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
|
|
|
| 24 |
|
| 25 |
import pytest
|
| 26 |
|
| 27 |
+
#: PNG minimal valide — utilisé là où le contenu doit passer
|
| 28 |
+
#: ``validate_image_safe`` (Pillow.verify). Avant ce durcissement,
|
| 29 |
+
#: les tests utilisaient ``b"\\x89PNG"`` (signature seule), mais le
|
| 30 |
+
#: durcissement Phase 1 valide chaque image extraite d'un ZIP — d'où
|
| 31 |
+
#: l'utilisation d'un PNG 1×1 réellement décodable ici.
|
| 32 |
+
_MINIMAL_PNG = (
|
| 33 |
+
b"\x89PNG\r\n\x1a\n"
|
| 34 |
+
b"\x00\x00\x00\rIHDR"
|
| 35 |
+
b"\x00\x00\x00\x01\x00\x00\x00\x01\x08\x06\x00\x00\x00"
|
| 36 |
+
b"\x1f\x15\xc4\x89"
|
| 37 |
+
b"\x00\x00\x00\nIDATx\x9cc\x00\x01\x00\x00\x05\x00\x01"
|
| 38 |
+
b"\r\n-\xb4\x00\x00\x00\x00IEND\xaeB`\x82"
|
| 39 |
+
)
|
| 40 |
|
| 41 |
|
| 42 |
def _zip_with_entry(name: str, data: bytes = b"PWNED") -> bytes:
|
|
|
|
| 160 |
sous ``dest``, pas dans ``/tmp/``."""
|
| 161 |
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 162 |
|
| 163 |
+
zip_bytes = _zip_with_entry("../../../tmp/x.png", _MINIMAL_PNG)
|
| 164 |
dest = tmp_path / "extract"
|
| 165 |
|
| 166 |
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
|
|
|
| 177 |
) -> None:
|
| 178 |
from picarones.interfaces.web.corpus_utils import flatten_zip_to_dir
|
| 179 |
|
| 180 |
+
zip_bytes = _zip_with_entry("/etc/passwd_clone.png", _MINIMAL_PNG)
|
| 181 |
dest = tmp_path / "extract"
|
| 182 |
|
| 183 |
with zipfile.ZipFile(io.BytesIO(zip_bytes)) as zf:
|
|
@@ -182,16 +182,19 @@ class TestHistoryWithRegression:
|
|
| 182 |
|
| 183 |
|
| 184 |
class TestDBErrorHandling:
|
| 185 |
-
def
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
| 191 |
from fastapi.testclient import TestClient
|
| 192 |
|
| 193 |
app = _make_app()
|
| 194 |
-
# Chemin
|
| 195 |
impossible_path = "/proc/cannot_write/history.sqlite"
|
| 196 |
|
| 197 |
with TestClient(app, raise_server_exceptions=False) as client:
|
|
@@ -199,8 +202,27 @@ class TestDBErrorHandling:
|
|
| 199 |
"/api/history/regressions",
|
| 200 |
params={"db_path": impossible_path},
|
| 201 |
)
|
| 202 |
-
|
| 203 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
assert r.status_code in (200, 500)
|
| 205 |
if r.status_code == 500:
|
| 206 |
body = r.json()
|
|
|
|
| 182 |
|
| 183 |
|
| 184 |
class TestDBErrorHandling:
|
| 185 |
+
def test_db_path_outside_workspace_rejected(self, tmp_path: Path) -> None:
|
| 186 |
+
"""db_path hors workspace est désormais rejeté en 400 par le
|
| 187 |
+
durcissement Phase 1 (validation contre compute_workspace_roots).
|
| 188 |
+
|
| 189 |
+
Avant Phase 1 : 500 silencieux après tentative d'ouverture
|
| 190 |
+
SQLite — vecteur de lecture filesystem arbitraire.
|
| 191 |
+
Après Phase 1 : 400 avec ``PathValidationError`` AVANT
|
| 192 |
+
toute interaction filesystem.
|
| 193 |
+
"""
|
| 194 |
from fastapi.testclient import TestClient
|
| 195 |
|
| 196 |
app = _make_app()
|
| 197 |
+
# Chemin hors zone workspace.
|
| 198 |
impossible_path = "/proc/cannot_write/history.sqlite"
|
| 199 |
|
| 200 |
with TestClient(app, raise_server_exceptions=False) as client:
|
|
|
|
| 202 |
"/api/history/regressions",
|
| 203 |
params={"db_path": impossible_path},
|
| 204 |
)
|
| 205 |
+
assert r.status_code == 400, r.text
|
| 206 |
+
assert "hors zone autorisée" in r.json()["detail"]
|
| 207 |
+
|
| 208 |
+
def test_db_path_inside_workspace_but_unwritable(
|
| 209 |
+
self, tmp_path: Path,
|
| 210 |
+
) -> None:
|
| 211 |
+
"""db_path valide (sous tmp_path) mais pointant sur un fichier
|
| 212 |
+
inexistant en sous-dossier inaccessible : 500 propre, pas de
|
| 213 |
+
crash silencieux."""
|
| 214 |
+
from fastapi.testclient import TestClient
|
| 215 |
+
|
| 216 |
+
app = _make_app()
|
| 217 |
+
# Sous-dossier inexistant sous tmp_path — SQLite va échouer
|
| 218 |
+
# à créer le fichier, mais la validation de chemin passe.
|
| 219 |
+
bad_under_workspace = tmp_path / "no_such_subdir" / "history.sqlite"
|
| 220 |
+
|
| 221 |
+
with TestClient(app, raise_server_exceptions=False) as client:
|
| 222 |
+
r = client.get(
|
| 223 |
+
"/api/history/regressions",
|
| 224 |
+
params={"db_path": str(bad_under_workspace)},
|
| 225 |
+
)
|
| 226 |
assert r.status_code in (200, 500)
|
| 227 |
if r.status_code == 500:
|
| 228 |
body = r.json()
|
|
@@ -6,7 +6,7 @@ Cible : lignes 100, 163, 170, 223 de
|
|
| 6 |
- 100 : ``/api/benchmark/start`` retourne 429 quand le sémaphore
|
| 7 |
des jobs concurrents est plein ;
|
| 8 |
- 163 : ``validated_prompt_filename`` est appelé pour chaque
|
| 9 |
-
``
|
| 10 |
invalide doit être rejeté en 400 (vecteur d'exfiltration LLM) ;
|
| 11 |
- 170 : ``/api/benchmark/run`` retourne 429 quand le sémaphore
|
| 12 |
est plein ;
|
|
|
|
| 6 |
- 100 : ``/api/benchmark/start`` retourne 429 quand le sémaphore
|
| 7 |
des jobs concurrents est plein ;
|
| 8 |
- 163 : ``validated_prompt_filename`` est appelé pour chaque
|
| 9 |
+
``PipelineConfig.prompt_file`` non-vide → un nom de prompt
|
| 10 |
invalide doit être rejeté en 400 (vecteur d'exfiltration LLM) ;
|
| 11 |
- 170 : ``/api/benchmark/run`` retourne 429 quand le sémaphore
|
| 12 |
est plein ;
|
|
@@ -4,7 +4,7 @@
|
|
| 4 |
Pourquoi ce fichier
|
| 5 |
-------------------
|
| 6 |
``_build_llm_adapter`` et ``_engine_from_competitor`` sont les
|
| 7 |
-
points de **routage** entre la config web (``
|
| 8 |
et les adapters concrets : si une régression silencieusement
|
| 9 |
fait passer ``mistral`` au lieu de ``openai``, ou ``tesseract``
|
| 10 |
au lieu de ``mistral_ocr``, le benchmark tourne mais avec le
|
|
@@ -36,7 +36,7 @@ from picarones.interfaces.web.benchmark_utils import (
|
|
| 36 |
_engine_from_competitor,
|
| 37 |
sse_format,
|
| 38 |
)
|
| 39 |
-
from picarones.interfaces.web.models import
|
| 40 |
|
| 41 |
|
| 42 |
# ──────────────────────────────────────────────────────────────────────
|
|
@@ -61,7 +61,7 @@ class TestBuildLLMAdapterRouting:
|
|
| 61 |
def test_provider_routes_to_expected_adapter(
|
| 62 |
self, provider: str, expected_class_name: str,
|
| 63 |
) -> None:
|
| 64 |
-
comp =
|
| 65 |
name="t", ocr_engine="", llm_provider=provider, llm_model="m",
|
| 66 |
)
|
| 67 |
adapter = _build_llm_adapter(comp)
|
|
@@ -71,7 +71,7 @@ class TestBuildLLMAdapterRouting:
|
|
| 71 |
)
|
| 72 |
|
| 73 |
def test_unknown_provider_raises_value_error(self) -> None:
|
| 74 |
-
comp =
|
| 75 |
name="t", ocr_engine="",
|
| 76 |
llm_provider="some_made_up_provider", llm_model="x",
|
| 77 |
)
|
|
@@ -82,7 +82,7 @@ class TestBuildLLMAdapterRouting:
|
|
| 82 |
"""Quand ``llm_model`` est vide, on passe ``None`` à
|
| 83 |
l'adapter (qui utilise son default interne) — pas une
|
| 84 |
chaîne vide qui serait rejetée par l'API."""
|
| 85 |
-
comp =
|
| 86 |
name="t", ocr_engine="", llm_provider="openai", llm_model="",
|
| 87 |
)
|
| 88 |
adapter = _build_llm_adapter(comp)
|
|
@@ -103,7 +103,7 @@ class TestEngineFromCompetitorOCROnly:
|
|
| 103 |
"""Le ``name`` est dérivé de ``(engine_id, ocr_model)`` pour
|
| 104 |
que deux configs distinctes obtiennent automatiquement des
|
| 105 |
identifiants différents au resolver (cf. S9 fix)."""
|
| 106 |
-
comp =
|
| 107 |
name="t", ocr_engine="tesseract", llm_provider="",
|
| 108 |
ocr_model="fra",
|
| 109 |
)
|
|
@@ -113,10 +113,10 @@ class TestEngineFromCompetitorOCROnly:
|
|
| 113 |
def test_tesseract_only_different_lang_distinct_name(self) -> None:
|
| 114 |
"""Garantie anti-collision : ``lang=eng`` et ``lang=fra``
|
| 115 |
produisent des ``name`` distincts au resolver."""
|
| 116 |
-
comp_fra =
|
| 117 |
ocr_engine="tesseract", llm_provider="", ocr_model="fra",
|
| 118 |
)
|
| 119 |
-
comp_eng =
|
| 120 |
ocr_engine="tesseract", llm_provider="", ocr_model="eng",
|
| 121 |
)
|
| 122 |
assert _engine_from_competitor(comp_fra).name == "tesseract_fra"
|
|
@@ -126,7 +126,7 @@ class TestEngineFromCompetitorOCROnly:
|
|
| 126 |
"""``RuntimeError`` (et pas ``ValueError`` brut) — c'est le
|
| 127 |
contrat documenté pour que le worker thread puisse
|
| 128 |
loguer ``warning`` et passer au concurrent suivant."""
|
| 129 |
-
comp =
|
| 130 |
name="t", ocr_engine="not_an_engine", llm_provider="",
|
| 131 |
)
|
| 132 |
with pytest.raises(RuntimeError, match="inconnu"):
|
|
@@ -141,30 +141,61 @@ class TestEngineFromCompetitorPipeline:
|
|
| 141 |
("pipeline_mode", "expected_mode"),
|
| 142 |
[
|
| 143 |
("text_only", "text_only"),
|
| 144 |
-
("post_correction_text", "text_only"),
|
| 145 |
("text_and_image", "text_and_image"),
|
| 146 |
-
("post_correction_image", "text_and_image"),
|
| 147 |
-
("", "text_only"), # fallback
|
| 148 |
],
|
| 149 |
)
|
| 150 |
-
def
|
| 151 |
self, pipeline_mode: str, expected_mode: str,
|
| 152 |
) -> None:
|
| 153 |
-
"""Modes qui exigent un OCR amont
|
| 154 |
-
|
| 155 |
-
|
|
|
|
|
|
|
|
|
|
| 156 |
name="t", ocr_engine="tesseract", llm_provider="mistral",
|
| 157 |
llm_model="m", ocr_model="fra", pipeline_mode=pipeline_mode,
|
| 158 |
)
|
| 159 |
pipeline = _engine_from_competitor(comp)
|
| 160 |
assert pipeline.mode == expected_mode
|
| 161 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
def test_zero_shot_mode_requires_corpus_ocr(self) -> None:
|
| 163 |
"""Le mode ``zero_shot`` exige ``ocr_adapter=None`` au niveau
|
| 164 |
du pipeline (le VLM lit l'image directement) — donc côté
|
| 165 |
factory web, il doit être combiné avec ``ocr_engine=corpus``
|
| 166 |
ou ``""``, pas avec un moteur live."""
|
| 167 |
-
comp =
|
| 168 |
name="t", ocr_engine="corpus", llm_provider="mistral",
|
| 169 |
llm_model="m", pipeline_mode="zero_shot",
|
| 170 |
)
|
|
@@ -173,18 +204,20 @@ class TestEngineFromCompetitorPipeline:
|
|
| 173 |
assert pipeline.ocr_adapter is None
|
| 174 |
|
| 175 |
def test_pipeline_name_from_explicit_name(self) -> None:
|
| 176 |
-
comp =
|
| 177 |
name="my-pipeline", ocr_engine="tesseract",
|
| 178 |
llm_provider="mistral", llm_model="m", ocr_model="fra",
|
|
|
|
| 179 |
)
|
| 180 |
pipeline = _engine_from_competitor(comp)
|
| 181 |
assert pipeline.pipeline_name == "my-pipeline"
|
| 182 |
|
| 183 |
def test_pipeline_name_default_format(self) -> None:
|
| 184 |
"""Sans ``name`` explicite, format ``{engine} → {model}``."""
|
| 185 |
-
comp =
|
| 186 |
name="", ocr_engine="tesseract", llm_provider="mistral",
|
| 187 |
llm_model="ministral-3b-latest", ocr_model="fra",
|
|
|
|
| 188 |
)
|
| 189 |
pipeline = _engine_from_competitor(comp)
|
| 190 |
assert "tesseract" in pipeline.pipeline_name
|
|
@@ -195,9 +228,10 @@ class TestEngineFromCompetitorPipeline:
|
|
| 195 |
par défaut (``correction_medieval_french.txt``). Cf. S9 :
|
| 196 |
``prompt_template`` contient désormais le CONTENU lu sur
|
| 197 |
disque, pas le filename brut."""
|
| 198 |
-
comp =
|
| 199 |
name="t", ocr_engine="tesseract", llm_provider="mistral",
|
| 200 |
llm_model="m", ocr_model="fra", prompt_file="",
|
|
|
|
| 201 |
)
|
| 202 |
pipeline = _engine_from_competitor(comp)
|
| 203 |
# Le template ne doit PAS être le filename littéral.
|
|
@@ -220,7 +254,7 @@ class TestEngineFromCompetitorCorpusOCR:
|
|
| 220 |
def test_corpus_or_empty_without_llm_raises(
|
| 221 |
self, ocr_engine: str,
|
| 222 |
) -> None:
|
| 223 |
-
comp =
|
| 224 |
name="t", ocr_engine=ocr_engine, llm_provider="",
|
| 225 |
)
|
| 226 |
with pytest.raises(ValueError, match="llm_provider"):
|
|
@@ -233,7 +267,7 @@ class TestEngineFromCompetitorCorpusOCR:
|
|
| 233 |
"""Mode corpus + LLM → pipeline ``zero_shot`` (le LLM/VLM
|
| 234 |
traite l'image ou l'OCR pré-calculé, l'``ocr_adapter`` est
|
| 235 |
``None``)."""
|
| 236 |
-
comp =
|
| 237 |
name="post-corr", ocr_engine=ocr_engine,
|
| 238 |
llm_provider="mistral", llm_model="m",
|
| 239 |
pipeline_mode="zero_shot",
|
|
@@ -247,7 +281,7 @@ class TestEngineFromCompetitorCorpusOCR:
|
|
| 247 |
|
| 248 |
def test_corpus_pipeline_name_format(self) -> None:
|
| 249 |
"""Sans ``name``, format ``corpus_ocr → {model}``."""
|
| 250 |
-
comp =
|
| 251 |
name="", ocr_engine="corpus", llm_provider="mistral",
|
| 252 |
llm_model="ministral-3b-latest",
|
| 253 |
pipeline_mode="zero_shot",
|
|
@@ -273,7 +307,7 @@ class TestEngineFromCompetitorCloudWithoutSDK:
|
|
| 273 |
def test_cloud_engine_without_sdk_runtime_error(
|
| 274 |
self, engine: str, module_path: str,
|
| 275 |
) -> None:
|
| 276 |
-
comp =
|
| 277 |
name="t", ocr_engine=engine, llm_provider="",
|
| 278 |
)
|
| 279 |
with patch.dict(sys.modules, {module_path: None}):
|
|
|
|
| 4 |
Pourquoi ce fichier
|
| 5 |
-------------------
|
| 6 |
``_build_llm_adapter`` et ``_engine_from_competitor`` sont les
|
| 7 |
+
points de **routage** entre la config web (``PipelineConfig``)
|
| 8 |
et les adapters concrets : si une régression silencieusement
|
| 9 |
fait passer ``mistral`` au lieu de ``openai``, ou ``tesseract``
|
| 10 |
au lieu de ``mistral_ocr``, le benchmark tourne mais avec le
|
|
|
|
| 36 |
_engine_from_competitor,
|
| 37 |
sse_format,
|
| 38 |
)
|
| 39 |
+
from picarones.interfaces.web.models import PipelineConfig
|
| 40 |
|
| 41 |
|
| 42 |
# ──────────────────────────────────────────────────────────────────────
|
|
|
|
| 61 |
def test_provider_routes_to_expected_adapter(
|
| 62 |
self, provider: str, expected_class_name: str,
|
| 63 |
) -> None:
|
| 64 |
+
comp = PipelineConfig(
|
| 65 |
name="t", ocr_engine="", llm_provider=provider, llm_model="m",
|
| 66 |
)
|
| 67 |
adapter = _build_llm_adapter(comp)
|
|
|
|
| 71 |
)
|
| 72 |
|
| 73 |
def test_unknown_provider_raises_value_error(self) -> None:
|
| 74 |
+
comp = PipelineConfig(
|
| 75 |
name="t", ocr_engine="",
|
| 76 |
llm_provider="some_made_up_provider", llm_model="x",
|
| 77 |
)
|
|
|
|
| 82 |
"""Quand ``llm_model`` est vide, on passe ``None`` à
|
| 83 |
l'adapter (qui utilise son default interne) — pas une
|
| 84 |
chaîne vide qui serait rejetée par l'API."""
|
| 85 |
+
comp = PipelineConfig(
|
| 86 |
name="t", ocr_engine="", llm_provider="openai", llm_model="",
|
| 87 |
)
|
| 88 |
adapter = _build_llm_adapter(comp)
|
|
|
|
| 103 |
"""Le ``name`` est dérivé de ``(engine_id, ocr_model)`` pour
|
| 104 |
que deux configs distinctes obtiennent automatiquement des
|
| 105 |
identifiants différents au resolver (cf. S9 fix)."""
|
| 106 |
+
comp = PipelineConfig(
|
| 107 |
name="t", ocr_engine="tesseract", llm_provider="",
|
| 108 |
ocr_model="fra",
|
| 109 |
)
|
|
|
|
| 113 |
def test_tesseract_only_different_lang_distinct_name(self) -> None:
|
| 114 |
"""Garantie anti-collision : ``lang=eng`` et ``lang=fra``
|
| 115 |
produisent des ``name`` distincts au resolver."""
|
| 116 |
+
comp_fra = PipelineConfig(
|
| 117 |
ocr_engine="tesseract", llm_provider="", ocr_model="fra",
|
| 118 |
)
|
| 119 |
+
comp_eng = PipelineConfig(
|
| 120 |
ocr_engine="tesseract", llm_provider="", ocr_model="eng",
|
| 121 |
)
|
| 122 |
assert _engine_from_competitor(comp_fra).name == "tesseract_fra"
|
|
|
|
| 126 |
"""``RuntimeError`` (et pas ``ValueError`` brut) — c'est le
|
| 127 |
contrat documenté pour que le worker thread puisse
|
| 128 |
loguer ``warning`` et passer au concurrent suivant."""
|
| 129 |
+
comp = PipelineConfig(
|
| 130 |
name="t", ocr_engine="not_an_engine", llm_provider="",
|
| 131 |
)
|
| 132 |
with pytest.raises(RuntimeError, match="inconnu"):
|
|
|
|
| 141 |
("pipeline_mode", "expected_mode"),
|
| 142 |
[
|
| 143 |
("text_only", "text_only"),
|
|
|
|
| 144 |
("text_and_image", "text_and_image"),
|
|
|
|
|
|
|
| 145 |
],
|
| 146 |
)
|
| 147 |
+
def test_pipeline_mode_passes_through_with_ocr(
|
| 148 |
self, pipeline_mode: str, expected_mode: str,
|
| 149 |
) -> None:
|
| 150 |
+
"""Modes canoniques qui exigent un OCR amont — Phase 2 du
|
| 151 |
+
chantier post-rewrite : plus de mapping/alias. Les 3 valeurs
|
| 152 |
+
de :class:`PipelineMode` traversent telles quelles vers le
|
| 153 |
+
``OCRLLMPipelineConfig`` (``zero_shot`` testé séparément car
|
| 154 |
+
il refuse l'OCR amont)."""
|
| 155 |
+
comp = PipelineConfig(
|
| 156 |
name="t", ocr_engine="tesseract", llm_provider="mistral",
|
| 157 |
llm_model="m", ocr_model="fra", pipeline_mode=pipeline_mode,
|
| 158 |
)
|
| 159 |
pipeline = _engine_from_competitor(comp)
|
| 160 |
assert pipeline.mode == expected_mode
|
| 161 |
|
| 162 |
+
@pytest.mark.parametrize(
|
| 163 |
+
"deprecated_mode",
|
| 164 |
+
["post_correction_text", "post_correction_image", "POST_CORRECTION_TEXT"],
|
| 165 |
+
)
|
| 166 |
+
def test_legacy_aliases_rejected_at_pydantic_level(
|
| 167 |
+
self, deprecated_mode: str,
|
| 168 |
+
) -> None:
|
| 169 |
+
"""Phase 2 rupture API : les anciens alias
|
| 170 |
+
(``post_correction_text``/``post_correction_image``) sont
|
| 171 |
+
rejetés par Pydantic au niveau ``PipelineConfig`` — plus de
|
| 172 |
+
mapping silencieux vers ``text_only`` / ``text_and_image``."""
|
| 173 |
+
from pydantic import ValidationError
|
| 174 |
+
with pytest.raises(ValidationError):
|
| 175 |
+
PipelineConfig(
|
| 176 |
+
name="t", ocr_engine="tesseract", llm_provider="mistral",
|
| 177 |
+
llm_model="m", ocr_model="fra",
|
| 178 |
+
pipeline_mode=deprecated_mode,
|
| 179 |
+
)
|
| 180 |
+
|
| 181 |
+
def test_empty_pipeline_mode_with_llm_raises(self) -> None:
|
| 182 |
+
"""Phase 2 rupture API : un client qui combine ``llm_provider``
|
| 183 |
+
non vide avec ``pipeline_mode=""`` reçoit désormais une
|
| 184 |
+
``ValueError`` claire — l'ancien fallback silencieux vers
|
| 185 |
+
``text_only`` masquait la config incomplète."""
|
| 186 |
+
comp = PipelineConfig(
|
| 187 |
+
name="t", ocr_engine="tesseract", llm_provider="mistral",
|
| 188 |
+
llm_model="m", ocr_model="fra", pipeline_mode="",
|
| 189 |
+
)
|
| 190 |
+
with pytest.raises(ValueError, match="pipeline_mode invalide"):
|
| 191 |
+
_engine_from_competitor(comp)
|
| 192 |
+
|
| 193 |
def test_zero_shot_mode_requires_corpus_ocr(self) -> None:
|
| 194 |
"""Le mode ``zero_shot`` exige ``ocr_adapter=None`` au niveau
|
| 195 |
du pipeline (le VLM lit l'image directement) — donc côté
|
| 196 |
factory web, il doit être combiné avec ``ocr_engine=corpus``
|
| 197 |
ou ``""``, pas avec un moteur live."""
|
| 198 |
+
comp = PipelineConfig(
|
| 199 |
name="t", ocr_engine="corpus", llm_provider="mistral",
|
| 200 |
llm_model="m", pipeline_mode="zero_shot",
|
| 201 |
)
|
|
|
|
| 204 |
assert pipeline.ocr_adapter is None
|
| 205 |
|
| 206 |
def test_pipeline_name_from_explicit_name(self) -> None:
|
| 207 |
+
comp = PipelineConfig(
|
| 208 |
name="my-pipeline", ocr_engine="tesseract",
|
| 209 |
llm_provider="mistral", llm_model="m", ocr_model="fra",
|
| 210 |
+
pipeline_mode="text_only",
|
| 211 |
)
|
| 212 |
pipeline = _engine_from_competitor(comp)
|
| 213 |
assert pipeline.pipeline_name == "my-pipeline"
|
| 214 |
|
| 215 |
def test_pipeline_name_default_format(self) -> None:
|
| 216 |
"""Sans ``name`` explicite, format ``{engine} → {model}``."""
|
| 217 |
+
comp = PipelineConfig(
|
| 218 |
name="", ocr_engine="tesseract", llm_provider="mistral",
|
| 219 |
llm_model="ministral-3b-latest", ocr_model="fra",
|
| 220 |
+
pipeline_mode="text_only",
|
| 221 |
)
|
| 222 |
pipeline = _engine_from_competitor(comp)
|
| 223 |
assert "tesseract" in pipeline.pipeline_name
|
|
|
|
| 228 |
par défaut (``correction_medieval_french.txt``). Cf. S9 :
|
| 229 |
``prompt_template`` contient désormais le CONTENU lu sur
|
| 230 |
disque, pas le filename brut."""
|
| 231 |
+
comp = PipelineConfig(
|
| 232 |
name="t", ocr_engine="tesseract", llm_provider="mistral",
|
| 233 |
llm_model="m", ocr_model="fra", prompt_file="",
|
| 234 |
+
pipeline_mode="text_only",
|
| 235 |
)
|
| 236 |
pipeline = _engine_from_competitor(comp)
|
| 237 |
# Le template ne doit PAS être le filename littéral.
|
|
|
|
| 254 |
def test_corpus_or_empty_without_llm_raises(
|
| 255 |
self, ocr_engine: str,
|
| 256 |
) -> None:
|
| 257 |
+
comp = PipelineConfig(
|
| 258 |
name="t", ocr_engine=ocr_engine, llm_provider="",
|
| 259 |
)
|
| 260 |
with pytest.raises(ValueError, match="llm_provider"):
|
|
|
|
| 267 |
"""Mode corpus + LLM → pipeline ``zero_shot`` (le LLM/VLM
|
| 268 |
traite l'image ou l'OCR pré-calculé, l'``ocr_adapter`` est
|
| 269 |
``None``)."""
|
| 270 |
+
comp = PipelineConfig(
|
| 271 |
name="post-corr", ocr_engine=ocr_engine,
|
| 272 |
llm_provider="mistral", llm_model="m",
|
| 273 |
pipeline_mode="zero_shot",
|
|
|
|
| 281 |
|
| 282 |
def test_corpus_pipeline_name_format(self) -> None:
|
| 283 |
"""Sans ``name``, format ``corpus_ocr → {model}``."""
|
| 284 |
+
comp = PipelineConfig(
|
| 285 |
name="", ocr_engine="corpus", llm_provider="mistral",
|
| 286 |
llm_model="ministral-3b-latest",
|
| 287 |
pipeline_mode="zero_shot",
|
|
|
|
| 307 |
def test_cloud_engine_without_sdk_runtime_error(
|
| 308 |
self, engine: str, module_path: str,
|
| 309 |
) -> None:
|
| 310 |
+
comp = PipelineConfig(
|
| 311 |
name="t", ocr_engine=engine, llm_provider="",
|
| 312 |
)
|
| 313 |
with patch.dict(sys.modules, {module_path: None}):
|
|
@@ -31,7 +31,7 @@ from picarones.interfaces.web.benchmark_utils import (
|
|
| 31 |
_OCR_KWARGS_BUILDERS,
|
| 32 |
_engine_from_competitor,
|
| 33 |
)
|
| 34 |
-
from picarones.interfaces.web.models import
|
| 35 |
|
| 36 |
|
| 37 |
# ``cfg_a`` et ``cfg_b`` sont passés tels quels au constructeur de
|
|
@@ -48,10 +48,10 @@ def test_two_distinct_configs_coexist_in_resolver(
|
|
| 48 |
"""Deux competitors avec ``ocr_model`` distincts doivent recevoir
|
| 49 |
des ``name`` distincts au resolver — le bug Tesseract initial,
|
| 50 |
généralisé à tous les moteurs supportés."""
|
| 51 |
-
comp_a =
|
| 52 |
ocr_engine=engine_id, ocr_model="cfg_a", llm_provider="",
|
| 53 |
)
|
| 54 |
-
comp_b =
|
| 55 |
ocr_engine=engine_id, ocr_model="cfg_b", llm_provider="",
|
| 56 |
)
|
| 57 |
try:
|
|
@@ -82,10 +82,10 @@ def test_standalone_plus_pipeline_same_config_coexist(
|
|
| 82 |
seul + un competitor pipeline OCR+LLM partageant la même config
|
| 83 |
OCR. Le resolver doit accepter (les 2 instances Python sont
|
| 84 |
fonctionnellement équivalentes, déduplication idempotente)."""
|
| 85 |
-
comp_standalone =
|
| 86 |
ocr_engine=engine_id, ocr_model="same_config", llm_provider="",
|
| 87 |
)
|
| 88 |
-
comp_pipeline =
|
| 89 |
ocr_engine=engine_id, ocr_model="same_config",
|
| 90 |
llm_provider="mistral", llm_model="mistral-small-latest",
|
| 91 |
pipeline_mode="text_only",
|
|
|
|
| 31 |
_OCR_KWARGS_BUILDERS,
|
| 32 |
_engine_from_competitor,
|
| 33 |
)
|
| 34 |
+
from picarones.interfaces.web.models import PipelineConfig
|
| 35 |
|
| 36 |
|
| 37 |
# ``cfg_a`` et ``cfg_b`` sont passés tels quels au constructeur de
|
|
|
|
| 48 |
"""Deux competitors avec ``ocr_model`` distincts doivent recevoir
|
| 49 |
des ``name`` distincts au resolver — le bug Tesseract initial,
|
| 50 |
généralisé à tous les moteurs supportés."""
|
| 51 |
+
comp_a = PipelineConfig(
|
| 52 |
ocr_engine=engine_id, ocr_model="cfg_a", llm_provider="",
|
| 53 |
)
|
| 54 |
+
comp_b = PipelineConfig(
|
| 55 |
ocr_engine=engine_id, ocr_model="cfg_b", llm_provider="",
|
| 56 |
)
|
| 57 |
try:
|
|
|
|
| 82 |
seul + un competitor pipeline OCR+LLM partageant la même config
|
| 83 |
OCR. Le resolver doit accepter (les 2 instances Python sont
|
| 84 |
fonctionnellement équivalentes, déduplication idempotente)."""
|
| 85 |
+
comp_standalone = PipelineConfig(
|
| 86 |
ocr_engine=engine_id, ocr_model="same_config", llm_provider="",
|
| 87 |
)
|
| 88 |
+
comp_pipeline = PipelineConfig(
|
| 89 |
ocr_engine=engine_id, ocr_model="same_config",
|
| 90 |
llm_provider="mistral", llm_model="mistral-small-latest",
|
| 91 |
pipeline_mode="text_only",
|
|
@@ -41,7 +41,7 @@ from picarones.interfaces.web.benchmark_utils import (
|
|
| 41 |
_engine_from_competitor,
|
| 42 |
_load_prompt_content,
|
| 43 |
)
|
| 44 |
-
from picarones.interfaces.web.models import
|
| 45 |
|
| 46 |
|
| 47 |
class TestLoadPromptContent:
|
|
@@ -113,7 +113,7 @@ class TestEngineFromCompetitorPassesPromptContent:
|
|
| 113 |
pas le filename brut."""
|
| 114 |
|
| 115 |
def test_pipeline_template_contains_file_content(self) -> None:
|
| 116 |
-
comp =
|
| 117 |
name="t",
|
| 118 |
ocr_engine="tesseract",
|
| 119 |
ocr_model="fra",
|
|
@@ -133,7 +133,7 @@ class TestEngineFromCompetitorPassesPromptContent:
|
|
| 133 |
def test_default_prompt_loaded_when_none_specified(self) -> None:
|
| 134 |
"""``prompt_file`` vide → default
|
| 135 |
``correction_medieval_french.txt`` chargé."""
|
| 136 |
-
comp =
|
| 137 |
ocr_engine="tesseract", ocr_model="fra",
|
| 138 |
llm_provider="mistral", llm_model="m",
|
| 139 |
pipeline_mode="text_only", prompt_file="",
|
|
@@ -146,7 +146,7 @@ class TestEngineFromCompetitorPassesPromptContent:
|
|
| 146 |
"""Si le frontend envoie un filename qui n'existe pas, le
|
| 147 |
factory doit lever proprement (pas continuer avec le filename
|
| 148 |
comme prompt — c'est le bug d'origine)."""
|
| 149 |
-
comp =
|
| 150 |
ocr_engine="tesseract", ocr_model="fra",
|
| 151 |
llm_provider="mistral", llm_model="m",
|
| 152 |
pipeline_mode="text_only",
|
|
|
|
| 41 |
_engine_from_competitor,
|
| 42 |
_load_prompt_content,
|
| 43 |
)
|
| 44 |
+
from picarones.interfaces.web.models import PipelineConfig
|
| 45 |
|
| 46 |
|
| 47 |
class TestLoadPromptContent:
|
|
|
|
| 113 |
pas le filename brut."""
|
| 114 |
|
| 115 |
def test_pipeline_template_contains_file_content(self) -> None:
|
| 116 |
+
comp = PipelineConfig(
|
| 117 |
name="t",
|
| 118 |
ocr_engine="tesseract",
|
| 119 |
ocr_model="fra",
|
|
|
|
| 133 |
def test_default_prompt_loaded_when_none_specified(self) -> None:
|
| 134 |
"""``prompt_file`` vide → default
|
| 135 |
``correction_medieval_french.txt`` chargé."""
|
| 136 |
+
comp = PipelineConfig(
|
| 137 |
ocr_engine="tesseract", ocr_model="fra",
|
| 138 |
llm_provider="mistral", llm_model="m",
|
| 139 |
pipeline_mode="text_only", prompt_file="",
|
|
|
|
| 146 |
"""Si le frontend envoie un filename qui n'existe pas, le
|
| 147 |
factory doit lever proprement (pas continuer avec le filename
|
| 148 |
comme prompt — c'est le bug d'origine)."""
|
| 149 |
+
comp = PipelineConfig(
|
| 150 |
ocr_engine="tesseract", ocr_model="fra",
|
| 151 |
llm_provider="mistral", llm_model="m",
|
| 152 |
pipeline_mode="text_only",
|
|
@@ -25,6 +25,7 @@ TestRunnerProgressCallback (5 tests) — progress_callback injecté dans run_
|
|
| 25 |
|
| 26 |
from __future__ import annotations
|
| 27 |
|
|
|
|
| 28 |
import json
|
| 29 |
import os
|
| 30 |
from pathlib import Path
|
|
@@ -33,6 +34,26 @@ from unittest.mock import patch
|
|
| 33 |
import pytest
|
| 34 |
from click.testing import CliRunner
|
| 35 |
from fastapi.testclient import TestClient
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
# ---------------------------------------------------------------------------
|
| 38 |
# Fixtures
|
|
@@ -1277,9 +1298,9 @@ class TestFastAPICorpusUpload:
|
|
| 1277 |
|
| 1278 |
buf = io.BytesIO()
|
| 1279 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1280 |
-
zf.writestr("page001.jpg",
|
| 1281 |
zf.writestr("page001.gt.txt", "Texte de la page 1")
|
| 1282 |
-
zf.writestr("page002.png",
|
| 1283 |
zf.writestr("page002.gt.txt", "Texte de la page 2")
|
| 1284 |
buf.seek(0)
|
| 1285 |
return buf.getvalue()
|
|
@@ -1292,9 +1313,9 @@ class TestFastAPICorpusUpload:
|
|
| 1292 |
|
| 1293 |
buf = io.BytesIO()
|
| 1294 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1295 |
-
zf.writestr("page001.jpg",
|
| 1296 |
zf.writestr("page001.gt.txt", "GT ok")
|
| 1297 |
-
zf.writestr("page002.png",
|
| 1298 |
buf.seek(0)
|
| 1299 |
return buf.getvalue()
|
| 1300 |
|
|
@@ -1427,7 +1448,7 @@ class TestFastAPICorpusUpload:
|
|
| 1427 |
|
| 1428 |
buf = io.BytesIO()
|
| 1429 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1430 |
-
zf.writestr("page001.png",
|
| 1431 |
zf.writestr("page001.xml", alto_xml_bytes)
|
| 1432 |
buf.seek(0)
|
| 1433 |
return buf.getvalue()
|
|
@@ -1502,7 +1523,7 @@ class TestFastAPICorpusUpload:
|
|
| 1502 |
|
| 1503 |
buf = io.BytesIO()
|
| 1504 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1505 |
-
zf.writestr("page002.png",
|
| 1506 |
zf.writestr("page002.xml", page_xml_bytes)
|
| 1507 |
buf.seek(0)
|
| 1508 |
return buf.getvalue()
|
|
@@ -1553,7 +1574,7 @@ class TestFastAPICorpusUpload:
|
|
| 1553 |
unknown_xml = b'<?xml version="1.0"?><root><item>foo</item></root>'
|
| 1554 |
buf = io.BytesIO()
|
| 1555 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1556 |
-
zf.writestr("pageX.png",
|
| 1557 |
zf.writestr("pageX.xml", unknown_xml)
|
| 1558 |
buf.seek(0)
|
| 1559 |
r = client.post(
|
|
|
|
| 25 |
|
| 26 |
from __future__ import annotations
|
| 27 |
|
| 28 |
+
import io
|
| 29 |
import json
|
| 30 |
import os
|
| 31 |
from pathlib import Path
|
|
|
|
| 34 |
import pytest
|
| 35 |
from click.testing import CliRunner
|
| 36 |
from fastapi.testclient import TestClient
|
| 37 |
+
from PIL import Image as _PILImage
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def _minimal_image_bytes(fmt: str) -> bytes:
|
| 41 |
+
"""Génère une image 1×1 valide qui passe ``validate_image_safe``.
|
| 42 |
+
|
| 43 |
+
Le durcissement Phase 1 du chantier post-rewrite appelle
|
| 44 |
+
``Pillow.verify()`` sur chaque image extraite d'un ZIP — les
|
| 45 |
+
anciens placeholders ``b"\\xff\\xd8\\xff"`` (signature seule) sont
|
| 46 |
+
désormais rejetés. Cette fonction produit l'image minimale au
|
| 47 |
+
setup des fixtures.
|
| 48 |
+
"""
|
| 49 |
+
buf = io.BytesIO()
|
| 50 |
+
_PILImage.new("RGB", (1, 1), color=(200, 200, 200)).save(buf, fmt)
|
| 51 |
+
return buf.getvalue()
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
_MINIMAL_PNG_BYTES = _minimal_image_bytes("PNG")
|
| 55 |
+
_MINIMAL_JPEG_BYTES = _minimal_image_bytes("JPEG")
|
| 56 |
+
|
| 57 |
|
| 58 |
# ---------------------------------------------------------------------------
|
| 59 |
# Fixtures
|
|
|
|
| 1298 |
|
| 1299 |
buf = io.BytesIO()
|
| 1300 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1301 |
+
zf.writestr("page001.jpg", _MINIMAL_JPEG_BYTES)
|
| 1302 |
zf.writestr("page001.gt.txt", "Texte de la page 1")
|
| 1303 |
+
zf.writestr("page002.png", _MINIMAL_PNG_BYTES)
|
| 1304 |
zf.writestr("page002.gt.txt", "Texte de la page 2")
|
| 1305 |
buf.seek(0)
|
| 1306 |
return buf.getvalue()
|
|
|
|
| 1313 |
|
| 1314 |
buf = io.BytesIO()
|
| 1315 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1316 |
+
zf.writestr("page001.jpg", _MINIMAL_JPEG_BYTES)
|
| 1317 |
zf.writestr("page001.gt.txt", "GT ok")
|
| 1318 |
+
zf.writestr("page002.png", _MINIMAL_PNG_BYTES)
|
| 1319 |
buf.seek(0)
|
| 1320 |
return buf.getvalue()
|
| 1321 |
|
|
|
|
| 1448 |
|
| 1449 |
buf = io.BytesIO()
|
| 1450 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1451 |
+
zf.writestr("page001.png", _MINIMAL_PNG_BYTES)
|
| 1452 |
zf.writestr("page001.xml", alto_xml_bytes)
|
| 1453 |
buf.seek(0)
|
| 1454 |
return buf.getvalue()
|
|
|
|
| 1523 |
|
| 1524 |
buf = io.BytesIO()
|
| 1525 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1526 |
+
zf.writestr("page002.png", _MINIMAL_PNG_BYTES)
|
| 1527 |
zf.writestr("page002.xml", page_xml_bytes)
|
| 1528 |
buf.seek(0)
|
| 1529 |
return buf.getvalue()
|
|
|
|
| 1574 |
unknown_xml = b'<?xml version="1.0"?><root><item>foo</item></root>'
|
| 1575 |
buf = io.BytesIO()
|
| 1576 |
with zipfile.ZipFile(buf, "w") as zf:
|
| 1577 |
+
zf.writestr("pageX.png", _MINIMAL_PNG_BYTES)
|
| 1578 |
zf.writestr("pageX.xml", unknown_xml)
|
| 1579 |
buf.seek(0)
|
| 1580 |
r = client.post(
|