Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Running

Claude commited on May 14

Commit

bddfd89

unverified ·

1 Parent(s): 76683ea

test(migration): Phase B0 — fondations pour migration Option B (RunOrchestrator)

Préparation du chantier de migration `run_benchmark_via_service` →
`RunOrchestrator.execute(RunSpec)`. 3 livrables qui serviront de filets
de sécurité pour les phases B1-B8 :

- **B0.1 test d'invariance** (`tests/integration/test_migration_invariance.py`) :
exécute un benchmark déterministe (PrecomputedTextAdapter × 2 docs) via
la façade actuelle, sérialise le `BenchmarkResult` et compare à un
snapshot JSON. Normalisation des champs volatils (chemins, timestamps,
durations). Snapshot enregistré dans `tests/integration/snapshots/`.
Mode update via `PICARONES_UPDATE_SNAPSHOT=1`.

- **B0.2 squelette feature parity** (`tests/app/services/test_run_orchestrator_feature_parity.py`) :
13 tests `pytest.skip("TODO Phase B2.X")` documentant précisément les
7 features à porter (progress_callback, cancel_event, partial_dir,
entity_extractor, char_exclude+normalization_profile, profile hooks,
output_json legacy). Chaque test sera dé-skippé au fur et à mesure de
la Phase B2 → gate du checkpoint C1.

- **B0.3 inventaire tests** (`docs/migration/option_b_test_inventory.md`) :
classification des 25 fichiers de tests qui touchent
`run_benchmark_via_service` (catégorie A, 10 fichiers) ou consomment
`BenchmarkResult` sans appeler le runner (catégorie B, 16 fichiers).
Aucun fichier ne patche via monkeypatch — simplifie la migration B4.

Tests : 60 passed, 12 skipped (placeholders B2), 0 failed.

Files changed (4) hide show

docs/migration/option_b_test_inventory.md +130 -0
tests/app/services/test_run_orchestrator_feature_parity.py +279 -0
tests/integration/snapshots/migration_invariance.json +470 -0
tests/integration/test_migration_invariance.py +289 -0

docs/migration/option_b_test_inventory.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# Inventaire des tests à migrer — Option B (RunOrchestrator)
+Document de référence pour la Phase B4 du chantier de migration
+`run_benchmark_via_service` → `RunOrchestrator.execute(RunSpec)`.
+Établi en Phase B0 par parcours exhaustif de `tests/` :
+- `grep -rln "run_benchmark_via_service" tests/` → 10 fichiers (catégorie A)
+- `grep -rln "BenchmarkResult" tests/` → 20 fichiers, dont 10 en intersection avec A et 10 en catégorie B seule
+Aucun fichier ne **patche** `run_benchmark_via_service` via `monkeypatch.setattr`
+ou `mock.patch` — vérifié en Phase B0. Tous les call sites sont des appels
+directs. Cela simplifie la stratégie de migration : pas de cibles indirectes
+à recâbler.
+---
+## Catégorie A — Appellent `run_benchmark_via_service` directement
+Ces tests doivent être migrés vers `RunOrchestrator.execute(spec)` ou vers le
+helper `build_run_spec_from_engines()` (Phase B3.1) pour les cas qui partent
+d'instances d'adapter en mémoire.
+| # | Fichier | Taille | Occurrences | Priorité B4 | Notes |
+|---|---|---|---|---|---|
+| A1 | `tests/app/test_sprint_d2b_partial_dir_resume.py` | 506 LOC | 12 | **Haute** | Teste le resume `partial_dir`. Doit valider le port vers `_orchestrator_partial.py` (Phase B2.3). Cœur de la non-régression. |
+| A2 | `tests/app/test_sprint_d2cdef_features.py` | 473 LOC | 14 | **Haute** | Teste les 7 paramètres étendus (`profile`, `entity_extractor`, `cancel_event`, etc.). Doit valider chaque feature portée en Phase B2. |
+| A3 | `tests/web/test_sprint6_web_interface.py` | 1392 LOC | 10 | **Haute** | Test d'intégration web. Confirmera que la migration `run_benchmark_thread_v2` ne casse rien côté UI. |
+| A4 | `tests/app/test_character_analysis_in_runner.py` | 246 LOC | 12 | Moyenne | Teste l'analyse caractère par engine. Conversion mécanique. |
+| A5 | `tests/app/test_sprint_h2b_canonical_in_runner.py` | 191 LOC | 9 | Moyenne | Teste l'extraction du `CANONICAL_DOCUMENT`. À adapter au nouveau ViewExecutor. |
+| A6 | `tests/evaluation/test_public_api.py` | — | 7 | Moyenne | API publique. Inclura un test de présence pour `RunOrchestrator`. |
+| A7 | `tests/evaluation/metrics/test_sprint12_nouvelles_fonctionnalites.py` | 288 LOC | 4 | Basse | Conversion mécanique. |
+| A8 | `tests/evaluation/metrics/test_sprint_a14_s1_normalization_propagation.py` | — | 2 | Basse | Vérifie `normalization_profile` — valide la Phase B2.5 (propagation via `EvaluationView`). |
+| A9 | `tests/evaluation/test_metric_hooks.py` | — | 1 | Basse | Trivial. Conversion en 1 ligne. |
+| A10 | `tests/architecture/test_file_budgets.py` | — | (référence uniquement) | Basse | Budgets des modules `_benchmark_*.py` à actualiser après Phase B2/B7. |
+**Total catégorie A** : 10 fichiers, ~3500 LOC, ~71 occurrences.
+---
+## Catégorie B — Consomment `BenchmarkResult` mais n'appellent pas le runner
+Ces tests **ne nécessitent aucun changement** tant que le converter
+`RunResult → BenchmarkResult` (`_benchmark_converter.py`) reste en place
+après la migration. Ils consomment soit un `BenchmarkResult` construit
+manuellement (fixture), soit un `BenchmarkResult` issu d'un appel au runner
+fait dans une fixture partagée.
+| # | Fichier | Rôle |
+|---|---|---|
+| B1 | `tests/golden/test_s5_benchmark_result_json_stable.py` | Round-trip JSON stable. Inchangé tant que `BenchmarkResult.from_json_object`/`to_dict` restent. |
+| B2 | `tests/reports/test_report.py` | Rendu HTML. Inchangé tant que `ReportGenerator(result)` accepte `BenchmarkResult`. |
+| B3 | `tests/reports/test_extra_metrics.py` | Métriques additionnelles attachées au rapport. |
+| B4 | `tests/reports/test_sprint72_worst_lines.py` | Worst-N lines (consomme `BenchmarkResult` non-compacté). |
+| B5 | `tests/evaluation/metrics/test_results.py` | API `MetricsResult` / `aggregate_metrics`. |
+| B6 | `tests/evaluation/metrics/test_sprint36_ensemble_narrative.py` | Narrative engine. Lit `benchmark_data` dict. |
+| B7 | `tests/evaluation/metrics/test_sprint44_median_default.py` | Médiane/Pareto. |
+| B8 | `tests/evaluation/metrics/test_sprint45_stratification.py` | Stratification du corpus. |
+| B9 | `tests/evaluation/test_sprint14_robust_filtering.py` | Filtre robustesse. |
+| B10 | `tests/adapters/corpus/test_sprint8_escriptorium_gallica.py` | Importer eScriptorium / Gallica. |
+| B11 | `tests/integration/test_importer_fallback_wiring.py` | Fallback importer. Test d'intégration. |
+| B12 | `tests/integration/test_s5_disk_full_simulation.py` | Disque plein. |
+| B13 | `tests/security/test_phase1_post_rewrite_wiring.py` | Sécurité post-rewrite. |
+| B14 | `tests/security/test_s1_xss_in_reports.py` | XSS dans rapports. |
+| B15 | `tests/test_minimal_install.py` | Installation minimale (smoke test). |
+| B16 | `tests/web/test_sprint28_ux_save_compare.py` | UX save/compare web. |
+**Total catégorie B** : 16 fichiers — **AUCUN changement requis** pour Option B.
+---
+## Catégorie C — Tests qui utilisent déjà `RunOrchestrator`
+Pour information : ces tests servent de modèle/template pour la migration.
+| # | Fichier | Rôle |
+|---|---|---|
+| C1 | `tests/app/test_run_orchestrator.py` | Tests unitaires complets de `RunOrchestrator`. Modèle de référence. |
+| C2 | `tests/integration/test_runner_profiles.py` | Profils de hooks via le `RunOrchestrator`. |
+| C3 | `tests/integration/test_html_views.py` | Vues HTML via `RunOrchestrator`. |
+| C4 | `tests/integration/test_narrative_and_views.py` | Narrative engine + vues. |
+| C5 | `tests/integration/test_engines_and_llm.py` | Engines + LLM via `RunOrchestrator`. |
+---
+## Stratégie globale Phase B4
+1. **Étape 1** (0.5 j) — Fixture partagée dans `tests/conftest.py` :
+   - `make_minimal_corpus_zip()` — corpus zip 2 docs déterministe
+   - `run_orchestrator_factory(tmp_path)` — `RunOrchestrator(workspace)`
+   - `build_minimal_run_spec(corpus_zip, output_dir, adapters)` — helper
+   - `assert_benchmark_results_equal(a, b, *, ignore=("started_at", "completed_at"))`
+2. **Étape 2** (2 j) — Migration catégorie A en commençant par la priorité haute :
+   - A1 (resume partial) + A2 (features étendues) + A3 (web) → 1 j
+   - A4, A5, A6 → 0.5 j
+   - A7, A8, A9 → 0.5 j (trivial)
+3. **Étape 3** (0.5 j) — Mise à jour `tests/architecture/test_file_budgets.py` (A10) :
+   - Marquer les modules `_benchmark_*.py` comme deprecated
+   - Augmenter le budget de `run_orchestrator.py` (~+300 LOC après Phase B2)
+4. **Étape 4** (1 j) — Run complet de la suite + ajustements :
+   - `pytest tests/ -q --tb=short`
+   - Snapshot d'invariance (`test_migration_invariance.py`) doit rester vert
+   - Feature parity (`test_run_orchestrator_feature_parity.py`) toutes vertes
+**Estimation totale Phase B4** : 3-4 jours, conforme au plan.
+---
+## Validation post-migration
+À la fin de la Phase B4 :
+```bash
+# Tous verts
+python -m pytest tests/ -q --tb=short
+# Snapshot d'invariance inchangé
+python -m pytest tests/integration/test_migration_invariance.py -v
+# 7 features de parity portées
+python -m pytest tests/app/services/test_run_orchestrator_feature_parity.py -v
+# Aucune occurrence résiduelle de run_benchmark_via_service hors module legacy
+grep -rln "run_benchmark_via_service" tests/ picarones/ | \
+  grep -v "benchmark_runner.py\|_benchmark_" | \
+  wc -l
+# Attendu : 0 (ou 1 si on garde un export public deprecated dans __init__.py)
+```

tests/app/services/test_run_orchestrator_feature_parity.py ADDED Viewed

	@@ -0,0 +1,279 @@

+"""Squelette des tests de feature parity entre ``run_benchmark_via_service``
+et ``RunOrchestrator.execute(RunSpec)``.
+Phase B0 du chantier de migration Option B.
+Rôle
+----
+Ce module liste les **7 features** que ``run_benchmark_via_service``
+expose aujourd'hui et que ``RunOrchestrator`` doit porter pendant la
+Phase B2.  Chaque test est documenté précisément (ce qui doit être
+vérifié) et marqué ``pytest.skip`` jusqu'à ce que la feature
+correspondante soit portée.
+Au fur et à mesure de la Phase B2, retirer le ``pytest.skip`` du test
+correspondant et implémenter sa logique.  À la fin de B2, tous les
+tests doivent être verts → on a atteint le **Checkpoint C1**.
+Convention
+----------
+Chaque test compare :
+1. ``run_benchmark_via_service(feature_X=value)`` — chemin legacy
+2. ``RunOrchestrator().execute(spec_with_feature_X=value)`` — chemin
+   rewrite
+Et vérifie que le ``BenchmarkResult`` produit est numériquement
+identique (modulo normalisation des champs volatils).
+Mapping vers le plan Option B
+-----------------------------
+- B2.1 ``progress_callback``      → ``test_parity_progress_callback``
+- B2.2 ``cancel_event``           → ``test_parity_cancel_event``
+- B2.3 ``partial_dir``            → ``test_parity_partial_dir_resume``
+- B2.4 ``entity_extractor``       → ``test_parity_entity_extractor_ner``
+- B2.5 ``char_exclude`` +         → ``test_parity_normalization_propagation``
+       ``normalization_profile``
+- B2.6 ``profile`` (hooks)        → ``test_parity_profile_hooks``
+- B2.7 ``output_json``            → ``test_parity_output_json_legacy_format``
+"""
+from __future__ import annotations
+from pathlib import Path
+import pytest
+SKIP_REASON_PREFIX = "TODO Phase B2."
+# ──────────────────────────────────────────────────────────────────────
+# B2.1 — progress_callback
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}1 — port progress_callback")
+def test_parity_progress_callback(tmp_path: Path) -> None:
+    """``progress_callback`` est appelé avec ``(engine_name, doc_idx,
+    doc_id)`` dans les deux chemins.
+    Spec
+    ----
+    - Lancer un benchmark à 1 engine × 3 docs.
+    - Le callback est invoqué exactement 3 fois (1 par doc).
+    - Les arguments matchent : ``engine_name`` = nom de l'adapter,
+      ``doc_idx`` = compteur global croissant (0, 1, 2), ``doc_id``
+      = ID du document.
+    - Le compteur est partagé entre threads via verrou
+      (cf. ``_benchmark_execution.py:109-139``).
+    Cible de port
+    -------------
+    Étendre ``RunOrchestrator._make_context_factory`` pour qu'il
+    accepte un ``progress_callback`` et reproduise le pattern
+    (verrou + compteur ``doc_idx``).
+    """
+# ──────────────────────────────────────────────────────────────────────
+# B2.2 — cancel_event
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}2 — port cancel_event")
+def test_parity_cancel_event(tmp_path: Path) -> None:
+    """Un ``threading.Event.set()`` arrête le run en cours.
+    Spec
+    ----
+    - Lancer un benchmark à 1 engine × 10 docs.
+    - Après ~2 docs traités, appeler ``cancel_event.set()``.
+    - Le run doit s'arrêter rapidement (< 1 s de marge).
+    - Le ``BenchmarkResult`` retourné contient les 2 premiers docs
+      (ou plus, selon timing) mais pas les 10.
+    Cible de port
+    -------------
+    Wrapper ``CorpusRunner.run`` dans ``RunOrchestrator`` pour qu'il
+    injecte le ``cancel_event`` dans ses kwargs (cf.
+    ``_benchmark_execution.py:142-149``).
+    """
+# ──────────────────────────────────────────────────────────────────────
+# B2.3 — partial_dir resume
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}3 — port partial_dir resume")
+def test_parity_partial_dir_resume_fresh_start(tmp_path: Path) -> None:
+    """Premier run avec ``partial_dir`` non existant → comportement
+    identique à un run sans ``partial_dir``.
+    Spec
+    ----
+    - ``partial_dir`` = répertoire vide.
+    - Lancer le bench.
+    - À la fin, le fichier ``{partial_dir}/picarones_{corpus}_{engine}
+      .partial.jsonl`` est supprimé (succès complet).
+    - Le ``BenchmarkResult`` est identique au run sans ``partial_dir``.
+    """
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}3 — port partial_dir resume")
+def test_parity_partial_dir_resume_after_crash(tmp_path: Path) -> None:
+    """Reprise après crash partiel : 3 docs sur 5 déjà persistés →
+    seuls les 2 restants sont soumis au runner.
+    Spec
+    ----
+    - Pré-écrire un partial JSONL avec 3 ``DocumentResult`` valides.
+    - Lancer le bench sur le corpus de 5 docs.
+    - Le ``CorpusRunner.run`` est appelé sur **2 docs seulement**
+      (vérifier via spy).
+    - Le ``BenchmarkResult`` final agrège les 5 docs (3 réutilisés +
+      2 nouveaux).
+    """
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}3 — port partial_dir resume")
+def test_parity_partial_dir_fingerprint_invalidates(tmp_path: Path) -> None:
+    """Fingerprint divergent invalide le partial (re-calcul depuis 0).
+    Spec
+    ----
+    - Pré-écrire un partial avec un ``code_version`` différent.
+    - Lancer le bench.
+    - Le partial est ignoré, les 5 docs sont recalculés.
+    """
+# ──────────────────────────────────────────────────────────────────────
+# B2.4 — entity_extractor (NER attach)
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}4 — port entity_extractor")
+def test_parity_entity_extractor_ner(tmp_path: Path) -> None:
+    """Quand un ``entity_extractor`` est fourni, les métriques NER
+    sont attachées au ``BenchmarkResult``.
+    Spec
+    ----
+    - Corpus avec ``EntitiesGT`` (au moins 1 doc avec niveau ENTITIES).
+    - ``entity_extractor`` = mock qui retourne des entités fixes.
+    - Le ``BenchmarkResult`` contient ``DocumentResult.ner_metrics`` :
+      ``precision``, ``recall``, ``f1`` par type d'entité.
+    - L'agrégation ``EngineReport.aggregated_ner`` est calculée.
+    """
+# ──────────────────────────────────────────────────────────────────────
+# B2.5 — char_exclude + normalization_profile
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}5 — port normalization propagation")
+def test_parity_char_exclude(tmp_path: Path) -> None:
+    """``char_exclude`` filtre les caractères avant calcul CER/WER.
+    Spec
+    ----
+    - GT = ``"Bonjour!"``, OCR = ``"Bonjour."``.
+    - Sans ``char_exclude`` : CER = 1/8 = 0.125.
+    - Avec ``char_exclude="!."`` : CER = 0.0 (les 2 caractères
+      filtrés sont les seuls différents).
+    """
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}5 — port normalization propagation")
+def test_parity_normalization_profile(tmp_path: Path) -> None:
+    """``normalization_profile="caseless"`` égalise les casses.
+    Spec
+    ----
+    - GT = ``"Bonjour"``, OCR = ``"BONJOUR"``.
+    - Sans profil : CER ≈ 1.0 (toutes les lettres diffèrent).
+    - Avec ``caseless`` : CER = 0.0.
+    """
+# ──────────────────────────────────────────────────────────────────────
+# B2.6 — profile (hooks document-level / corpus aggregators)
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}6 — port profile hooks")
+def test_parity_profile_validation(tmp_path: Path) -> None:
+    """``profile="unknown"`` lève ``ValueError`` AVANT le run.
+    Spec
+    ----
+    - Comportement identique aux 3 tests
+      ``TestProfileValidation`` de
+      ``tests/app/test_sprint_d2cdef_features.py``.
+    """
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}6 — port profile hooks")
+def test_parity_profile_standard_runs_hooks(tmp_path: Path) -> None:
+    """``profile="standard"`` exécute les hooks document-level
+    enregistrés via ``@register_document_metric``.
+    Spec
+    ----
+    - Enregistrer un hook test ``@register_document_metric("standard")``
+      qui renvoie ``{"hooked": True}``.
+    - Lancer le bench.
+    - ``DocumentResult.hook_values["hooked"] is True``.
+    """
+# ────────────��─────────────────────────────────────────────────────────
+# B2.7 — output_json (legacy BenchmarkResult JSON)
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}7 — port output_json legacy")
+def test_parity_output_json_legacy_format(tmp_path: Path) -> None:
+    """Quand ``output_json`` est fourni, un fichier JSON au format
+    ``BenchmarkResult.as_dict()`` est écrit en plus des 4 fichiers
+    JSONL natifs du ``RunOrchestrator``.
+    Spec
+    ----
+    - Lancer ``RunOrchestrator().execute(spec_with_output_json)``.
+    - Vérifier que ``output_json`` existe et contient un JSON
+      désérialisable via ``BenchmarkResult.from_json_object``.
+    - Vérifier que les 4 fichiers JSONL natifs sont aussi écrits
+      (cohabitation).
+    """
+# ──────────────────────────────────────────────────────────────────────
+# Test global de feature parity — vérification croisée
+# ──────────────────────────────────────────────────────────────────────
+@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}* — toutes features portées")
+def test_parity_all_features_combined(tmp_path: Path) -> None:
+    """Lance les deux chemins avec toutes les features actives et
+    vérifie l'égalité numérique du ``BenchmarkResult``.
+    Spec
+    ----
+    - Construire un ``RunSpec`` avec : ``profile="standard"``,
+      ``partial_dir=tmp_path/"partial"``, ``output_json=tmp_path/
+      "bm.json"``, ``char_exclude="!."``,
+      ``normalization_profile="caseless"``.
+    - Lancer ``run_benchmark_via_service`` avec les mêmes paramètres.
+    - Lancer ``RunOrchestrator().execute(spec)``.
+    - Normaliser les 2 ``BenchmarkResult`` (cf.
+      ``test_migration_invariance.py:_normalize_for_snapshot``).
+    - Vérifier ``a == b``.
+    Ce test est le **gate finale du Checkpoint C1**.  Quand il passe,
+    la Phase B2 est terminée et on peut commencer B3 (migration des
+    call sites).
+    """

tests/integration/snapshots/migration_invariance.json ADDED Viewed

	@@ -0,0 +1,470 @@

+{
+  "corpus": {
+    "document_count": 2,
+    "name": "invariance_corpus",
+    "source": null
+  },
+  "engine_reports": [
+    {
+      "aggregated_char_scores": {
+        "diacritic": {
+          "correctly_recognized": 0,
+          "score": 1.0,
+          "total_in_gt": 0
+        },
+        "ligature": {
+          "correctly_recognized": 0,
+          "per_ligature": {},
+          "score": 1.0,
+          "total_in_gt": 0
+        }
+      },
+      "aggregated_confusion": {
+        "matrix": {},
+        "total_deletions": 0,
+        "total_insertions": 0,
+        "total_substitutions": 1
+      },
+      "aggregated_hallucination": {
+        "anchor_score_mean": 0.5,
+        "anchor_score_min": 0.0,
+        "document_count": 2,
+        "hallucinating_doc_count": 1,
+        "hallucinating_doc_rate": 0.5,
+        "length_ratio_mean": 1.0,
+        "net_insertion_rate_mean": 0.25
+      },
+      "aggregated_line_metrics": {
+        "catastrophic_rate": {
+          "0.3": 0.0,
+          "0.5": 0.0,
+          "1.0": 0.0
+        },
+        "document_count": 2,
+        "gini_mean": 0.0,
+        "gini_stdev": 0.0,
+        "heatmap": [
+          0.0,
+          0.0,
+          0.0,
+          0.0,
+          0.0,
+          0.0,
+          0.0,
+          0.0,
+          0.0,
+          0.045455
+        ],
+        "mean_cer_mean": 0.045455,
+        "percentiles": {
+          "p50": 0.045455,
+          "p75": 0.045455,
+          "p90": 0.045455,
+          "p95": 0.045455,
+          "p99": 0.045455
+        }
+      },
+      "aggregated_metrics": {
+        "cer": {
+          "max": 0.090909,
+          "mean": 0.045455,
+          "median": 0.045455,
+          "min": 0.0,
+          "stdev": 0.064282
+        },
+        "cer_caseless": {
+          "max": 0.090909,
+          "mean": 0.045455,
+          "median": 0.045455,
+          "min": 0.0,
+          "stdev": 0.064282
+        },
+        "cer_diplomatic": {
+          "max": 0.090909,
+          "mean": 0.045455,
+          "median": 0.045455,
+          "min": 0.0,
+          "profile": "medieval_french",
+          "stdev": 0.064282
+        },
+        "cer_nfc": {
+          "max": 0.090909,
+          "mean": 0.045455,
+          "median": 0.045455,
+          "min": 0.0,
+          "stdev": 0.064282
+        },
+        "document_count": 2,
+        "failed_count": 0,
+        "mer": {
+          "max": 0.5,
+          "mean": 0.25,
+          "median": 0.25,
+          "min": 0.0,
+          "stdev": 0.353553
+        },
+        "wer": {
+          "max": 0.5,
+          "mean": 0.25,
+          "median": 0.25,
+          "min": 0.0,
+          "stdev": 0.353553
+        },
+        "wer_normalized": {
+          "max": 0.5,
+          "mean": 0.25,
+          "median": 0.25,
+          "min": 0.0,
+          "stdev": 0.353553
+        },
+        "wil": {
+          "max": 0.75,
+          "mean": 0.375,
+          "median": 0.375,
+          "min": 0.0,
+          "stdev": 0.53033
+        }
+      },
+      "aggregated_searchability": {
+        "max_distance": 2,
+        "missed_tokens_sample": [],
+        "n_docs": 2,
+        "n_gt_tokens": 5,
+        "n_searchable": 5,
+        "recall": 1.0
+      },
+      "aggregated_structure": {
+        "document_count": 2,
+        "mean_line_accuracy": 1.0,
+        "mean_line_fragmentation_rate": 0.0,
+        "mean_line_fusion_rate": 0.0,
+        "mean_paragraph_conservation": 1.0,
+        "mean_reading_order_score": 0.75
+      },
+      "aggregated_taxonomy": {
+        "class_distribution": {
+          "abbreviation_error": 0.0,
+          "case_error": 0.0,
+          "diacritic_error": 0.0,
+          "hapax": 1.0,
+          "lacuna": 0.0,
+          "ligature_error": 0.0,
+          "oov_character": 0.0,
+          "segmentation_error": 0.0,
+          "visual_confusion": 0.0
+        },
+        "counts": {
+          "abbreviation_error": 0,
+          "case_error": 0,
+          "diacritic_error": 0,
+          "hapax": 1,
+          "lacuna": 0,
+          "ligature_error": 0,
+          "oov_character": 0,
+          "segmentation_error": 0,
+          "visual_confusion": 0
+        },
+        "total_errors": 1
+      },
+      "document_results": [
+        {
+          "char_scores": {
+            "diacritic": {
+              "correctly_recognized": 0,
+              "per_diacritic": {},
+              "score": 1.0,
+              "total_in_gt": 0
+            },
+            "ligature": {
+              "correctly_recognized": 0,
+              "per_ligature": {},
+              "score": 1.0,
+              "total_in_gt": 0
+            }
+          },
+          "confusion_matrix": {
+            "matrix": {},
+            "total_deletions": 0,
+            "total_insertions": 0,
+            "total_substitutions": 0
+          },
+          "doc_id": "doc1",
+          "duration_seconds": 0.0,
+          "engine_error": null,
+          "ground_truth": "Bonjour le monde",
+          "hallucination_metrics": {
+            "anchor_score": 1.0,
+            "anchor_threshold_used": 0.5,
+            "gt_word_count": 3,
+            "hallucinated_blocks": [],
+            "hyp_word_count": 3,
+            "is_hallucinating": false,
+            "length_ratio": 1.0,
+            "length_ratio_threshold_used": 1.2,
+            "net_inserted_words": 0,
+            "net_insertion_rate": 0.0,
+            "ngram_size_used": 3
+          },
+          "hypothesis": "Bonjour le monde",
+          "image_path": "FIXTURES/doc1.png",
+          "line_metrics": {
+            "catastrophic_rate": {
+              "0.3": 0.0,
+              "0.5": 0.0,
+              "1.0": 0.0
+            },
+            "cer_per_line": [
+              0.0
+            ],
+            "gini": 0.0,
+            "heatmap": [
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0
+            ],
+            "line_count": 1,
+            "mean_cer": 0.0,
+            "percentiles": {
+              "p50": 0.0,
+              "p75": 0.0,
+              "p90": 0.0,
+              "p95": 0.0,
+              "p99": 0.0
+            }
+          },
+          "metrics": {
+            "cer": 0.0,
+            "cer_caseless": 0.0,
+            "cer_diplomatic": 0.0,
+            "cer_nfc": 0.0,
+            "diplomatic_profile_name": "medieval_french",
+            "error": null,
+            "hypothesis_length": 16,
+            "mer": 0.0,
+            "reference_length": 16,
+            "wer": 0.0,
+            "wer_normalized": 0.0,
+            "wil": 0.0
+          },
+          "searchability_metrics": {
+            "max_distance": 2,
+            "missed_tokens": [],
+            "n_gt_tokens": 3,
+            "n_searchable": 3,
+            "recall": 1.0
+          },
+          "structure": {
+            "gt_line_count": 1,
+            "line_accuracy": 1.0,
+            "line_fragmentation_count": 0,
+            "line_fragmentation_rate": 0.0,
+            "line_fusion_count": 0,
+            "line_fusion_rate": 0.0,
+            "ocr_line_count": 1,
+            "paragraph_conservation_score": 1.0,
+            "reading_order_score": 1.0
+          },
+          "taxonomy": {
+            "class_distribution": {},
+            "counts": {
+              "abbreviation_error": 0,
+              "case_error": 0,
+              "diacritic_error": 0,
+              "hapax": 0,
+              "lacuna": 0,
+              "ligature_error": 0,
+              "oov_character": 0,
+              "segmentation_error": 0,
+              "visual_confusion": 0
+            },
+            "examples": {
+              "abbreviation_error": [],
+              "case_error": [],
+              "diacritic_error": [],
+              "hapax": [],
+              "lacuna": [],
+              "ligature_error": [],
+              "oov_character": [],
+              "segmentation_error": [],
+              "visual_confusion": []
+            },
+            "total_errors": 0
+          }
+        },
+        {
+          "char_scores": {
+            "diacritic": {
+              "correctly_recognized": 0,
+              "per_diacritic": {},
+              "score": 1.0,
+              "total_in_gt": 0
+            },
+            "ligature": {
+              "correctly_recognized": 0,
+              "per_ligature": {},
+              "score": 1.0,
+              "total_in_gt": 0
+            }
+          },
+          "confusion_matrix": {
+            "matrix": {
+              "l": {
+                "i": 1
+              }
+            },
+            "total_deletions": 0,
+            "total_insertions": 0,
+            "total_substitutions": 1
+          },
+          "doc_id": "doc2",
+          "duration_seconds": 0.0,
+          "engine_error": null,
+          "ground_truth": "Hello world",
+          "hallucination_metrics": {
+            "anchor_score": 0.0,
+            "anchor_threshold_used": 0.5,
+            "gt_word_count": 2,
+            "hallucinated_blocks": [],
+            "hyp_word_count": 2,
+            "is_hallucinating": true,
+            "length_ratio": 1.0,
+            "length_ratio_threshold_used": 1.2,
+            "net_inserted_words": 1,
+            "net_insertion_rate": 0.5,
+            "ngram_size_used": 3
+          },
+          "hypothesis": "Helio world",
+          "image_path": "FIXTURES/doc2.png",
+          "line_metrics": {
+            "catastrophic_rate": {
+              "0.3": 0.0,
+              "0.5": 0.0,
+              "1.0": 0.0
+            },
+            "cer_per_line": [
+              0.090909
+            ],
+            "gini": 0.0,
+            "heatmap": [
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.0,
+              0.090909
+            ],
+            "line_count": 1,
+            "mean_cer": 0.090909,
+            "percentiles": {
+              "p50": 0.090909,
+              "p75": 0.090909,
+              "p90": 0.090909,
+              "p95": 0.090909,
+              "p99": 0.090909
+            }
+          },
+          "metrics": {
+            "cer": 0.090909,
+            "cer_caseless": 0.090909,
+            "cer_diplomatic": 0.090909,
+            "cer_nfc": 0.090909,
+            "diplomatic_profile_name": "medieval_french",
+            "error": null,
+            "hypothesis_length": 11,
+            "mer": 0.5,
+            "reference_length": 11,
+            "wer": 0.5,
+            "wer_normalized": 0.5,
+            "wil": 0.75
+          },
+          "searchability_metrics": {
+            "max_distance": 2,
+            "missed_tokens": [],
+            "n_gt_tokens": 2,
+            "n_searchable": 2,
+            "recall": 1.0
+          },
+          "structure": {
+            "gt_line_count": 1,
+            "line_accuracy": 1.0,
+            "line_fragmentation_count": 0,
+            "line_fragmentation_rate": 0.0,
+            "line_fusion_count": 0,
+            "line_fusion_rate": 0.0,
+            "ocr_line_count": 1,
+            "paragraph_conservation_score": 1.0,
+            "reading_order_score": 0.5
+          },
+          "taxonomy": {
+            "class_distribution": {
+              "abbreviation_error": 0.0,
+              "case_error": 0.0,
+              "diacritic_error": 0.0,
+              "hapax": 1.0,
+              "lacuna": 0.0,
+              "ligature_error": 0.0,
+              "oov_character": 0.0,
+              "segmentation_error": 0.0,
+              "visual_confusion": 0.0
+            },
+            "counts": {
+              "abbreviation_error": 0,
+              "case_error": 0,
+              "diacritic_error": 0,
+              "hapax": 1,
+              "lacuna": 0,
+              "ligature_error": 0,
+              "oov_character": 0,
+              "segmentation_error": 0,
+              "visual_confusion": 0
+            },
+            "examples": {
+              "abbreviation_error": [],
+              "case_error": [],
+              "diacritic_error": [],
+              "hapax": [
+                {
+                  "gt": "Hello",
+                  "ocr": "Helio"
+                }
+              ],
+              "lacuna": [],
+              "ligature_error": [],
+              "oov_character": [],
+              "segmentation_error": [],
+              "visual_confusion": []
+            },
+            "total_errors": 1
+          }
+        }
+      ],
+      "engine_config": {},
+      "engine_name": "precomputed_invariance",
+      "engine_version": "PINNED"
+    }
+  ],
+  "metadata": {},
+  "picarones_version": "PINNED",
+  "ranking": [
+    {
+      "documents": 2,
+      "engine": "precomputed_invariance",
+      "failed": 0,
+      "mean_cer": 0.045455,
+      "mean_wer": 0.25,
+      "median_cer": 0.045455
+    }
+  ],
+  "run_date": "PINNED"
+}

tests/integration/test_migration_invariance.py ADDED Viewed

	@@ -0,0 +1,289 @@

+"""Test d'invariance run-to-run pour la migration Option B.
+Phase B0 du chantier de migration ``run_benchmark_via_service`` →
+``RunOrchestrator.execute(RunSpec)``.
+Rôle
+----
+Ce test exécute un benchmark **déterministe** (corpus mini de 2 docs +
+``PrecomputedTextAdapter``) via la façade actuelle
+``run_benchmark_via_service`` et compare son ``BenchmarkResult``
+normalisé à un snapshot JSON enregistré dans
+``tests/integration/snapshots/migration_invariance.json``.
+Pourquoi
+--------
+Pendant la migration vers ``RunOrchestrator``, on porte 7 features
+(``progress_callback``, ``cancel_event``, ``partial_dir``,
+``entity_extractor``, ``char_exclude``, ``normalization_profile``,
+``profile``, ``output_json``).  Chaque port doit préserver
+**exactement** le comportement numérique du chemin existant.  Ce test
+sert de filet de sécurité : si une refactorisation interne modifie le
+résultat (CER, agrégation, ordre des engines, structure du JSON), le
+snapshot diverge et la CI échoue.
+Le test n'utilise **aucune** dépendance externe (pas de Tesseract, pas
+de réseau).  Le ``PrecomputedTextAdapter`` lit un fichier texte écrit
+sur disque — sortie 100% déterministe.
+Mise à jour du snapshot
+-----------------------
+Si une modification **volontaire** change le résultat (ex. nouveau
+champ dans ``BenchmarkResult``), régénérer le snapshot :
+    PICARONES_UPDATE_SNAPSHOT=1 python -m pytest \
+        tests/integration/test_migration_invariance.py
+Et inspecter le diff git du snapshot avant commit.
+Normalisation
+-------------
+Les champs volatils sont neutralisés avant comparaison :
+- ``picarones_version`` → ``"PINNED"``
+- ``run_date`` → ``"PINNED"``
+- ``corpus.source`` → ``"FIXTURES/corpus"``
+- ``image_path`` → ``"FIXTURES/docN.png"``
+- ``duration_seconds`` → ``0.0``
+- Tout autre champ contenant le ``tmp_path`` → remplacé par
+  ``"FIXTURES/..."``
+Cela garantit que le snapshot reste stable cross-OS et cross-run.
+"""
+from __future__ import annotations
+import json
+import os
+import re
+from pathlib import Path
+from typing import Any
+import pytest
+from picarones.adapters.ocr.precomputed import PrecomputedTextAdapter
+from picarones.app.services.benchmark_runner import run_benchmark_via_service
+from picarones.evaluation.corpus import Corpus, Document
+SNAPSHOT_PATH = (
+    Path(__file__).parent / "snapshots" / "migration_invariance.json"
+)
+# ──────────────────────────────────────────────────────────────────────
+# Fixtures déterministes
+# ──────────────────────────────────────────────────────────────────────
+def _make_invariance_corpus(tmp_path: Path) -> Corpus:
+    """Corpus mini de 2 documents avec GT + texte précalculé.
+    Le texte précalculé est légèrement différent de la GT pour produire
+    des métriques CER/WER non triviales (et donc plus discriminantes
+    dans le snapshot).
+    """
+    documents: list[Document] = []
+    # Doc 1 : GT = "Bonjour le monde", OCR = "Bonjour le monde" → CER 0.0
+    doc1_img = tmp_path / "doc1.png"
+    doc1_img.write_bytes(b"\x89PNG\r\n\x1a\n")  # PNG header minimal
+    doc1_ocr = tmp_path / "doc1.invariance.txt"
+    doc1_ocr.write_text("Bonjour le monde", encoding="utf-8")
+    documents.append(Document(
+        image_path=doc1_img,
+        ground_truth="Bonjour le monde",
+        doc_id="doc1",
+    ))
+    # Doc 2 : GT = "Hello world", OCR = "Helio world" → CER non nul
+    doc2_img = tmp_path / "doc2.png"
+    doc2_img.write_bytes(b"\x89PNG\r\n\x1a\n")
+    doc2_ocr = tmp_path / "doc2.invariance.txt"
+    doc2_ocr.write_text("Helio world", encoding="utf-8")
+    documents.append(Document(
+        image_path=doc2_img,
+        ground_truth="Hello world",
+        doc_id="doc2",
+    ))
+    return Corpus(name="invariance_corpus", documents=documents)
+def _make_invariance_engine() -> PrecomputedTextAdapter:
+    """``PrecomputedTextAdapter`` qui lit ``<stem>.invariance.txt``."""
+    return PrecomputedTextAdapter(source_label="invariance")
+# ──────────────────────────────────────────────────────────────────────
+# Normalisation du snapshot
+# ──────────────────────────────────────────────────────────────────────
+def _normalize_for_snapshot(data: Any, tmp_path: Path) -> Any:
+    """Normalise récursivement les champs volatils du ``BenchmarkResult``.
+    Remplace ``tmp_path`` par ``"FIXTURES"`` dans toutes les valeurs
+    string.  Neutralise les champs explicitement volatils
+    (``duration_seconds``, ``run_date``, ``picarones_version``,
+    ``engine_version``, ``code_version``).
+    """
+    tmp_str = str(tmp_path)
+    # Pattern pour matcher tmp_path/quelque-chose (pour les chemins
+    # absolus qui n'apparaissent pas en clé mais en valeur string).
+    tmp_re = re.compile(re.escape(tmp_str))
+    def _normalize(value: Any, *, key: str | None = None) -> Any:
+        if isinstance(value, dict):
+            return {k: _normalize(v, key=k) for k, v in value.items()}
+        if isinstance(value, list):
+            return [_normalize(item) for item in value]
+        if isinstance(value, str):
+            return tmp_re.sub("FIXTURES", value)
+        if isinstance(value, float):
+            # Neutralise les durées (volatiles d'un run à l'autre).
+            if key == "duration_seconds":
+                return 0.0
+            # Garde les autres floats avec une précision raisonnable
+            # pour absorber le bruit de calcul minimum.
+            return round(value, 6)
+        return value
+    normalized = _normalize(data)
+    # Champs volatils au niveau racine — neutralisés en post-traitement
+    # parce que leur valeur ne contient pas ``tmp_path``.
+    if isinstance(normalized, dict):
+        for volatile_key in ("picarones_version", "run_date"):
+            if volatile_key in normalized:
+                normalized[volatile_key] = "PINNED"
+        # engine_version peut apparaître dans chaque engine_report.
+        for report in normalized.get("engine_reports", []):
+            if "engine_version" in report:
+                report["engine_version"] = "PINNED"
+            # Les pipeline_info portent parfois des chemins ou metadata.
+            pipeline_info = report.get("pipeline_info")
+            if isinstance(pipeline_info, dict):
+                if "code_version" in pipeline_info:
+                    pipeline_info["code_version"] = "PINNED"
+    return normalized
+# ──────────────────────────────────────────────────────────────────────
+# Comparaison snapshot
+# ──────────────────────────────────────────────────────────────────────
+def _load_snapshot() -> dict | None:
+    if not SNAPSHOT_PATH.exists():
+        return None
+    return json.loads(SNAPSHOT_PATH.read_text(encoding="utf-8"))
+def _write_snapshot(data: dict) -> None:
+    SNAPSHOT_PATH.parent.mkdir(parents=True, exist_ok=True)
+    SNAPSHOT_PATH.write_text(
+        json.dumps(data, ensure_ascii=False, indent=2, sort_keys=True),
+        encoding="utf-8",
+    )
+def _should_update_snapshot() -> bool:
+    return os.environ.get("PICARONES_UPDATE_SNAPSHOT") == "1"
+# ──────────────────────────────────────────────────────────────────────
+# Test principal
+# ──────────────────────────────────────────────────────────────────────
+def test_run_benchmark_via_service_invariance(tmp_path: Path) -> None:
+    """Snapshot d'invariance du comportement actuel.
+    Ce test est le filet de sécurité de la migration Option B.  Il doit
+    rester vert à chaque étape du chantier (B1, B2, B3, B4, ...) tant
+    que ``run_benchmark_via_service`` est la façade publique.
+    Quand la migration sera terminée et ``run_benchmark_via_service``
+    supprimée (Phase B8), ce test sera retiré ou migré vers
+    ``RunOrchestrator.execute()``.
+    """
+    corpus = _make_invariance_corpus(tmp_path)
+    engine = _make_invariance_engine()
+    benchmark_result = run_benchmark_via_service(
+        corpus=corpus,
+        engines=[engine],
+        code_version="invariance-test-1.0.0",
+    )
+    actual_normalized = _normalize_for_snapshot(
+        benchmark_result.as_dict(), tmp_path,
+    )
+    snapshot = _load_snapshot()
+    if snapshot is None or _should_update_snapshot():
+        _write_snapshot(actual_normalized)
+        if snapshot is None:
+            pytest.skip(
+                f"Snapshot créé pour la première fois à "
+                f"{SNAPSHOT_PATH.relative_to(Path.cwd())}. "
+                f"Vérifier son contenu puis ré-exécuter le test."
+            )
+        else:
+            # Mode update explicite : on a écrit, le test passe sans
+            # vérification additionnelle.  L'opérateur est responsable
+            # d'inspecter le diff git.
+            return
+    assert actual_normalized == snapshot, (
+        "BenchmarkResult diverge du snapshot d'invariance.\n"
+        f"Snapshot : {SNAPSHOT_PATH}\n"
+        "Si la divergence est intentionnelle, régénérer avec :\n"
+        "    PICARONES_UPDATE_SNAPSHOT=1 python -m pytest "
+        f"{Path(__file__).relative_to(Path.cwd())}\n"
+        "et inspecter le diff git du snapshot avant commit."
+    )
+# ──────────────────────────────────────────────────────────────────────
+# Test annexe — vérifie que la normalisation elle-même est stable
+# ──────────────────────────────────────────────────────────────────────
+def test_normalization_is_idempotent(tmp_path: Path) -> None:
+    """La normalisation d'un dict déjà normalisé ne le change pas.
+    Garantit qu'on peut ré-appliquer la normalisation sans dériver.
+    Test pédagogique de la mécanique du snapshot.
+    """
+    sample = {
+        "picarones_version": "2.0.0",
+        "run_date": "2026-05-14T12:00:00Z",
+        "corpus": {"source": str(tmp_path / "corpus.zip")},
+        "engine_reports": [
+            {
+                "engine_version": "1.2.3",
+                "document_results": [
+                    {
+                        "image_path": str(tmp_path / "doc1.png"),
+                        "duration_seconds": 0.123456,
+                        "metrics": {"cer": 0.05},
+                    },
+                ],
+            },
+        ],
+    }
+    once = _normalize_for_snapshot(sample, tmp_path)
+    twice = _normalize_for_snapshot(once, tmp_path)
+    assert once == twice
+    assert once["picarones_version"] == "PINNED"
+    assert once["run_date"] == "PINNED"
+    assert once["engine_reports"][0]["engine_version"] == "PINNED"
+    assert once["engine_reports"][0]["document_results"][0]["duration_seconds"] == 0.0
+    assert "FIXTURES" in once["corpus"]["source"]
+    assert "FIXTURES" in once["engine_reports"][0]["document_results"][0]["image_path"]