Spaces:
Running
test(migration): Phase B0 — fondations pour migration Option B (RunOrchestrator)
Browse filesPréparation du chantier de migration `run_benchmark_via_service` →
`RunOrchestrator.execute(RunSpec)`. 3 livrables qui serviront de filets
de sécurité pour les phases B1-B8 :
- **B0.1 test d'invariance** (`tests/integration/test_migration_invariance.py`) :
exécute un benchmark déterministe (PrecomputedTextAdapter × 2 docs) via
la façade actuelle, sérialise le `BenchmarkResult` et compare à un
snapshot JSON. Normalisation des champs volatils (chemins, timestamps,
durations). Snapshot enregistré dans `tests/integration/snapshots/`.
Mode update via `PICARONES_UPDATE_SNAPSHOT=1`.
- **B0.2 squelette feature parity** (`tests/app/services/test_run_orchestrator_feature_parity.py`) :
13 tests `pytest.skip("TODO Phase B2.X")` documentant précisément les
7 features à porter (progress_callback, cancel_event, partial_dir,
entity_extractor, char_exclude+normalization_profile, profile hooks,
output_json legacy). Chaque test sera dé-skippé au fur et à mesure de
la Phase B2 → gate du checkpoint C1.
- **B0.3 inventaire tests** (`docs/migration/option_b_test_inventory.md`) :
classification des 25 fichiers de tests qui touchent
`run_benchmark_via_service` (catégorie A, 10 fichiers) ou consomment
`BenchmarkResult` sans appeler le runner (catégorie B, 16 fichiers).
Aucun fichier ne patche via monkeypatch — simplifie la migration B4.
Tests : 60 passed, 12 skipped (placeholders B2), 0 failed.
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Inventaire des tests à migrer — Option B (RunOrchestrator)
|
| 2 |
+
|
| 3 |
+
Document de référence pour la Phase B4 du chantier de migration
|
| 4 |
+
`run_benchmark_via_service` → `RunOrchestrator.execute(RunSpec)`.
|
| 5 |
+
|
| 6 |
+
Établi en Phase B0 par parcours exhaustif de `tests/` :
|
| 7 |
+
- `grep -rln "run_benchmark_via_service" tests/` → 10 fichiers (catégorie A)
|
| 8 |
+
- `grep -rln "BenchmarkResult" tests/` → 20 fichiers, dont 10 en intersection avec A et 10 en catégorie B seule
|
| 9 |
+
|
| 10 |
+
Aucun fichier ne **patche** `run_benchmark_via_service` via `monkeypatch.setattr`
|
| 11 |
+
ou `mock.patch` — vérifié en Phase B0. Tous les call sites sont des appels
|
| 12 |
+
directs. Cela simplifie la stratégie de migration : pas de cibles indirectes
|
| 13 |
+
à recâbler.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Catégorie A — Appellent `run_benchmark_via_service` directement
|
| 18 |
+
|
| 19 |
+
Ces tests doivent être migrés vers `RunOrchestrator.execute(spec)` ou vers le
|
| 20 |
+
helper `build_run_spec_from_engines()` (Phase B3.1) pour les cas qui partent
|
| 21 |
+
d'instances d'adapter en mémoire.
|
| 22 |
+
|
| 23 |
+
| # | Fichier | Taille | Occurrences | Priorité B4 | Notes |
|
| 24 |
+
|---|---|---|---|---|---|
|
| 25 |
+
| A1 | `tests/app/test_sprint_d2b_partial_dir_resume.py` | 506 LOC | 12 | **Haute** | Teste le resume `partial_dir`. Doit valider le port vers `_orchestrator_partial.py` (Phase B2.3). Cœur de la non-régression. |
|
| 26 |
+
| A2 | `tests/app/test_sprint_d2cdef_features.py` | 473 LOC | 14 | **Haute** | Teste les 7 paramètres étendus (`profile`, `entity_extractor`, `cancel_event`, etc.). Doit valider chaque feature portée en Phase B2. |
|
| 27 |
+
| A3 | `tests/web/test_sprint6_web_interface.py` | 1392 LOC | 10 | **Haute** | Test d'intégration web. Confirmera que la migration `run_benchmark_thread_v2` ne casse rien côté UI. |
|
| 28 |
+
| A4 | `tests/app/test_character_analysis_in_runner.py` | 246 LOC | 12 | Moyenne | Teste l'analyse caractère par engine. Conversion mécanique. |
|
| 29 |
+
| A5 | `tests/app/test_sprint_h2b_canonical_in_runner.py` | 191 LOC | 9 | Moyenne | Teste l'extraction du `CANONICAL_DOCUMENT`. À adapter au nouveau ViewExecutor. |
|
| 30 |
+
| A6 | `tests/evaluation/test_public_api.py` | — | 7 | Moyenne | API publique. Inclura un test de présence pour `RunOrchestrator`. |
|
| 31 |
+
| A7 | `tests/evaluation/metrics/test_sprint12_nouvelles_fonctionnalites.py` | 288 LOC | 4 | Basse | Conversion mécanique. |
|
| 32 |
+
| A8 | `tests/evaluation/metrics/test_sprint_a14_s1_normalization_propagation.py` | — | 2 | Basse | Vérifie `normalization_profile` — valide la Phase B2.5 (propagation via `EvaluationView`). |
|
| 33 |
+
| A9 | `tests/evaluation/test_metric_hooks.py` | — | 1 | Basse | Trivial. Conversion en 1 ligne. |
|
| 34 |
+
| A10 | `tests/architecture/test_file_budgets.py` | — | (référence uniquement) | Basse | Budgets des modules `_benchmark_*.py` à actualiser après Phase B2/B7. |
|
| 35 |
+
|
| 36 |
+
**Total catégorie A** : 10 fichiers, ~3500 LOC, ~71 occurrences.
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Catégorie B — Consomment `BenchmarkResult` mais n'appellent pas le runner
|
| 41 |
+
|
| 42 |
+
Ces tests **ne nécessitent aucun changement** tant que le converter
|
| 43 |
+
`RunResult → BenchmarkResult` (`_benchmark_converter.py`) reste en place
|
| 44 |
+
après la migration. Ils consomment soit un `BenchmarkResult` construit
|
| 45 |
+
manuellement (fixture), soit un `BenchmarkResult` issu d'un appel au runner
|
| 46 |
+
fait dans une fixture partagée.
|
| 47 |
+
|
| 48 |
+
| # | Fichier | Rôle |
|
| 49 |
+
|---|---|---|
|
| 50 |
+
| B1 | `tests/golden/test_s5_benchmark_result_json_stable.py` | Round-trip JSON stable. Inchangé tant que `BenchmarkResult.from_json_object`/`to_dict` restent. |
|
| 51 |
+
| B2 | `tests/reports/test_report.py` | Rendu HTML. Inchangé tant que `ReportGenerator(result)` accepte `BenchmarkResult`. |
|
| 52 |
+
| B3 | `tests/reports/test_extra_metrics.py` | Métriques additionnelles attachées au rapport. |
|
| 53 |
+
| B4 | `tests/reports/test_sprint72_worst_lines.py` | Worst-N lines (consomme `BenchmarkResult` non-compacté). |
|
| 54 |
+
| B5 | `tests/evaluation/metrics/test_results.py` | API `MetricsResult` / `aggregate_metrics`. |
|
| 55 |
+
| B6 | `tests/evaluation/metrics/test_sprint36_ensemble_narrative.py` | Narrative engine. Lit `benchmark_data` dict. |
|
| 56 |
+
| B7 | `tests/evaluation/metrics/test_sprint44_median_default.py` | Médiane/Pareto. |
|
| 57 |
+
| B8 | `tests/evaluation/metrics/test_sprint45_stratification.py` | Stratification du corpus. |
|
| 58 |
+
| B9 | `tests/evaluation/test_sprint14_robust_filtering.py` | Filtre robustesse. |
|
| 59 |
+
| B10 | `tests/adapters/corpus/test_sprint8_escriptorium_gallica.py` | Importer eScriptorium / Gallica. |
|
| 60 |
+
| B11 | `tests/integration/test_importer_fallback_wiring.py` | Fallback importer. Test d'intégration. |
|
| 61 |
+
| B12 | `tests/integration/test_s5_disk_full_simulation.py` | Disque plein. |
|
| 62 |
+
| B13 | `tests/security/test_phase1_post_rewrite_wiring.py` | Sécurité post-rewrite. |
|
| 63 |
+
| B14 | `tests/security/test_s1_xss_in_reports.py` | XSS dans rapports. |
|
| 64 |
+
| B15 | `tests/test_minimal_install.py` | Installation minimale (smoke test). |
|
| 65 |
+
| B16 | `tests/web/test_sprint28_ux_save_compare.py` | UX save/compare web. |
|
| 66 |
+
|
| 67 |
+
**Total catégorie B** : 16 fichiers — **AUCUN changement requis** pour Option B.
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## Catégorie C — Tests qui utilisent déjà `RunOrchestrator`
|
| 72 |
+
|
| 73 |
+
Pour information : ces tests servent de modèle/template pour la migration.
|
| 74 |
+
|
| 75 |
+
| # | Fichier | Rôle |
|
| 76 |
+
|---|---|---|
|
| 77 |
+
| C1 | `tests/app/test_run_orchestrator.py` | Tests unitaires complets de `RunOrchestrator`. Modèle de référence. |
|
| 78 |
+
| C2 | `tests/integration/test_runner_profiles.py` | Profils de hooks via le `RunOrchestrator`. |
|
| 79 |
+
| C3 | `tests/integration/test_html_views.py` | Vues HTML via `RunOrchestrator`. |
|
| 80 |
+
| C4 | `tests/integration/test_narrative_and_views.py` | Narrative engine + vues. |
|
| 81 |
+
| C5 | `tests/integration/test_engines_and_llm.py` | Engines + LLM via `RunOrchestrator`. |
|
| 82 |
+
|
| 83 |
+
---
|
| 84 |
+
|
| 85 |
+
## Stratégie globale Phase B4
|
| 86 |
+
|
| 87 |
+
1. **Étape 1** (0.5 j) — Fixture partagée dans `tests/conftest.py` :
|
| 88 |
+
- `make_minimal_corpus_zip()` — corpus zip 2 docs déterministe
|
| 89 |
+
- `run_orchestrator_factory(tmp_path)` — `RunOrchestrator(workspace)`
|
| 90 |
+
- `build_minimal_run_spec(corpus_zip, output_dir, adapters)` — helper
|
| 91 |
+
- `assert_benchmark_results_equal(a, b, *, ignore=("started_at", "completed_at"))`
|
| 92 |
+
|
| 93 |
+
2. **Étape 2** (2 j) — Migration catégorie A en commençant par la priorité haute :
|
| 94 |
+
- A1 (resume partial) + A2 (features étendues) + A3 (web) → 1 j
|
| 95 |
+
- A4, A5, A6 → 0.5 j
|
| 96 |
+
- A7, A8, A9 → 0.5 j (trivial)
|
| 97 |
+
|
| 98 |
+
3. **Étape 3** (0.5 j) — Mise à jour `tests/architecture/test_file_budgets.py` (A10) :
|
| 99 |
+
- Marquer les modules `_benchmark_*.py` comme deprecated
|
| 100 |
+
- Augmenter le budget de `run_orchestrator.py` (~+300 LOC après Phase B2)
|
| 101 |
+
|
| 102 |
+
4. **Étape 4** (1 j) — Run complet de la suite + ajustements :
|
| 103 |
+
- `pytest tests/ -q --tb=short`
|
| 104 |
+
- Snapshot d'invariance (`test_migration_invariance.py`) doit rester vert
|
| 105 |
+
- Feature parity (`test_run_orchestrator_feature_parity.py`) toutes vertes
|
| 106 |
+
|
| 107 |
+
**Estimation totale Phase B4** : 3-4 jours, conforme au plan.
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
## Validation post-migration
|
| 112 |
+
|
| 113 |
+
À la fin de la Phase B4 :
|
| 114 |
+
|
| 115 |
+
```bash
|
| 116 |
+
# Tous verts
|
| 117 |
+
python -m pytest tests/ -q --tb=short
|
| 118 |
+
|
| 119 |
+
# Snapshot d'invariance inchangé
|
| 120 |
+
python -m pytest tests/integration/test_migration_invariance.py -v
|
| 121 |
+
|
| 122 |
+
# 7 features de parity portées
|
| 123 |
+
python -m pytest tests/app/services/test_run_orchestrator_feature_parity.py -v
|
| 124 |
+
|
| 125 |
+
# Aucune occurrence résiduelle de run_benchmark_via_service hors module legacy
|
| 126 |
+
grep -rln "run_benchmark_via_service" tests/ picarones/ | \
|
| 127 |
+
grep -v "benchmark_runner.py\|_benchmark_" | \
|
| 128 |
+
wc -l
|
| 129 |
+
# Attendu : 0 (ou 1 si on garde un export public deprecated dans __init__.py)
|
| 130 |
+
```
|
|
@@ -0,0 +1,279 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Squelette des tests de feature parity entre ``run_benchmark_via_service``
|
| 2 |
+
et ``RunOrchestrator.execute(RunSpec)``.
|
| 3 |
+
|
| 4 |
+
Phase B0 du chantier de migration Option B.
|
| 5 |
+
|
| 6 |
+
Rôle
|
| 7 |
+
----
|
| 8 |
+
Ce module liste les **7 features** que ``run_benchmark_via_service``
|
| 9 |
+
expose aujourd'hui et que ``RunOrchestrator`` doit porter pendant la
|
| 10 |
+
Phase B2. Chaque test est documenté précisément (ce qui doit être
|
| 11 |
+
vérifié) et marqué ``pytest.skip`` jusqu'à ce que la feature
|
| 12 |
+
correspondante soit portée.
|
| 13 |
+
|
| 14 |
+
Au fur et à mesure de la Phase B2, retirer le ``pytest.skip`` du test
|
| 15 |
+
correspondant et implémenter sa logique. À la fin de B2, tous les
|
| 16 |
+
tests doivent être verts → on a atteint le **Checkpoint C1**.
|
| 17 |
+
|
| 18 |
+
Convention
|
| 19 |
+
----------
|
| 20 |
+
Chaque test compare :
|
| 21 |
+
|
| 22 |
+
1. ``run_benchmark_via_service(feature_X=value)`` — chemin legacy
|
| 23 |
+
2. ``RunOrchestrator().execute(spec_with_feature_X=value)`` — chemin
|
| 24 |
+
rewrite
|
| 25 |
+
|
| 26 |
+
Et vérifie que le ``BenchmarkResult`` produit est numériquement
|
| 27 |
+
identique (modulo normalisation des champs volatils).
|
| 28 |
+
|
| 29 |
+
Mapping vers le plan Option B
|
| 30 |
+
-----------------------------
|
| 31 |
+
- B2.1 ``progress_callback`` → ``test_parity_progress_callback``
|
| 32 |
+
- B2.2 ``cancel_event`` → ``test_parity_cancel_event``
|
| 33 |
+
- B2.3 ``partial_dir`` → ``test_parity_partial_dir_resume``
|
| 34 |
+
- B2.4 ``entity_extractor`` → ``test_parity_entity_extractor_ner``
|
| 35 |
+
- B2.5 ``char_exclude`` + → ``test_parity_normalization_propagation``
|
| 36 |
+
``normalization_profile``
|
| 37 |
+
- B2.6 ``profile`` (hooks) → ``test_parity_profile_hooks``
|
| 38 |
+
- B2.7 ``output_json`` → ``test_parity_output_json_legacy_format``
|
| 39 |
+
"""
|
| 40 |
+
|
| 41 |
+
from __future__ import annotations
|
| 42 |
+
|
| 43 |
+
from pathlib import Path
|
| 44 |
+
|
| 45 |
+
import pytest
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
SKIP_REASON_PREFIX = "TODO Phase B2."
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 52 |
+
# B2.1 — progress_callback
|
| 53 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}1 — port progress_callback")
|
| 57 |
+
def test_parity_progress_callback(tmp_path: Path) -> None:
|
| 58 |
+
"""``progress_callback`` est appelé avec ``(engine_name, doc_idx,
|
| 59 |
+
doc_id)`` dans les deux chemins.
|
| 60 |
+
|
| 61 |
+
Spec
|
| 62 |
+
----
|
| 63 |
+
- Lancer un benchmark à 1 engine × 3 docs.
|
| 64 |
+
- Le callback est invoqué exactement 3 fois (1 par doc).
|
| 65 |
+
- Les arguments matchent : ``engine_name`` = nom de l'adapter,
|
| 66 |
+
``doc_idx`` = compteur global croissant (0, 1, 2), ``doc_id``
|
| 67 |
+
= ID du document.
|
| 68 |
+
- Le compteur est partagé entre threads via verrou
|
| 69 |
+
(cf. ``_benchmark_execution.py:109-139``).
|
| 70 |
+
|
| 71 |
+
Cible de port
|
| 72 |
+
-------------
|
| 73 |
+
Étendre ``RunOrchestrator._make_context_factory`` pour qu'il
|
| 74 |
+
accepte un ``progress_callback`` et reproduise le pattern
|
| 75 |
+
(verrou + compteur ``doc_idx``).
|
| 76 |
+
"""
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 80 |
+
# B2.2 — cancel_event
|
| 81 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}2 — port cancel_event")
|
| 85 |
+
def test_parity_cancel_event(tmp_path: Path) -> None:
|
| 86 |
+
"""Un ``threading.Event.set()`` arrête le run en cours.
|
| 87 |
+
|
| 88 |
+
Spec
|
| 89 |
+
----
|
| 90 |
+
- Lancer un benchmark à 1 engine × 10 docs.
|
| 91 |
+
- Après ~2 docs traités, appeler ``cancel_event.set()``.
|
| 92 |
+
- Le run doit s'arrêter rapidement (< 1 s de marge).
|
| 93 |
+
- Le ``BenchmarkResult`` retourné contient les 2 premiers docs
|
| 94 |
+
(ou plus, selon timing) mais pas les 10.
|
| 95 |
+
|
| 96 |
+
Cible de port
|
| 97 |
+
-------------
|
| 98 |
+
Wrapper ``CorpusRunner.run`` dans ``RunOrchestrator`` pour qu'il
|
| 99 |
+
injecte le ``cancel_event`` dans ses kwargs (cf.
|
| 100 |
+
``_benchmark_execution.py:142-149``).
|
| 101 |
+
"""
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 105 |
+
# B2.3 — partial_dir resume
|
| 106 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}3 — port partial_dir resume")
|
| 110 |
+
def test_parity_partial_dir_resume_fresh_start(tmp_path: Path) -> None:
|
| 111 |
+
"""Premier run avec ``partial_dir`` non existant → comportement
|
| 112 |
+
identique à un run sans ``partial_dir``.
|
| 113 |
+
|
| 114 |
+
Spec
|
| 115 |
+
----
|
| 116 |
+
- ``partial_dir`` = répertoire vide.
|
| 117 |
+
- Lancer le bench.
|
| 118 |
+
- À la fin, le fichier ``{partial_dir}/picarones_{corpus}_{engine}
|
| 119 |
+
.partial.jsonl`` est supprimé (succès complet).
|
| 120 |
+
- Le ``BenchmarkResult`` est identique au run sans ``partial_dir``.
|
| 121 |
+
"""
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}3 — port partial_dir resume")
|
| 125 |
+
def test_parity_partial_dir_resume_after_crash(tmp_path: Path) -> None:
|
| 126 |
+
"""Reprise après crash partiel : 3 docs sur 5 déjà persistés →
|
| 127 |
+
seuls les 2 restants sont soumis au runner.
|
| 128 |
+
|
| 129 |
+
Spec
|
| 130 |
+
----
|
| 131 |
+
- Pré-écrire un partial JSONL avec 3 ``DocumentResult`` valides.
|
| 132 |
+
- Lancer le bench sur le corpus de 5 docs.
|
| 133 |
+
- Le ``CorpusRunner.run`` est appelé sur **2 docs seulement**
|
| 134 |
+
(vérifier via spy).
|
| 135 |
+
- Le ``BenchmarkResult`` final agrège les 5 docs (3 réutilisés +
|
| 136 |
+
2 nouveaux).
|
| 137 |
+
"""
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}3 — port partial_dir resume")
|
| 141 |
+
def test_parity_partial_dir_fingerprint_invalidates(tmp_path: Path) -> None:
|
| 142 |
+
"""Fingerprint divergent invalide le partial (re-calcul depuis 0).
|
| 143 |
+
|
| 144 |
+
Spec
|
| 145 |
+
----
|
| 146 |
+
- Pré-écrire un partial avec un ``code_version`` différent.
|
| 147 |
+
- Lancer le bench.
|
| 148 |
+
- Le partial est ignoré, les 5 docs sont recalculés.
|
| 149 |
+
"""
|
| 150 |
+
|
| 151 |
+
|
| 152 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 153 |
+
# B2.4 — entity_extractor (NER attach)
|
| 154 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}4 — port entity_extractor")
|
| 158 |
+
def test_parity_entity_extractor_ner(tmp_path: Path) -> None:
|
| 159 |
+
"""Quand un ``entity_extractor`` est fourni, les métriques NER
|
| 160 |
+
sont attachées au ``BenchmarkResult``.
|
| 161 |
+
|
| 162 |
+
Spec
|
| 163 |
+
----
|
| 164 |
+
- Corpus avec ``EntitiesGT`` (au moins 1 doc avec niveau ENTITIES).
|
| 165 |
+
- ``entity_extractor`` = mock qui retourne des entités fixes.
|
| 166 |
+
- Le ``BenchmarkResult`` contient ``DocumentResult.ner_metrics`` :
|
| 167 |
+
``precision``, ``recall``, ``f1`` par type d'entité.
|
| 168 |
+
- L'agrégation ``EngineReport.aggregated_ner`` est calculée.
|
| 169 |
+
"""
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 173 |
+
# B2.5 — char_exclude + normalization_profile
|
| 174 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}5 — port normalization propagation")
|
| 178 |
+
def test_parity_char_exclude(tmp_path: Path) -> None:
|
| 179 |
+
"""``char_exclude`` filtre les caractères avant calcul CER/WER.
|
| 180 |
+
|
| 181 |
+
Spec
|
| 182 |
+
----
|
| 183 |
+
- GT = ``"Bonjour!"``, OCR = ``"Bonjour."``.
|
| 184 |
+
- Sans ``char_exclude`` : CER = 1/8 = 0.125.
|
| 185 |
+
- Avec ``char_exclude="!."`` : CER = 0.0 (les 2 caractères
|
| 186 |
+
filtrés sont les seuls différents).
|
| 187 |
+
"""
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}5 — port normalization propagation")
|
| 191 |
+
def test_parity_normalization_profile(tmp_path: Path) -> None:
|
| 192 |
+
"""``normalization_profile="caseless"`` égalise les casses.
|
| 193 |
+
|
| 194 |
+
Spec
|
| 195 |
+
----
|
| 196 |
+
- GT = ``"Bonjour"``, OCR = ``"BONJOUR"``.
|
| 197 |
+
- Sans profil : CER ≈ 1.0 (toutes les lettres diffèrent).
|
| 198 |
+
- Avec ``caseless`` : CER = 0.0.
|
| 199 |
+
"""
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 203 |
+
# B2.6 — profile (hooks document-level / corpus aggregators)
|
| 204 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}6 — port profile hooks")
|
| 208 |
+
def test_parity_profile_validation(tmp_path: Path) -> None:
|
| 209 |
+
"""``profile="unknown"`` lève ``ValueError`` AVANT le run.
|
| 210 |
+
|
| 211 |
+
Spec
|
| 212 |
+
----
|
| 213 |
+
- Comportement identique aux 3 tests
|
| 214 |
+
``TestProfileValidation`` de
|
| 215 |
+
``tests/app/test_sprint_d2cdef_features.py``.
|
| 216 |
+
"""
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}6 — port profile hooks")
|
| 220 |
+
def test_parity_profile_standard_runs_hooks(tmp_path: Path) -> None:
|
| 221 |
+
"""``profile="standard"`` exécute les hooks document-level
|
| 222 |
+
enregistrés via ``@register_document_metric``.
|
| 223 |
+
|
| 224 |
+
Spec
|
| 225 |
+
----
|
| 226 |
+
- Enregistrer un hook test ``@register_document_metric("standard")``
|
| 227 |
+
qui renvoie ``{"hooked": True}``.
|
| 228 |
+
- Lancer le bench.
|
| 229 |
+
- ``DocumentResult.hook_values["hooked"] is True``.
|
| 230 |
+
"""
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
# ────────────��─────────────────────────────────────────────────────────
|
| 234 |
+
# B2.7 — output_json (legacy BenchmarkResult JSON)
|
| 235 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 236 |
+
|
| 237 |
+
|
| 238 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}7 — port output_json legacy")
|
| 239 |
+
def test_parity_output_json_legacy_format(tmp_path: Path) -> None:
|
| 240 |
+
"""Quand ``output_json`` est fourni, un fichier JSON au format
|
| 241 |
+
``BenchmarkResult.as_dict()`` est écrit en plus des 4 fichiers
|
| 242 |
+
JSONL natifs du ``RunOrchestrator``.
|
| 243 |
+
|
| 244 |
+
Spec
|
| 245 |
+
----
|
| 246 |
+
- Lancer ``RunOrchestrator().execute(spec_with_output_json)``.
|
| 247 |
+
- Vérifier que ``output_json`` existe et contient un JSON
|
| 248 |
+
désérialisable via ``BenchmarkResult.from_json_object``.
|
| 249 |
+
- Vérifier que les 4 fichiers JSONL natifs sont aussi écrits
|
| 250 |
+
(cohabitation).
|
| 251 |
+
"""
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 255 |
+
# Test global de feature parity — vérification croisée
|
| 256 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 257 |
+
|
| 258 |
+
|
| 259 |
+
@pytest.mark.skip(reason=f"{SKIP_REASON_PREFIX}* — toutes features portées")
|
| 260 |
+
def test_parity_all_features_combined(tmp_path: Path) -> None:
|
| 261 |
+
"""Lance les deux chemins avec toutes les features actives et
|
| 262 |
+
vérifie l'égalité numérique du ``BenchmarkResult``.
|
| 263 |
+
|
| 264 |
+
Spec
|
| 265 |
+
----
|
| 266 |
+
- Construire un ``RunSpec`` avec : ``profile="standard"``,
|
| 267 |
+
``partial_dir=tmp_path/"partial"``, ``output_json=tmp_path/
|
| 268 |
+
"bm.json"``, ``char_exclude="!."``,
|
| 269 |
+
``normalization_profile="caseless"``.
|
| 270 |
+
- Lancer ``run_benchmark_via_service`` avec les mêmes paramètres.
|
| 271 |
+
- Lancer ``RunOrchestrator().execute(spec)``.
|
| 272 |
+
- Normaliser les 2 ``BenchmarkResult`` (cf.
|
| 273 |
+
``test_migration_invariance.py:_normalize_for_snapshot``).
|
| 274 |
+
- Vérifier ``a == b``.
|
| 275 |
+
|
| 276 |
+
Ce test est le **gate finale du Checkpoint C1**. Quand il passe,
|
| 277 |
+
la Phase B2 est terminée et on peut commencer B3 (migration des
|
| 278 |
+
call sites).
|
| 279 |
+
"""
|
|
@@ -0,0 +1,470 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"corpus": {
|
| 3 |
+
"document_count": 2,
|
| 4 |
+
"name": "invariance_corpus",
|
| 5 |
+
"source": null
|
| 6 |
+
},
|
| 7 |
+
"engine_reports": [
|
| 8 |
+
{
|
| 9 |
+
"aggregated_char_scores": {
|
| 10 |
+
"diacritic": {
|
| 11 |
+
"correctly_recognized": 0,
|
| 12 |
+
"score": 1.0,
|
| 13 |
+
"total_in_gt": 0
|
| 14 |
+
},
|
| 15 |
+
"ligature": {
|
| 16 |
+
"correctly_recognized": 0,
|
| 17 |
+
"per_ligature": {},
|
| 18 |
+
"score": 1.0,
|
| 19 |
+
"total_in_gt": 0
|
| 20 |
+
}
|
| 21 |
+
},
|
| 22 |
+
"aggregated_confusion": {
|
| 23 |
+
"matrix": {},
|
| 24 |
+
"total_deletions": 0,
|
| 25 |
+
"total_insertions": 0,
|
| 26 |
+
"total_substitutions": 1
|
| 27 |
+
},
|
| 28 |
+
"aggregated_hallucination": {
|
| 29 |
+
"anchor_score_mean": 0.5,
|
| 30 |
+
"anchor_score_min": 0.0,
|
| 31 |
+
"document_count": 2,
|
| 32 |
+
"hallucinating_doc_count": 1,
|
| 33 |
+
"hallucinating_doc_rate": 0.5,
|
| 34 |
+
"length_ratio_mean": 1.0,
|
| 35 |
+
"net_insertion_rate_mean": 0.25
|
| 36 |
+
},
|
| 37 |
+
"aggregated_line_metrics": {
|
| 38 |
+
"catastrophic_rate": {
|
| 39 |
+
"0.3": 0.0,
|
| 40 |
+
"0.5": 0.0,
|
| 41 |
+
"1.0": 0.0
|
| 42 |
+
},
|
| 43 |
+
"document_count": 2,
|
| 44 |
+
"gini_mean": 0.0,
|
| 45 |
+
"gini_stdev": 0.0,
|
| 46 |
+
"heatmap": [
|
| 47 |
+
0.0,
|
| 48 |
+
0.0,
|
| 49 |
+
0.0,
|
| 50 |
+
0.0,
|
| 51 |
+
0.0,
|
| 52 |
+
0.0,
|
| 53 |
+
0.0,
|
| 54 |
+
0.0,
|
| 55 |
+
0.0,
|
| 56 |
+
0.045455
|
| 57 |
+
],
|
| 58 |
+
"mean_cer_mean": 0.045455,
|
| 59 |
+
"percentiles": {
|
| 60 |
+
"p50": 0.045455,
|
| 61 |
+
"p75": 0.045455,
|
| 62 |
+
"p90": 0.045455,
|
| 63 |
+
"p95": 0.045455,
|
| 64 |
+
"p99": 0.045455
|
| 65 |
+
}
|
| 66 |
+
},
|
| 67 |
+
"aggregated_metrics": {
|
| 68 |
+
"cer": {
|
| 69 |
+
"max": 0.090909,
|
| 70 |
+
"mean": 0.045455,
|
| 71 |
+
"median": 0.045455,
|
| 72 |
+
"min": 0.0,
|
| 73 |
+
"stdev": 0.064282
|
| 74 |
+
},
|
| 75 |
+
"cer_caseless": {
|
| 76 |
+
"max": 0.090909,
|
| 77 |
+
"mean": 0.045455,
|
| 78 |
+
"median": 0.045455,
|
| 79 |
+
"min": 0.0,
|
| 80 |
+
"stdev": 0.064282
|
| 81 |
+
},
|
| 82 |
+
"cer_diplomatic": {
|
| 83 |
+
"max": 0.090909,
|
| 84 |
+
"mean": 0.045455,
|
| 85 |
+
"median": 0.045455,
|
| 86 |
+
"min": 0.0,
|
| 87 |
+
"profile": "medieval_french",
|
| 88 |
+
"stdev": 0.064282
|
| 89 |
+
},
|
| 90 |
+
"cer_nfc": {
|
| 91 |
+
"max": 0.090909,
|
| 92 |
+
"mean": 0.045455,
|
| 93 |
+
"median": 0.045455,
|
| 94 |
+
"min": 0.0,
|
| 95 |
+
"stdev": 0.064282
|
| 96 |
+
},
|
| 97 |
+
"document_count": 2,
|
| 98 |
+
"failed_count": 0,
|
| 99 |
+
"mer": {
|
| 100 |
+
"max": 0.5,
|
| 101 |
+
"mean": 0.25,
|
| 102 |
+
"median": 0.25,
|
| 103 |
+
"min": 0.0,
|
| 104 |
+
"stdev": 0.353553
|
| 105 |
+
},
|
| 106 |
+
"wer": {
|
| 107 |
+
"max": 0.5,
|
| 108 |
+
"mean": 0.25,
|
| 109 |
+
"median": 0.25,
|
| 110 |
+
"min": 0.0,
|
| 111 |
+
"stdev": 0.353553
|
| 112 |
+
},
|
| 113 |
+
"wer_normalized": {
|
| 114 |
+
"max": 0.5,
|
| 115 |
+
"mean": 0.25,
|
| 116 |
+
"median": 0.25,
|
| 117 |
+
"min": 0.0,
|
| 118 |
+
"stdev": 0.353553
|
| 119 |
+
},
|
| 120 |
+
"wil": {
|
| 121 |
+
"max": 0.75,
|
| 122 |
+
"mean": 0.375,
|
| 123 |
+
"median": 0.375,
|
| 124 |
+
"min": 0.0,
|
| 125 |
+
"stdev": 0.53033
|
| 126 |
+
}
|
| 127 |
+
},
|
| 128 |
+
"aggregated_searchability": {
|
| 129 |
+
"max_distance": 2,
|
| 130 |
+
"missed_tokens_sample": [],
|
| 131 |
+
"n_docs": 2,
|
| 132 |
+
"n_gt_tokens": 5,
|
| 133 |
+
"n_searchable": 5,
|
| 134 |
+
"recall": 1.0
|
| 135 |
+
},
|
| 136 |
+
"aggregated_structure": {
|
| 137 |
+
"document_count": 2,
|
| 138 |
+
"mean_line_accuracy": 1.0,
|
| 139 |
+
"mean_line_fragmentation_rate": 0.0,
|
| 140 |
+
"mean_line_fusion_rate": 0.0,
|
| 141 |
+
"mean_paragraph_conservation": 1.0,
|
| 142 |
+
"mean_reading_order_score": 0.75
|
| 143 |
+
},
|
| 144 |
+
"aggregated_taxonomy": {
|
| 145 |
+
"class_distribution": {
|
| 146 |
+
"abbreviation_error": 0.0,
|
| 147 |
+
"case_error": 0.0,
|
| 148 |
+
"diacritic_error": 0.0,
|
| 149 |
+
"hapax": 1.0,
|
| 150 |
+
"lacuna": 0.0,
|
| 151 |
+
"ligature_error": 0.0,
|
| 152 |
+
"oov_character": 0.0,
|
| 153 |
+
"segmentation_error": 0.0,
|
| 154 |
+
"visual_confusion": 0.0
|
| 155 |
+
},
|
| 156 |
+
"counts": {
|
| 157 |
+
"abbreviation_error": 0,
|
| 158 |
+
"case_error": 0,
|
| 159 |
+
"diacritic_error": 0,
|
| 160 |
+
"hapax": 1,
|
| 161 |
+
"lacuna": 0,
|
| 162 |
+
"ligature_error": 0,
|
| 163 |
+
"oov_character": 0,
|
| 164 |
+
"segmentation_error": 0,
|
| 165 |
+
"visual_confusion": 0
|
| 166 |
+
},
|
| 167 |
+
"total_errors": 1
|
| 168 |
+
},
|
| 169 |
+
"document_results": [
|
| 170 |
+
{
|
| 171 |
+
"char_scores": {
|
| 172 |
+
"diacritic": {
|
| 173 |
+
"correctly_recognized": 0,
|
| 174 |
+
"per_diacritic": {},
|
| 175 |
+
"score": 1.0,
|
| 176 |
+
"total_in_gt": 0
|
| 177 |
+
},
|
| 178 |
+
"ligature": {
|
| 179 |
+
"correctly_recognized": 0,
|
| 180 |
+
"per_ligature": {},
|
| 181 |
+
"score": 1.0,
|
| 182 |
+
"total_in_gt": 0
|
| 183 |
+
}
|
| 184 |
+
},
|
| 185 |
+
"confusion_matrix": {
|
| 186 |
+
"matrix": {},
|
| 187 |
+
"total_deletions": 0,
|
| 188 |
+
"total_insertions": 0,
|
| 189 |
+
"total_substitutions": 0
|
| 190 |
+
},
|
| 191 |
+
"doc_id": "doc1",
|
| 192 |
+
"duration_seconds": 0.0,
|
| 193 |
+
"engine_error": null,
|
| 194 |
+
"ground_truth": "Bonjour le monde",
|
| 195 |
+
"hallucination_metrics": {
|
| 196 |
+
"anchor_score": 1.0,
|
| 197 |
+
"anchor_threshold_used": 0.5,
|
| 198 |
+
"gt_word_count": 3,
|
| 199 |
+
"hallucinated_blocks": [],
|
| 200 |
+
"hyp_word_count": 3,
|
| 201 |
+
"is_hallucinating": false,
|
| 202 |
+
"length_ratio": 1.0,
|
| 203 |
+
"length_ratio_threshold_used": 1.2,
|
| 204 |
+
"net_inserted_words": 0,
|
| 205 |
+
"net_insertion_rate": 0.0,
|
| 206 |
+
"ngram_size_used": 3
|
| 207 |
+
},
|
| 208 |
+
"hypothesis": "Bonjour le monde",
|
| 209 |
+
"image_path": "FIXTURES/doc1.png",
|
| 210 |
+
"line_metrics": {
|
| 211 |
+
"catastrophic_rate": {
|
| 212 |
+
"0.3": 0.0,
|
| 213 |
+
"0.5": 0.0,
|
| 214 |
+
"1.0": 0.0
|
| 215 |
+
},
|
| 216 |
+
"cer_per_line": [
|
| 217 |
+
0.0
|
| 218 |
+
],
|
| 219 |
+
"gini": 0.0,
|
| 220 |
+
"heatmap": [
|
| 221 |
+
0.0,
|
| 222 |
+
0.0,
|
| 223 |
+
0.0,
|
| 224 |
+
0.0,
|
| 225 |
+
0.0,
|
| 226 |
+
0.0,
|
| 227 |
+
0.0,
|
| 228 |
+
0.0,
|
| 229 |
+
0.0,
|
| 230 |
+
0.0
|
| 231 |
+
],
|
| 232 |
+
"line_count": 1,
|
| 233 |
+
"mean_cer": 0.0,
|
| 234 |
+
"percentiles": {
|
| 235 |
+
"p50": 0.0,
|
| 236 |
+
"p75": 0.0,
|
| 237 |
+
"p90": 0.0,
|
| 238 |
+
"p95": 0.0,
|
| 239 |
+
"p99": 0.0
|
| 240 |
+
}
|
| 241 |
+
},
|
| 242 |
+
"metrics": {
|
| 243 |
+
"cer": 0.0,
|
| 244 |
+
"cer_caseless": 0.0,
|
| 245 |
+
"cer_diplomatic": 0.0,
|
| 246 |
+
"cer_nfc": 0.0,
|
| 247 |
+
"diplomatic_profile_name": "medieval_french",
|
| 248 |
+
"error": null,
|
| 249 |
+
"hypothesis_length": 16,
|
| 250 |
+
"mer": 0.0,
|
| 251 |
+
"reference_length": 16,
|
| 252 |
+
"wer": 0.0,
|
| 253 |
+
"wer_normalized": 0.0,
|
| 254 |
+
"wil": 0.0
|
| 255 |
+
},
|
| 256 |
+
"searchability_metrics": {
|
| 257 |
+
"max_distance": 2,
|
| 258 |
+
"missed_tokens": [],
|
| 259 |
+
"n_gt_tokens": 3,
|
| 260 |
+
"n_searchable": 3,
|
| 261 |
+
"recall": 1.0
|
| 262 |
+
},
|
| 263 |
+
"structure": {
|
| 264 |
+
"gt_line_count": 1,
|
| 265 |
+
"line_accuracy": 1.0,
|
| 266 |
+
"line_fragmentation_count": 0,
|
| 267 |
+
"line_fragmentation_rate": 0.0,
|
| 268 |
+
"line_fusion_count": 0,
|
| 269 |
+
"line_fusion_rate": 0.0,
|
| 270 |
+
"ocr_line_count": 1,
|
| 271 |
+
"paragraph_conservation_score": 1.0,
|
| 272 |
+
"reading_order_score": 1.0
|
| 273 |
+
},
|
| 274 |
+
"taxonomy": {
|
| 275 |
+
"class_distribution": {},
|
| 276 |
+
"counts": {
|
| 277 |
+
"abbreviation_error": 0,
|
| 278 |
+
"case_error": 0,
|
| 279 |
+
"diacritic_error": 0,
|
| 280 |
+
"hapax": 0,
|
| 281 |
+
"lacuna": 0,
|
| 282 |
+
"ligature_error": 0,
|
| 283 |
+
"oov_character": 0,
|
| 284 |
+
"segmentation_error": 0,
|
| 285 |
+
"visual_confusion": 0
|
| 286 |
+
},
|
| 287 |
+
"examples": {
|
| 288 |
+
"abbreviation_error": [],
|
| 289 |
+
"case_error": [],
|
| 290 |
+
"diacritic_error": [],
|
| 291 |
+
"hapax": [],
|
| 292 |
+
"lacuna": [],
|
| 293 |
+
"ligature_error": [],
|
| 294 |
+
"oov_character": [],
|
| 295 |
+
"segmentation_error": [],
|
| 296 |
+
"visual_confusion": []
|
| 297 |
+
},
|
| 298 |
+
"total_errors": 0
|
| 299 |
+
}
|
| 300 |
+
},
|
| 301 |
+
{
|
| 302 |
+
"char_scores": {
|
| 303 |
+
"diacritic": {
|
| 304 |
+
"correctly_recognized": 0,
|
| 305 |
+
"per_diacritic": {},
|
| 306 |
+
"score": 1.0,
|
| 307 |
+
"total_in_gt": 0
|
| 308 |
+
},
|
| 309 |
+
"ligature": {
|
| 310 |
+
"correctly_recognized": 0,
|
| 311 |
+
"per_ligature": {},
|
| 312 |
+
"score": 1.0,
|
| 313 |
+
"total_in_gt": 0
|
| 314 |
+
}
|
| 315 |
+
},
|
| 316 |
+
"confusion_matrix": {
|
| 317 |
+
"matrix": {
|
| 318 |
+
"l": {
|
| 319 |
+
"i": 1
|
| 320 |
+
}
|
| 321 |
+
},
|
| 322 |
+
"total_deletions": 0,
|
| 323 |
+
"total_insertions": 0,
|
| 324 |
+
"total_substitutions": 1
|
| 325 |
+
},
|
| 326 |
+
"doc_id": "doc2",
|
| 327 |
+
"duration_seconds": 0.0,
|
| 328 |
+
"engine_error": null,
|
| 329 |
+
"ground_truth": "Hello world",
|
| 330 |
+
"hallucination_metrics": {
|
| 331 |
+
"anchor_score": 0.0,
|
| 332 |
+
"anchor_threshold_used": 0.5,
|
| 333 |
+
"gt_word_count": 2,
|
| 334 |
+
"hallucinated_blocks": [],
|
| 335 |
+
"hyp_word_count": 2,
|
| 336 |
+
"is_hallucinating": true,
|
| 337 |
+
"length_ratio": 1.0,
|
| 338 |
+
"length_ratio_threshold_used": 1.2,
|
| 339 |
+
"net_inserted_words": 1,
|
| 340 |
+
"net_insertion_rate": 0.5,
|
| 341 |
+
"ngram_size_used": 3
|
| 342 |
+
},
|
| 343 |
+
"hypothesis": "Helio world",
|
| 344 |
+
"image_path": "FIXTURES/doc2.png",
|
| 345 |
+
"line_metrics": {
|
| 346 |
+
"catastrophic_rate": {
|
| 347 |
+
"0.3": 0.0,
|
| 348 |
+
"0.5": 0.0,
|
| 349 |
+
"1.0": 0.0
|
| 350 |
+
},
|
| 351 |
+
"cer_per_line": [
|
| 352 |
+
0.090909
|
| 353 |
+
],
|
| 354 |
+
"gini": 0.0,
|
| 355 |
+
"heatmap": [
|
| 356 |
+
0.0,
|
| 357 |
+
0.0,
|
| 358 |
+
0.0,
|
| 359 |
+
0.0,
|
| 360 |
+
0.0,
|
| 361 |
+
0.0,
|
| 362 |
+
0.0,
|
| 363 |
+
0.0,
|
| 364 |
+
0.0,
|
| 365 |
+
0.090909
|
| 366 |
+
],
|
| 367 |
+
"line_count": 1,
|
| 368 |
+
"mean_cer": 0.090909,
|
| 369 |
+
"percentiles": {
|
| 370 |
+
"p50": 0.090909,
|
| 371 |
+
"p75": 0.090909,
|
| 372 |
+
"p90": 0.090909,
|
| 373 |
+
"p95": 0.090909,
|
| 374 |
+
"p99": 0.090909
|
| 375 |
+
}
|
| 376 |
+
},
|
| 377 |
+
"metrics": {
|
| 378 |
+
"cer": 0.090909,
|
| 379 |
+
"cer_caseless": 0.090909,
|
| 380 |
+
"cer_diplomatic": 0.090909,
|
| 381 |
+
"cer_nfc": 0.090909,
|
| 382 |
+
"diplomatic_profile_name": "medieval_french",
|
| 383 |
+
"error": null,
|
| 384 |
+
"hypothesis_length": 11,
|
| 385 |
+
"mer": 0.5,
|
| 386 |
+
"reference_length": 11,
|
| 387 |
+
"wer": 0.5,
|
| 388 |
+
"wer_normalized": 0.5,
|
| 389 |
+
"wil": 0.75
|
| 390 |
+
},
|
| 391 |
+
"searchability_metrics": {
|
| 392 |
+
"max_distance": 2,
|
| 393 |
+
"missed_tokens": [],
|
| 394 |
+
"n_gt_tokens": 2,
|
| 395 |
+
"n_searchable": 2,
|
| 396 |
+
"recall": 1.0
|
| 397 |
+
},
|
| 398 |
+
"structure": {
|
| 399 |
+
"gt_line_count": 1,
|
| 400 |
+
"line_accuracy": 1.0,
|
| 401 |
+
"line_fragmentation_count": 0,
|
| 402 |
+
"line_fragmentation_rate": 0.0,
|
| 403 |
+
"line_fusion_count": 0,
|
| 404 |
+
"line_fusion_rate": 0.0,
|
| 405 |
+
"ocr_line_count": 1,
|
| 406 |
+
"paragraph_conservation_score": 1.0,
|
| 407 |
+
"reading_order_score": 0.5
|
| 408 |
+
},
|
| 409 |
+
"taxonomy": {
|
| 410 |
+
"class_distribution": {
|
| 411 |
+
"abbreviation_error": 0.0,
|
| 412 |
+
"case_error": 0.0,
|
| 413 |
+
"diacritic_error": 0.0,
|
| 414 |
+
"hapax": 1.0,
|
| 415 |
+
"lacuna": 0.0,
|
| 416 |
+
"ligature_error": 0.0,
|
| 417 |
+
"oov_character": 0.0,
|
| 418 |
+
"segmentation_error": 0.0,
|
| 419 |
+
"visual_confusion": 0.0
|
| 420 |
+
},
|
| 421 |
+
"counts": {
|
| 422 |
+
"abbreviation_error": 0,
|
| 423 |
+
"case_error": 0,
|
| 424 |
+
"diacritic_error": 0,
|
| 425 |
+
"hapax": 1,
|
| 426 |
+
"lacuna": 0,
|
| 427 |
+
"ligature_error": 0,
|
| 428 |
+
"oov_character": 0,
|
| 429 |
+
"segmentation_error": 0,
|
| 430 |
+
"visual_confusion": 0
|
| 431 |
+
},
|
| 432 |
+
"examples": {
|
| 433 |
+
"abbreviation_error": [],
|
| 434 |
+
"case_error": [],
|
| 435 |
+
"diacritic_error": [],
|
| 436 |
+
"hapax": [
|
| 437 |
+
{
|
| 438 |
+
"gt": "Hello",
|
| 439 |
+
"ocr": "Helio"
|
| 440 |
+
}
|
| 441 |
+
],
|
| 442 |
+
"lacuna": [],
|
| 443 |
+
"ligature_error": [],
|
| 444 |
+
"oov_character": [],
|
| 445 |
+
"segmentation_error": [],
|
| 446 |
+
"visual_confusion": []
|
| 447 |
+
},
|
| 448 |
+
"total_errors": 1
|
| 449 |
+
}
|
| 450 |
+
}
|
| 451 |
+
],
|
| 452 |
+
"engine_config": {},
|
| 453 |
+
"engine_name": "precomputed_invariance",
|
| 454 |
+
"engine_version": "PINNED"
|
| 455 |
+
}
|
| 456 |
+
],
|
| 457 |
+
"metadata": {},
|
| 458 |
+
"picarones_version": "PINNED",
|
| 459 |
+
"ranking": [
|
| 460 |
+
{
|
| 461 |
+
"documents": 2,
|
| 462 |
+
"engine": "precomputed_invariance",
|
| 463 |
+
"failed": 0,
|
| 464 |
+
"mean_cer": 0.045455,
|
| 465 |
+
"mean_wer": 0.25,
|
| 466 |
+
"median_cer": 0.045455
|
| 467 |
+
}
|
| 468 |
+
],
|
| 469 |
+
"run_date": "PINNED"
|
| 470 |
+
}
|
|
@@ -0,0 +1,289 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Test d'invariance run-to-run pour la migration Option B.
|
| 2 |
+
|
| 3 |
+
Phase B0 du chantier de migration ``run_benchmark_via_service`` →
|
| 4 |
+
``RunOrchestrator.execute(RunSpec)``.
|
| 5 |
+
|
| 6 |
+
Rôle
|
| 7 |
+
----
|
| 8 |
+
Ce test exécute un benchmark **déterministe** (corpus mini de 2 docs +
|
| 9 |
+
``PrecomputedTextAdapter``) via la façade actuelle
|
| 10 |
+
``run_benchmark_via_service`` et compare son ``BenchmarkResult``
|
| 11 |
+
normalisé à un snapshot JSON enregistré dans
|
| 12 |
+
``tests/integration/snapshots/migration_invariance.json``.
|
| 13 |
+
|
| 14 |
+
Pourquoi
|
| 15 |
+
--------
|
| 16 |
+
Pendant la migration vers ``RunOrchestrator``, on porte 7 features
|
| 17 |
+
(``progress_callback``, ``cancel_event``, ``partial_dir``,
|
| 18 |
+
``entity_extractor``, ``char_exclude``, ``normalization_profile``,
|
| 19 |
+
``profile``, ``output_json``). Chaque port doit préserver
|
| 20 |
+
**exactement** le comportement numérique du chemin existant. Ce test
|
| 21 |
+
sert de filet de sécurité : si une refactorisation interne modifie le
|
| 22 |
+
résultat (CER, agrégation, ordre des engines, structure du JSON), le
|
| 23 |
+
snapshot diverge et la CI échoue.
|
| 24 |
+
|
| 25 |
+
Le test n'utilise **aucune** dépendance externe (pas de Tesseract, pas
|
| 26 |
+
de réseau). Le ``PrecomputedTextAdapter`` lit un fichier texte écrit
|
| 27 |
+
sur disque — sortie 100% déterministe.
|
| 28 |
+
|
| 29 |
+
Mise à jour du snapshot
|
| 30 |
+
-----------------------
|
| 31 |
+
Si une modification **volontaire** change le résultat (ex. nouveau
|
| 32 |
+
champ dans ``BenchmarkResult``), régénérer le snapshot :
|
| 33 |
+
|
| 34 |
+
PICARONES_UPDATE_SNAPSHOT=1 python -m pytest \
|
| 35 |
+
tests/integration/test_migration_invariance.py
|
| 36 |
+
|
| 37 |
+
Et inspecter le diff git du snapshot avant commit.
|
| 38 |
+
|
| 39 |
+
Normalisation
|
| 40 |
+
-------------
|
| 41 |
+
Les champs volatils sont neutralisés avant comparaison :
|
| 42 |
+
|
| 43 |
+
- ``picarones_version`` → ``"PINNED"``
|
| 44 |
+
- ``run_date`` → ``"PINNED"``
|
| 45 |
+
- ``corpus.source`` → ``"FIXTURES/corpus"``
|
| 46 |
+
- ``image_path`` → ``"FIXTURES/docN.png"``
|
| 47 |
+
- ``duration_seconds`` → ``0.0``
|
| 48 |
+
- Tout autre champ contenant le ``tmp_path`` → remplacé par
|
| 49 |
+
``"FIXTURES/..."``
|
| 50 |
+
|
| 51 |
+
Cela garantit que le snapshot reste stable cross-OS et cross-run.
|
| 52 |
+
"""
|
| 53 |
+
|
| 54 |
+
from __future__ import annotations
|
| 55 |
+
|
| 56 |
+
import json
|
| 57 |
+
import os
|
| 58 |
+
import re
|
| 59 |
+
from pathlib import Path
|
| 60 |
+
from typing import Any
|
| 61 |
+
|
| 62 |
+
import pytest
|
| 63 |
+
|
| 64 |
+
from picarones.adapters.ocr.precomputed import PrecomputedTextAdapter
|
| 65 |
+
from picarones.app.services.benchmark_runner import run_benchmark_via_service
|
| 66 |
+
from picarones.evaluation.corpus import Corpus, Document
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
SNAPSHOT_PATH = (
|
| 70 |
+
Path(__file__).parent / "snapshots" / "migration_invariance.json"
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 75 |
+
# Fixtures déterministes
|
| 76 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def _make_invariance_corpus(tmp_path: Path) -> Corpus:
|
| 80 |
+
"""Corpus mini de 2 documents avec GT + texte précalculé.
|
| 81 |
+
|
| 82 |
+
Le texte précalculé est légèrement différent de la GT pour produire
|
| 83 |
+
des métriques CER/WER non triviales (et donc plus discriminantes
|
| 84 |
+
dans le snapshot).
|
| 85 |
+
"""
|
| 86 |
+
documents: list[Document] = []
|
| 87 |
+
|
| 88 |
+
# Doc 1 : GT = "Bonjour le monde", OCR = "Bonjour le monde" → CER 0.0
|
| 89 |
+
doc1_img = tmp_path / "doc1.png"
|
| 90 |
+
doc1_img.write_bytes(b"\x89PNG\r\n\x1a\n") # PNG header minimal
|
| 91 |
+
doc1_ocr = tmp_path / "doc1.invariance.txt"
|
| 92 |
+
doc1_ocr.write_text("Bonjour le monde", encoding="utf-8")
|
| 93 |
+
documents.append(Document(
|
| 94 |
+
image_path=doc1_img,
|
| 95 |
+
ground_truth="Bonjour le monde",
|
| 96 |
+
doc_id="doc1",
|
| 97 |
+
))
|
| 98 |
+
|
| 99 |
+
# Doc 2 : GT = "Hello world", OCR = "Helio world" → CER non nul
|
| 100 |
+
doc2_img = tmp_path / "doc2.png"
|
| 101 |
+
doc2_img.write_bytes(b"\x89PNG\r\n\x1a\n")
|
| 102 |
+
doc2_ocr = tmp_path / "doc2.invariance.txt"
|
| 103 |
+
doc2_ocr.write_text("Helio world", encoding="utf-8")
|
| 104 |
+
documents.append(Document(
|
| 105 |
+
image_path=doc2_img,
|
| 106 |
+
ground_truth="Hello world",
|
| 107 |
+
doc_id="doc2",
|
| 108 |
+
))
|
| 109 |
+
|
| 110 |
+
return Corpus(name="invariance_corpus", documents=documents)
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def _make_invariance_engine() -> PrecomputedTextAdapter:
|
| 114 |
+
"""``PrecomputedTextAdapter`` qui lit ``<stem>.invariance.txt``."""
|
| 115 |
+
return PrecomputedTextAdapter(source_label="invariance")
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 119 |
+
# Normalisation du snapshot
|
| 120 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
def _normalize_for_snapshot(data: Any, tmp_path: Path) -> Any:
|
| 124 |
+
"""Normalise récursivement les champs volatils du ``BenchmarkResult``.
|
| 125 |
+
|
| 126 |
+
Remplace ``tmp_path`` par ``"FIXTURES"`` dans toutes les valeurs
|
| 127 |
+
string. Neutralise les champs explicitement volatils
|
| 128 |
+
(``duration_seconds``, ``run_date``, ``picarones_version``,
|
| 129 |
+
``engine_version``, ``code_version``).
|
| 130 |
+
"""
|
| 131 |
+
tmp_str = str(tmp_path)
|
| 132 |
+
# Pattern pour matcher tmp_path/quelque-chose (pour les chemins
|
| 133 |
+
# absolus qui n'apparaissent pas en clé mais en valeur string).
|
| 134 |
+
tmp_re = re.compile(re.escape(tmp_str))
|
| 135 |
+
|
| 136 |
+
def _normalize(value: Any, *, key: str | None = None) -> Any:
|
| 137 |
+
if isinstance(value, dict):
|
| 138 |
+
return {k: _normalize(v, key=k) for k, v in value.items()}
|
| 139 |
+
if isinstance(value, list):
|
| 140 |
+
return [_normalize(item) for item in value]
|
| 141 |
+
if isinstance(value, str):
|
| 142 |
+
return tmp_re.sub("FIXTURES", value)
|
| 143 |
+
if isinstance(value, float):
|
| 144 |
+
# Neutralise les durées (volatiles d'un run à l'autre).
|
| 145 |
+
if key == "duration_seconds":
|
| 146 |
+
return 0.0
|
| 147 |
+
# Garde les autres floats avec une précision raisonnable
|
| 148 |
+
# pour absorber le bruit de calcul minimum.
|
| 149 |
+
return round(value, 6)
|
| 150 |
+
return value
|
| 151 |
+
|
| 152 |
+
normalized = _normalize(data)
|
| 153 |
+
|
| 154 |
+
# Champs volatils au niveau racine — neutralisés en post-traitement
|
| 155 |
+
# parce que leur valeur ne contient pas ``tmp_path``.
|
| 156 |
+
if isinstance(normalized, dict):
|
| 157 |
+
for volatile_key in ("picarones_version", "run_date"):
|
| 158 |
+
if volatile_key in normalized:
|
| 159 |
+
normalized[volatile_key] = "PINNED"
|
| 160 |
+
|
| 161 |
+
# engine_version peut apparaître dans chaque engine_report.
|
| 162 |
+
for report in normalized.get("engine_reports", []):
|
| 163 |
+
if "engine_version" in report:
|
| 164 |
+
report["engine_version"] = "PINNED"
|
| 165 |
+
# Les pipeline_info portent parfois des chemins ou metadata.
|
| 166 |
+
pipeline_info = report.get("pipeline_info")
|
| 167 |
+
if isinstance(pipeline_info, dict):
|
| 168 |
+
if "code_version" in pipeline_info:
|
| 169 |
+
pipeline_info["code_version"] = "PINNED"
|
| 170 |
+
|
| 171 |
+
return normalized
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 175 |
+
# Comparaison snapshot
|
| 176 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
def _load_snapshot() -> dict | None:
|
| 180 |
+
if not SNAPSHOT_PATH.exists():
|
| 181 |
+
return None
|
| 182 |
+
return json.loads(SNAPSHOT_PATH.read_text(encoding="utf-8"))
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
def _write_snapshot(data: dict) -> None:
|
| 186 |
+
SNAPSHOT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
| 187 |
+
SNAPSHOT_PATH.write_text(
|
| 188 |
+
json.dumps(data, ensure_ascii=False, indent=2, sort_keys=True),
|
| 189 |
+
encoding="utf-8",
|
| 190 |
+
)
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
def _should_update_snapshot() -> bool:
|
| 194 |
+
return os.environ.get("PICARONES_UPDATE_SNAPSHOT") == "1"
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 198 |
+
# Test principal
|
| 199 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
def test_run_benchmark_via_service_invariance(tmp_path: Path) -> None:
|
| 203 |
+
"""Snapshot d'invariance du comportement actuel.
|
| 204 |
+
|
| 205 |
+
Ce test est le filet de sécurité de la migration Option B. Il doit
|
| 206 |
+
rester vert à chaque étape du chantier (B1, B2, B3, B4, ...) tant
|
| 207 |
+
que ``run_benchmark_via_service`` est la façade publique.
|
| 208 |
+
|
| 209 |
+
Quand la migration sera terminée et ``run_benchmark_via_service``
|
| 210 |
+
supprimée (Phase B8), ce test sera retiré ou migré vers
|
| 211 |
+
``RunOrchestrator.execute()``.
|
| 212 |
+
"""
|
| 213 |
+
corpus = _make_invariance_corpus(tmp_path)
|
| 214 |
+
engine = _make_invariance_engine()
|
| 215 |
+
|
| 216 |
+
benchmark_result = run_benchmark_via_service(
|
| 217 |
+
corpus=corpus,
|
| 218 |
+
engines=[engine],
|
| 219 |
+
code_version="invariance-test-1.0.0",
|
| 220 |
+
)
|
| 221 |
+
|
| 222 |
+
actual_normalized = _normalize_for_snapshot(
|
| 223 |
+
benchmark_result.as_dict(), tmp_path,
|
| 224 |
+
)
|
| 225 |
+
|
| 226 |
+
snapshot = _load_snapshot()
|
| 227 |
+
if snapshot is None or _should_update_snapshot():
|
| 228 |
+
_write_snapshot(actual_normalized)
|
| 229 |
+
if snapshot is None:
|
| 230 |
+
pytest.skip(
|
| 231 |
+
f"Snapshot créé pour la première fois à "
|
| 232 |
+
f"{SNAPSHOT_PATH.relative_to(Path.cwd())}. "
|
| 233 |
+
f"Vérifier son contenu puis ré-exécuter le test."
|
| 234 |
+
)
|
| 235 |
+
else:
|
| 236 |
+
# Mode update explicite : on a écrit, le test passe sans
|
| 237 |
+
# vérification additionnelle. L'opérateur est responsable
|
| 238 |
+
# d'inspecter le diff git.
|
| 239 |
+
return
|
| 240 |
+
|
| 241 |
+
assert actual_normalized == snapshot, (
|
| 242 |
+
"BenchmarkResult diverge du snapshot d'invariance.\n"
|
| 243 |
+
f"Snapshot : {SNAPSHOT_PATH}\n"
|
| 244 |
+
"Si la divergence est intentionnelle, régénérer avec :\n"
|
| 245 |
+
" PICARONES_UPDATE_SNAPSHOT=1 python -m pytest "
|
| 246 |
+
f"{Path(__file__).relative_to(Path.cwd())}\n"
|
| 247 |
+
"et inspecter le diff git du snapshot avant commit."
|
| 248 |
+
)
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 252 |
+
# Test annexe — vérifie que la normalisation elle-même est stable
|
| 253 |
+
# ──────────────────────────────────────────────────────────────────────
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
def test_normalization_is_idempotent(tmp_path: Path) -> None:
|
| 257 |
+
"""La normalisation d'un dict déjà normalisé ne le change pas.
|
| 258 |
+
|
| 259 |
+
Garantit qu'on peut ré-appliquer la normalisation sans dériver.
|
| 260 |
+
Test pédagogique de la mécanique du snapshot.
|
| 261 |
+
"""
|
| 262 |
+
sample = {
|
| 263 |
+
"picarones_version": "2.0.0",
|
| 264 |
+
"run_date": "2026-05-14T12:00:00Z",
|
| 265 |
+
"corpus": {"source": str(tmp_path / "corpus.zip")},
|
| 266 |
+
"engine_reports": [
|
| 267 |
+
{
|
| 268 |
+
"engine_version": "1.2.3",
|
| 269 |
+
"document_results": [
|
| 270 |
+
{
|
| 271 |
+
"image_path": str(tmp_path / "doc1.png"),
|
| 272 |
+
"duration_seconds": 0.123456,
|
| 273 |
+
"metrics": {"cer": 0.05},
|
| 274 |
+
},
|
| 275 |
+
],
|
| 276 |
+
},
|
| 277 |
+
],
|
| 278 |
+
}
|
| 279 |
+
|
| 280 |
+
once = _normalize_for_snapshot(sample, tmp_path)
|
| 281 |
+
twice = _normalize_for_snapshot(once, tmp_path)
|
| 282 |
+
|
| 283 |
+
assert once == twice
|
| 284 |
+
assert once["picarones_version"] == "PINNED"
|
| 285 |
+
assert once["run_date"] == "PINNED"
|
| 286 |
+
assert once["engine_reports"][0]["engine_version"] == "PINNED"
|
| 287 |
+
assert once["engine_reports"][0]["document_results"][0]["duration_seconds"] == 0.0
|
| 288 |
+
assert "FIXTURES" in once["corpus"]["source"]
|
| 289 |
+
assert "FIXTURES" in once["engine_reports"][0]["document_results"][0]["image_path"]
|