Spaces:
Running
Sprint 4 du plan rapport — moteur narratif complet + synthèse factuelle
Browse filesSprint 4 de la phase 0. Synthèse de 3-5 phrases en tête du rapport,
déterministe, sans LLM, chaque nombre traçable au JSON de résultats.
Détecteurs (`core/narrative/detectors.py`) — 9 nouveaux :
- `detect_global_leader_cer` : leader du classement, payload avec cer_pct,
n_docs, runner_up pour permettre la fusion avec `significant_gap`.
- `detect_significant_gap` : lit `statistics.pairwise_wilcoxon`, émet si
leader vs runner-up est significatif (p < 0,05).
- `detect_stratum_winner` : agrège CER par (moteur, script_type), flag si
un moteur est au moins 25 % meilleur que le 2ᵉ sur une strate ≥ 3 docs.
- `detect_stratum_collapse` : flag si CER local > 2× CER global d'un moteur
sur une strate ≥ 3 documents.
- `detect_error_profile_outlier` : compare `aggregated_taxonomy.distribution`
par classe entre moteurs, flag si un moteur dépasse 2× la médiane avec
part ≥ 15 %.
- `detect_llm_hallucination_flag` : uniquement pour pipelines/VLM — flag si
hallucinating_doc_rate > 30 %, anchor_score < 0,60 ou length_ratio > 1,30.
- `detect_robustness_fragile` : lit `benchmark_data.robustness` si présent,
flag si CER au niveau max ≥ 3× CER baseline.
- `detect_speed_winner` : moteur au moins 3× plus rapide que la médiane,
dans le même groupe Nemenyi que le leader OU CER ≤ 1,1 × CER du leader.
- `detect_confidence_warning` : largeur d'IC 95 % > 3× |leader − runner-up|
OU > 5 points de CER → signale classement fragile.
`pareto_alternative` et `cost_outlier` restent stubs jusqu'au Sprint 5.
`register_default_detectors(registry)` enregistre les 12 types dans le
registre par défaut (stubs inclus — sûrs, retournent []).
Arbitre (`core/narrative/arbiter.py`) :
- Tri stable par (−importance, ordre canonique du type, moteurs, strate).
- Non-redondance : un seul fait par moteur sauf paires complémentaires
(leader + gap, leader + speed, leader + confidence, tie + speed).
- `_remove_contradictions` : si `STATISTICAL_TIE` (Nemenyi, corrigé pour
comparaisons multiples) inclut deux moteurs, tout `SIGNIFICANT_GAP`
(Wilcoxon non corrigé) entre ces mêmes moteurs est supprimé. Nemenyi
l'emporte pour éviter de dire en même temps "A bat B" et "A, B indiscernables".
- Limite : ≤ max_facts (défaut 5). Seuil min_importance = MEDIUM.
Renderer (`core/narrative/renderer.py`) :
- Charge les templates YAML `templates/{lang}.yaml` (1 template par type).
- Utilise `str.format_map` avec un `_SafeFormatMap` qui retourne "?" pour
clés manquantes + warning dans les logs. Aucune exception ne remonte.
- `extract_numbers(text)` pour les tests de traçabilité.
Templates (`core/narrative/templates/{fr,en}.yaml`) :
- 10 templates bilingues (1 par type implémenté).
- Règle stricte : aucune valeur numérique ou nom hors des champs du payload.
Intégration rapport :
- `_narrative_summary.html` : partial Jinja2 qui rend les phrases en `<li>`.
- Placé dans `base.html.j2` entre `_header.html` et `_critical_difference.html`.
- `ReportGenerator.generate` appelle `build_synthesis` et passe le résultat.
- CSS `.synth-card` avec bordure bleue à gauche, marqueurs puces bleus.
- i18n FR/EN : 2 nouvelles clés `synth_title`, `synth_hint`.
- Mode présentation masque le `hint`.
- **Autoescape Jinja2 désactivé** : équivalent au `_HTML_TEMPLATE.format()`
historique. Tout le contenu injecté vient du code Picarones.
Packaging :
- `pyproject.toml` : `core/narrative/templates/*.yaml` en package-data.
- `MANIFEST.in` : même inclusion.
Tests (`test_sprint19_narrative_engine.py`) — 32 nouveaux :
- Détecteurs individuels : cas canoniques + cas vides pour chacun.
- Arbitre : tri par importance, limite max_facts, dédup même engine+type,
conservation des paires complémentaires, filtrage LOW, règle Nemenyi vs
Wilcoxon.
- Renderer : templates chargés, langue respectée, clé manquante ne crash pas,
déterminisme.
- E2E `build_synthesis` : produit des phrases, reproductible.
- **Anti-hallucination** : parse chaque phrase rendue, vérifie que chaque
nombre est dans le payload d'un Fact retenu (ou dans la liste limitative
des constantes de template {"95", "100"}). Payloads = résultats de calculs
déterministes sur l'entrée, donc chaîne de traçabilité complète.
- Intégration rapport : section synthèse présente, déterminisme octet à octet,
registre par défaut peuplé (12 types), locale EN rend bien en anglais.
Suite complète : 1174 passed, 2 skipped (vs 1142 avant). Zéro régression.
Exemple de synthèse sur la démo (8 docs, 3 moteurs + pipelines) :
• Sur ce corpus de 8 documents, pero_ocr obtient le CER moyen le plus bas
(0.13 %).
• Les moteurs pero_ocr, tesseract → gpt-4o, gpt-4o-vision (zero-shot),
tesseract ne sont pas statistiquement distinguables (Friedman-Nemenyi,
α = 0.05, n = 8 documents, CD = 2.157).
https://claude.ai/code/session_0162FdNNJyNvBuYzkgtsr9VB
- CLAUDE.md +28 -13
- MANIFEST.in +1 -0
- picarones/core/narrative/__init__.py +58 -6
- picarones/core/narrative/arbiter.py +136 -0
- picarones/core/narrative/detectors.py +481 -54
- picarones/core/narrative/renderer.py +105 -0
- picarones/core/narrative/templates/en.yaml +46 -0
- picarones/core/narrative/templates/fr.yaml +50 -0
- picarones/report/generator.py +12 -5
- picarones/report/i18n/en.json +2 -0
- picarones/report/i18n/fr.json +2 -0
- picarones/report/templates/_narrative_summary.html +16 -0
- picarones/report/templates/_styles.css +37 -0
- picarones/report/templates/base.html.j2 +2 -0
- pyproject.toml +1 -0
- tests/test_sprint19_narrative_engine.py +597 -0
|
@@ -193,6 +193,7 @@ AZURE_DOC_INTEL_KEY=...
|
|
| 193 |
| 16 | **Sprint 1 du plan rapport** : câblage de `line_metrics` et `hallucination` dans le runner et l'agrégation `EngineReport`, fondations du moteur narratif (`core/narrative/` avec modèle `Fact` et registre de détecteurs), correctifs qualité (deprecation Pillow `getdata` → `tobytes`, deux `except Exception: pass` remplacés par warnings explicites) |
|
| 194 |
| 17 | **Sprint 2 du plan rapport** : refactor de `generator.py` (3690 → 617 lignes) via Jinja2. Le monolithe `_HTML_TEMPLATE` est découpé en 10 fichiers externes dans `picarones/report/templates/` (base + 5 vues + header/footer + CSS + JS). L'i18n `i18n.py` (dict Python 101 clés) migré vers `picarones/report/i18n/{fr,en}.json` chargés à l'import. Ajout de 16 tests de non-régression (structure, déterminisme, i18n, garde-fous contre balises dupliquées). |
|
| 195 |
| 18 | **Sprint 3 du plan rapport** : test de Friedman multi-moteurs + post-hoc Nemenyi + Critical Difference Diagram (Demšar 2006). Nouveau module `core/statistics.py` : `friedman_test`, `nemenyi_posthoc`, `build_critical_difference_svg` avec table Nemenyi (k=2 à 50, α=0,05 et 0,01), fallback pur Python (Wilson-Hilferty pour chi²), support scipy optionnel (extra `stats`). Partial `_critical_difference.html` inséré en tête du rapport, SVG rendu server-side (pas de JS), i18n FR/EN pour les aides. Détecteur narratif `detect_statistical_tie` activé (lit `nemenyi.tied_groups`). 41 tests ajoutés (cas canoniques, dégénérés, SVG, intégration rapport). |
|
|
|
|
| 196 |
|
| 197 |
---
|
| 198 |
|
|
@@ -202,30 +203,44 @@ Fondations en place dans `picarones/core/narrative/` :
|
|
| 202 |
|
| 203 |
```
|
| 204 |
core/narrative/
|
| 205 |
-
├── __init__.py # API publique
|
| 206 |
-
├── facts.py # Modèle
|
| 207 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
```
|
| 209 |
|
| 210 |
**Principe anti-hallucination** : chaque valeur numérique ou nom d'entité dans le
|
| 211 |
-
`payload` d'un `Fact` doit provenir
|
| 212 |
-
|
| 213 |
-
|
|
|
|
| 214 |
|
| 215 |
-
**Détecteurs
|
| 216 |
-
|
| 217 |
-
- Sprint 3 : `statistical_tie` — **implémenté** (lit `nemenyi.tied_groups`)
|
| 218 |
- Sprint 4 : `global_leader_cer`, `significant_gap`, `stratum_winner`, `stratum_collapse`,
|
| 219 |
-
`error_profile_outlier`, `llm_hallucination_flag`, `robustness_fragile`,
|
| 220 |
-
`
|
| 221 |
-
- Sprint 5 : `pareto_alternative`, `cost_outlier`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
---
|
| 224 |
|
| 225 |
## Contexte développement
|
| 226 |
|
| 227 |
- **Environnement** : GitHub Codespaces (`/workspaces/Picarones`), Python 3.12
|
| 228 |
-
- **Tests** :
|
| 229 |
- **Branche active** : `claude/review-picarones-benchmarks-E3J42`
|
| 230 |
- **Transcript de la conversation de développement** :
|
| 231 |
`/mnt/transcripts/2026-03-11-14-01-41-picarones-ocr-bench-project.txt`
|
|
|
|
| 193 |
| 16 | **Sprint 1 du plan rapport** : câblage de `line_metrics` et `hallucination` dans le runner et l'agrégation `EngineReport`, fondations du moteur narratif (`core/narrative/` avec modèle `Fact` et registre de détecteurs), correctifs qualité (deprecation Pillow `getdata` → `tobytes`, deux `except Exception: pass` remplacés par warnings explicites) |
|
| 194 |
| 17 | **Sprint 2 du plan rapport** : refactor de `generator.py` (3690 → 617 lignes) via Jinja2. Le monolithe `_HTML_TEMPLATE` est découpé en 10 fichiers externes dans `picarones/report/templates/` (base + 5 vues + header/footer + CSS + JS). L'i18n `i18n.py` (dict Python 101 clés) migré vers `picarones/report/i18n/{fr,en}.json` chargés à l'import. Ajout de 16 tests de non-régression (structure, déterminisme, i18n, garde-fous contre balises dupliquées). |
|
| 195 |
| 18 | **Sprint 3 du plan rapport** : test de Friedman multi-moteurs + post-hoc Nemenyi + Critical Difference Diagram (Demšar 2006). Nouveau module `core/statistics.py` : `friedman_test`, `nemenyi_posthoc`, `build_critical_difference_svg` avec table Nemenyi (k=2 à 50, α=0,05 et 0,01), fallback pur Python (Wilson-Hilferty pour chi²), support scipy optionnel (extra `stats`). Partial `_critical_difference.html` inséré en tête du rapport, SVG rendu server-side (pas de JS), i18n FR/EN pour les aides. Détecteur narratif `detect_statistical_tie` activé (lit `nemenyi.tied_groups`). 41 tests ajoutés (cas canoniques, dégénérés, SVG, intégration rapport). |
|
| 196 |
+
| 19 | **Sprint 4 du plan rapport** : moteur narratif complet + synthèse factuelle en tête. 9 détecteurs implémentés (global_leader_cer, significant_gap, stratum_winner/collapse, error_profile_outlier, llm_hallucination_flag, robustness_fragile, speed_winner, confidence_warning). Arbitre (`arbiter.py`) avec tri par importance, non-redondance, suppression des contradictions Wilcoxon/Nemenyi. Renderer (`renderer.py`) lit templates YAML `core/narrative/templates/{fr,en}.yaml` (10 templates par langue) et rend par `str.format_map` déterministe. Nouveau partial `_narrative_summary.html` placé en tête du rapport (entre header et CDD). Garde-fou anti-hallucination testé : chaque nombre rendu est traçable au payload du Fact associé. 32 tests (détecteurs unitaires, arbitre, renderer, E2E, traçabilité, intégration HTML). `pareto_alternative` et `cost_outlier` restent stubs pour Sprint 5. |
|
| 197 |
|
| 198 |
---
|
| 199 |
|
|
|
|
| 203 |
|
| 204 |
```
|
| 205 |
core/narrative/
|
| 206 |
+
├── __init__.py # API publique + pipeline build_synthesis
|
| 207 |
+
├── facts.py # Modèle Fact, FactType (12 types), FactImportance, DetectorRegistry
|
| 208 |
+
├── detectors.py # 10 détecteurs implémentés (Sprint 19) + 2 stubs (Sprint 5)
|
| 209 |
+
├── arbiter.py # Tri par importance, non-redondance, anti-contradiction
|
| 210 |
+
├── renderer.py # Rendu templates YAML par str.format_map (déterministe)
|
| 211 |
+
└── templates/
|
| 212 |
+
├── fr.yaml # 10 templates français
|
| 213 |
+
└── en.yaml # 10 templates anglais
|
| 214 |
```
|
| 215 |
|
| 216 |
**Principe anti-hallucination** : chaque valeur numérique ou nom d'entité dans le
|
| 217 |
+
`payload` d'un `Fact` doit provenir du JSON d'entrée. Test `test_sprint19_narrative_engine.py`
|
| 218 |
+
parse la synthèse rendue et vérifie que chaque nombre est traçable au payload
|
| 219 |
+
(via `_numbers_in_payload`) augmenté d'une liste blanche limitative de constantes
|
| 220 |
+
de template (`95`, `100`).
|
| 221 |
|
| 222 |
+
**Détecteurs activés dans le registre par défaut (Sprint 19)** :
|
| 223 |
+
- Sprint 3 : `statistical_tie`
|
|
|
|
| 224 |
- Sprint 4 : `global_leader_cer`, `significant_gap`, `stratum_winner`, `stratum_collapse`,
|
| 225 |
+
`error_profile_outlier`, `llm_hallucination_flag`, `robustness_fragile`,
|
| 226 |
+
`speed_winner`, `confidence_warning`
|
| 227 |
+
- Sprint 5 : `pareto_alternative`, `cost_outlier` — stubs (retournent `[]`)
|
| 228 |
+
|
| 229 |
+
**Règle anti-contradiction** (arbitre) : si `SIGNIFICANT_GAP` (Wilcoxon non corrigé)
|
| 230 |
+
et `STATISTICAL_TIE` (Nemenyi corrigé) concernent les mêmes moteurs, Nemenyi
|
| 231 |
+
l'emporte — on ne veut pas dire en même temps "A bat B significativement" ET
|
| 232 |
+
"A et B sont indiscernables".
|
| 233 |
+
|
| 234 |
+
**Pipeline** : `build_synthesis(benchmark_data, lang, max_facts=5)` détecte,
|
| 235 |
+
arbitre, rend. Le `ReportGenerator.generate` l'appelle et passe le résultat
|
| 236 |
+
au template `_narrative_summary.html` (placé entre `_header.html` et `_critical_difference.html`).
|
| 237 |
|
| 238 |
---
|
| 239 |
|
| 240 |
## Contexte développement
|
| 241 |
|
| 242 |
- **Environnement** : GitHub Codespaces (`/workspaces/Picarones`), Python 3.12
|
| 243 |
+
- **Tests** : 1174 passed, 2 skipped (Sprint 19)
|
| 244 |
- **Branche active** : `claude/review-picarones-benchmarks-E3J42`
|
| 245 |
- **Transcript de la conversation de développement** :
|
| 246 |
`/mnt/transcripts/2026-03-11-14-01-41-picarones-ocr-bench-project.txt`
|
|
@@ -7,3 +7,4 @@ recursive-include picarones/web/static *.css
|
|
| 7 |
recursive-include picarones *.json *.yaml *.yml
|
| 8 |
recursive-include picarones/report/templates *.j2 *.html *.css *.js
|
| 9 |
recursive-include picarones/report/i18n *.json
|
|
|
|
|
|
| 7 |
recursive-include picarones *.json *.yaml *.yml
|
| 8 |
recursive-include picarones/report/templates *.j2 *.html *.css *.js
|
| 9 |
recursive-include picarones/report/i18n *.json
|
| 10 |
+
recursive-include picarones/core/narrative/templates *.yaml
|
|
@@ -1,12 +1,17 @@
|
|
| 1 |
"""Moteur narratif factuel — génération de synthèse déterministe.
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
résultats en entrée.
|
| 7 |
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
from picarones.core.narrative.facts import (
|
|
@@ -15,7 +20,47 @@ from picarones.core.narrative.facts import (
|
|
| 15 |
FactImportance,
|
| 16 |
DetectorRegistry,
|
| 17 |
detect_all,
|
|
|
|
| 18 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
__all__ = [
|
| 21 |
"Fact",
|
|
@@ -23,4 +68,11 @@ __all__ = [
|
|
| 23 |
"FactImportance",
|
| 24 |
"DetectorRegistry",
|
| 25 |
"detect_all",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
]
|
|
|
|
| 1 |
"""Moteur narratif factuel — génération de synthèse déterministe.
|
| 2 |
|
| 3 |
+
Extrait des faits saillants d'un ``BenchmarkResult`` et les rend en phrases
|
| 4 |
+
courtes via des templates externes YAML. Aucun LLM : chaque nombre ou nom
|
| 5 |
+
apparaissant dans la synthèse est traçable au JSON de résultats en entrée.
|
|
|
|
| 6 |
|
| 7 |
+
API publique
|
| 8 |
+
------------
|
| 9 |
+
- ``Fact``, ``FactType``, ``FactImportance`` : modèle de données
|
| 10 |
+
- ``DetectorRegistry`` : registre des détecteurs
|
| 11 |
+
- ``detect_all(data)`` : applique le registre par défaut
|
| 12 |
+
- ``select_facts(facts, max_facts=5)`` : arbitre de sélection
|
| 13 |
+
- ``render_synthesis(facts, lang="fr")`` : rend en liste de phrases
|
| 14 |
+
- ``build_synthesis(data, lang="fr")`` : pipeline complet (Sprint 4)
|
| 15 |
"""
|
| 16 |
|
| 17 |
from picarones.core.narrative.facts import (
|
|
|
|
| 20 |
FactImportance,
|
| 21 |
DetectorRegistry,
|
| 22 |
detect_all,
|
| 23 |
+
_DEFAULT_REGISTRY,
|
| 24 |
)
|
| 25 |
+
from picarones.core.narrative.arbiter import select_facts
|
| 26 |
+
from picarones.core.narrative.renderer import (
|
| 27 |
+
render_fact,
|
| 28 |
+
render_synthesis,
|
| 29 |
+
extract_numbers,
|
| 30 |
+
)
|
| 31 |
+
from picarones.core.narrative.detectors import (
|
| 32 |
+
register_default_detectors,
|
| 33 |
+
DETECTORS_BY_TYPE,
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
# Activer le registre par défaut — Sprint 4
|
| 38 |
+
register_default_detectors(_DEFAULT_REGISTRY)
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def build_synthesis(
|
| 42 |
+
benchmark_data: dict,
|
| 43 |
+
lang: str = "fr",
|
| 44 |
+
max_facts: int = 5,
|
| 45 |
+
) -> dict:
|
| 46 |
+
"""Pipeline complet : détection → arbitre → rendu.
|
| 47 |
+
|
| 48 |
+
Returns
|
| 49 |
+
-------
|
| 50 |
+
dict avec :
|
| 51 |
+
- ``sentences`` : liste de phrases prêtes à l'affichage
|
| 52 |
+
- ``facts`` : liste de dicts ``Fact.as_dict()`` pour traçabilité
|
| 53 |
+
- ``lang`` : langue utilisée
|
| 54 |
+
"""
|
| 55 |
+
all_facts = detect_all(benchmark_data)
|
| 56 |
+
selected = select_facts(all_facts, max_facts=max_facts)
|
| 57 |
+
sentences = render_synthesis(selected, lang=lang)
|
| 58 |
+
return {
|
| 59 |
+
"sentences": sentences,
|
| 60 |
+
"facts": [f.as_dict() for f in selected],
|
| 61 |
+
"lang": lang,
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
|
| 65 |
__all__ = [
|
| 66 |
"Fact",
|
|
|
|
| 68 |
"FactImportance",
|
| 69 |
"DetectorRegistry",
|
| 70 |
"detect_all",
|
| 71 |
+
"select_facts",
|
| 72 |
+
"render_fact",
|
| 73 |
+
"render_synthesis",
|
| 74 |
+
"extract_numbers",
|
| 75 |
+
"build_synthesis",
|
| 76 |
+
"register_default_detectors",
|
| 77 |
+
"DETECTORS_BY_TYPE",
|
| 78 |
]
|
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Arbitre de sélection des faits narratifs.
|
| 2 |
+
|
| 3 |
+
L'arbitre transforme une liste potentiellement longue de ``Fact`` détectés
|
| 4 |
+
en une synthèse courte (3 à 5 phrases) adaptée à l'ouverture du rapport.
|
| 5 |
+
|
| 6 |
+
Règles de sélection :
|
| 7 |
+
1. Tri par importance décroissante, puis par type (ordre canonique).
|
| 8 |
+
2. Non-redondance : un seul fait par moteur, sauf si les types sont
|
| 9 |
+
complémentaires (ex. ``GLOBAL_LEADER_CER`` + ``SIGNIFICANT_GAP``
|
| 10 |
+
concernent le leader mais apportent une information différente).
|
| 11 |
+
3. Limite : au maximum ``max_facts`` faits retenus (défaut 5).
|
| 12 |
+
4. Déterminisme : tri stable sur (−importance, ordre canonique du type,
|
| 13 |
+
noms des moteurs) pour garantir une sortie bit-à-bit identique.
|
| 14 |
+
|
| 15 |
+
Les détecteurs peuvent émettre plusieurs faits du même type (ex. plusieurs
|
| 16 |
+
``STATISTICAL_TIE`` si plusieurs groupes distincts). L'arbitre ne fusionne
|
| 17 |
+
pas mais peut limiter par type.
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
from typing import Iterable
|
| 23 |
+
|
| 24 |
+
from picarones.core.narrative.facts import Fact, FactImportance, FactType
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
# Ordre canonique des types pour départager les ex-aequo à l'importance égale.
|
| 28 |
+
_TYPE_ORDER: tuple[FactType, ...] = (
|
| 29 |
+
FactType.GLOBAL_LEADER_CER,
|
| 30 |
+
FactType.STATISTICAL_TIE,
|
| 31 |
+
FactType.SIGNIFICANT_GAP,
|
| 32 |
+
FactType.STRATUM_WINNER,
|
| 33 |
+
FactType.STRATUM_COLLAPSE,
|
| 34 |
+
FactType.ERROR_PROFILE_OUTLIER,
|
| 35 |
+
FactType.LLM_HALLUCINATION_FLAG,
|
| 36 |
+
FactType.ROBUSTNESS_FRAGILE,
|
| 37 |
+
FactType.PARETO_ALTERNATIVE,
|
| 38 |
+
FactType.SPEED_WINNER,
|
| 39 |
+
FactType.COST_OUTLIER,
|
| 40 |
+
FactType.CONFIDENCE_WARNING,
|
| 41 |
+
)
|
| 42 |
+
_TYPE_INDEX: dict[FactType, int] = {t: i for i, t in enumerate(_TYPE_ORDER)}
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# Paires de types qui ne sont PAS considérées comme redondantes même quand
|
| 46 |
+
# elles concernent le même moteur. Tout autre couple → un seul fait retenu
|
| 47 |
+
# pour le moteur (le plus important).
|
| 48 |
+
_COMPLEMENTARY_PAIRS: frozenset[frozenset[FactType]] = frozenset({
|
| 49 |
+
frozenset({FactType.GLOBAL_LEADER_CER, FactType.SIGNIFICANT_GAP}),
|
| 50 |
+
frozenset({FactType.GLOBAL_LEADER_CER, FactType.SPEED_WINNER}),
|
| 51 |
+
frozenset({FactType.GLOBAL_LEADER_CER, FactType.CONFIDENCE_WARNING}),
|
| 52 |
+
frozenset({FactType.STATISTICAL_TIE, FactType.SPEED_WINNER}),
|
| 53 |
+
})
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def _sort_key(fact: Fact) -> tuple:
|
| 57 |
+
"""Clé de tri stable : importance (desc), type canonique, moteurs."""
|
| 58 |
+
return (
|
| 59 |
+
-int(fact.importance),
|
| 60 |
+
_TYPE_INDEX.get(fact.type, len(_TYPE_ORDER)),
|
| 61 |
+
tuple(sorted(fact.engines_involved)),
|
| 62 |
+
fact.stratum or "",
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def _is_redundant(candidate: Fact, kept: Fact) -> bool:
|
| 67 |
+
"""Vrai si ``candidate`` apporte trop peu par rapport à ``kept``.
|
| 68 |
+
|
| 69 |
+
Deux faits sont redondants s'ils concernent exactement le même moteur,
|
| 70 |
+
ont le même type, et la même strate (s'il y en a une). Des types
|
| 71 |
+
différents sur le même moteur ne sont considérés redondants que s'ils
|
| 72 |
+
n'appartiennent pas aux paires complémentaires (ex : un leader peut
|
| 73 |
+
aussi être rapide ; c'est complémentaire).
|
| 74 |
+
"""
|
| 75 |
+
if candidate.type == kept.type and candidate.stratum == kept.stratum:
|
| 76 |
+
return set(candidate.engines_involved) == set(kept.engines_involved)
|
| 77 |
+
if set(candidate.engines_involved) == set(kept.engines_involved):
|
| 78 |
+
pair = frozenset({candidate.type, kept.type})
|
| 79 |
+
return pair not in _COMPLEMENTARY_PAIRS
|
| 80 |
+
return False
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def _remove_contradictions(facts: list[Fact]) -> list[Fact]:
|
| 84 |
+
"""Supprime les faits incohérents sur le plan statistique.
|
| 85 |
+
|
| 86 |
+
Règle centrale : si Nemenyi (post-hoc corrigé pour comparaisons multiples)
|
| 87 |
+
place deux moteurs dans le même groupe d'ex-aequo, alors un ``SIGNIFICANT_GAP``
|
| 88 |
+
basé sur Wilcoxon non corrigé entre ces deux mêmes moteurs est trompeur
|
| 89 |
+
pour un lecteur non statisticien. Nemenyi l'emporte.
|
| 90 |
+
"""
|
| 91 |
+
tied_groups: list[set[str]] = []
|
| 92 |
+
for f in facts:
|
| 93 |
+
if f.type == FactType.STATISTICAL_TIE:
|
| 94 |
+
tied_groups.append(set(f.engines_involved))
|
| 95 |
+
|
| 96 |
+
def _is_contradicted(fact: Fact) -> bool:
|
| 97 |
+
if fact.type != FactType.SIGNIFICANT_GAP:
|
| 98 |
+
return False
|
| 99 |
+
pair = set(fact.engines_involved)
|
| 100 |
+
return any(pair <= group for group in tied_groups)
|
| 101 |
+
|
| 102 |
+
return [f for f in facts if not _is_contradicted(f)]
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def select_facts(
|
| 106 |
+
facts: Iterable[Fact],
|
| 107 |
+
max_facts: int = 5,
|
| 108 |
+
min_importance: FactImportance = FactImportance.MEDIUM,
|
| 109 |
+
) -> list[Fact]:
|
| 110 |
+
"""Sélectionne la synthèse finale à partir d'une liste brute de faits.
|
| 111 |
+
|
| 112 |
+
Parameters
|
| 113 |
+
----------
|
| 114 |
+
facts:
|
| 115 |
+
Liste de ``Fact`` brute issue de ``DetectorRegistry.run``.
|
| 116 |
+
max_facts:
|
| 117 |
+
Nombre maximal de faits retenus (défaut : 5).
|
| 118 |
+
min_importance:
|
| 119 |
+
Seuil minimal d'importance. Les faits ``LOW`` sont exclus par défaut.
|
| 120 |
+
|
| 121 |
+
Returns
|
| 122 |
+
-------
|
| 123 |
+
Liste ordonnée, prête à être rendue. Toujours ≤ ``max_facts``.
|
| 124 |
+
"""
|
| 125 |
+
facts_list = [f for f in facts if int(f.importance) >= int(min_importance)]
|
| 126 |
+
facts_list = _remove_contradictions(facts_list)
|
| 127 |
+
ranked = sorted(facts_list, key=_sort_key)
|
| 128 |
+
|
| 129 |
+
selected: list[Fact] = []
|
| 130 |
+
for fact in ranked:
|
| 131 |
+
if any(_is_redundant(fact, kept) for kept in selected):
|
| 132 |
+
continue
|
| 133 |
+
selected.append(fact)
|
| 134 |
+
if len(selected) >= max_facts:
|
| 135 |
+
break
|
| 136 |
+
return selected
|
|
@@ -1,40 +1,86 @@
|
|
| 1 |
-
"""Détecteurs de faits —
|
| 2 |
|
| 3 |
Chaque détecteur est une fonction pure ``(benchmark_data: dict) -> list[Fact]``.
|
| 4 |
-
Le sprint qui implémente chaque détecteur est indiqué dans le docstring.
|
| 5 |
-
|
| 6 |
Convention : un détecteur qui ne trouve rien retourne une liste vide. Il ne
|
| 7 |
doit jamais lever d'exception — la gestion d'erreur est centralisée dans
|
| 8 |
``DetectorRegistry.run``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
"""
|
| 10 |
|
| 11 |
from __future__ import annotations
|
| 12 |
|
|
|
|
|
|
|
|
|
|
| 13 |
from picarones.core.narrative.facts import Fact, FactImportance, FactType
|
| 14 |
|
| 15 |
|
| 16 |
# ---------------------------------------------------------------------------
|
| 17 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
# ---------------------------------------------------------------------------
|
| 19 |
|
| 20 |
def detect_global_leader_cer(benchmark_data: dict) -> list[Fact]:
|
| 21 |
-
"""
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
"""
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
def detect_statistical_tie(benchmark_data: dict) -> list[Fact]:
|
| 30 |
-
"""Détecte les groupes de moteurs statistiquement indiscernables.
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
chaque groupe d'ex-aequo non trivial (≥ 2 moteurs). La présence du fait
|
| 35 |
-
est un signal important pour la synthèse : "les moteurs X, Y, Z sont
|
| 36 |
-
statistiquement indiscernables au seuil α = 0,05".
|
| 37 |
-
"""
|
| 38 |
nemenyi = benchmark_data.get("statistics", {}).get("nemenyi", {})
|
| 39 |
if not nemenyi or nemenyi.get("error"):
|
| 40 |
return []
|
|
@@ -48,9 +94,7 @@ def detect_statistical_tie(benchmark_data: dict) -> list[Fact]:
|
|
| 48 |
facts: list[Fact] = []
|
| 49 |
for group in tied_groups:
|
| 50 |
if len(group) < 2:
|
| 51 |
-
continue
|
| 52 |
-
# Importance : un groupe incluant le leader (rang le plus bas) est critique
|
| 53 |
-
# (il nuance fortement le classement ordinal), les autres sont HIGH.
|
| 54 |
is_leader_tie = min(mean_ranks.get(n, 999) for n in group) == min(
|
| 55 |
mean_ranks.values(), default=0
|
| 56 |
)
|
|
@@ -61,11 +105,13 @@ def detect_statistical_tie(benchmark_data: dict) -> list[Fact]:
|
|
| 61 |
importance=importance,
|
| 62 |
payload={
|
| 63 |
"engines": list(group),
|
|
|
|
| 64 |
"mean_ranks": {n: mean_ranks.get(n) for n in group},
|
| 65 |
-
"critical_distance": cd,
|
| 66 |
"alpha": alpha,
|
| 67 |
"n_blocks": n_blocks,
|
| 68 |
"includes_leader": is_leader_tie,
|
|
|
|
| 69 |
},
|
| 70 |
engines_involved=tuple(group),
|
| 71 |
))
|
|
@@ -73,73 +119,447 @@ def detect_statistical_tie(benchmark_data: dict) -> list[Fact]:
|
|
| 73 |
|
| 74 |
|
| 75 |
def detect_significant_gap(benchmark_data: dict) -> list[Fact]:
|
| 76 |
-
"""
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
|
| 80 |
def detect_pareto_alternative(benchmark_data: dict) -> list[Fact]:
|
| 81 |
-
"""
|
|
|
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
"""
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
|
| 88 |
def detect_stratum_winner(benchmark_data: dict) -> list[Fact]:
|
| 89 |
-
"""
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
|
| 93 |
def detect_stratum_collapse(benchmark_data: dict) -> list[Fact]:
|
| 94 |
-
"""
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
|
| 98 |
def detect_error_profile_outlier(benchmark_data: dict) -> list[Fact]:
|
| 99 |
-
"""
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
|
|
|
| 103 |
"""
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
|
| 107 |
def detect_llm_hallucination_flag(benchmark_data: dict) -> list[Fact]:
|
| 108 |
-
"""
|
| 109 |
|
| 110 |
-
|
| 111 |
-
``
|
| 112 |
-
émet un Fact si un moteur dépasse significativement la médiane.
|
| 113 |
-
Implémentation complète Sprint 4.
|
| 114 |
"""
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
|
| 118 |
def detect_robustness_fragile(benchmark_data: dict) -> list[Fact]:
|
| 119 |
-
"""
|
| 120 |
-
return []
|
| 121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
| 124 |
-
""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
|
|
|
| 128 |
return []
|
| 129 |
|
| 130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
def detect_speed_winner(benchmark_data: dict) -> list[Fact]:
|
| 132 |
-
"""
|
| 133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
|
| 136 |
def detect_confidence_warning(benchmark_data: dict) -> list[Fact]:
|
| 137 |
-
"""
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
|
| 141 |
# ---------------------------------------------------------------------------
|
| 142 |
-
# Enregistrement par défaut —
|
| 143 |
# ---------------------------------------------------------------------------
|
| 144 |
|
| 145 |
DETECTORS_BY_TYPE = {
|
|
@@ -156,7 +576,14 @@ DETECTORS_BY_TYPE = {
|
|
| 156 |
FactType.SPEED_WINNER: detect_speed_winner,
|
| 157 |
FactType.CONFIDENCE_WARNING: detect_confidence_warning,
|
| 158 |
}
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Détecteurs de faits — implémentations Sprint 4.
|
| 2 |
|
| 3 |
Chaque détecteur est une fonction pure ``(benchmark_data: dict) -> list[Fact]``.
|
|
|
|
|
|
|
| 4 |
Convention : un détecteur qui ne trouve rien retourne une liste vide. Il ne
|
| 5 |
doit jamais lever d'exception — la gestion d'erreur est centralisée dans
|
| 6 |
``DetectorRegistry.run``.
|
| 7 |
+
|
| 8 |
+
Règle anti-hallucination : chaque nombre ou nom placé dans ``payload`` doit
|
| 9 |
+
venir directement du JSON d'entrée (jamais d'une interpolation). Les tests
|
| 10 |
+
du Sprint 4 parsent la synthèse rendue et vérifient que chaque valeur
|
| 11 |
+
numérique qu'elle contient est traçable.
|
| 12 |
"""
|
| 13 |
|
| 14 |
from __future__ import annotations
|
| 15 |
|
| 16 |
+
import statistics as _stats
|
| 17 |
+
from typing import Optional
|
| 18 |
+
|
| 19 |
from picarones.core.narrative.facts import Fact, FactImportance, FactType
|
| 20 |
|
| 21 |
|
| 22 |
# ---------------------------------------------------------------------------
|
| 23 |
+
# Helpers internes
|
| 24 |
+
# ---------------------------------------------------------------------------
|
| 25 |
+
|
| 26 |
+
def _engines_summary(data: dict) -> list[dict]:
|
| 27 |
+
"""Accès normalisé à la liste des résumés moteur."""
|
| 28 |
+
return data.get("engines", []) or []
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def _engine_by_name(data: dict, name: str) -> Optional[dict]:
|
| 32 |
+
for e in _engines_summary(data):
|
| 33 |
+
if e.get("name") == name:
|
| 34 |
+
return e
|
| 35 |
+
return None
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _n_docs(data: dict) -> int:
|
| 39 |
+
meta = data.get("meta", {}) or {}
|
| 40 |
+
return int(meta.get("document_count") or 0)
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
# ---------------------------------------------------------------------------
|
| 44 |
+
# Sprint 4 — Détecteurs implémentés
|
| 45 |
# ---------------------------------------------------------------------------
|
| 46 |
|
| 47 |
def detect_global_leader_cer(benchmark_data: dict) -> list[Fact]:
|
| 48 |
+
"""Moteur avec le CER moyen le plus bas sur l'ensemble du corpus.
|
| 49 |
|
| 50 |
+
Émet un Fact CRITICAL si au moins 2 moteurs sont comparés, en attachant
|
| 51 |
+
aussi le 2ᵉ pour permettre à l'arbitre de fusionner avec ``significant_gap``.
|
| 52 |
"""
|
| 53 |
+
ranking = benchmark_data.get("ranking") or []
|
| 54 |
+
# Éliminer les entrées sans CER calculé
|
| 55 |
+
valid = [r for r in ranking if r.get("mean_cer") is not None]
|
| 56 |
+
if len(valid) < 1:
|
| 57 |
+
return []
|
| 58 |
|
| 59 |
+
leader = valid[0]
|
| 60 |
+
runner_up = valid[1] if len(valid) >= 2 else None
|
| 61 |
+
|
| 62 |
+
payload = {
|
| 63 |
+
"engine": leader["engine"],
|
| 64 |
+
"cer": float(leader["mean_cer"]),
|
| 65 |
+
"cer_pct": round(float(leader["mean_cer"]) * 100, 2),
|
| 66 |
+
"n_engines": len(valid),
|
| 67 |
+
"n_docs": _n_docs(benchmark_data),
|
| 68 |
+
}
|
| 69 |
+
if runner_up is not None:
|
| 70 |
+
payload["runner_up"] = runner_up["engine"]
|
| 71 |
+
payload["runner_up_cer"] = float(runner_up["mean_cer"])
|
| 72 |
+
payload["runner_up_cer_pct"] = round(float(runner_up["mean_cer"]) * 100, 2)
|
| 73 |
+
|
| 74 |
+
return [Fact(
|
| 75 |
+
type=FactType.GLOBAL_LEADER_CER,
|
| 76 |
+
importance=FactImportance.CRITICAL,
|
| 77 |
+
payload=payload,
|
| 78 |
+
engines_involved=(leader["engine"],),
|
| 79 |
+
)]
|
| 80 |
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
def detect_statistical_tie(benchmark_data: dict) -> list[Fact]:
|
| 83 |
+
"""Groupes de moteurs statistiquement indiscernables (Nemenyi)."""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
nemenyi = benchmark_data.get("statistics", {}).get("nemenyi", {})
|
| 85 |
if not nemenyi or nemenyi.get("error"):
|
| 86 |
return []
|
|
|
|
| 94 |
facts: list[Fact] = []
|
| 95 |
for group in tied_groups:
|
| 96 |
if len(group) < 2:
|
| 97 |
+
continue
|
|
|
|
|
|
|
| 98 |
is_leader_tie = min(mean_ranks.get(n, 999) for n in group) == min(
|
| 99 |
mean_ranks.values(), default=0
|
| 100 |
)
|
|
|
|
| 105 |
importance=importance,
|
| 106 |
payload={
|
| 107 |
"engines": list(group),
|
| 108 |
+
"engines_list": ", ".join(group),
|
| 109 |
"mean_ranks": {n: mean_ranks.get(n) for n in group},
|
| 110 |
+
"critical_distance": round(cd, 3),
|
| 111 |
"alpha": alpha,
|
| 112 |
"n_blocks": n_blocks,
|
| 113 |
"includes_leader": is_leader_tie,
|
| 114 |
+
"n_tied": len(group),
|
| 115 |
},
|
| 116 |
engines_involved=tuple(group),
|
| 117 |
))
|
|
|
|
| 119 |
|
| 120 |
|
| 121 |
def detect_significant_gap(benchmark_data: dict) -> list[Fact]:
|
| 122 |
+
"""Écart statistiquement significatif entre le 1ᵉʳ et le 2ᵉ du classement.
|
| 123 |
+
|
| 124 |
+
Lit la matrice de Wilcoxon pairwise et vérifie si la paire (leader,
|
| 125 |
+
runner-up) y apparaît avec ``significant = True``.
|
| 126 |
+
"""
|
| 127 |
+
ranking = benchmark_data.get("ranking") or []
|
| 128 |
+
valid = [r for r in ranking if r.get("mean_cer") is not None]
|
| 129 |
+
if len(valid) < 2:
|
| 130 |
+
return []
|
| 131 |
+
|
| 132 |
+
leader = valid[0]["engine"]
|
| 133 |
+
runner_up = valid[1]["engine"]
|
| 134 |
+
|
| 135 |
+
pairwise = benchmark_data.get("statistics", {}).get("pairwise_wilcoxon") or []
|
| 136 |
+
match = None
|
| 137 |
+
for p in pairwise:
|
| 138 |
+
names = {p.get("engine_a"), p.get("engine_b")}
|
| 139 |
+
if names == {leader, runner_up}:
|
| 140 |
+
match = p
|
| 141 |
+
break
|
| 142 |
+
if match is None:
|
| 143 |
+
return []
|
| 144 |
+
|
| 145 |
+
if not match.get("significant"):
|
| 146 |
+
return [] # pas d'écart significatif — rien à signaler ici
|
| 147 |
+
|
| 148 |
+
delta_cer = abs(float(valid[0]["mean_cer"]) - float(valid[1]["mean_cer"]))
|
| 149 |
+
return [Fact(
|
| 150 |
+
type=FactType.SIGNIFICANT_GAP,
|
| 151 |
+
importance=FactImportance.CRITICAL,
|
| 152 |
+
payload={
|
| 153 |
+
"leader": leader,
|
| 154 |
+
"runner_up": runner_up,
|
| 155 |
+
"p_value": float(match.get("p_value", 0.0)),
|
| 156 |
+
"delta_cer": round(delta_cer, 4),
|
| 157 |
+
"delta_cer_pct": round(delta_cer * 100, 2),
|
| 158 |
+
"n_pairs": int(match.get("n_pairs", 0)),
|
| 159 |
+
},
|
| 160 |
+
engines_involved=(leader, runner_up),
|
| 161 |
+
)]
|
| 162 |
|
| 163 |
|
| 164 |
def detect_pareto_alternative(benchmark_data: dict) -> list[Fact]:
|
| 165 |
+
"""Moteur Pareto-dominant différent du leader CER. Sprint 5."""
|
| 166 |
+
return []
|
| 167 |
|
| 168 |
+
|
| 169 |
+
def _stratum_cer_by_engine(benchmark_data: dict) -> dict[str, dict[str, list[float]]]:
|
| 170 |
+
"""Agrège les CER par (moteur, strate).
|
| 171 |
+
|
| 172 |
+
Strate = ``document["script_type"]`` si présent. Retourne ``{}`` si aucun
|
| 173 |
+
document n'expose de strate (pas d'émission possible).
|
| 174 |
"""
|
| 175 |
+
out: dict[str, dict[str, list[float]]] = {}
|
| 176 |
+
for doc in benchmark_data.get("documents") or []:
|
| 177 |
+
stratum = doc.get("script_type")
|
| 178 |
+
if not stratum:
|
| 179 |
+
continue
|
| 180 |
+
for er in doc.get("engine_results") or []:
|
| 181 |
+
if er.get("error"):
|
| 182 |
+
continue
|
| 183 |
+
cer = er.get("cer")
|
| 184 |
+
if cer is None:
|
| 185 |
+
continue
|
| 186 |
+
name = er.get("engine")
|
| 187 |
+
out.setdefault(name, {}).setdefault(stratum, []).append(float(cer))
|
| 188 |
+
return out
|
| 189 |
|
| 190 |
|
| 191 |
def detect_stratum_winner(benchmark_data: dict) -> list[Fact]:
|
| 192 |
+
"""Moteur qui domine nettement sur une strate (≥ 3 documents, CER
|
| 193 |
+
au moins 25 % plus bas que le second sur cette strate).
|
| 194 |
+
"""
|
| 195 |
+
agg = _stratum_cer_by_engine(benchmark_data)
|
| 196 |
+
if not agg:
|
| 197 |
+
return []
|
| 198 |
+
|
| 199 |
+
# Inverser : {stratum: {engine: mean_cer}}
|
| 200 |
+
by_stratum: dict[str, dict[str, float]] = {}
|
| 201 |
+
for engine, strata in agg.items():
|
| 202 |
+
for stratum, vals in strata.items():
|
| 203 |
+
if len(vals) < 3:
|
| 204 |
+
continue
|
| 205 |
+
by_stratum.setdefault(stratum, {})[engine] = sum(vals) / len(vals)
|
| 206 |
+
|
| 207 |
+
facts: list[Fact] = []
|
| 208 |
+
for stratum, engine_cer in by_stratum.items():
|
| 209 |
+
if len(engine_cer) < 2:
|
| 210 |
+
continue
|
| 211 |
+
ordered = sorted(engine_cer.items(), key=lambda kv: kv[1])
|
| 212 |
+
best_name, best_cer = ordered[0]
|
| 213 |
+
second_cer = ordered[1][1]
|
| 214 |
+
if second_cer == 0:
|
| 215 |
+
continue
|
| 216 |
+
if best_cer < second_cer * 0.75: # dominance ≥ 25 %
|
| 217 |
+
facts.append(Fact(
|
| 218 |
+
type=FactType.STRATUM_WINNER,
|
| 219 |
+
importance=FactImportance.HIGH,
|
| 220 |
+
payload={
|
| 221 |
+
"engine": best_name,
|
| 222 |
+
"stratum": stratum,
|
| 223 |
+
"cer": round(best_cer, 4),
|
| 224 |
+
"cer_pct": round(best_cer * 100, 2),
|
| 225 |
+
"second_engine": ordered[1][0],
|
| 226 |
+
"second_cer": round(second_cer, 4),
|
| 227 |
+
"second_cer_pct": round(second_cer * 100, 2),
|
| 228 |
+
"n_docs_stratum": len(agg[best_name][stratum]),
|
| 229 |
+
},
|
| 230 |
+
engines_involved=(best_name,),
|
| 231 |
+
stratum=stratum,
|
| 232 |
+
))
|
| 233 |
+
return facts
|
| 234 |
|
| 235 |
|
| 236 |
def detect_stratum_collapse(benchmark_data: dict) -> list[Fact]:
|
| 237 |
+
"""Moteur globalement compétitif qui s'effondre sur une strate.
|
| 238 |
+
|
| 239 |
+
Déclenché si, pour un moteur, le CER moyen sur une strate ≥ 3 documents
|
| 240 |
+
est plus du double du CER global du même moteur.
|
| 241 |
+
"""
|
| 242 |
+
agg = _stratum_cer_by_engine(benchmark_data)
|
| 243 |
+
if not agg:
|
| 244 |
+
return []
|
| 245 |
+
|
| 246 |
+
facts: list[Fact] = []
|
| 247 |
+
for engine_name, strata in agg.items():
|
| 248 |
+
summary = _engine_by_name(benchmark_data, engine_name) or {}
|
| 249 |
+
global_cer = summary.get("cer")
|
| 250 |
+
if global_cer is None:
|
| 251 |
+
continue
|
| 252 |
+
global_cer = float(global_cer)
|
| 253 |
+
if global_cer <= 0:
|
| 254 |
+
continue
|
| 255 |
+
for stratum, vals in strata.items():
|
| 256 |
+
if len(vals) < 3:
|
| 257 |
+
continue
|
| 258 |
+
local_cer = sum(vals) / len(vals)
|
| 259 |
+
if local_cer > 2.0 * global_cer and (local_cer - global_cer) > 0.05:
|
| 260 |
+
facts.append(Fact(
|
| 261 |
+
type=FactType.STRATUM_COLLAPSE,
|
| 262 |
+
importance=FactImportance.HIGH,
|
| 263 |
+
payload={
|
| 264 |
+
"engine": engine_name,
|
| 265 |
+
"stratum": stratum,
|
| 266 |
+
"local_cer": round(local_cer, 4),
|
| 267 |
+
"local_cer_pct": round(local_cer * 100, 2),
|
| 268 |
+
"global_cer": round(global_cer, 4),
|
| 269 |
+
"global_cer_pct": round(global_cer * 100, 2),
|
| 270 |
+
"delta_cer_pct": round((local_cer - global_cer) * 100, 2),
|
| 271 |
+
"n_docs_stratum": len(vals),
|
| 272 |
+
},
|
| 273 |
+
engines_involved=(engine_name,),
|
| 274 |
+
stratum=stratum,
|
| 275 |
+
))
|
| 276 |
+
return facts
|
| 277 |
|
| 278 |
|
| 279 |
def detect_error_profile_outlier(benchmark_data: dict) -> list[Fact]:
|
| 280 |
+
"""Moteur au profil taxonomique atypique.
|
| 281 |
|
| 282 |
+
Émet un Fact si, pour un moteur et une classe d'erreur, la part relative
|
| 283 |
+
est au moins 2× plus élevée que la médiane des autres moteurs (et > 15 %
|
| 284 |
+
du total pour éviter les strates marginales).
|
| 285 |
"""
|
| 286 |
+
engines = _engines_summary(benchmark_data)
|
| 287 |
+
# {engine: {class_name: proportion}}
|
| 288 |
+
profiles: dict[str, dict[str, float]] = {}
|
| 289 |
+
for e in engines:
|
| 290 |
+
tax = e.get("aggregated_taxonomy") or {}
|
| 291 |
+
distribution = tax.get("distribution") or tax.get("proportions") or {}
|
| 292 |
+
if not distribution:
|
| 293 |
+
continue
|
| 294 |
+
profiles[e["name"]] = {k: float(v) for k, v in distribution.items()}
|
| 295 |
+
if len(profiles) < 2:
|
| 296 |
+
return []
|
| 297 |
+
|
| 298 |
+
# Collecter toutes les classes rencontrées
|
| 299 |
+
all_classes: set[str] = set()
|
| 300 |
+
for p in profiles.values():
|
| 301 |
+
all_classes.update(p.keys())
|
| 302 |
+
|
| 303 |
+
facts: list[Fact] = []
|
| 304 |
+
for cls in all_classes:
|
| 305 |
+
values = [(name, p.get(cls, 0.0)) for name, p in profiles.items()]
|
| 306 |
+
props = [v for _, v in values]
|
| 307 |
+
if not props:
|
| 308 |
+
continue
|
| 309 |
+
median_prop = _stats.median(props)
|
| 310 |
+
for name, v in values:
|
| 311 |
+
if v < 0.15: # trop marginal pour être notable
|
| 312 |
+
continue
|
| 313 |
+
if median_prop <= 0:
|
| 314 |
+
continue
|
| 315 |
+
if v >= 2.0 * median_prop:
|
| 316 |
+
facts.append(Fact(
|
| 317 |
+
type=FactType.ERROR_PROFILE_OUTLIER,
|
| 318 |
+
importance=FactImportance.HIGH,
|
| 319 |
+
payload={
|
| 320 |
+
"engine": name,
|
| 321 |
+
"error_class": cls,
|
| 322 |
+
"proportion": round(v, 4),
|
| 323 |
+
"proportion_pct": round(v * 100, 1),
|
| 324 |
+
"median_proportion": round(median_prop, 4),
|
| 325 |
+
"median_proportion_pct": round(median_prop * 100, 1),
|
| 326 |
+
"ratio_to_median": round(v / median_prop, 2) if median_prop else None,
|
| 327 |
+
},
|
| 328 |
+
engines_involved=(name,),
|
| 329 |
+
))
|
| 330 |
+
return facts
|
| 331 |
|
| 332 |
|
| 333 |
def detect_llm_hallucination_flag(benchmark_data: dict) -> list[Fact]:
|
| 334 |
+
"""LLM/VLM au taux d'hallucination notablement élevé.
|
| 335 |
|
| 336 |
+
Déclenché si ``hallucinating_doc_rate`` > 30 % OU ``anchor_score_mean`` < 0,6
|
| 337 |
+
pour un moteur dont le champ ``is_pipeline`` ou ``is_vlm`` est ``True``.
|
|
|
|
|
|
|
| 338 |
"""
|
| 339 |
+
facts: list[Fact] = []
|
| 340 |
+
for e in _engines_summary(benchmark_data):
|
| 341 |
+
agg = e.get("aggregated_hallucination") or {}
|
| 342 |
+
if not agg:
|
| 343 |
+
continue
|
| 344 |
+
rate = agg.get("hallucinating_doc_rate")
|
| 345 |
+
anchor = agg.get("anchor_score_mean")
|
| 346 |
+
length_ratio = agg.get("length_ratio_mean")
|
| 347 |
+
# Signal seulement si c'est un pipeline LLM ou un VLM
|
| 348 |
+
is_llm = bool(e.get("is_pipeline")) or bool(e.get("is_vlm"))
|
| 349 |
+
if not is_llm:
|
| 350 |
+
continue
|
| 351 |
+
|
| 352 |
+
flagged = False
|
| 353 |
+
reasons = []
|
| 354 |
+
if rate is not None and float(rate) > 0.30:
|
| 355 |
+
flagged = True
|
| 356 |
+
reasons.append("taux de documents hallucinés")
|
| 357 |
+
if anchor is not None and float(anchor) < 0.60:
|
| 358 |
+
flagged = True
|
| 359 |
+
reasons.append("ancrage faible")
|
| 360 |
+
if length_ratio is not None and float(length_ratio) > 1.30:
|
| 361 |
+
flagged = True
|
| 362 |
+
reasons.append("sortie anormalement longue")
|
| 363 |
+
if not flagged:
|
| 364 |
+
continue
|
| 365 |
+
|
| 366 |
+
facts.append(Fact(
|
| 367 |
+
type=FactType.LLM_HALLUCINATION_FLAG,
|
| 368 |
+
importance=FactImportance.HIGH,
|
| 369 |
+
payload={
|
| 370 |
+
"engine": e["name"],
|
| 371 |
+
"hallucinating_rate": round(float(rate or 0.0), 4),
|
| 372 |
+
"hallucinating_rate_pct": round(float(rate or 0.0) * 100, 1),
|
| 373 |
+
"anchor_score": round(float(anchor), 3) if anchor is not None else None,
|
| 374 |
+
"length_ratio": round(float(length_ratio), 3) if length_ratio is not None else None,
|
| 375 |
+
"reasons": reasons,
|
| 376 |
+
"reasons_list": ", ".join(reasons),
|
| 377 |
+
},
|
| 378 |
+
engines_involved=(e["name"],),
|
| 379 |
+
))
|
| 380 |
+
return facts
|
| 381 |
|
| 382 |
|
| 383 |
def detect_robustness_fragile(benchmark_data: dict) -> list[Fact]:
|
| 384 |
+
"""Moteur qui dégrade fortement au-dessus d'un seuil de bruit/flou.
|
|
|
|
| 385 |
|
| 386 |
+
Activé si les données de robustesse sont embarquées dans
|
| 387 |
+
``benchmark_data["robustness"]`` (hors scope du benchmark classique,
|
| 388 |
+
produit par ``picarones robustness`` et injecté optionnellement).
|
| 389 |
+
"""
|
| 390 |
+
robustness = benchmark_data.get("robustness")
|
| 391 |
+
if not robustness:
|
| 392 |
+
return []
|
| 393 |
|
| 394 |
+
facts: list[Fact] = []
|
| 395 |
+
curves = robustness.get("curves") or robustness.get("engines") or []
|
| 396 |
+
# Structure attendue : [{engine, degradation_type, points: [{level, cer}]}]
|
| 397 |
+
# Flag : CER à niveau max > 3× CER au niveau min.
|
| 398 |
+
for entry in curves:
|
| 399 |
+
engine = entry.get("engine")
|
| 400 |
+
dtype = entry.get("degradation_type")
|
| 401 |
+
points = entry.get("points") or []
|
| 402 |
+
if not engine or not points or len(points) < 2:
|
| 403 |
+
continue
|
| 404 |
+
try:
|
| 405 |
+
sorted_pts = sorted(points, key=lambda p: float(p["level"]))
|
| 406 |
+
except (KeyError, TypeError, ValueError):
|
| 407 |
+
continue
|
| 408 |
+
first, last = sorted_pts[0], sorted_pts[-1]
|
| 409 |
+
c0 = float(first.get("cer") or 0.0)
|
| 410 |
+
c1 = float(last.get("cer") or 0.0)
|
| 411 |
+
if c0 <= 0.01: # éviter division par quasi-zéro
|
| 412 |
+
continue
|
| 413 |
+
if c1 >= 3.0 * c0 and c1 > 0.15:
|
| 414 |
+
facts.append(Fact(
|
| 415 |
+
type=FactType.ROBUSTNESS_FRAGILE,
|
| 416 |
+
importance=FactImportance.HIGH,
|
| 417 |
+
payload={
|
| 418 |
+
"engine": engine,
|
| 419 |
+
"degradation": dtype,
|
| 420 |
+
"cer_baseline": round(c0, 4),
|
| 421 |
+
"cer_baseline_pct": round(c0 * 100, 1),
|
| 422 |
+
"cer_degraded": round(c1, 4),
|
| 423 |
+
"cer_degraded_pct": round(c1 * 100, 1),
|
| 424 |
+
"ratio": round(c1 / c0, 1),
|
| 425 |
+
"level_max": float(last.get("level") or 0),
|
| 426 |
+
},
|
| 427 |
+
engines_involved=(engine,),
|
| 428 |
+
))
|
| 429 |
+
return facts
|
| 430 |
|
| 431 |
+
|
| 432 |
+
def detect_cost_outlier(benchmark_data: dict) -> list[Fact]:
|
| 433 |
+
"""Moteur au ratio coût/qualité très défavorable. Sprint 5."""
|
| 434 |
return []
|
| 435 |
|
| 436 |
|
| 437 |
+
def _mean_duration_per_engine(benchmark_data: dict) -> dict[str, float]:
|
| 438 |
+
"""Durée moyenne d'exécution par moteur (en secondes par document)."""
|
| 439 |
+
durations: dict[str, list[float]] = {}
|
| 440 |
+
for doc in benchmark_data.get("documents") or []:
|
| 441 |
+
for er in doc.get("engine_results") or []:
|
| 442 |
+
d = er.get("duration")
|
| 443 |
+
if d is None:
|
| 444 |
+
continue
|
| 445 |
+
durations.setdefault(er["engine"], []).append(float(d))
|
| 446 |
+
return {k: sum(v) / len(v) for k, v in durations.items() if v}
|
| 447 |
+
|
| 448 |
+
|
| 449 |
def detect_speed_winner(benchmark_data: dict) -> list[Fact]:
|
| 450 |
+
"""Moteur significativement plus rapide pour une qualité comparable.
|
| 451 |
+
|
| 452 |
+
Déclenché si un moteur est au moins 3× plus rapide que la médiane ET que
|
| 453 |
+
son CER n'est pas significativement pire (dans le même groupe Nemenyi que
|
| 454 |
+
le leader OU CER ≤ 1,1 × CER du leader).
|
| 455 |
+
"""
|
| 456 |
+
durations = _mean_duration_per_engine(benchmark_data)
|
| 457 |
+
if len(durations) < 2:
|
| 458 |
+
return []
|
| 459 |
+
|
| 460 |
+
values = list(durations.values())
|
| 461 |
+
median_dur = _stats.median(values)
|
| 462 |
+
if median_dur <= 0:
|
| 463 |
+
return []
|
| 464 |
+
|
| 465 |
+
ranking = benchmark_data.get("ranking") or []
|
| 466 |
+
valid = [r for r in ranking if r.get("mean_cer") is not None]
|
| 467 |
+
if not valid:
|
| 468 |
+
return []
|
| 469 |
+
leader_cer = float(valid[0]["mean_cer"])
|
| 470 |
+
quality_ceiling = max(0.01, leader_cer * 1.10)
|
| 471 |
+
|
| 472 |
+
tied_groups = benchmark_data.get("statistics", {}).get("nemenyi", {}).get("tied_groups") or []
|
| 473 |
+
leader_group: set[str] = set()
|
| 474 |
+
for g in tied_groups:
|
| 475 |
+
if valid[0]["engine"] in g:
|
| 476 |
+
leader_group = set(g)
|
| 477 |
+
break
|
| 478 |
+
|
| 479 |
+
facts: list[Fact] = []
|
| 480 |
+
candidates = sorted(durations.items(), key=lambda kv: kv[1])
|
| 481 |
+
for engine, dur in candidates:
|
| 482 |
+
if dur * 3.0 > median_dur:
|
| 483 |
+
break # les suivants sont encore plus lents
|
| 484 |
+
summary = _engine_by_name(benchmark_data, engine) or {}
|
| 485 |
+
engine_cer = summary.get("cer")
|
| 486 |
+
if engine_cer is None:
|
| 487 |
+
continue
|
| 488 |
+
acceptable_quality = (
|
| 489 |
+
engine in leader_group or float(engine_cer) <= quality_ceiling
|
| 490 |
+
)
|
| 491 |
+
if not acceptable_quality:
|
| 492 |
+
continue
|
| 493 |
+
facts.append(Fact(
|
| 494 |
+
type=FactType.SPEED_WINNER,
|
| 495 |
+
importance=FactImportance.MEDIUM,
|
| 496 |
+
payload={
|
| 497 |
+
"engine": engine,
|
| 498 |
+
"mean_duration": round(dur, 3),
|
| 499 |
+
"median_duration": round(median_dur, 3),
|
| 500 |
+
"speedup": round(median_dur / dur, 1) if dur > 0 else None,
|
| 501 |
+
"cer": round(float(engine_cer), 4),
|
| 502 |
+
"cer_pct": round(float(engine_cer) * 100, 2),
|
| 503 |
+
},
|
| 504 |
+
engines_involved=(engine,),
|
| 505 |
+
))
|
| 506 |
+
return facts[:1] # seulement le plus rapide — éviter le bruit
|
| 507 |
|
| 508 |
|
| 509 |
def detect_confidence_warning(benchmark_data: dict) -> list[Fact]:
|
| 510 |
+
"""Intervalle de confiance large → classement peu fiable.
|
| 511 |
+
|
| 512 |
+
Déclenché si, pour le leader ou le runner-up, la largeur de l'IC 95 %
|
| 513 |
+
est plus du triple de l'écart |leader − runner-up| OU > 5 points de CER.
|
| 514 |
+
"""
|
| 515 |
+
stats = benchmark_data.get("statistics", {}) or {}
|
| 516 |
+
cis = stats.get("bootstrap_cis") or []
|
| 517 |
+
if len(cis) < 2:
|
| 518 |
+
return []
|
| 519 |
+
|
| 520 |
+
ranking = benchmark_data.get("ranking") or []
|
| 521 |
+
valid = [r for r in ranking if r.get("mean_cer") is not None]
|
| 522 |
+
if len(valid) < 2:
|
| 523 |
+
return []
|
| 524 |
+
|
| 525 |
+
by_name = {c["engine"]: c for c in cis if "engine" in c}
|
| 526 |
+
leader = valid[0]["engine"]
|
| 527 |
+
runner_up = valid[1]["engine"]
|
| 528 |
+
leader_ci = by_name.get(leader)
|
| 529 |
+
runner_ci = by_name.get(runner_up)
|
| 530 |
+
if not leader_ci or not runner_ci:
|
| 531 |
+
return []
|
| 532 |
+
|
| 533 |
+
gap = abs(float(valid[0]["mean_cer"]) - float(valid[1]["mean_cer"]))
|
| 534 |
+
facts: list[Fact] = []
|
| 535 |
+
for engine_name, ci in ((leader, leader_ci), (runner_up, runner_ci)):
|
| 536 |
+
lo = float(ci.get("ci_lower") or 0.0)
|
| 537 |
+
hi = float(ci.get("ci_upper") or 0.0)
|
| 538 |
+
width = hi - lo
|
| 539 |
+
wide_vs_gap = gap > 0 and width > 3.0 * gap
|
| 540 |
+
wide_absolute = width > 0.05
|
| 541 |
+
if wide_vs_gap or wide_absolute:
|
| 542 |
+
facts.append(Fact(
|
| 543 |
+
type=FactType.CONFIDENCE_WARNING,
|
| 544 |
+
importance=FactImportance.MEDIUM,
|
| 545 |
+
payload={
|
| 546 |
+
"engine": engine_name,
|
| 547 |
+
"ci_lower": round(lo, 4),
|
| 548 |
+
"ci_upper": round(hi, 4),
|
| 549 |
+
"ci_width": round(width, 4),
|
| 550 |
+
"ci_width_pct": round(width * 100, 2),
|
| 551 |
+
"mean_cer": round(float(ci.get("mean") or 0.0), 4),
|
| 552 |
+
"mean_cer_pct": round(float(ci.get("mean") or 0.0) * 100, 2),
|
| 553 |
+
"gap_to_runner_up_pct": round(gap * 100, 2),
|
| 554 |
+
},
|
| 555 |
+
engines_involved=(engine_name,),
|
| 556 |
+
))
|
| 557 |
+
break # un seul avertissement suffit
|
| 558 |
+
return facts
|
| 559 |
|
| 560 |
|
| 561 |
# ---------------------------------------------------------------------------
|
| 562 |
+
# Enregistrement par défaut — activé au Sprint 4
|
| 563 |
# ---------------------------------------------------------------------------
|
| 564 |
|
| 565 |
DETECTORS_BY_TYPE = {
|
|
|
|
| 576 |
FactType.SPEED_WINNER: detect_speed_winner,
|
| 577 |
FactType.CONFIDENCE_WARNING: detect_confidence_warning,
|
| 578 |
}
|
| 579 |
+
|
| 580 |
+
|
| 581 |
+
def register_default_detectors(registry) -> None:
|
| 582 |
+
"""Enregistre les détecteurs du Sprint 4 dans un ``DetectorRegistry``.
|
| 583 |
+
|
| 584 |
+
Les types ``PARETO_ALTERNATIVE`` et ``COST_OUTLIER`` restent des stubs
|
| 585 |
+
jusqu'au Sprint 5 : les enregistrer maintenant ne fait rien de visible
|
| 586 |
+
(liste vide toujours retournée), ce qui est sûr et simplifie le parcours.
|
| 587 |
+
"""
|
| 588 |
+
for fact_type, fn in DETECTORS_BY_TYPE.items():
|
| 589 |
+
registry.register(fact_type, fn)
|
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Rendu des faits narratifs en texte lisible.
|
| 2 |
+
|
| 3 |
+
Les templates sont chargés depuis ``templates/{lang}.yaml`` au premier accès.
|
| 4 |
+
Le rendu utilise ``str.format_map`` sur le ``payload`` du ``Fact``. Aucun LLM,
|
| 5 |
+
aucune génération : la sortie est la concaténation de templates remplis avec
|
| 6 |
+
des valeurs venant strictement du JSON d'entrée.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import logging
|
| 12 |
+
import re
|
| 13 |
+
from pathlib import Path
|
| 14 |
+
from typing import Iterable
|
| 15 |
+
|
| 16 |
+
import yaml
|
| 17 |
+
|
| 18 |
+
from picarones.core.narrative.facts import Fact, FactType
|
| 19 |
+
|
| 20 |
+
logger = logging.getLogger(__name__)
|
| 21 |
+
|
| 22 |
+
_TEMPLATES_DIR = Path(__file__).parent / "templates"
|
| 23 |
+
_TEMPLATES_CACHE: dict[str, dict[str, str]] = {}
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def _load_templates(lang: str) -> dict[str, str]:
|
| 27 |
+
"""Charge et met en cache les templates de la langue demandée.
|
| 28 |
+
|
| 29 |
+
Fallback : si la langue n'existe pas, retourne les templates FR. Si FR
|
| 30 |
+
est également absent (incident d'installation), retourne un dict vide.
|
| 31 |
+
"""
|
| 32 |
+
if lang in _TEMPLATES_CACHE:
|
| 33 |
+
return _TEMPLATES_CACHE[lang]
|
| 34 |
+
|
| 35 |
+
path = _TEMPLATES_DIR / f"{lang}.yaml"
|
| 36 |
+
if not path.exists():
|
| 37 |
+
if lang != "fr":
|
| 38 |
+
return _load_templates("fr")
|
| 39 |
+
_TEMPLATES_CACHE[lang] = {}
|
| 40 |
+
return _TEMPLATES_CACHE[lang]
|
| 41 |
+
|
| 42 |
+
try:
|
| 43 |
+
with path.open(encoding="utf-8") as fh:
|
| 44 |
+
data = yaml.safe_load(fh) or {}
|
| 45 |
+
if not isinstance(data, dict):
|
| 46 |
+
logger.warning("[narrative] %s n'est pas un dict YAML — ignoré", path)
|
| 47 |
+
_TEMPLATES_CACHE[lang] = {}
|
| 48 |
+
else:
|
| 49 |
+
_TEMPLATES_CACHE[lang] = {str(k): str(v).strip() for k, v in data.items()}
|
| 50 |
+
except yaml.YAMLError as e:
|
| 51 |
+
logger.warning("[narrative] échec parsing %s : %s", path, e)
|
| 52 |
+
_TEMPLATES_CACHE[lang] = {}
|
| 53 |
+
|
| 54 |
+
return _TEMPLATES_CACHE[lang]
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
class _SafeFormatMap(dict):
|
| 58 |
+
"""Dict qui retourne ``'?'`` pour les clés manquantes dans un template.
|
| 59 |
+
|
| 60 |
+
Évite qu'un détecteur mal documenté fasse crasher le rendu. En pratique
|
| 61 |
+
les tests couvrent les clés attendues, mais la robustesse prévaut.
|
| 62 |
+
"""
|
| 63 |
+
|
| 64 |
+
def __missing__(self, key: str) -> str:
|
| 65 |
+
logger.warning("[narrative] clé manquante dans payload : %r", key)
|
| 66 |
+
return "?"
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def render_fact(fact: Fact, lang: str = "fr") -> str:
|
| 70 |
+
"""Rend un Fact en une phrase selon la langue.
|
| 71 |
+
|
| 72 |
+
Retourne ``""`` si le template est absent pour ce type.
|
| 73 |
+
"""
|
| 74 |
+
templates = _load_templates(lang)
|
| 75 |
+
tpl = templates.get(fact.type.value)
|
| 76 |
+
if not tpl:
|
| 77 |
+
return ""
|
| 78 |
+
|
| 79 |
+
try:
|
| 80 |
+
return tpl.format_map(_SafeFormatMap(fact.payload))
|
| 81 |
+
except (ValueError, KeyError) as e:
|
| 82 |
+
logger.warning(
|
| 83 |
+
"[narrative] rendu impossible pour %s : %s", fact.type.value, e,
|
| 84 |
+
)
|
| 85 |
+
return ""
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def render_synthesis(facts: Iterable[Fact], lang: str = "fr") -> list[str]:
|
| 89 |
+
"""Rend une liste de Fact en liste de phrases (ordre préservé)."""
|
| 90 |
+
out: list[str] = []
|
| 91 |
+
for fact in facts:
|
| 92 |
+
phrase = render_fact(fact, lang)
|
| 93 |
+
phrase = re.sub(r"\s+", " ", phrase).strip()
|
| 94 |
+
if phrase:
|
| 95 |
+
out.append(phrase)
|
| 96 |
+
return out
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def extract_numbers(text: str) -> list[str]:
|
| 100 |
+
"""Extrait les nombres (décimaux ou entiers) présents dans une phrase.
|
| 101 |
+
|
| 102 |
+
Utilisé par le test de traçabilité : chaque nombre remonté en synthèse
|
| 103 |
+
doit être présent dans le JSON d'entrée.
|
| 104 |
+
"""
|
| 105 |
+
return re.findall(r"\d+(?:[.,]\d+)?", text)
|
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Narrative rendering templates — English.
|
| 2 |
+
# Anti-hallucination rule: never introduce a number or entity name that is not
|
| 3 |
+
# already in the Fact ``payload``. Tests verify traceability of every number
|
| 4 |
+
# appearing in the rendered synthesis.
|
| 5 |
+
|
| 6 |
+
global_leader_cer: >-
|
| 7 |
+
On this corpus of {n_docs} documents, {engine} achieves the lowest mean CER
|
| 8 |
+
({cer_pct} %).
|
| 9 |
+
|
| 10 |
+
statistical_tie: >-
|
| 11 |
+
Engines {engines_list} are not statistically distinguishable
|
| 12 |
+
(Friedman-Nemenyi, α = {alpha}, n = {n_blocks} documents, CD = {critical_distance}).
|
| 13 |
+
|
| 14 |
+
significant_gap: >-
|
| 15 |
+
The gap between {leader} and {runner_up} is statistically significant
|
| 16 |
+
(Wilcoxon, p = {p_value:.4f}, Δ CER = {delta_cer_pct} points over {n_pairs} pairs).
|
| 17 |
+
|
| 18 |
+
stratum_winner: >-
|
| 19 |
+
On stratum "{stratum}" ({n_docs_stratum} documents), {engine} clearly
|
| 20 |
+
dominates with a CER of {cer_pct} % vs. {second_cer_pct} % for {second_engine}.
|
| 21 |
+
|
| 22 |
+
stratum_collapse: >-
|
| 23 |
+
{engine} is globally competitive ({global_cer_pct} %) but collapses on
|
| 24 |
+
stratum "{stratum}" ({local_cer_pct} % over {n_docs_stratum} documents,
|
| 25 |
+
i.e. {delta_cer_pct} points above its own average).
|
| 26 |
+
|
| 27 |
+
error_profile_outlier: >-
|
| 28 |
+
{engine} has an atypical error profile: {proportion_pct} % of errors fall
|
| 29 |
+
into class "{error_class}", vs. a median of {median_proportion_pct} % across
|
| 30 |
+
other engines (×{ratio_to_median} the median).
|
| 31 |
+
|
| 32 |
+
llm_hallucination_flag: >-
|
| 33 |
+
Hallucination signal on {engine} ({reasons_list}) —
|
| 34 |
+
{hallucinating_rate_pct} % of documents above alert thresholds.
|
| 35 |
+
|
| 36 |
+
robustness_fragile: >-
|
| 37 |
+
{engine} is fragile under "{degradation}" degradation: its CER rises from
|
| 38 |
+
{cer_baseline_pct} % to {cer_degraded_pct} % at maximum level (×{ratio}).
|
| 39 |
+
|
| 40 |
+
speed_winner: >-
|
| 41 |
+
{engine} is the fastest ({mean_duration} s/doc, ×{speedup} faster than the
|
| 42 |
+
median) for comparable quality (CER {cer_pct} %).
|
| 43 |
+
|
| 44 |
+
confidence_warning: >-
|
| 45 |
+
Ranking is fragile: the 95 % confidence interval of {engine} spans
|
| 46 |
+
{ci_width_pct} CER points, compared with a gap of {gap_to_runner_up_pct} points to the runner-up.
|
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Templates de rendu narratif — français.
|
| 2 |
+
#
|
| 3 |
+
# Chaque clé correspond à une valeur de ``FactType``. La valeur est un template
|
| 4 |
+
# Python ``.format()`` qui consomme les champs du ``Fact.payload``.
|
| 5 |
+
#
|
| 6 |
+
# Règle anti-hallucination : n'introduire aucune valeur numérique ou nom
|
| 7 |
+
# d'entité qui ne soit pas dans le ``payload``. Les tests parsent la synthèse
|
| 8 |
+
# rendue et vérifient la traçabilité.
|
| 9 |
+
|
| 10 |
+
global_leader_cer: >-
|
| 11 |
+
Sur ce corpus de {n_docs} documents, {engine} obtient le CER moyen le plus
|
| 12 |
+
bas ({cer_pct} %).
|
| 13 |
+
|
| 14 |
+
statistical_tie: >-
|
| 15 |
+
Les moteurs {engines_list} ne sont pas statistiquement distinguables
|
| 16 |
+
(Friedman-Nemenyi, α = {alpha}, n = {n_blocks} documents, CD = {critical_distance}).
|
| 17 |
+
|
| 18 |
+
significant_gap: >-
|
| 19 |
+
L'écart entre {leader} et {runner_up} est statistiquement significatif
|
| 20 |
+
(Wilcoxon, p = {p_value:.4f}, Δ CER = {delta_cer_pct} points sur {n_pairs} paires).
|
| 21 |
+
|
| 22 |
+
stratum_winner: >-
|
| 23 |
+
Sur la strate « {stratum} » ({n_docs_stratum} documents), {engine} domine
|
| 24 |
+
nettement avec un CER de {cer_pct} % contre {second_cer_pct} % pour {second_engine}.
|
| 25 |
+
|
| 26 |
+
stratum_collapse: >-
|
| 27 |
+
{engine} est globalement compétitif ({global_cer_pct} %) mais s'effondre sur
|
| 28 |
+
la strate « {stratum} » ({local_cer_pct} % sur {n_docs_stratum} documents,
|
| 29 |
+
soit {delta_cer_pct} points au-dessus de sa moyenne).
|
| 30 |
+
|
| 31 |
+
error_profile_outlier: >-
|
| 32 |
+
Le profil d'erreurs de {engine} est atypique : {proportion_pct} % de la
|
| 33 |
+
classe « {error_class} », contre une médiane de {median_proportion_pct} %
|
| 34 |
+
sur les autres moteurs (ratio ×{ratio_to_median}).
|
| 35 |
+
|
| 36 |
+
llm_hallucination_flag: >-
|
| 37 |
+
Signal d'hallucination sur {engine} ({reasons_list}) —
|
| 38 |
+
{hallucinating_rate_pct} % de documents au-dessus des seuils d'alerte.
|
| 39 |
+
|
| 40 |
+
robustness_fragile: >-
|
| 41 |
+
{engine} est fragile à la dégradation « {degradation} » : son CER passe de
|
| 42 |
+
{cer_baseline_pct} % à {cer_degraded_pct} % au niveau maximal (ratio ×{ratio}).
|
| 43 |
+
|
| 44 |
+
speed_winner: >-
|
| 45 |
+
{engine} est le plus rapide ({mean_duration} s / doc, ×{speedup} plus vite
|
| 46 |
+
que la médiane) pour un CER comparable ({cer_pct} %).
|
| 47 |
+
|
| 48 |
+
confidence_warning: >-
|
| 49 |
+
Classement fragile : l'intervalle de confiance à 95 % de {engine} s'étend
|
| 50 |
+
sur {ci_width_pct} points de CER, à comparer à l'écart de {gap_to_runner_up_pct} points avec le second.
|
|
@@ -495,14 +495,16 @@ _TEMPLATES_DIR = Path(__file__).parent / "templates"
|
|
| 495 |
def _build_jinja_env():
|
| 496 |
"""Construit l'Environment Jinja2 pour le rapport.
|
| 497 |
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
|
|
|
|
|
|
| 501 |
"""
|
| 502 |
-
from jinja2 import Environment, FileSystemLoader
|
| 503 |
env = Environment(
|
| 504 |
loader=FileSystemLoader(str(_TEMPLATES_DIR)),
|
| 505 |
-
autoescape=
|
| 506 |
keep_trailing_newline=True,
|
| 507 |
)
|
| 508 |
return env
|
|
@@ -584,6 +586,10 @@ class ReportGenerator:
|
|
| 584 |
report_data.get("statistics", {}).get("nemenyi", {}),
|
| 585 |
)
|
| 586 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 587 |
env = _build_jinja_env()
|
| 588 |
template = env.get_template("base.html.j2")
|
| 589 |
html = template.render(
|
|
@@ -595,6 +601,7 @@ class ReportGenerator:
|
|
| 595 |
chartjs_inline=chartjs_js,
|
| 596 |
critical_difference_svg=cdd_svg,
|
| 597 |
friedman=report_data.get("statistics", {}).get("friedman", {}),
|
|
|
|
| 598 |
)
|
| 599 |
|
| 600 |
output_path.write_text(html, encoding="utf-8")
|
|
|
|
| 495 |
def _build_jinja_env():
|
| 496 |
"""Construit l'Environment Jinja2 pour le rapport.
|
| 497 |
|
| 498 |
+
Autoescape désactivé : le comportement est équivalent à celui du
|
| 499 |
+
``_HTML_TEMPLATE.format()`` historique. Les variables injectées
|
| 500 |
+
(JSON embarqué, SVG généré, synthèse narrative issue de templates
|
| 501 |
+
internes) sont toutes produites par le code Picarones et ne nécessitent
|
| 502 |
+
pas d'échappement HTML.
|
| 503 |
"""
|
| 504 |
+
from jinja2 import Environment, FileSystemLoader
|
| 505 |
env = Environment(
|
| 506 |
loader=FileSystemLoader(str(_TEMPLATES_DIR)),
|
| 507 |
+
autoescape=False,
|
| 508 |
keep_trailing_newline=True,
|
| 509 |
)
|
| 510 |
return env
|
|
|
|
| 586 |
report_data.get("statistics", {}).get("nemenyi", {}),
|
| 587 |
)
|
| 588 |
|
| 589 |
+
# Sprint 18 — synthèse factuelle narrative (déterministe, sans LLM)
|
| 590 |
+
from picarones.core.narrative import build_synthesis
|
| 591 |
+
synthesis = build_synthesis(report_data, lang=self.lang)
|
| 592 |
+
|
| 593 |
env = _build_jinja_env()
|
| 594 |
template = env.get_template("base.html.j2")
|
| 595 |
html = template.render(
|
|
|
|
| 601 |
chartjs_inline=chartjs_js,
|
| 602 |
critical_difference_svg=cdd_svg,
|
| 603 |
friedman=report_data.get("statistics", {}).get("friedman", {}),
|
| 604 |
+
synthesis=synthesis,
|
| 605 |
)
|
| 606 |
|
| 607 |
output_path.write_text(html, encoding="utf-8")
|
|
@@ -97,6 +97,8 @@
|
|
| 97 |
"ratio_anchor_note": "X-axis = trigram anchor score [0–1]. Y-axis = output/GT length ratio. ⚠️ Zone: anchor < 0.5 or ratio > 1.2 → probable hallucinations.",
|
| 98 |
"ratio_anchor_subtitle": "— VLM hallucinations",
|
| 99 |
"reliability_note": "For the X% easiest documents (sorted by ascending CER), what is the cumulative mean CER? A low curve = engine performing well even on easy documents.",
|
|
|
|
|
|
|
| 100 |
"tab_analyses": "Analyses",
|
| 101 |
"tab_characters": "Characters",
|
| 102 |
"tab_document": "Document",
|
|
|
|
| 97 |
"ratio_anchor_note": "X-axis = trigram anchor score [0–1]. Y-axis = output/GT length ratio. ⚠️ Zone: anchor < 0.5 or ratio > 1.2 → probable hallucinations.",
|
| 98 |
"ratio_anchor_subtitle": "— VLM hallucinations",
|
| 99 |
"reliability_note": "For the X% easiest documents (sorted by ascending CER), what is the cumulative mean CER? A low curve = engine performing well even on easy documents.",
|
| 100 |
+
"synth_hint": "Generated mechanically from results — no LLM, reproducible.",
|
| 101 |
+
"synth_title": "Factual summary",
|
| 102 |
"tab_analyses": "Analyses",
|
| 103 |
"tab_characters": "Characters",
|
| 104 |
"tab_document": "Document",
|
|
@@ -97,6 +97,8 @@
|
|
| 97 |
"ratio_anchor_note": "Axe X = score d'ancrage trigrammes [0–1]. Axe Y = ratio longueur sortie/GT. Zone ⚠️ : ancrage < 0.5 ou ratio > 1.2 → hallucinations probables.",
|
| 98 |
"ratio_anchor_subtitle": "— hallucinations VLM",
|
| 99 |
"reliability_note": "Pour les X% documents les plus faciles (triés par CER croissant), quel est le CER moyen cumulé ? Une courbe basse = moteur performant même sur les documents faciles.",
|
|
|
|
|
|
|
| 100 |
"tab_analyses": "Analyses",
|
| 101 |
"tab_characters": "Caractères",
|
| 102 |
"tab_document": "Document",
|
|
|
|
| 97 |
"ratio_anchor_note": "Axe X = score d'ancrage trigrammes [0–1]. Axe Y = ratio longueur sortie/GT. Zone ⚠️ : ancrage < 0.5 ou ratio > 1.2 → hallucinations probables.",
|
| 98 |
"ratio_anchor_subtitle": "— hallucinations VLM",
|
| 99 |
"reliability_note": "Pour les X% documents les plus faciles (triés par CER croissant), quel est le CER moyen cumulé ? Une courbe basse = moteur performant même sur les documents faciles.",
|
| 100 |
+
"synth_hint": "Générée mécaniquement depuis les résultats — aucun LLM, reproductible.",
|
| 101 |
+
"synth_title": "Synthèse factuelle",
|
| 102 |
"tab_analyses": "Analyses",
|
| 103 |
"tab_characters": "Caractères",
|
| 104 |
"tab_document": "Document",
|
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!-- ── Synthèse factuelle (Sprint 18) ─────────────────────────────── -->
|
| 2 |
+
{% if synthesis and synthesis.sentences %}
|
| 3 |
+
<section class="synth-card" aria-labelledby="synth-title">
|
| 4 |
+
<header class="synth-header">
|
| 5 |
+
<h2 id="synth-title" data-i18n="synth_title">Synthèse factuelle</h2>
|
| 6 |
+
<span class="synth-hint" data-i18n="synth_hint">
|
| 7 |
+
Générée mécaniquement depuis les résultats — aucun LLM, reproductible.
|
| 8 |
+
</span>
|
| 9 |
+
</header>
|
| 10 |
+
<ul class="synth-list">
|
| 11 |
+
{% for sentence in synthesis.sentences %}
|
| 12 |
+
<li>{{ sentence }}</li>
|
| 13 |
+
{% endfor %}
|
| 14 |
+
</ul>
|
| 15 |
+
</section>
|
| 16 |
+
{% endif %}
|
|
@@ -632,3 +632,40 @@ body.present-mode nav .meta { display: none; }
|
|
| 632 |
|
| 633 |
body.present-mode .cdd-info-btn,
|
| 634 |
body.present-mode .cdd-help { display: none !important; }
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 632 |
|
| 633 |
body.present-mode .cdd-info-btn,
|
| 634 |
body.present-mode .cdd-help { display: none !important; }
|
| 635 |
+
|
| 636 |
+
/* ── Sprint 18 — Synthèse factuelle narrative ──────────────────── */
|
| 637 |
+
.synth-card {
|
| 638 |
+
background: var(--panel, #fff);
|
| 639 |
+
border: 1px solid var(--border, #e2e8f0);
|
| 640 |
+
border-left: 4px solid #2563eb;
|
| 641 |
+
border-radius: 8px;
|
| 642 |
+
padding: 1rem 1.5rem 1.25rem;
|
| 643 |
+
margin: 1rem 1.5rem 0;
|
| 644 |
+
}
|
| 645 |
+
.synth-header {
|
| 646 |
+
display: flex; align-items: baseline; gap: .75rem;
|
| 647 |
+
margin-bottom: .5rem;
|
| 648 |
+
flex-wrap: wrap;
|
| 649 |
+
}
|
| 650 |
+
.synth-header h2 {
|
| 651 |
+
margin: 0;
|
| 652 |
+
font-size: 1rem;
|
| 653 |
+
font-weight: 600;
|
| 654 |
+
color: var(--text, #0f172a);
|
| 655 |
+
}
|
| 656 |
+
.synth-hint {
|
| 657 |
+
font-size: .75rem;
|
| 658 |
+
color: var(--text-muted, #64748b);
|
| 659 |
+
font-style: italic;
|
| 660 |
+
}
|
| 661 |
+
.synth-list {
|
| 662 |
+
margin: 0;
|
| 663 |
+
padding-left: 1.25rem;
|
| 664 |
+
line-height: 1.5;
|
| 665 |
+
font-size: .92rem;
|
| 666 |
+
color: var(--text, #0f172a);
|
| 667 |
+
}
|
| 668 |
+
.synth-list li { margin: .25rem 0; }
|
| 669 |
+
.synth-list li::marker { color: #2563eb; }
|
| 670 |
+
|
| 671 |
+
body.present-mode .synth-hint { display: none; }
|
|
@@ -17,6 +17,8 @@
|
|
| 17 |
|
| 18 |
{% include '_header.html' %}
|
| 19 |
|
|
|
|
|
|
|
| 20 |
{% include '_critical_difference.html' %}
|
| 21 |
|
| 22 |
{% include 'view_ranking.html' %}
|
|
|
|
| 17 |
|
| 18 |
{% include '_header.html' %}
|
| 19 |
|
| 20 |
+
{% include '_narrative_summary.html' %}
|
| 21 |
+
|
| 22 |
{% include '_critical_difference.html' %}
|
| 23 |
|
| 24 |
{% include 'view_ranking.html' %}
|
|
@@ -86,6 +86,7 @@ picarones = [
|
|
| 86 |
"report/templates/*.css",
|
| 87 |
"report/templates/*.js",
|
| 88 |
"report/i18n/*.json",
|
|
|
|
| 89 |
]
|
| 90 |
|
| 91 |
[tool.pytest.ini_options]
|
|
|
|
| 86 |
"report/templates/*.css",
|
| 87 |
"report/templates/*.js",
|
| 88 |
"report/i18n/*.json",
|
| 89 |
+
"core/narrative/templates/*.yaml",
|
| 90 |
]
|
| 91 |
|
| 92 |
[tool.pytest.ini_options]
|
|
@@ -0,0 +1,597 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests Sprint 19 — Moteur narratif complet (détecteurs + arbitre + rendu).
|
| 2 |
+
|
| 3 |
+
Sprint 4 du plan rapport. Couvre :
|
| 4 |
+
1. Les 9 détecteurs implémentés (scénarios canoniques + cas vides).
|
| 5 |
+
2. L'arbitre : tri par importance, non-redondance, contradiction Nemenyi/Wilcoxon.
|
| 6 |
+
3. Le renderer : chargement des templates YAML, déterminisme.
|
| 7 |
+
4. Le garde-fou anti-hallucination : tout nombre rendu existe dans le JSON.
|
| 8 |
+
5. L'intégration au rapport HTML (section synthèse, reproductibilité).
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import hashlib
|
| 14 |
+
import re
|
| 15 |
+
|
| 16 |
+
import pytest
|
| 17 |
+
|
| 18 |
+
from picarones.core.narrative import (
|
| 19 |
+
DetectorRegistry,
|
| 20 |
+
Fact,
|
| 21 |
+
FactImportance,
|
| 22 |
+
FactType,
|
| 23 |
+
build_synthesis,
|
| 24 |
+
detect_all,
|
| 25 |
+
extract_numbers,
|
| 26 |
+
register_default_detectors,
|
| 27 |
+
render_fact,
|
| 28 |
+
render_synthesis,
|
| 29 |
+
select_facts,
|
| 30 |
+
)
|
| 31 |
+
from picarones.core.narrative.detectors import (
|
| 32 |
+
detect_confidence_warning,
|
| 33 |
+
detect_error_profile_outlier,
|
| 34 |
+
detect_global_leader_cer,
|
| 35 |
+
detect_llm_hallucination_flag,
|
| 36 |
+
detect_robustness_fragile,
|
| 37 |
+
detect_significant_gap,
|
| 38 |
+
detect_speed_winner,
|
| 39 |
+
detect_statistical_tie,
|
| 40 |
+
detect_stratum_collapse,
|
| 41 |
+
detect_stratum_winner,
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# ---------------------------------------------------------------------------
|
| 46 |
+
# Fixtures — données de benchmark minimales et contrôlées
|
| 47 |
+
# ---------------------------------------------------------------------------
|
| 48 |
+
|
| 49 |
+
def _minimal_data(**overrides) -> dict:
|
| 50 |
+
base = {
|
| 51 |
+
"meta": {"document_count": 10},
|
| 52 |
+
"ranking": [
|
| 53 |
+
{"engine": "A", "mean_cer": 0.05, "mean_wer": 0.15, "documents": 10, "failed": 0},
|
| 54 |
+
{"engine": "B", "mean_cer": 0.12, "mean_wer": 0.25, "documents": 10, "failed": 0},
|
| 55 |
+
{"engine": "C", "mean_cer": 0.30, "mean_wer": 0.50, "documents": 10, "failed": 0},
|
| 56 |
+
],
|
| 57 |
+
"engines": [
|
| 58 |
+
{"name": "A", "cer": 0.05, "wer": 0.15, "is_pipeline": False, "is_vlm": False},
|
| 59 |
+
{"name": "B", "cer": 0.12, "wer": 0.25, "is_pipeline": False, "is_vlm": False},
|
| 60 |
+
{"name": "C", "cer": 0.30, "wer": 0.50, "is_pipeline": False, "is_vlm": False},
|
| 61 |
+
],
|
| 62 |
+
"documents": [],
|
| 63 |
+
"statistics": {
|
| 64 |
+
"pairwise_wilcoxon": [],
|
| 65 |
+
"bootstrap_cis": [],
|
| 66 |
+
"friedman": {},
|
| 67 |
+
"nemenyi": {"tied_groups": [], "mean_ranks": {}, "critical_distance": 0.0},
|
| 68 |
+
},
|
| 69 |
+
}
|
| 70 |
+
base.update(overrides)
|
| 71 |
+
return base
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
# ---------------------------------------------------------------------------
|
| 75 |
+
# Détecteurs individuels
|
| 76 |
+
# ---------------------------------------------------------------------------
|
| 77 |
+
|
| 78 |
+
class TestGlobalLeaderCer:
|
| 79 |
+
def test_emits_fact_with_cer_pct_and_n_docs(self):
|
| 80 |
+
facts = detect_global_leader_cer(_minimal_data())
|
| 81 |
+
assert len(facts) == 1
|
| 82 |
+
f = facts[0]
|
| 83 |
+
assert f.type == FactType.GLOBAL_LEADER_CER
|
| 84 |
+
assert f.importance == FactImportance.CRITICAL
|
| 85 |
+
assert f.payload["engine"] == "A"
|
| 86 |
+
assert f.payload["cer_pct"] == 5.0
|
| 87 |
+
assert f.payload["n_docs"] == 10
|
| 88 |
+
assert f.payload["runner_up"] == "B"
|
| 89 |
+
|
| 90 |
+
def test_empty_when_no_ranking(self):
|
| 91 |
+
assert detect_global_leader_cer(_minimal_data(ranking=[])) == []
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
class TestSignificantGap:
|
| 95 |
+
def test_emits_when_leader_vs_runnerup_is_significant(self):
|
| 96 |
+
data = _minimal_data(statistics={
|
| 97 |
+
"pairwise_wilcoxon": [
|
| 98 |
+
{"engine_a": "A", "engine_b": "B", "p_value": 0.002,
|
| 99 |
+
"significant": True, "n_pairs": 10},
|
| 100 |
+
],
|
| 101 |
+
"bootstrap_cis": [], "friedman": {},
|
| 102 |
+
"nemenyi": {"tied_groups": [], "mean_ranks": {}},
|
| 103 |
+
})
|
| 104 |
+
facts = detect_significant_gap(data)
|
| 105 |
+
assert len(facts) == 1
|
| 106 |
+
assert facts[0].payload["leader"] == "A"
|
| 107 |
+
assert facts[0].payload["runner_up"] == "B"
|
| 108 |
+
assert facts[0].payload["p_value"] == pytest.approx(0.002)
|
| 109 |
+
|
| 110 |
+
def test_empty_when_not_significant(self):
|
| 111 |
+
data = _minimal_data(statistics={
|
| 112 |
+
"pairwise_wilcoxon": [
|
| 113 |
+
{"engine_a": "A", "engine_b": "B", "p_value": 0.4,
|
| 114 |
+
"significant": False, "n_pairs": 10},
|
| 115 |
+
],
|
| 116 |
+
"bootstrap_cis": [], "friedman": {},
|
| 117 |
+
"nemenyi": {"tied_groups": [], "mean_ranks": {}},
|
| 118 |
+
})
|
| 119 |
+
assert detect_significant_gap(data) == []
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
class TestStatisticalTie:
|
| 123 |
+
def test_emits_for_each_tied_group(self):
|
| 124 |
+
data = _minimal_data(statistics={
|
| 125 |
+
"pairwise_wilcoxon": [],
|
| 126 |
+
"bootstrap_cis": [],
|
| 127 |
+
"friedman": {},
|
| 128 |
+
"nemenyi": {
|
| 129 |
+
"tied_groups": [["A", "B"], ["C"]],
|
| 130 |
+
"mean_ranks": {"A": 1.2, "B": 1.5, "C": 3.0},
|
| 131 |
+
"critical_distance": 0.8,
|
| 132 |
+
"alpha": 0.05,
|
| 133 |
+
"n_blocks": 10,
|
| 134 |
+
},
|
| 135 |
+
})
|
| 136 |
+
facts = detect_statistical_tie(data)
|
| 137 |
+
assert len(facts) == 1
|
| 138 |
+
assert set(facts[0].engines_involved) == {"A", "B"}
|
| 139 |
+
assert facts[0].payload["includes_leader"] is True
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
class TestErrorProfileOutlier:
|
| 143 |
+
def test_flags_engine_with_atypical_profile(self):
|
| 144 |
+
engines = [
|
| 145 |
+
{"name": "A", "aggregated_taxonomy": {"distribution": {"visual_confusion": 0.50, "abbreviation_error": 0.10}}},
|
| 146 |
+
{"name": "B", "aggregated_taxonomy": {"distribution": {"visual_confusion": 0.20, "abbreviation_error": 0.10}}},
|
| 147 |
+
{"name": "C", "aggregated_taxonomy": {"distribution": {"visual_confusion": 0.15, "abbreviation_error": 0.10}}},
|
| 148 |
+
]
|
| 149 |
+
data = _minimal_data(engines=engines)
|
| 150 |
+
facts = detect_error_profile_outlier(data)
|
| 151 |
+
flagged = [f for f in facts if f.payload["engine"] == "A"]
|
| 152 |
+
assert flagged
|
| 153 |
+
assert flagged[0].payload["error_class"] == "visual_confusion"
|
| 154 |
+
|
| 155 |
+
def test_empty_when_no_taxonomy(self):
|
| 156 |
+
assert detect_error_profile_outlier(_minimal_data()) == []
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
class TestLlmHallucinationFlag:
|
| 160 |
+
def test_flags_pipeline_with_high_rate(self):
|
| 161 |
+
engines = [
|
| 162 |
+
{"name": "tesseract", "aggregated_hallucination": {"hallucinating_doc_rate": 0.05},
|
| 163 |
+
"is_pipeline": False, "is_vlm": False},
|
| 164 |
+
{"name": "gpt-4o", "aggregated_hallucination": {
|
| 165 |
+
"hallucinating_doc_rate": 0.45, "anchor_score_mean": 0.55, "length_ratio_mean": 1.4},
|
| 166 |
+
"is_pipeline": True, "is_vlm": True},
|
| 167 |
+
]
|
| 168 |
+
data = _minimal_data(engines=engines)
|
| 169 |
+
facts = detect_llm_hallucination_flag(data)
|
| 170 |
+
assert len(facts) == 1
|
| 171 |
+
assert facts[0].payload["engine"] == "gpt-4o"
|
| 172 |
+
assert facts[0].payload["hallucinating_rate_pct"] == 45.0
|
| 173 |
+
|
| 174 |
+
def test_ignores_non_llm_engines(self):
|
| 175 |
+
engines = [
|
| 176 |
+
{"name": "tesseract", "aggregated_hallucination": {"hallucinating_doc_rate": 0.9},
|
| 177 |
+
"is_pipeline": False, "is_vlm": False},
|
| 178 |
+
]
|
| 179 |
+
data = _minimal_data(engines=engines)
|
| 180 |
+
assert detect_llm_hallucination_flag(data) == []
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
class TestStratumDetectors:
|
| 184 |
+
def _docs_with_strata(self):
|
| 185 |
+
# 6 docs — 3 en "gothique", 3 en "humaniste"
|
| 186 |
+
# Engine A est super bon en humaniste, moyen en gothique
|
| 187 |
+
# Engine B est moyen partout
|
| 188 |
+
docs = []
|
| 189 |
+
for i in range(3):
|
| 190 |
+
docs.append({
|
| 191 |
+
"doc_id": f"goth{i}",
|
| 192 |
+
"script_type": "gothique",
|
| 193 |
+
"engine_results": [
|
| 194 |
+
{"engine": "A", "cer": 0.12, "error": None},
|
| 195 |
+
{"engine": "B", "cer": 0.15, "error": None},
|
| 196 |
+
],
|
| 197 |
+
})
|
| 198 |
+
for i in range(3):
|
| 199 |
+
docs.append({
|
| 200 |
+
"doc_id": f"hum{i}",
|
| 201 |
+
"script_type": "humaniste",
|
| 202 |
+
"engine_results": [
|
| 203 |
+
{"engine": "A", "cer": 0.02, "error": None},
|
| 204 |
+
{"engine": "B", "cer": 0.10, "error": None},
|
| 205 |
+
],
|
| 206 |
+
})
|
| 207 |
+
return docs
|
| 208 |
+
|
| 209 |
+
def test_stratum_winner_detected(self):
|
| 210 |
+
docs = self._docs_with_strata()
|
| 211 |
+
engines = [{"name": "A", "cer": 0.07}, {"name": "B", "cer": 0.12}]
|
| 212 |
+
data = _minimal_data(documents=docs, engines=engines)
|
| 213 |
+
facts = detect_stratum_winner(data)
|
| 214 |
+
humanist = [f for f in facts if f.stratum == "humaniste"]
|
| 215 |
+
assert humanist
|
| 216 |
+
assert humanist[0].payload["engine"] == "A"
|
| 217 |
+
|
| 218 |
+
def test_stratum_collapse_detected(self):
|
| 219 |
+
# Engine A globalement bon (0.05) mais s'effondre sur "cursive" (0.30)
|
| 220 |
+
docs = []
|
| 221 |
+
for i in range(5):
|
| 222 |
+
docs.append({
|
| 223 |
+
"doc_id": f"good{i}",
|
| 224 |
+
"script_type": "textualis",
|
| 225 |
+
"engine_results": [{"engine": "A", "cer": 0.04, "error": None}],
|
| 226 |
+
})
|
| 227 |
+
for i in range(3):
|
| 228 |
+
docs.append({
|
| 229 |
+
"doc_id": f"bad{i}",
|
| 230 |
+
"script_type": "cursive",
|
| 231 |
+
"engine_results": [{"engine": "A", "cer": 0.30, "error": None}],
|
| 232 |
+
})
|
| 233 |
+
engines = [{"name": "A", "cer": 0.10}]
|
| 234 |
+
data = _minimal_data(documents=docs, engines=engines)
|
| 235 |
+
facts = detect_stratum_collapse(data)
|
| 236 |
+
assert any(f.stratum == "cursive" for f in facts)
|
| 237 |
+
|
| 238 |
+
|
| 239 |
+
class TestSpeedWinner:
|
| 240 |
+
def test_detects_fast_engine_with_comparable_quality(self):
|
| 241 |
+
# "fast" est 50× plus rapide ET n'est qu'à 6 % de CER en plus du leader
|
| 242 |
+
# (dans la marge de tolérance de qualité du détecteur).
|
| 243 |
+
docs = []
|
| 244 |
+
for i in range(5):
|
| 245 |
+
docs.append({
|
| 246 |
+
"doc_id": f"d{i}",
|
| 247 |
+
"engine_results": [
|
| 248 |
+
{"engine": "fast", "cer": 0.053, "error": None, "duration": 0.1},
|
| 249 |
+
{"engine": "slow", "cer": 0.050, "error": None, "duration": 5.0},
|
| 250 |
+
],
|
| 251 |
+
})
|
| 252 |
+
engines = [{"name": "fast", "cer": 0.053}, {"name": "slow", "cer": 0.050}]
|
| 253 |
+
ranking = [
|
| 254 |
+
{"engine": "slow", "mean_cer": 0.050, "documents": 5, "failed": 0},
|
| 255 |
+
{"engine": "fast", "mean_cer": 0.053, "documents": 5, "failed": 0},
|
| 256 |
+
]
|
| 257 |
+
data = _minimal_data(documents=docs, engines=engines, ranking=ranking)
|
| 258 |
+
facts = detect_speed_winner(data)
|
| 259 |
+
assert facts, "speed_winner devrait détecter un moteur 50× plus rapide"
|
| 260 |
+
assert facts[0].payload["engine"] == "fast"
|
| 261 |
+
assert facts[0].payload["speedup"] >= 3.0
|
| 262 |
+
|
| 263 |
+
def test_ignores_fast_engine_with_bad_quality(self):
|
| 264 |
+
# "fast" est rapide mais a un CER 3× celui du leader — pas un speed winner
|
| 265 |
+
docs = [{
|
| 266 |
+
"doc_id": f"d{i}",
|
| 267 |
+
"engine_results": [
|
| 268 |
+
{"engine": "fast", "cer": 0.15, "error": None, "duration": 0.1},
|
| 269 |
+
{"engine": "slow", "cer": 0.05, "error": None, "duration": 5.0},
|
| 270 |
+
],
|
| 271 |
+
} for i in range(5)]
|
| 272 |
+
engines = [{"name": "fast", "cer": 0.15}, {"name": "slow", "cer": 0.05}]
|
| 273 |
+
ranking = [
|
| 274 |
+
{"engine": "slow", "mean_cer": 0.05, "documents": 5, "failed": 0},
|
| 275 |
+
{"engine": "fast", "mean_cer": 0.15, "documents": 5, "failed": 0},
|
| 276 |
+
]
|
| 277 |
+
data = _minimal_data(documents=docs, engines=engines, ranking=ranking)
|
| 278 |
+
assert detect_speed_winner(data) == []
|
| 279 |
+
|
| 280 |
+
|
| 281 |
+
class TestConfidenceWarning:
|
| 282 |
+
def test_wide_ci_triggers_warning(self):
|
| 283 |
+
cis = [
|
| 284 |
+
{"engine": "A", "mean": 0.05, "ci_lower": 0.01, "ci_upper": 0.25},
|
| 285 |
+
{"engine": "B", "mean": 0.12, "ci_lower": 0.08, "ci_upper": 0.16},
|
| 286 |
+
]
|
| 287 |
+
data = _minimal_data(statistics={
|
| 288 |
+
"pairwise_wilcoxon": [], "bootstrap_cis": cis,
|
| 289 |
+
"friedman": {}, "nemenyi": {"tied_groups": [], "mean_ranks": {}},
|
| 290 |
+
})
|
| 291 |
+
facts = detect_confidence_warning(data)
|
| 292 |
+
assert len(facts) == 1
|
| 293 |
+
assert facts[0].payload["engine"] == "A"
|
| 294 |
+
|
| 295 |
+
|
| 296 |
+
class TestRobustnessFragile:
|
| 297 |
+
def test_detects_collapse_under_high_degradation(self):
|
| 298 |
+
data = _minimal_data(robustness={
|
| 299 |
+
"curves": [
|
| 300 |
+
{"engine": "X", "degradation_type": "noise", "points": [
|
| 301 |
+
{"level": 0, "cer": 0.05},
|
| 302 |
+
{"level": 80, "cer": 0.40},
|
| 303 |
+
]},
|
| 304 |
+
{"engine": "Y", "degradation_type": "noise", "points": [
|
| 305 |
+
{"level": 0, "cer": 0.05},
|
| 306 |
+
{"level": 80, "cer": 0.08},
|
| 307 |
+
]},
|
| 308 |
+
],
|
| 309 |
+
})
|
| 310 |
+
facts = detect_robustness_fragile(data)
|
| 311 |
+
names = {f.payload["engine"] for f in facts}
|
| 312 |
+
assert "X" in names
|
| 313 |
+
assert "Y" not in names
|
| 314 |
+
|
| 315 |
+
|
| 316 |
+
# ---------------------------------------------------------------------------
|
| 317 |
+
# Arbitre
|
| 318 |
+
# ---------------------------------------------------------------------------
|
| 319 |
+
|
| 320 |
+
class TestArbiter:
|
| 321 |
+
def _fact(self, t, imp=FactImportance.HIGH, engines=("A",), stratum=None, payload=None):
|
| 322 |
+
return Fact(type=t, importance=imp, payload=payload or {},
|
| 323 |
+
engines_involved=tuple(engines), stratum=stratum)
|
| 324 |
+
|
| 325 |
+
def test_sort_by_importance_descending(self):
|
| 326 |
+
f1 = self._fact(FactType.SPEED_WINNER, imp=FactImportance.MEDIUM)
|
| 327 |
+
f2 = self._fact(FactType.GLOBAL_LEADER_CER, imp=FactImportance.CRITICAL, engines=("B",))
|
| 328 |
+
selected = select_facts([f1, f2])
|
| 329 |
+
assert selected[0].type == FactType.GLOBAL_LEADER_CER
|
| 330 |
+
|
| 331 |
+
def test_max_facts_limit(self):
|
| 332 |
+
facts = [self._fact(FactType.ERROR_PROFILE_OUTLIER, engines=(f"E{i}",)) for i in range(10)]
|
| 333 |
+
selected = select_facts(facts, max_facts=3)
|
| 334 |
+
assert len(selected) == 3
|
| 335 |
+
|
| 336 |
+
def test_deduplicates_same_engine_same_type(self):
|
| 337 |
+
f1 = self._fact(FactType.ERROR_PROFILE_OUTLIER, engines=("A",), payload={"x": 1})
|
| 338 |
+
f2 = self._fact(FactType.ERROR_PROFILE_OUTLIER, engines=("A",), payload={"x": 2})
|
| 339 |
+
selected = select_facts([f1, f2])
|
| 340 |
+
assert len(selected) == 1
|
| 341 |
+
|
| 342 |
+
def test_keeps_complementary_facts_for_same_engine(self):
|
| 343 |
+
leader = self._fact(FactType.GLOBAL_LEADER_CER, imp=FactImportance.CRITICAL, engines=("A",))
|
| 344 |
+
gap = self._fact(FactType.SIGNIFICANT_GAP, imp=FactImportance.CRITICAL, engines=("A", "B"))
|
| 345 |
+
selected = select_facts([leader, gap])
|
| 346 |
+
# Les deux doivent survivre (paire complémentaire)
|
| 347 |
+
types = {f.type for f in selected}
|
| 348 |
+
assert FactType.GLOBAL_LEADER_CER in types
|
| 349 |
+
assert FactType.SIGNIFICANT_GAP in types
|
| 350 |
+
|
| 351 |
+
def test_low_importance_filtered(self):
|
| 352 |
+
low = Fact(type=FactType.SPEED_WINNER, importance=FactImportance.LOW,
|
| 353 |
+
payload={}, engines_involved=("A",))
|
| 354 |
+
high = self._fact(FactType.GLOBAL_LEADER_CER, imp=FactImportance.CRITICAL, engines=("A",))
|
| 355 |
+
selected = select_facts([low, high])
|
| 356 |
+
assert all(f.importance >= FactImportance.MEDIUM for f in selected)
|
| 357 |
+
|
| 358 |
+
def test_nemenyi_tie_suppresses_contradicting_wilcoxon_gap(self):
|
| 359 |
+
# Si A et B sont dans le même groupe Nemenyi, on ne doit pas afficher
|
| 360 |
+
# un SIGNIFICANT_GAP entre A et B en plus.
|
| 361 |
+
tie = self._fact(FactType.STATISTICAL_TIE, imp=FactImportance.CRITICAL,
|
| 362 |
+
engines=("A", "B", "C"))
|
| 363 |
+
gap = self._fact(FactType.SIGNIFICANT_GAP, imp=FactImportance.CRITICAL,
|
| 364 |
+
engines=("A", "B"))
|
| 365 |
+
selected = select_facts([tie, gap])
|
| 366 |
+
types = {f.type for f in selected}
|
| 367 |
+
assert FactType.STATISTICAL_TIE in types
|
| 368 |
+
assert FactType.SIGNIFICANT_GAP not in types
|
| 369 |
+
|
| 370 |
+
|
| 371 |
+
# ---------------------------------------------------------------------------
|
| 372 |
+
# Rendu et déterminisme
|
| 373 |
+
# ---------------------------------------------------------------------------
|
| 374 |
+
|
| 375 |
+
class TestRenderer:
|
| 376 |
+
def test_render_fact_with_known_template(self):
|
| 377 |
+
f = Fact(
|
| 378 |
+
type=FactType.GLOBAL_LEADER_CER,
|
| 379 |
+
importance=FactImportance.CRITICAL,
|
| 380 |
+
payload={"engine": "testseract", "cer_pct": 4.2, "n_docs": 50,
|
| 381 |
+
"cer": 0.042, "n_engines": 3},
|
| 382 |
+
engines_involved=("testseract",),
|
| 383 |
+
)
|
| 384 |
+
text = render_fact(f, "fr")
|
| 385 |
+
assert "testseract" in text
|
| 386 |
+
assert "4.2" in text
|
| 387 |
+
assert "50" in text
|
| 388 |
+
|
| 389 |
+
def test_render_respects_language(self):
|
| 390 |
+
f = Fact(
|
| 391 |
+
type=FactType.GLOBAL_LEADER_CER,
|
| 392 |
+
importance=FactImportance.CRITICAL,
|
| 393 |
+
payload={"engine": "X", "cer_pct": 1.0, "n_docs": 10,
|
| 394 |
+
"cer": 0.01, "n_engines": 2},
|
| 395 |
+
)
|
| 396 |
+
fr = render_fact(f, "fr")
|
| 397 |
+
en = render_fact(f, "en")
|
| 398 |
+
assert fr != en
|
| 399 |
+
assert "Sur ce corpus" in fr
|
| 400 |
+
assert "On this corpus" in en
|
| 401 |
+
|
| 402 |
+
def test_render_missing_key_does_not_crash(self):
|
| 403 |
+
# Payload incomplet volontairement
|
| 404 |
+
f = Fact(
|
| 405 |
+
type=FactType.GLOBAL_LEADER_CER,
|
| 406 |
+
importance=FactImportance.CRITICAL,
|
| 407 |
+
payload={"engine": "only_name"},
|
| 408 |
+
)
|
| 409 |
+
text = render_fact(f)
|
| 410 |
+
# Doit renvoyer une phrase non vide, même si certains placeholders sont manquants
|
| 411 |
+
assert "only_name" in text
|
| 412 |
+
|
| 413 |
+
def test_render_synthesis_deterministic(self):
|
| 414 |
+
facts = [
|
| 415 |
+
Fact(type=FactType.GLOBAL_LEADER_CER, importance=FactImportance.CRITICAL,
|
| 416 |
+
payload={"engine": "A", "cer_pct": 3.1, "n_docs": 20,
|
| 417 |
+
"cer": 0.031, "n_engines": 2},
|
| 418 |
+
engines_involved=("A",)),
|
| 419 |
+
]
|
| 420 |
+
s1 = render_synthesis(facts, "fr")
|
| 421 |
+
s2 = render_synthesis(facts, "fr")
|
| 422 |
+
assert s1 == s2
|
| 423 |
+
|
| 424 |
+
|
| 425 |
+
class TestBuildSynthesisE2E:
|
| 426 |
+
def test_full_pipeline_produces_sentences(self):
|
| 427 |
+
data = _minimal_data(statistics={
|
| 428 |
+
"pairwise_wilcoxon": [
|
| 429 |
+
{"engine_a": "A", "engine_b": "B", "p_value": 0.01,
|
| 430 |
+
"significant": True, "n_pairs": 10},
|
| 431 |
+
],
|
| 432 |
+
"bootstrap_cis": [
|
| 433 |
+
{"engine": "A", "mean": 0.05, "ci_lower": 0.04, "ci_upper": 0.06},
|
| 434 |
+
{"engine": "B", "mean": 0.12, "ci_lower": 0.11, "ci_upper": 0.13},
|
| 435 |
+
],
|
| 436 |
+
"friedman": {},
|
| 437 |
+
"nemenyi": {"tied_groups": [["A"], ["B"], ["C"]],
|
| 438 |
+
"mean_ranks": {"A": 1.0, "B": 2.0, "C": 3.0},
|
| 439 |
+
"critical_distance": 0.5},
|
| 440 |
+
})
|
| 441 |
+
result = build_synthesis(data, "fr")
|
| 442 |
+
assert "sentences" in result
|
| 443 |
+
assert "facts" in result
|
| 444 |
+
assert len(result["sentences"]) >= 1
|
| 445 |
+
# Au moins la mention du leader
|
| 446 |
+
assert any("A" in s for s in result["sentences"])
|
| 447 |
+
|
| 448 |
+
def test_pipeline_deterministic_across_calls(self):
|
| 449 |
+
data = _minimal_data()
|
| 450 |
+
s1 = build_synthesis(data, "fr")
|
| 451 |
+
s2 = build_synthesis(data, "fr")
|
| 452 |
+
assert s1 == s2
|
| 453 |
+
|
| 454 |
+
|
| 455 |
+
# ---------------------------------------------------------------------------
|
| 456 |
+
# Garde-fou anti-hallucination : traçabilité des nombres
|
| 457 |
+
# ---------------------------------------------------------------------------
|
| 458 |
+
|
| 459 |
+
def _numbers_in_payload(payload: dict) -> set[str]:
|
| 460 |
+
"""Collecte tous les nombres d'un payload de Fact sous formes multiples.
|
| 461 |
+
|
| 462 |
+
Inclut les représentations usuelles produites par ``str.format`` :
|
| 463 |
+
``5``, ``5.0``, ``5.00``, ``5.000``, etc., pour tolérer les formats
|
| 464 |
+
``{x}`` et ``{x:.2f}`` dans les templates.
|
| 465 |
+
"""
|
| 466 |
+
out: set[str] = set()
|
| 467 |
+
|
| 468 |
+
def _add_variants(v):
|
| 469 |
+
try:
|
| 470 |
+
f = float(v)
|
| 471 |
+
except (TypeError, ValueError):
|
| 472 |
+
return
|
| 473 |
+
out.add(str(v))
|
| 474 |
+
out.add(str(f))
|
| 475 |
+
if f == int(f):
|
| 476 |
+
out.add(str(int(f)))
|
| 477 |
+
for dec in (1, 2, 3, 4):
|
| 478 |
+
out.add(f"{f:.{dec}f}")
|
| 479 |
+
|
| 480 |
+
def _walk(x):
|
| 481 |
+
if isinstance(x, dict):
|
| 482 |
+
for v in x.values():
|
| 483 |
+
_walk(v)
|
| 484 |
+
elif isinstance(x, (list, tuple)):
|
| 485 |
+
for v in x:
|
| 486 |
+
_walk(v)
|
| 487 |
+
elif isinstance(x, bool):
|
| 488 |
+
return
|
| 489 |
+
elif isinstance(x, (int, float)):
|
| 490 |
+
_add_variants(x)
|
| 491 |
+
elif isinstance(x, str):
|
| 492 |
+
for n in re.findall(r"\d+(?:\.\d+)?", x):
|
| 493 |
+
_add_variants(n)
|
| 494 |
+
|
| 495 |
+
_walk(payload)
|
| 496 |
+
return out
|
| 497 |
+
|
| 498 |
+
|
| 499 |
+
# Constantes littérales autorisées dans les templates (non traçables au
|
| 500 |
+
# payload car ce sont des éléments typographiques — seuil 95 % correspondant
|
| 501 |
+
# à α = 0,05, etc.). Ajouter ici rend la règle explicite.
|
| 502 |
+
_TEMPLATE_CONSTANTS = {"95", "100"}
|
| 503 |
+
|
| 504 |
+
|
| 505 |
+
class TestAntiHallucinationTraceability:
|
| 506 |
+
"""Chaque nombre dans la synthèse doit venir du payload d'un Fact
|
| 507 |
+
(lui-même traçable au JSON d'entrée par construction des détecteurs)
|
| 508 |
+
ou appartenir à la liste limitative des constantes de template.
|
| 509 |
+
"""
|
| 510 |
+
|
| 511 |
+
def test_every_number_in_synthesis_is_traceable(self):
|
| 512 |
+
data = _minimal_data(statistics={
|
| 513 |
+
"pairwise_wilcoxon": [
|
| 514 |
+
{"engine_a": "A", "engine_b": "B", "p_value": 0.0123,
|
| 515 |
+
"significant": True, "n_pairs": 10},
|
| 516 |
+
],
|
| 517 |
+
"bootstrap_cis": [
|
| 518 |
+
{"engine": "A", "mean": 0.05, "ci_lower": 0.01, "ci_upper": 0.25},
|
| 519 |
+
{"engine": "B", "mean": 0.12, "ci_lower": 0.11, "ci_upper": 0.13},
|
| 520 |
+
],
|
| 521 |
+
"friedman": {"statistic": 5.2, "p_value": 0.07, "significant": False},
|
| 522 |
+
"nemenyi": {
|
| 523 |
+
"tied_groups": [["A", "B"]],
|
| 524 |
+
"mean_ranks": {"A": 1.3, "B": 1.7, "C": 3.0},
|
| 525 |
+
"critical_distance": 0.856,
|
| 526 |
+
"alpha": 0.05,
|
| 527 |
+
"n_blocks": 10,
|
| 528 |
+
},
|
| 529 |
+
})
|
| 530 |
+
result = build_synthesis(data, "fr")
|
| 531 |
+
# Concaténer tous les payloads des Facts retenus
|
| 532 |
+
allowed = set(_TEMPLATE_CONSTANTS)
|
| 533 |
+
for f in result["facts"]:
|
| 534 |
+
allowed |= _numbers_in_payload(f.get("payload", {}))
|
| 535 |
+
|
| 536 |
+
unknown = []
|
| 537 |
+
for sentence in result["sentences"]:
|
| 538 |
+
for num in extract_numbers(sentence):
|
| 539 |
+
num_norm = num.replace(",", ".")
|
| 540 |
+
if num_norm not in allowed:
|
| 541 |
+
unknown.append((num, sentence))
|
| 542 |
+
assert not unknown, f"Nombres non traçables : {unknown}"
|
| 543 |
+
|
| 544 |
+
|
| 545 |
+
# ---------------------------------------------------------------------------
|
| 546 |
+
# Intégration au rapport HTML
|
| 547 |
+
# ---------------------------------------------------------------------------
|
| 548 |
+
|
| 549 |
+
@pytest.fixture(scope="module")
|
| 550 |
+
def benchmark_result():
|
| 551 |
+
from picarones import fixtures
|
| 552 |
+
return fixtures.generate_sample_benchmark(n_docs=8)
|
| 553 |
+
|
| 554 |
+
|
| 555 |
+
class TestReportIntegration:
|
| 556 |
+
def test_report_contains_synthesis_section(self, benchmark_result, tmp_path):
|
| 557 |
+
from picarones.report.generator import ReportGenerator
|
| 558 |
+
out = tmp_path / "report.html"
|
| 559 |
+
ReportGenerator(benchmark_result).generate(out)
|
| 560 |
+
html = out.read_text(encoding="utf-8")
|
| 561 |
+
assert 'class="synth-card"' in html
|
| 562 |
+
assert 'id="synth-title"' in html
|
| 563 |
+
# Au moins une phrase rendue
|
| 564 |
+
assert re.search(r'<ul class="synth-list">\s*<li>', html)
|
| 565 |
+
|
| 566 |
+
def test_report_synthesis_is_deterministic(self, benchmark_result, tmp_path):
|
| 567 |
+
from picarones.report.generator import ReportGenerator
|
| 568 |
+
out1 = tmp_path / "r1.html"
|
| 569 |
+
out2 = tmp_path / "r2.html"
|
| 570 |
+
ReportGenerator(benchmark_result).generate(out1)
|
| 571 |
+
ReportGenerator(benchmark_result).generate(out2)
|
| 572 |
+
# Extraire la section synth et comparer
|
| 573 |
+
h1 = out1.read_text(encoding="utf-8")
|
| 574 |
+
h2 = out2.read_text(encoding="utf-8")
|
| 575 |
+
s1 = re.search(r'<section class="synth-card".*?</section>', h1, re.DOTALL)
|
| 576 |
+
s2 = re.search(r'<section class="synth-card".*?</section>', h2, re.DOTALL)
|
| 577 |
+
assert s1 and s2
|
| 578 |
+
assert hashlib.sha256(s1.group().encode()).hexdigest() == \
|
| 579 |
+
hashlib.sha256(s2.group().encode()).hexdigest()
|
| 580 |
+
|
| 581 |
+
def test_default_registry_has_all_types_registered(self):
|
| 582 |
+
from picarones.core.narrative import _DEFAULT_REGISTRY
|
| 583 |
+
registered = set(_DEFAULT_REGISTRY.registered_types())
|
| 584 |
+
# Tous les 12 types doivent être enregistrés (même ceux encore stubs)
|
| 585 |
+
assert len(registered) == 12
|
| 586 |
+
|
| 587 |
+
def test_english_locale_produces_english_sentences(self, benchmark_result, tmp_path):
|
| 588 |
+
from picarones.report.generator import ReportGenerator
|
| 589 |
+
out = tmp_path / "report_en.html"
|
| 590 |
+
ReportGenerator(benchmark_result, lang="en").generate(out)
|
| 591 |
+
html = out.read_text(encoding="utf-8")
|
| 592 |
+
m = re.search(r'<ul class="synth-list">(.*?)</ul>', html, re.DOTALL)
|
| 593 |
+
assert m
|
| 594 |
+
ul_content = m.group(1)
|
| 595 |
+
# Soit "On this corpus" (leader) soit "Engines" (tie) soit "The gap"
|
| 596 |
+
assert any(marker in ul_content for marker in
|
| 597 |
+
("On this corpus", "Engines ", "The gap", "statistically"))
|