Spaces:
Sleeping
phaseA: extras/ pour modules Cercle 3 + hygiène anti-verdict
Browse filesPremière phase de la refonte en 3 cercles concentriques :
Cercle 1 (noyau invariant) ⊂ Cercle 2 (officiels) ⊂ Cercle 3 (plugins)
Cette phase A se concentre exclusivement sur l'extraction du Cercle 3
(modules niche, gouvernance préventive, renderers correspondants).
Les phases B (historical), C (importers), E (séparation core/measurements)
et D (doc API stable) suivront — voir docs/architecture-cercles.md.
Fichiers déplacés vers picarones/extras/ (8 modules, ~1700 lignes)
-------------------------------------------------------------------
extras/academic/ (modules sans cas d'usage prod direct) :
- taxonomy_intra_doc.py (heatmap classe×position, question rare)
- taxonomy_cooccurrence.py (matrice Jaccard inter-classes, académique)
- image_predictive.py (poids combinés éditoriaux arbitraires)
extras/governance/ (gouvernance préventive) :
- module_policy.py (manifest+audit modules tiers — inutile
tant qu'il n'y a pas 5+ modules contribués)
extras/render/ (renderers correspondants) :
- taxonomy_intra_doc_render.py
- taxonomy_cooccurrence_render.py
- image_predictive_render.py
- module_audit_render.py
Rétrocompatibilité absolue
--------------------------
Pour chaque module déplacé, un fichier-shim de 16 lignes reste à
l'ancien emplacement (``picarones/core/X.py`` ou
``picarones/report/X.py``) et ré-exporte les noms publics depuis le
nouveau chemin. Les imports historiques :
from picarones.core.taxonomy_intra_doc import compute_taxonomy_position_heatmap
from picarones.report.module_audit_render import build_module_audit_html
continuent à fonctionner sans modification. L'identité est préservée
(``shim.X is extras.X``) — pas de duplication de logique.
Hygiène anti-verdict (5 phrases reformulées)
---------------------------------------------
Le projet revendique « facts not verdicts »
(docs/user/reading-a-report.md). Quelques phrases prescriptives
s'étaient glissées :
- Template ``stratum_winner`` : « domine nettement » → factuel
« obtient le CER le plus bas (X% contre Y%) »
- Template ``confidence_warning`` : « Classement fragile » →
« Incertitude statistique élevée »
- i18n ``gini_cer_ideal`` : « idéal : bas-gauche » → « lecture :
bas-gauche »
- i18n ``gini_cer_note`` : « moteur idéal a CER bas ET Gini bas » →
« Un moteur dans la zone bas-gauche combine CER bas ET Gini
bas. Le choix selon ce graphe dépend du workflow visé. »
- i18n ``taxocomp_note`` : « préférable pour une édition critique »
→ « tend à produire des erreurs plus facilement corrigées en
édition critique »
Versions FR + EN cohérentes.
Document de cartographie
------------------------
docs/architecture-cercles.md (250 lignes) :
- Description des 3 cercles + leurs critères.
- Liste exhaustive des modules de chaque cercle.
- Tests concrets pour décider Cercle 1 vs 2 vs 3.
- Disclaimer : cartographie évolutive via RFC.
Validation 7/7 en sandbox
-------------------------
- 12 imports historiques (Cercle 3 via shims).
- 8 imports nouveaux chemins (extras/ direct).
- Identité shim → nouveau chemin préservée (test ``is``).
- Vue advanced_taxonomy du chantier 3 fonctionne avec données opt-in
``intra_doc`` provenant désormais de extras/academic/.
- 5 phrases reformulées détectées dans les fichiers (anti-régression).
- docs/architecture-cercles.md présent et complet.
- 8 shims minces (16 lignes chacun, pas de logique métier).
Tests
-----
+250 lignes dans tests/test_phaseA_migration.py organisés en 7 classes :
TestRetrocompatHistoricalImports, TestNewExtrasImports,
TestIdentityThroughShim, TestChantier3ViewsStillWork,
TestAntiVerdictHygiene, TestArchitectureCerclesDoc, TestOriginalsAreShims.
Verrou levé
-----------
Le Cercle 3 a une localisation physique distincte du cœur. Les modules
qui ne servent pas la question centrale du produit (« peut-on déployer
ce moteur en prod ? ») sont visiblement séparés. La discipline
architecturale devient enforceable par revue de PR (« ce module va
dans extras/ ? »).
Phases suivantes (option 2 — validation entre chaque)
-----------------------------------------------------
- Phase B : extras/historical/ (8 modules philologiques + renderers)
- Phase C : extras/importers/ (5 importers + statut experimental)
- Phase E : core/ → core/ (15 modules) + measurements/ (~30 modules)
- Phase D : docs/api-stable.md + test_public_api.py + version 2.0
- docs/architecture-cercles.md +206 -0
- picarones/core/image_predictive.py +15 -278
- picarones/core/module_policy.py +15 -328
- picarones/core/narrative/templates/en.yaml +5 -4
- picarones/core/narrative/templates/fr.yaml +6 -4
- picarones/core/taxonomy_cooccurrence.py +15 -145
- picarones/core/taxonomy_intra_doc.py +15 -197
- picarones/extras/__init__.py +23 -0
- picarones/extras/academic/__init__.py +18 -0
- picarones/extras/academic/image_predictive.py +283 -0
- picarones/extras/academic/taxonomy_cooccurrence.py +150 -0
- picarones/extras/academic/taxonomy_intra_doc.py +202 -0
- picarones/extras/governance/__init__.py +8 -0
- picarones/extras/governance/module_policy.py +333 -0
- picarones/extras/render/__init__.py +13 -0
- picarones/extras/render/image_predictive_render.py +221 -0
- picarones/extras/render/module_audit_render.py +173 -0
- picarones/extras/render/taxonomy_cooccurrence_render.py +199 -0
- picarones/extras/render/taxonomy_intra_doc_render.py +182 -0
- picarones/report/i18n/en.json +3 -3
- picarones/report/i18n/fr.json +3 -3
- picarones/report/image_predictive_render.py +15 -216
- picarones/report/module_audit_render.py +15 -168
- picarones/report/taxonomy_cooccurrence_render.py +15 -194
- picarones/report/taxonomy_intra_doc_render.py +15 -177
- tests/test_phaseA_migration.py +318 -0
|
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Architecture en 3 cercles — chantier de refonte post-chantier 6
|
| 2 |
+
|
| 3 |
+
Ce document **fige la cartographie** de chaque module Picarones dans son
|
| 4 |
+
cercle d'appartenance. Il sert de référence stable pour les
|
| 5 |
+
contributions futures : avant d'ajouter un module, consulter ce
|
| 6 |
+
document pour identifier dans quel cercle il doit aller.
|
| 7 |
+
|
| 8 |
+
## Principe — 3 cercles concentriques
|
| 9 |
+
|
| 10 |
+
```
|
| 11 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 12 |
+
│ Cercle 3 — Plugins (extras/) │
|
| 13 |
+
│ ┌─────────────────────────────────────────────────────┐ │
|
| 14 |
+
│ │ Cercle 2 — Modules officiels │ │
|
| 15 |
+
│ │ ┌──────────────────────────────────────────┐ │ │
|
| 16 |
+
│ │ │ Cercle 1 — Noyau invariant (core/) │ │ │
|
| 17 |
+
│ │ │ API publique stable, ~15 modules │ │ │
|
| 18 |
+
│ │ └──────────────────────────────────────────┘ │ │
|
| 19 |
+
│ │ Adapters, mesures, rapport, CLI, web │ │
|
| 20 |
+
│ │ ~30 modules métriques + ~15 adapters/UI │ │
|
| 21 |
+
│ └─────────────────────────────────────────────────────┘ │
|
| 22 |
+
│ Modules niche, gouvernance préventive, importers exotiques │
|
| 23 |
+
│ Distribués via extras pip ou packages séparés à terme │
|
| 24 |
+
└─────────────────────────────────────────────────────────────┘
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
Plus on s'éloigne du cœur, plus c'est optionnel et plus c'est facile
|
| 28 |
+
à supprimer/remplacer/externaliser.
|
| 29 |
+
|
| 30 |
+
## Cercle 1 — Noyau invariant
|
| 31 |
+
|
| 32 |
+
**Critères** : ce qui définit *ce qu'est* Picarones. API publique
|
| 33 |
+
stable. Ne casse pas entre versions mineures.
|
| 34 |
+
|
| 35 |
+
**Localisation** : `picarones/core/` (après phase E) — strictement
|
| 36 |
+
~15 modules.
|
| 37 |
+
|
| 38 |
+
**Contenu** :
|
| 39 |
+
|
| 40 |
+
| Module | Rôle |
|
| 41 |
+
|---|---|
|
| 42 |
+
| `corpus.py` | Document, Corpus, GTLevel multi-niveaux |
|
| 43 |
+
| `modules.py` | BaseModule, ArtifactType (contrat unique pour modules tiers) |
|
| 44 |
+
| `results.py` | BenchmarkResult, EngineReport, DocumentResult |
|
| 45 |
+
| `metrics.py` | CER/WER/MER/WIL via jiwer (métriques de base) |
|
| 46 |
+
| `runner.py` | Orchestrateur (parallélisation, reprise, timeout) |
|
| 47 |
+
| `pipeline_runner.py` | Banc d'essai mono-doc des pipelines composées |
|
| 48 |
+
| `pipeline_benchmark.py` | Orchestration corpus-wide |
|
| 49 |
+
| `pipeline_comparison.py` | Comparaison de N pipelines |
|
| 50 |
+
| `pipeline_spec_loader.py` | Chargement YAML déclaratif |
|
| 51 |
+
| `metric_registry.py` | Registre typé `(input_type, output_type) → metric` |
|
| 52 |
+
| `metric_hooks.py` | Profils + registre de hooks document/corpus |
|
| 53 |
+
| `builtin_metrics.py` | CER/WER/MER/WIL enregistrés sur registre typé |
|
| 54 |
+
| `alto_metrics.py` | Métriques `(ALTO, ALTO)` (chantier 1) |
|
| 55 |
+
|
| 56 |
+
**Discipline** :
|
| 57 |
+
- Toute modification non rétrocompatible exige une **RFC** et bump majeur.
|
| 58 |
+
- Test `test_public_api.py` (à créer en phase D) qui échoue si un nom disparaît.
|
| 59 |
+
- Aucun import direct depuis `extras/` ou de modules optionnels.
|
| 60 |
+
|
| 61 |
+
## Cercle 2 — Modules officiels
|
| 62 |
+
|
| 63 |
+
**Critères** : maintenu par les mainteneurs Picarones, livré par
|
| 64 |
+
défaut, mais peut techniquement vivre ailleurs (un fork peut le
|
| 65 |
+
remplacer par un équivalent).
|
| 66 |
+
|
| 67 |
+
**Localisation** :
|
| 68 |
+
- `picarones/measurements/` (après phase E) — métriques au-delà du CER de base.
|
| 69 |
+
- `picarones/engines/` — adapters OCR.
|
| 70 |
+
- `picarones/llm/` — adapters LLM.
|
| 71 |
+
- `picarones/modules/` — modules `BaseModule` de référence (chantier 1).
|
| 72 |
+
- `picarones/report/` — génération HTML.
|
| 73 |
+
- `picarones/cli/` — interface CLI.
|
| 74 |
+
- `picarones/web/` — interface web FastAPI.
|
| 75 |
+
- `picarones/pipelines/` — pipelines OCR+LLM legacy (à statuer en phase D).
|
| 76 |
+
|
| 77 |
+
**Métriques officielles** (futur `picarones/measurements/`) :
|
| 78 |
+
|
| 79 |
+
| Catégorie | Modules |
|
| 80 |
+
|---|---|
|
| 81 |
+
| Texte | `confusion`, `char_scores`, `taxonomy`, `structure`, `taxonomy_comparison` |
|
| 82 |
+
| Lignes | `line_metrics`, `hallucination` |
|
| 83 |
+
| Fiabilité | `calibration`, `reliability`, `robustness`, `robustness_projection` |
|
| 84 |
+
| Structure ALTO/PAGE | `reading_order`, `layout`, `error_absorption` |
|
| 85 |
+
| Recherche | `searchability`, `numerical_sequences`, `rare_tokens` |
|
| 86 |
+
| Lisibilité | `readability` (Flesch), `specialization` |
|
| 87 |
+
| Inter-moteurs | `inter_engine`, `worst_lines` |
|
| 88 |
+
| Économie | `throughput`, `cost_projection`, `marginal_cost`, `pricing` |
|
| 89 |
+
| Comparaison | `incremental_comparison` |
|
| 90 |
+
| Narrative | `narrative/` (engine + 6 familles de détecteurs) |
|
| 91 |
+
| Hooks | `builtin_hooks` |
|
| 92 |
+
| Contexte corpus | `history`, `difficulty`, `image_quality`, `normalization` |
|
| 93 |
+
| Statistiques | `statistics` |
|
| 94 |
+
| Levers | `levers` |
|
| 95 |
+
|
| 96 |
+
**Discipline** :
|
| 97 |
+
- Modification libre sans RFC.
|
| 98 |
+
- Nouveau module doit s'enregistrer via `@register_metric` ou
|
| 99 |
+
`@register_document_metric` plutôt qu'imports directs depuis `runner.py`.
|
| 100 |
+
- Couvre les 4 axes du produit : viabilité prod, hallucinations VLM,
|
| 101 |
+
pipelines composées, projection coût/vitesse.
|
| 102 |
+
|
| 103 |
+
## Cercle 3 — Plugins
|
| 104 |
+
|
| 105 |
+
**Critères** : ne sert pas tout le monde, peut être désactivé sans
|
| 106 |
+
amputer le produit principal.
|
| 107 |
+
|
| 108 |
+
**Localisation** : `picarones/extras/` (sous-package interne pour
|
| 109 |
+
l'instant ; packages PyPI séparés possibles à terme).
|
| 110 |
+
|
| 111 |
+
**Sous-packages** :
|
| 112 |
+
|
| 113 |
+
### `extras/academic/` — modules techniques sans cas d'usage prod
|
| 114 |
+
|
| 115 |
+
| Module | Pourquoi en plugin |
|
| 116 |
+
|---|---|
|
| 117 |
+
| `taxonomy_intra_doc.py` | Heatmap classe×position. Question rare, peu actionnable |
|
| 118 |
+
| `taxonomy_cooccurrence.py` | Jaccard inter-classes. Académique, info rare |
|
| 119 |
+
| `image_predictive.py` | Score combiné avec poids éditoriaux arbitraires |
|
| 120 |
+
|
| 121 |
+
### `extras/governance/` — gouvernance préventive
|
| 122 |
+
|
| 123 |
+
| Module | Pourquoi en plugin |
|
| 124 |
+
|---|---|
|
| 125 |
+
| `module_policy.py` | Manifest + audit pour modules contribués externes. Inutile tant qu'il n'y a pas 5+ modules tiers réels |
|
| 126 |
+
|
| 127 |
+
### `extras/historical/` — métriques philologiques (phase B)
|
| 128 |
+
|
| 129 |
+
| Module | Public spécifique |
|
| 130 |
+
|---|---|
|
| 131 |
+
| `unicode_blocks.py` | Tous périodes |
|
| 132 |
+
| `abbreviations.py` | Médiéval (Capelli) |
|
| 133 |
+
| `mufi.py` | Médiéval (PUA) |
|
| 134 |
+
| `early_modern_typography.py` | XVIᵉ-XVIIIᵉ siècles |
|
| 135 |
+
| `modern_archives.py` | XIXᵉ-XXᵉ siècles |
|
| 136 |
+
| `roman_numerals.py` | Toutes périodes |
|
| 137 |
+
| `lexical_modernization.py` | Édition critique |
|
| 138 |
+
| `philological_runner.py` | Orchestration des 6 modules ci-dessus |
|
| 139 |
+
|
| 140 |
+
### `extras/importers/` — imports externes (phase C)
|
| 141 |
+
|
| 142 |
+
| Module | Statut |
|
| 143 |
+
|---|---|
|
| 144 |
+
| `_http.py` | Helpers HTTP partagés (chantier 4) |
|
| 145 |
+
| `iiif.py` | Maintenu |
|
| 146 |
+
| `htr_united.py` | Maintenu |
|
| 147 |
+
| `gallica.py` | Maintenu |
|
| 148 |
+
| `huggingface.py` | Expérimental (à finir ou marqué unstable) |
|
| 149 |
+
| `escriptorium.py` | Expérimental (à finir ou marqué unstable) |
|
| 150 |
+
|
| 151 |
+
### `extras/render/` — renderers correspondants
|
| 152 |
+
|
| 153 |
+
Renderers atomiques pour les modules `extras/`. Importés
|
| 154 |
+
conditionnellement par les vues thématiques du chantier 3 (qui sont
|
| 155 |
+
elles-mêmes dans `report/views/`, donc Cercle 2).
|
| 156 |
+
|
| 157 |
+
## Distinguer un module Cercle 1 vs Cercle 2
|
| 158 |
+
|
| 159 |
+
Test concret : si on supprime ce module, est-ce que la phrase
|
| 160 |
+
*« Picarones est un banc d'essai pour pipelines OCR/HTR/VLM »* reste
|
| 161 |
+
vraie ?
|
| 162 |
+
|
| 163 |
+
- **Oui** → Cercle 2 (le produit existe sans ce module).
|
| 164 |
+
- **Non** → Cercle 1 (le module participe à la définition même).
|
| 165 |
+
|
| 166 |
+
Exemple :
|
| 167 |
+
- Sans `corpus.py` : impossible de charger un corpus → Cercle 1.
|
| 168 |
+
- Sans `confusion.py` : on a toujours un bench fonctionnel sans
|
| 169 |
+
matrice de confusion → Cercle 2.
|
| 170 |
+
- Sans `taxonomy_intra_doc.py` : on a toujours un bench complet et
|
| 171 |
+
utile → Cercle 3.
|
| 172 |
+
|
| 173 |
+
## Distinguer un module Cercle 2 vs Cercle 3
|
| 174 |
+
|
| 175 |
+
Test concret : ce module sert-il à répondre à la question
|
| 176 |
+
*« peut-on déployer ce moteur en prod sur ce corpus dans nos
|
| 177 |
+
contraintes ? »* — soit en mesurant un risque (hallucinations,
|
| 178 |
+
stabilité), soit en projetant un coût (throughput, pricing), soit
|
| 179 |
+
en évaluant la qualité (CER, calibration, structure) ?
|
| 180 |
+
|
| 181 |
+
- **Oui** → Cercle 2.
|
| 182 |
+
- **Non** → Cercle 3.
|
| 183 |
+
|
| 184 |
+
Exemple :
|
| 185 |
+
- `hallucination.py` : mesure un risque pour la prod VLM → Cercle 2.
|
| 186 |
+
- `throughput.py` : projette un coût opérationnel → Cercle 2.
|
| 187 |
+
- `taxonomy_intra_doc.py` : décrit une distribution sans implication
|
| 188 |
+
de décision → Cercle 3.
|
| 189 |
+
|
| 190 |
+
## Disclaimer
|
| 191 |
+
|
| 192 |
+
Cette cartographie est **une décision produit**, pas une vérité
|
| 193 |
+
absolue. Elle peut évoluer si les usages réels d'institutions
|
| 194 |
+
révèlent qu'un module Cercle 3 est en fait essentiel, ou
|
| 195 |
+
inversement.
|
| 196 |
+
|
| 197 |
+
Toute remise en cause doit passer par une RFC documentée, pas par
|
| 198 |
+
une PR silencieuse.
|
| 199 |
+
|
| 200 |
+
## Voir aussi
|
| 201 |
+
|
| 202 |
+
- [`docs/architecture.md`](architecture.md) — vue d'ensemble post-chantiers 1-6.
|
| 203 |
+
- [`docs/profiles.md`](profiles.md) — profils de calcul (chantier 2).
|
| 204 |
+
- [`docs/views.md`](views.md) — vues HTML du rapport.
|
| 205 |
+
- [`docs/cli-workflows.md`](cli-workflows.md) — commandes CLI.
|
| 206 |
+
- `docs/api-stable.md` — *à créer en phase D* — engagement API publique du Cercle 1.
|
|
@@ -1,283 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
``image_quality`` (Sprint 5) mesure des features d'image
|
| 8 |
-
indépendamment ; ce module **les combine** pour produire deux
|
| 9 |
-
indicateurs corpus-level :
|
| 10 |
-
|
| 11 |
-
1. **Score de complexité paléographique** ∈ [0, 1]. Combine
|
| 12 |
-
bruit, faible netteté, faible contraste et rotation en un
|
| 13 |
-
indicateur unique de la difficulté intrinsèque pour un OCR.
|
| 14 |
-
0 = document trivial, 1 = document extrême. Permet
|
| 15 |
-
d'expliquer une partie du CER observé.
|
| 16 |
-
|
| 17 |
-
2. **Score d'homogénéité du corpus** ∈ [0, 1]. Variance des
|
| 18 |
-
features entre documents. 0 = corpus uniforme (la moyenne
|
| 19 |
-
globale du benchmark est fiable), 1 = corpus hétérogène
|
| 20 |
-
(la moyenne ment, il faut stratifier). Couplé au détecteur
|
| 21 |
-
``stratification_recommended`` (Sprint 46) qui agit sur
|
| 22 |
-
``script_type``.
|
| 23 |
-
|
| 24 |
-
Pondérations
|
| 25 |
-
------------
|
| 26 |
-
La roadmap propose une combinaison **pondérée** sans fixer les
|
| 27 |
-
poids — on adopte une convention éditoriale documentée :
|
| 28 |
-
|
| 29 |
-
- ``noise_level`` : poids 0.30 (bruit franc → CER ↑)
|
| 30 |
-
- ``1 - sharpness_score`` : poids 0.30 (flou → CER ↑)
|
| 31 |
-
- ``1 - contrast_score`` : poids 0.20 (faible contraste → CER ↑)
|
| 32 |
-
- ``|rotation_degrees|/30`` : poids 0.20 (rotation > 30° = pire)
|
| 33 |
-
|
| 34 |
-
Les poids somment à 1. L'utilisateur peut surcharger via
|
| 35 |
-
``weights={...}``.
|
| 36 |
-
|
| 37 |
-
Pas de prédiction CER absolue
|
| 38 |
-
-----------------------------
|
| 39 |
-
On ne prétend **pas** prédire une valeur CER en pourcentage —
|
| 40 |
-
ça demanderait un modèle entraîné par moteur, ce que la
|
| 41 |
-
philosophie banc d'essai exclut. On fournit un score relatif
|
| 42 |
-
qui se corrèle au CER observé pour une **lecture
|
| 43 |
-
diagnostique** : *« le document A est ~3× plus complexe que le
|
| 44 |
-
document B, ce qui est cohérent avec le CER observé. »*
|
| 45 |
"""
|
| 46 |
|
| 47 |
-
from
|
| 48 |
-
|
| 49 |
-
import logging
|
| 50 |
-
import math
|
| 51 |
-
import statistics
|
| 52 |
-
from typing import Iterable, Optional
|
| 53 |
-
|
| 54 |
-
logger = logging.getLogger(__name__)
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
# Poids éditoriaux par défaut.
|
| 58 |
-
DEFAULT_COMPLEXITY_WEIGHTS = {
|
| 59 |
-
"noise_level": 0.30,
|
| 60 |
-
"blur": 0.30, # 1 - sharpness_score
|
| 61 |
-
"low_contrast": 0.20, # 1 - contrast_score
|
| 62 |
-
"rotation": 0.20, # |rotation_degrees| / 30
|
| 63 |
-
}
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
# Plage de saturation pour la rotation. Au-delà de 30°, on
|
| 67 |
-
# considère que c'est aussi pire que pire.
|
| 68 |
-
_ROTATION_SATURATION_DEG = 30.0
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
def _clip01(x: float) -> float:
|
| 72 |
-
return max(0.0, min(1.0, x))
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
def _extract_feature(
|
| 76 |
-
quality: dict, key: str, default: float = 0.0,
|
| 77 |
-
) -> float:
|
| 78 |
-
val = quality.get(key, default)
|
| 79 |
-
if val is None:
|
| 80 |
-
return default
|
| 81 |
-
try:
|
| 82 |
-
return float(val)
|
| 83 |
-
except (TypeError, ValueError):
|
| 84 |
-
return default
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
def compute_paleographic_complexity(
|
| 88 |
-
quality: dict,
|
| 89 |
-
*,
|
| 90 |
-
weights: Optional[dict[str, float]] = None,
|
| 91 |
-
) -> Optional[dict]:
|
| 92 |
-
"""Score de complexité paléographique d'une image.
|
| 93 |
-
|
| 94 |
-
Parameters
|
| 95 |
-
----------
|
| 96 |
-
quality:
|
| 97 |
-
Dict ``ImageQualityResult.as_dict()`` ou compatible.
|
| 98 |
-
Champs lus : ``noise_level``, ``sharpness_score``,
|
| 99 |
-
``contrast_score``, ``rotation_degrees``.
|
| 100 |
-
weights:
|
| 101 |
-
Poids surchargeant les défauts. Doit contenir les
|
| 102 |
-
4 clés ``noise_level``, ``blur``, ``low_contrast``,
|
| 103 |
-
``rotation``. Les poids sont normalisés (somme = 1).
|
| 104 |
-
|
| 105 |
-
Returns
|
| 106 |
-
-------
|
| 107 |
-
dict | None
|
| 108 |
-
``{
|
| 109 |
-
"score": float, # ∈ [0, 1]
|
| 110 |
-
"components": {
|
| 111 |
-
"noise": float, "blur": float,
|
| 112 |
-
"low_contrast": float, "rotation": float,
|
| 113 |
-
},
|
| 114 |
-
"weights_used": dict,
|
| 115 |
-
}`` ou ``None`` si ``quality`` est falsy.
|
| 116 |
-
"""
|
| 117 |
-
if not quality:
|
| 118 |
-
return None
|
| 119 |
-
w = dict(DEFAULT_COMPLEXITY_WEIGHTS)
|
| 120 |
-
if weights:
|
| 121 |
-
for k in w:
|
| 122 |
-
if k in weights:
|
| 123 |
-
w[k] = float(weights[k])
|
| 124 |
-
total = sum(w.values())
|
| 125 |
-
if total <= 0:
|
| 126 |
-
return None
|
| 127 |
-
w = {k: v / total for k, v in w.items()}
|
| 128 |
-
noise = _clip01(_extract_feature(quality, "noise_level"))
|
| 129 |
-
sharpness = _clip01(_extract_feature(quality, "sharpness_score"))
|
| 130 |
-
contrast = _clip01(_extract_feature(quality, "contrast_score"))
|
| 131 |
-
rotation_deg = abs(_extract_feature(quality, "rotation_degrees"))
|
| 132 |
-
blur = 1.0 - sharpness
|
| 133 |
-
low_contrast = 1.0 - contrast
|
| 134 |
-
rotation = _clip01(rotation_deg / _ROTATION_SATURATION_DEG)
|
| 135 |
-
score = (
|
| 136 |
-
w["noise_level"] * noise
|
| 137 |
-
+ w["blur"] * blur
|
| 138 |
-
+ w["low_contrast"] * low_contrast
|
| 139 |
-
+ w["rotation"] * rotation
|
| 140 |
-
)
|
| 141 |
-
return {
|
| 142 |
-
"score": _clip01(score),
|
| 143 |
-
"components": {
|
| 144 |
-
"noise": noise,
|
| 145 |
-
"blur": blur,
|
| 146 |
-
"low_contrast": low_contrast,
|
| 147 |
-
"rotation": rotation,
|
| 148 |
-
},
|
| 149 |
-
"weights_used": w,
|
| 150 |
-
}
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
def compute_corpus_homogeneity(
|
| 154 |
-
image_qualities: Iterable[dict],
|
| 155 |
-
) -> Optional[dict]:
|
| 156 |
-
"""Score d'homogénéité du corpus ∈ [0, 1].
|
| 157 |
-
|
| 158 |
-
0 = corpus uniforme (faible variance entre documents),
|
| 159 |
-
1 = corpus hétérogène.
|
| 160 |
-
|
| 161 |
-
Méthode : pour chaque feature dans ``noise_level``,
|
| 162 |
-
``sharpness_score``, ``contrast_score``, ``rotation_degrees``,
|
| 163 |
-
on calcule l'écart-type *normalisé* sur les documents (par
|
| 164 |
-
une plage de référence), puis on prend la moyenne des 4.
|
| 165 |
-
|
| 166 |
-
Plages de normalisation :
|
| 167 |
-
- ``noise_level``, ``sharpness_score``, ``contrast_score``
|
| 168 |
-
∈ [0, 1] → écart-type / 0.5 (max théorique de l'écart-type
|
| 169 |
-
d'une distribution sur [0,1]) borné à 1.
|
| 170 |
-
- ``rotation_degrees`` → écart-type / 10°.
|
| 171 |
-
|
| 172 |
-
Parameters
|
| 173 |
-
----------
|
| 174 |
-
image_qualities:
|
| 175 |
-
Itérable de dicts ``ImageQualityResult.as_dict()``.
|
| 176 |
-
|
| 177 |
-
Returns
|
| 178 |
-
-------
|
| 179 |
-
dict | None
|
| 180 |
-
``{
|
| 181 |
-
"score": float, # ∈ [0, 1]
|
| 182 |
-
"n_docs": int,
|
| 183 |
-
"per_feature": {
|
| 184 |
-
feature: {"mean": float, "stdev": float,
|
| 185 |
-
"normalised": float},
|
| 186 |
-
},
|
| 187 |
-
}`` ou ``None`` si moins de 2 documents.
|
| 188 |
-
"""
|
| 189 |
-
docs = [q for q in image_qualities if q]
|
| 190 |
-
if len(docs) < 2:
|
| 191 |
-
return None
|
| 192 |
-
features = (
|
| 193 |
-
("noise_level", 0.5),
|
| 194 |
-
("sharpness_score", 0.5),
|
| 195 |
-
("contrast_score", 0.5),
|
| 196 |
-
("rotation_degrees", 10.0),
|
| 197 |
-
)
|
| 198 |
-
per_feature: dict[str, dict] = {}
|
| 199 |
-
norm_stdevs: list[float] = []
|
| 200 |
-
for key, divisor in features:
|
| 201 |
-
values = [
|
| 202 |
-
_extract_feature(q, key)
|
| 203 |
-
for q in docs
|
| 204 |
-
]
|
| 205 |
-
if not values:
|
| 206 |
-
continue
|
| 207 |
-
mean = statistics.fmean(values)
|
| 208 |
-
try:
|
| 209 |
-
stdev = statistics.stdev(values) if len(values) >= 2 else 0.0
|
| 210 |
-
except statistics.StatisticsError:
|
| 211 |
-
stdev = 0.0
|
| 212 |
-
normalised = _clip01(stdev / divisor) if divisor > 0 else 0.0
|
| 213 |
-
per_feature[key] = {
|
| 214 |
-
"mean": mean,
|
| 215 |
-
"stdev": stdev,
|
| 216 |
-
"normalised": normalised,
|
| 217 |
-
}
|
| 218 |
-
norm_stdevs.append(normalised)
|
| 219 |
-
if not norm_stdevs:
|
| 220 |
-
return None
|
| 221 |
-
score = statistics.fmean(norm_stdevs)
|
| 222 |
-
return {
|
| 223 |
-
"score": _clip01(score),
|
| 224 |
-
"n_docs": len(docs),
|
| 225 |
-
"per_feature": per_feature,
|
| 226 |
-
}
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
def aggregate_corpus_predictive(
|
| 230 |
-
image_qualities: Iterable[dict],
|
| 231 |
-
*,
|
| 232 |
-
weights: Optional[dict[str, float]] = None,
|
| 233 |
-
) -> Optional[dict]:
|
| 234 |
-
"""Synthèse corpus-wide : complexité moyenne + homogénéité.
|
| 235 |
-
|
| 236 |
-
Returns
|
| 237 |
-
-------
|
| 238 |
-
dict | None
|
| 239 |
-
``{
|
| 240 |
-
"n_docs": int,
|
| 241 |
-
"complexity_mean": float,
|
| 242 |
-
"complexity_median": float,
|
| 243 |
-
"complexity_min": float,
|
| 244 |
-
"complexity_max": float,
|
| 245 |
-
"complexity_stdev": float,
|
| 246 |
-
"homogeneity": dict, # sortie de
|
| 247 |
-
# compute_corpus_homogeneity
|
| 248 |
-
}`` ou ``None`` si moins d'un document.
|
| 249 |
-
"""
|
| 250 |
-
docs = [q for q in image_qualities if q]
|
| 251 |
-
if not docs:
|
| 252 |
-
return None
|
| 253 |
-
scores: list[float] = []
|
| 254 |
-
for q in docs:
|
| 255 |
-
result = compute_paleographic_complexity(q, weights=weights)
|
| 256 |
-
if result is not None:
|
| 257 |
-
scores.append(float(result["score"]))
|
| 258 |
-
if not scores:
|
| 259 |
-
return None
|
| 260 |
-
homogeneity = compute_corpus_homogeneity(docs)
|
| 261 |
-
return {
|
| 262 |
-
"n_docs": len(docs),
|
| 263 |
-
"complexity_mean": statistics.fmean(scores),
|
| 264 |
-
"complexity_median": statistics.median(scores),
|
| 265 |
-
"complexity_min": min(scores),
|
| 266 |
-
"complexity_max": max(scores),
|
| 267 |
-
"complexity_stdev": (
|
| 268 |
-
statistics.stdev(scores) if len(scores) >= 2 else 0.0
|
| 269 |
-
),
|
| 270 |
-
"homogeneity": homogeneity,
|
| 271 |
-
}
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
__all__ = [
|
| 275 |
-
"DEFAULT_COMPLEXITY_WEIGHTS",
|
| 276 |
-
"compute_paleographic_complexity",
|
| 277 |
-
"compute_corpus_homogeneity",
|
| 278 |
-
"aggregate_corpus_predictive",
|
| 279 |
-
]
|
| 280 |
-
|
| 281 |
|
| 282 |
-
#
|
| 283 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.academic.image_predictive`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.core.image_predictive
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.academic.image_predictive import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.academic.image_predictive as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -1,333 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
Avant d'ouvrir Picarones aux contributions externes (axe B —
|
| 8 |
-
modules tiers que l'utilisateur amène), il faut un cadre de
|
| 9 |
-
qualité explicite : *« un module qui ne passe pas l'audit
|
| 10 |
-
n'est pas exécutable. »*
|
| 11 |
-
|
| 12 |
-
Ce module fournit l'**enveloppe d'audit** :
|
| 13 |
-
|
| 14 |
-
- ``ModuleManifest`` — métadonnées obligatoires (auteur,
|
| 15 |
-
licence, version, citation, contrat d'entrée/sortie typé).
|
| 16 |
-
- ``validate_manifest(manifest)`` — vérifie que tous les champs
|
| 17 |
-
obligatoires sont présents et bien formés.
|
| 18 |
-
- ``audit_module(module_class_or_instance, manifest)`` —
|
| 19 |
-
vérifie en plus que la classe respecte le contrat ``BaseModule``
|
| 20 |
-
et que ``input_types``/``output_types`` correspondent au
|
| 21 |
-
manifeste.
|
| 22 |
-
- ``AuditResult`` — verdict structuré ``passed/failed`` + liste
|
| 23 |
-
des checks détaillés.
|
| 24 |
-
|
| 25 |
-
Stratégie d'ouverture
|
| 26 |
-
---------------------
|
| 27 |
-
Phase fermée actuelle : modules officiels uniquement,
|
| 28 |
-
contributions via PR sur le repo principal. Phase ouverte
|
| 29 |
-
future : une fois 5–6 modules officiels stables, ouverture via
|
| 30 |
-
``entry_points`` sur PyPI (``picarones-module-X``). Ce module
|
| 31 |
-
prépare la phase ouverte sans la déclencher : tout module
|
| 32 |
-
externe devra fournir un ``ModuleManifest`` valide pour être
|
| 33 |
-
exécuté.
|
| 34 |
-
|
| 35 |
-
Pas de SPDX validator
|
| 36 |
-
---------------------
|
| 37 |
-
On vérifie la présence et la non-vacuité des champs licence ;
|
| 38 |
-
on ne valide pas la conformité SPDX du nom (``MIT`` vs
|
| 39 |
-
``mit-license`` vs ``MIT License``). Le chercheur reste
|
| 40 |
-
responsable du choix de licence ; l'outil documente, il ne
|
| 41 |
-
juge pas.
|
| 42 |
"""
|
| 43 |
|
| 44 |
-
from
|
| 45 |
-
|
| 46 |
-
import logging
|
| 47 |
-
from dataclasses import dataclass, field
|
| 48 |
-
from typing import Any, Optional
|
| 49 |
-
|
| 50 |
-
logger = logging.getLogger(__name__)
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
# Champs obligatoires d'un ManifestModule (texte non-vide).
|
| 54 |
-
_REQUIRED_TEXT_FIELDS = (
|
| 55 |
-
"name", "version", "author", "license",
|
| 56 |
-
"description",
|
| 57 |
-
)
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
@dataclass
|
| 61 |
-
class ModuleManifest:
|
| 62 |
-
"""Métadonnées d'un module contribué.
|
| 63 |
-
|
| 64 |
-
Attributes
|
| 65 |
-
----------
|
| 66 |
-
name:
|
| 67 |
-
Identifiant unique du module (ex. ``"my-llm-correcteur"``).
|
| 68 |
-
version:
|
| 69 |
-
Version sémantique (ex. ``"1.2.0"``).
|
| 70 |
-
author:
|
| 71 |
-
Auteur ou institution responsable.
|
| 72 |
-
license:
|
| 73 |
-
Identifiant de licence (SPDX recommandé, non validé).
|
| 74 |
-
description:
|
| 75 |
-
Description courte (≤ 1 phrase).
|
| 76 |
-
input_types:
|
| 77 |
-
Liste des types d'entrée (chaînes). Doit correspondre
|
| 78 |
-
à ``module.input_types`` (Sprint 33).
|
| 79 |
-
output_types:
|
| 80 |
-
Liste des types de sortie. Doit correspondre à
|
| 81 |
-
``module.output_types``.
|
| 82 |
-
citation:
|
| 83 |
-
Citation académique (BibTeX, DOI, ou texte libre).
|
| 84 |
-
Optionnel.
|
| 85 |
-
homepage:
|
| 86 |
-
URL du dépôt ou de la page projet. Optionnel.
|
| 87 |
-
picarones_min_version:
|
| 88 |
-
Version minimale de Picarones requise. Optionnel.
|
| 89 |
-
extra:
|
| 90 |
-
Métadonnées libres (clé → valeur).
|
| 91 |
-
"""
|
| 92 |
-
|
| 93 |
-
name: str
|
| 94 |
-
version: str
|
| 95 |
-
author: str
|
| 96 |
-
license: str
|
| 97 |
-
description: str
|
| 98 |
-
input_types: list[str] = field(default_factory=list)
|
| 99 |
-
output_types: list[str] = field(default_factory=list)
|
| 100 |
-
citation: Optional[str] = None
|
| 101 |
-
homepage: Optional[str] = None
|
| 102 |
-
picarones_min_version: Optional[str] = None
|
| 103 |
-
extra: dict = field(default_factory=dict)
|
| 104 |
-
|
| 105 |
-
def as_dict(self) -> dict:
|
| 106 |
-
return {
|
| 107 |
-
"name": self.name,
|
| 108 |
-
"version": self.version,
|
| 109 |
-
"author": self.author,
|
| 110 |
-
"license": self.license,
|
| 111 |
-
"description": self.description,
|
| 112 |
-
"input_types": list(self.input_types),
|
| 113 |
-
"output_types": list(self.output_types),
|
| 114 |
-
"citation": self.citation,
|
| 115 |
-
"homepage": self.homepage,
|
| 116 |
-
"picarones_min_version": self.picarones_min_version,
|
| 117 |
-
"extra": dict(self.extra),
|
| 118 |
-
}
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
@dataclass
|
| 122 |
-
class AuditCheck:
|
| 123 |
-
"""Un check individuel de l'audit."""
|
| 124 |
-
|
| 125 |
-
name: str
|
| 126 |
-
passed: bool
|
| 127 |
-
detail: Optional[str] = None
|
| 128 |
-
|
| 129 |
-
def as_dict(self) -> dict:
|
| 130 |
-
return {
|
| 131 |
-
"name": self.name,
|
| 132 |
-
"passed": self.passed,
|
| 133 |
-
"detail": self.detail,
|
| 134 |
-
}
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
@dataclass
|
| 138 |
-
class AuditResult:
|
| 139 |
-
"""Résultat global d'un audit de module."""
|
| 140 |
-
|
| 141 |
-
module_name: str
|
| 142 |
-
passed: bool
|
| 143 |
-
checks: list[AuditCheck] = field(default_factory=list)
|
| 144 |
-
|
| 145 |
-
@property
|
| 146 |
-
def n_passed(self) -> int:
|
| 147 |
-
return sum(1 for c in self.checks if c.passed)
|
| 148 |
-
|
| 149 |
-
@property
|
| 150 |
-
def n_failed(self) -> int:
|
| 151 |
-
return sum(1 for c in self.checks if not c.passed)
|
| 152 |
-
|
| 153 |
-
def as_dict(self) -> dict:
|
| 154 |
-
return {
|
| 155 |
-
"module_name": self.module_name,
|
| 156 |
-
"passed": self.passed,
|
| 157 |
-
"n_passed": self.n_passed,
|
| 158 |
-
"n_failed": self.n_failed,
|
| 159 |
-
"checks": [c.as_dict() for c in self.checks],
|
| 160 |
-
}
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
def validate_manifest(manifest: ModuleManifest) -> list[AuditCheck]:
|
| 164 |
-
"""Vérifie qu'un manifest est complet et bien formé.
|
| 165 |
-
|
| 166 |
-
Returns
|
| 167 |
-
-------
|
| 168 |
-
list[AuditCheck]
|
| 169 |
-
Un check par champ obligatoire + un check pour
|
| 170 |
-
``input_types``/``output_types`` non vides.
|
| 171 |
-
"""
|
| 172 |
-
checks: list[AuditCheck] = []
|
| 173 |
-
for field_name in _REQUIRED_TEXT_FIELDS:
|
| 174 |
-
value = getattr(manifest, field_name, None)
|
| 175 |
-
ok = isinstance(value, str) and bool(value.strip())
|
| 176 |
-
checks.append(AuditCheck(
|
| 177 |
-
name=f"manifest.{field_name}",
|
| 178 |
-
passed=ok,
|
| 179 |
-
detail=None if ok else f"champ '{field_name}' vide ou absent",
|
| 180 |
-
))
|
| 181 |
-
# input_types / output_types : au moins une entrée chacun
|
| 182 |
-
in_ok = (
|
| 183 |
-
isinstance(manifest.input_types, list)
|
| 184 |
-
and len(manifest.input_types) > 0
|
| 185 |
-
and all(
|
| 186 |
-
isinstance(t, str) and t for t in manifest.input_types
|
| 187 |
-
)
|
| 188 |
-
)
|
| 189 |
-
checks.append(AuditCheck(
|
| 190 |
-
name="manifest.input_types",
|
| 191 |
-
passed=in_ok,
|
| 192 |
-
detail=None if in_ok else "input_types vide ou non-string",
|
| 193 |
-
))
|
| 194 |
-
out_ok = (
|
| 195 |
-
isinstance(manifest.output_types, list)
|
| 196 |
-
and len(manifest.output_types) > 0
|
| 197 |
-
and all(
|
| 198 |
-
isinstance(t, str) and t for t in manifest.output_types
|
| 199 |
-
)
|
| 200 |
-
)
|
| 201 |
-
checks.append(AuditCheck(
|
| 202 |
-
name="manifest.output_types",
|
| 203 |
-
passed=out_ok,
|
| 204 |
-
detail=None if out_ok else "output_types vide ou non-string",
|
| 205 |
-
))
|
| 206 |
-
return checks
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
def _is_base_module(cls: Any) -> bool:
|
| 210 |
-
"""Best-effort : vérifie que cls hérite de BaseModule.
|
| 211 |
-
|
| 212 |
-
On ne **pas** importer ``BaseModule`` au top-level pour
|
| 213 |
-
éviter les cycles : on inspecte la chaîne de classes par
|
| 214 |
-
leur nom.
|
| 215 |
-
"""
|
| 216 |
-
try:
|
| 217 |
-
for base in cls.__mro__:
|
| 218 |
-
if base.__name__ == "BaseModule":
|
| 219 |
-
return True
|
| 220 |
-
except AttributeError:
|
| 221 |
-
return False
|
| 222 |
-
return False
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
def audit_module(
|
| 226 |
-
module_class_or_instance: Any,
|
| 227 |
-
manifest: ModuleManifest,
|
| 228 |
-
) -> AuditResult:
|
| 229 |
-
"""Audite un module contribué : interface + manifest.
|
| 230 |
-
|
| 231 |
-
Parameters
|
| 232 |
-
----------
|
| 233 |
-
module_class_or_instance:
|
| 234 |
-
Soit la classe ``BaseModule`` (Sprint 33), soit une
|
| 235 |
-
instance.
|
| 236 |
-
manifest:
|
| 237 |
-
``ModuleManifest`` correspondant au module.
|
| 238 |
-
|
| 239 |
-
Returns
|
| 240 |
-
-------
|
| 241 |
-
AuditResult
|
| 242 |
-
``passed=True`` ssi tous les checks passent.
|
| 243 |
-
"""
|
| 244 |
-
checks = validate_manifest(manifest)
|
| 245 |
-
|
| 246 |
-
# Check : héritage de BaseModule
|
| 247 |
-
cls = (
|
| 248 |
-
type(module_class_or_instance)
|
| 249 |
-
if not isinstance(module_class_or_instance, type)
|
| 250 |
-
else module_class_or_instance
|
| 251 |
-
)
|
| 252 |
-
inherits_base = _is_base_module(cls)
|
| 253 |
-
checks.append(AuditCheck(
|
| 254 |
-
name="module.inherits_base_module",
|
| 255 |
-
passed=inherits_base,
|
| 256 |
-
detail=(
|
| 257 |
-
None if inherits_base
|
| 258 |
-
else "la classe n'hérite pas de picarones.core.modules.BaseModule"
|
| 259 |
-
),
|
| 260 |
-
))
|
| 261 |
-
|
| 262 |
-
# Check : input_types / output_types correspondent
|
| 263 |
-
declared_in: list[str] = []
|
| 264 |
-
declared_out: list[str] = []
|
| 265 |
-
try:
|
| 266 |
-
instance = (
|
| 267 |
-
module_class_or_instance
|
| 268 |
-
if not isinstance(module_class_or_instance, type)
|
| 269 |
-
else None
|
| 270 |
-
)
|
| 271 |
-
attr_in = getattr(cls, "input_types", None)
|
| 272 |
-
attr_out = getattr(cls, "output_types", None)
|
| 273 |
-
if instance is not None:
|
| 274 |
-
attr_in = getattr(instance, "input_types", attr_in)
|
| 275 |
-
attr_out = getattr(instance, "output_types", attr_out)
|
| 276 |
-
if attr_in is not None:
|
| 277 |
-
declared_in = [
|
| 278 |
-
getattr(t, "value", str(t)) for t in attr_in
|
| 279 |
-
]
|
| 280 |
-
if attr_out is not None:
|
| 281 |
-
declared_out = [
|
| 282 |
-
getattr(t, "value", str(t)) for t in attr_out
|
| 283 |
-
]
|
| 284 |
-
except Exception: # noqa: BLE001
|
| 285 |
-
pass
|
| 286 |
-
# Comparaison case-insensitive : on accepte "TEXT" ou "text"
|
| 287 |
-
# côté manifest, le contrat sémantique est le même.
|
| 288 |
-
declared_in_lower = sorted(t.lower() for t in declared_in)
|
| 289 |
-
declared_out_lower = sorted(t.lower() for t in declared_out)
|
| 290 |
-
manifest_in_lower = sorted(t.lower() for t in manifest.input_types)
|
| 291 |
-
manifest_out_lower = sorted(t.lower() for t in manifest.output_types)
|
| 292 |
-
in_match = declared_in_lower == manifest_in_lower
|
| 293 |
-
checks.append(AuditCheck(
|
| 294 |
-
name="module.input_types_match_manifest",
|
| 295 |
-
passed=in_match,
|
| 296 |
-
detail=(
|
| 297 |
-
None if in_match
|
| 298 |
-
else f"déclaré {declared_in} vs manifest {manifest.input_types}"
|
| 299 |
-
),
|
| 300 |
-
))
|
| 301 |
-
out_match = declared_out_lower == manifest_out_lower
|
| 302 |
-
checks.append(AuditCheck(
|
| 303 |
-
name="module.output_types_match_manifest",
|
| 304 |
-
passed=out_match,
|
| 305 |
-
detail=(
|
| 306 |
-
None if out_match
|
| 307 |
-
else f"déclaré {declared_out} vs manifest {manifest.output_types}"
|
| 308 |
-
),
|
| 309 |
-
))
|
| 310 |
-
|
| 311 |
-
# Check : process callable
|
| 312 |
-
has_process = callable(getattr(cls, "process", None))
|
| 313 |
-
checks.append(AuditCheck(
|
| 314 |
-
name="module.has_process",
|
| 315 |
-
passed=has_process,
|
| 316 |
-
detail=None if has_process else "méthode process() absente",
|
| 317 |
-
))
|
| 318 |
-
|
| 319 |
-
passed = all(c.passed for c in checks)
|
| 320 |
-
return AuditResult(
|
| 321 |
-
module_name=manifest.name,
|
| 322 |
-
passed=passed,
|
| 323 |
-
checks=checks,
|
| 324 |
-
)
|
| 325 |
-
|
| 326 |
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
"
|
| 333 |
-
]
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.governance.module_policy`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.core.module_policy
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.governance.module_policy import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.governance.module_policy as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -16,8 +16,8 @@ significant_gap: >-
|
|
| 16 |
(Wilcoxon, p = {p_value:.4f}, Δ CER = {delta_cer_pct} points over {n_pairs} pairs).
|
| 17 |
|
| 18 |
stratum_winner: >-
|
| 19 |
-
On stratum "{stratum}" ({n_docs_stratum} documents), {engine}
|
| 20 |
-
|
| 21 |
|
| 22 |
stratum_collapse: >-
|
| 23 |
{engine} is globally competitive ({global_cer_pct} %) but collapses on
|
|
@@ -42,8 +42,9 @@ speed_winner: >-
|
|
| 42 |
median) for comparable quality (CER {cer_pct} %).
|
| 43 |
|
| 44 |
confidence_warning: >-
|
| 45 |
-
|
| 46 |
-
{ci_width_pct} CER points, compared with a gap of
|
|
|
|
| 47 |
|
| 48 |
pareto_alternative: >-
|
| 49 |
At much lower cost, {engine} offers an interesting trade-off ({cer_pct} %
|
|
|
|
| 16 |
(Wilcoxon, p = {p_value:.4f}, Δ CER = {delta_cer_pct} points over {n_pairs} pairs).
|
| 17 |
|
| 18 |
stratum_winner: >-
|
| 19 |
+
On stratum "{stratum}" ({n_docs_stratum} documents), {engine} achieves
|
| 20 |
+
the lowest CER ({cer_pct} % vs. {second_cer_pct} % for {second_engine}).
|
| 21 |
|
| 22 |
stratum_collapse: >-
|
| 23 |
{engine} is globally competitive ({global_cer_pct} %) but collapses on
|
|
|
|
| 42 |
median) for comparable quality (CER {cer_pct} %).
|
| 43 |
|
| 44 |
confidence_warning: >-
|
| 45 |
+
High statistical uncertainty: the {confidence_level} % confidence interval of
|
| 46 |
+
{engine} spans {ci_width_pct} CER points, compared with a gap of
|
| 47 |
+
{gap_to_runner_up_pct} points to the runner-up.
|
| 48 |
|
| 49 |
pareto_alternative: >-
|
| 50 |
At much lower cost, {engine} offers an interesting trade-off ({cer_pct} %
|
|
@@ -20,8 +20,9 @@ significant_gap: >-
|
|
| 20 |
(Wilcoxon, p = {p_value:.4f}, Δ CER = {delta_cer_pct} points sur {n_pairs} paires).
|
| 21 |
|
| 22 |
stratum_winner: >-
|
| 23 |
-
Sur la strate « {stratum} » ({n_docs_stratum} documents), {engine}
|
| 24 |
-
|
|
|
|
| 25 |
|
| 26 |
stratum_collapse: >-
|
| 27 |
{engine} est globalement compétitif ({global_cer_pct} %) mais s'effondre sur
|
|
@@ -46,8 +47,9 @@ speed_winner: >-
|
|
| 46 |
que la médiane) pour un CER comparable ({cer_pct} %).
|
| 47 |
|
| 48 |
confidence_warning: >-
|
| 49 |
-
|
| 50 |
-
sur {ci_width_pct} points de CER, à comparer à l'écart de
|
|
|
|
| 51 |
|
| 52 |
pareto_alternative: >-
|
| 53 |
À coût sensiblement inférieur, {engine} offre un compromis intéressant
|
|
|
|
| 20 |
(Wilcoxon, p = {p_value:.4f}, Δ CER = {delta_cer_pct} points sur {n_pairs} paires).
|
| 21 |
|
| 22 |
stratum_winner: >-
|
| 23 |
+
Sur la strate « {stratum} » ({n_docs_stratum} documents), {engine}
|
| 24 |
+
obtient le CER le plus bas ({cer_pct} % contre {second_cer_pct} %
|
| 25 |
+
pour {second_engine}).
|
| 26 |
|
| 27 |
stratum_collapse: >-
|
| 28 |
{engine} est globalement compétitif ({global_cer_pct} %) mais s'effondre sur
|
|
|
|
| 47 |
que la médiane) pour un CER comparable ({cer_pct} %).
|
| 48 |
|
| 49 |
confidence_warning: >-
|
| 50 |
+
Incertitude statistique élevée : l'intervalle de confiance à {confidence_level} %
|
| 51 |
+
de {engine} s'étend sur {ci_width_pct} points de CER, à comparer à l'écart de
|
| 52 |
+
{gap_to_runner_up_pct} points avec le second.
|
| 53 |
|
| 54 |
pareto_alternative: >-
|
| 55 |
À coût sensiblement inférieur, {engine} offre un compromis intéressant
|
|
@@ -1,150 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
La taxonomie d'erreurs (10 classes, ``picarones/core/taxonomy.py``)
|
| 8 |
-
est calculée par document mais le rapport actuel ne montre qu'un
|
| 9 |
-
seul histogramme global. La roadmap A.I.4 demande trois lectures
|
| 10 |
-
plus fines de cette taxonomie ; ce sprint livre la première :
|
| 11 |
-
**co-occurrence**.
|
| 12 |
-
|
| 13 |
-
Si ``ligature_error`` et ``abbreviation_error`` co-occurrent
|
| 14 |
-
toujours dans les mêmes documents, c'est un signal de scribe
|
| 15 |
-
particulier — utile pour stratifier le corpus *a posteriori*
|
| 16 |
-
(qu'est-ce qui caractérise les documents difficiles ?).
|
| 17 |
-
|
| 18 |
-
Mesure
|
| 19 |
-
------
|
| 20 |
-
Indice de **Jaccard** entre paires de classes au niveau
|
| 21 |
-
**document** :
|
| 22 |
-
|
| 23 |
-
.. math::
|
| 24 |
-
|
| 25 |
-
J(A, B) = \\frac{|D_A \\cap D_B|}{|D_A \\cup D_B|}
|
| 26 |
-
|
| 27 |
-
où ``D_X`` est l'ensemble des documents qui contiennent au moins
|
| 28 |
-
une erreur de classe ``X``.
|
| 29 |
-
|
| 30 |
-
- ``J(A, B) = 1`` : A et B apparaissent toujours ensemble (et
|
| 31 |
-
jamais l'un sans l'autre).
|
| 32 |
-
- ``J(A, B) = 0`` : A et B ne co-occurrent jamais.
|
| 33 |
-
- ``J(A, B) = 0,5`` : A et B partagent la moitié de leur union.
|
| 34 |
-
|
| 35 |
-
Stratégie de découpage
|
| 36 |
-
----------------------
|
| 37 |
-
Couche de calcul pure d'abord (pattern Sprint 35, 38, 52-58).
|
| 38 |
-
Le rendu HTML (heatmap SVG) est livré dans le même sprint pour
|
| 39 |
-
boucler la dimension ; les chantiers 2 et 3 d'A.I.4 (évolution
|
| 40 |
-
intra-document, taxonomie comparative) suivent.
|
| 41 |
"""
|
| 42 |
|
| 43 |
-
from
|
| 44 |
-
|
| 45 |
-
import logging
|
| 46 |
-
from typing import Iterable, Optional
|
| 47 |
-
|
| 48 |
-
logger = logging.getLogger(__name__)
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
def compute_taxonomy_cooccurrence(
|
| 52 |
-
per_doc_classes: Iterable[Iterable[str]],
|
| 53 |
-
*,
|
| 54 |
-
min_doc_count: int = 1,
|
| 55 |
-
top_n_pairs: int = 10,
|
| 56 |
-
) -> Optional[dict]:
|
| 57 |
-
"""Calcule la matrice de Jaccard inter-classes au niveau document.
|
| 58 |
-
|
| 59 |
-
Parameters
|
| 60 |
-
----------
|
| 61 |
-
per_doc_classes:
|
| 62 |
-
Itérable de docs, chaque doc étant un itérable de noms de
|
| 63 |
-
classes taxonomiques détectées (set, list, tuple…).
|
| 64 |
-
Les doublons à l'intérieur d'un doc sont ignorés (présence
|
| 65 |
-
binaire au niveau doc).
|
| 66 |
-
min_doc_count:
|
| 67 |
-
Nombre minimum de documents dans lesquels une classe doit
|
| 68 |
-
apparaître pour figurer dans la matrice (défaut 1).
|
| 69 |
-
Permet d'écarter les classes anecdotiques.
|
| 70 |
-
top_n_pairs:
|
| 71 |
-
Nombre de paires retournées dans ``top_pairs`` (triées par
|
| 72 |
-
Jaccard décroissant). Défaut 10.
|
| 73 |
-
|
| 74 |
-
Returns
|
| 75 |
-
-------
|
| 76 |
-
Optional[dict]
|
| 77 |
-
``{
|
| 78 |
-
"classes": list[str], # triées alpha
|
| 79 |
-
"n_documents": int,
|
| 80 |
-
"doc_count": dict[str, int], # nb docs par classe
|
| 81 |
-
"cooccurrence_matrix": dict[str, dict[str, float]],
|
| 82 |
-
# symétrique, diagonale = 1.0 (sauf classe vide)
|
| 83 |
-
"top_pairs": list[tuple[str, str, float]],
|
| 84 |
-
# paires les plus co-occurrentes (Jaccard désc.)
|
| 85 |
-
}``
|
| 86 |
-
ou ``None`` si aucune classe ne dépasse ``min_doc_count``
|
| 87 |
-
ou si l'itérable est vide.
|
| 88 |
-
"""
|
| 89 |
-
docs: list[frozenset[str]] = []
|
| 90 |
-
for doc_classes in per_doc_classes:
|
| 91 |
-
if doc_classes is None:
|
| 92 |
-
continue
|
| 93 |
-
cleaned = frozenset(c for c in doc_classes if c)
|
| 94 |
-
docs.append(cleaned)
|
| 95 |
-
if not docs:
|
| 96 |
-
return None
|
| 97 |
-
|
| 98 |
-
# Comptage par classe
|
| 99 |
-
doc_count: dict[str, int] = {}
|
| 100 |
-
for doc in docs:
|
| 101 |
-
for cls in doc:
|
| 102 |
-
doc_count[cls] = doc_count.get(cls, 0) + 1
|
| 103 |
-
|
| 104 |
-
# Filtrage min_doc_count
|
| 105 |
-
classes = sorted(
|
| 106 |
-
c for c, n in doc_count.items() if n >= min_doc_count
|
| 107 |
-
)
|
| 108 |
-
if not classes:
|
| 109 |
-
return None
|
| 110 |
-
|
| 111 |
-
# Matrice de Jaccard
|
| 112 |
-
matrix: dict[str, dict[str, float]] = {
|
| 113 |
-
c: {} for c in classes
|
| 114 |
-
}
|
| 115 |
-
for i, ca in enumerate(classes):
|
| 116 |
-
docs_a = {idx for idx, d in enumerate(docs) if ca in d}
|
| 117 |
-
for cb in classes[i:]:
|
| 118 |
-
if ca == cb:
|
| 119 |
-
# Diagonale : Jaccard(X, X) = 1 si X est présent
|
| 120 |
-
matrix[ca][cb] = 1.0 if docs_a else 0.0
|
| 121 |
-
continue
|
| 122 |
-
docs_b = {idx for idx, d in enumerate(docs) if cb in d}
|
| 123 |
-
inter = len(docs_a & docs_b)
|
| 124 |
-
union = len(docs_a | docs_b)
|
| 125 |
-
jaccard = inter / union if union > 0 else 0.0
|
| 126 |
-
matrix[ca][cb] = jaccard
|
| 127 |
-
matrix[cb][ca] = jaccard # symétrique
|
| 128 |
-
|
| 129 |
-
# Top paires (hors diagonale)
|
| 130 |
-
pairs: list[tuple[str, str, float]] = []
|
| 131 |
-
for i, ca in enumerate(classes):
|
| 132 |
-
for cb in classes[i + 1:]:
|
| 133 |
-
j = matrix[ca][cb]
|
| 134 |
-
if j > 0:
|
| 135 |
-
pairs.append((ca, cb, j))
|
| 136 |
-
pairs.sort(key=lambda p: (-p[2], p[0], p[1]))
|
| 137 |
-
top_pairs = pairs[:top_n_pairs]
|
| 138 |
-
|
| 139 |
-
return {
|
| 140 |
-
"classes": classes,
|
| 141 |
-
"n_documents": len(docs),
|
| 142 |
-
"doc_count": doc_count,
|
| 143 |
-
"cooccurrence_matrix": matrix,
|
| 144 |
-
"top_pairs": top_pairs,
|
| 145 |
-
}
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.academic.taxonomy_cooccurrence`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.core.taxonomy_cooccurrence
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.academic.taxonomy_cooccurrence import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.academic.taxonomy_cooccurrence as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -1,202 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
La taxonomie d'erreurs (10 classes, ``picarones/core/taxonomy.py``)
|
| 8 |
-
est calculée par document mais agrégée en un seul histogramme
|
| 9 |
-
global. ``line_metrics.py`` (Sprint 10) a déjà une heatmap de
|
| 10 |
-
**CER par tranche de position** dans le document. Ce sprint
|
| 11 |
-
**étend cette heatmap à toutes les classes taxonomiques** : où
|
| 12 |
-
dans le document apparaît tel type d'erreur ?
|
| 13 |
-
|
| 14 |
-
Lecture concrète : si ``ligature_error`` est concentré dans la
|
| 15 |
-
première tranche, c'est une erreur de **marge** (haut de page) ;
|
| 16 |
-
si réparti uniformément, c'est une erreur de **scribe**.
|
| 17 |
-
|
| 18 |
-
Implémentation
|
| 19 |
-
--------------
|
| 20 |
-
On refait la classification mot-à-mot (cohérent avec
|
| 21 |
-
``classify_errors``) en gardant la position du mot GT
|
| 22 |
-
(``i1`` dans la diff word-level). Chaque erreur est binnifiée
|
| 23 |
-
selon sa position dans le document (``bin = floor(i1 / n_gt_words *
|
| 24 |
-
n_bins)``).
|
| 25 |
-
|
| 26 |
-
Sortie
|
| 27 |
-
------
|
| 28 |
-
``compute_taxonomy_position_heatmap(reference, hypothesis,
|
| 29 |
-
n_bins=10)`` retourne un dict ``{class_name: list[float]}`` où
|
| 30 |
-
chaque liste a ``n_bins`` valeurs représentant le **compte**
|
| 31 |
-
d'erreurs de cette classe dans la tranche correspondante.
|
| 32 |
-
|
| 33 |
-
Stratégie de découpage
|
| 34 |
-
----------------------
|
| 35 |
-
Couche de calcul + rendu HTML bout-en-bout, comme Sprint 75.
|
| 36 |
"""
|
| 37 |
|
| 38 |
-
from
|
| 39 |
-
|
| 40 |
-
import difflib
|
| 41 |
-
import logging
|
| 42 |
-
import unicodedata
|
| 43 |
-
from typing import Optional
|
| 44 |
-
|
| 45 |
-
from picarones.core.taxonomy import (
|
| 46 |
-
ERROR_CLASSES,
|
| 47 |
-
_is_abbreviation_error,
|
| 48 |
-
_is_diacritic_error,
|
| 49 |
-
_is_ligature_error,
|
| 50 |
-
_is_oov_word,
|
| 51 |
-
_is_visual_confusion,
|
| 52 |
-
)
|
| 53 |
-
|
| 54 |
-
logger = logging.getLogger(__name__)
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
def _classify_word_pair(gt_word: str, hyp_word: str) -> str:
|
| 58 |
-
"""Retourne la classe taxonomique d'une erreur mot-à-mot.
|
| 59 |
-
|
| 60 |
-
Reproduit la logique de ``taxonomy._classify_word_error`` sans
|
| 61 |
-
modifier ses compteurs internes — utile pour avoir
|
| 62 |
-
``(position, class)`` paire.
|
| 63 |
-
"""
|
| 64 |
-
if gt_word.casefold() == hyp_word.casefold() and gt_word != hyp_word:
|
| 65 |
-
return "case_error"
|
| 66 |
-
gt_norm = unicodedata.normalize("NFC", gt_word)
|
| 67 |
-
hyp_norm = unicodedata.normalize("NFC", hyp_word)
|
| 68 |
-
if _is_ligature_error(gt_norm, hyp_norm):
|
| 69 |
-
return "ligature_error"
|
| 70 |
-
if _is_abbreviation_error(gt_norm, hyp_norm):
|
| 71 |
-
return "abbreviation_error"
|
| 72 |
-
if _is_diacritic_error(gt_norm, hyp_norm):
|
| 73 |
-
return "diacritic_error"
|
| 74 |
-
if _is_visual_confusion(gt_norm, hyp_norm):
|
| 75 |
-
return "visual_confusion"
|
| 76 |
-
if _is_oov_word(hyp_word):
|
| 77 |
-
return "oov_character"
|
| 78 |
-
return "hapax"
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
def _bin_for_position(position: int, total: int, n_bins: int) -> int:
|
| 82 |
-
"""Retourne l'index de bin pour une position (0-based) sur un
|
| 83 |
-
total de mots GT. Garde-fou sur les bornes : si position == total
|
| 84 |
-
(peut arriver pour insert en fin de doc), on clip au dernier bin.
|
| 85 |
-
"""
|
| 86 |
-
if total <= 0 or n_bins <= 0:
|
| 87 |
-
return 0
|
| 88 |
-
bin_idx = int((position / total) * n_bins)
|
| 89 |
-
if bin_idx >= n_bins:
|
| 90 |
-
bin_idx = n_bins - 1
|
| 91 |
-
if bin_idx < 0:
|
| 92 |
-
bin_idx = 0
|
| 93 |
-
return bin_idx
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
def compute_taxonomy_position_heatmap(
|
| 97 |
-
reference: Optional[str],
|
| 98 |
-
hypothesis: Optional[str],
|
| 99 |
-
*,
|
| 100 |
-
n_bins: int = 10,
|
| 101 |
-
) -> Optional[dict]:
|
| 102 |
-
"""Calcule la heatmap class × position pour un document.
|
| 103 |
-
|
| 104 |
-
Parameters
|
| 105 |
-
----------
|
| 106 |
-
reference:
|
| 107 |
-
Texte GT du document.
|
| 108 |
-
hypothesis:
|
| 109 |
-
Texte produit par l'OCR.
|
| 110 |
-
n_bins:
|
| 111 |
-
Nombre de tranches de position (défaut 10, cohérent avec
|
| 112 |
-
``line_metrics.heatmap``).
|
| 113 |
-
|
| 114 |
-
Returns
|
| 115 |
-
-------
|
| 116 |
-
Optional[dict]
|
| 117 |
-
``{
|
| 118 |
-
"n_bins": int,
|
| 119 |
-
"n_words_gt": int, # nb mots GT
|
| 120 |
-
"total_errors": int, # somme sur toutes classes
|
| 121 |
-
"per_class": {
|
| 122 |
-
class_name: list[int], # n_bins valeurs (compte par bin)
|
| 123 |
-
},
|
| 124 |
-
"totals_per_bin": list[int], # nb total d'erreurs par bin
|
| 125 |
-
}``
|
| 126 |
-
Ou ``None`` si la GT est vide.
|
| 127 |
-
"""
|
| 128 |
-
if n_bins <= 0:
|
| 129 |
-
raise ValueError("n_bins doit être > 0")
|
| 130 |
-
ref = reference or ""
|
| 131 |
-
hyp = hypothesis or ""
|
| 132 |
-
gt_words = ref.split()
|
| 133 |
-
hyp_words = hyp.split()
|
| 134 |
-
n_gt = len(gt_words)
|
| 135 |
-
if n_gt == 0:
|
| 136 |
-
return None
|
| 137 |
-
|
| 138 |
-
per_class: dict[str, list[int]] = {
|
| 139 |
-
cls: [0] * n_bins for cls in ERROR_CLASSES
|
| 140 |
-
}
|
| 141 |
-
totals_per_bin: list[int] = [0] * n_bins
|
| 142 |
-
total_errors = 0
|
| 143 |
-
|
| 144 |
-
matcher = difflib.SequenceMatcher(
|
| 145 |
-
None, gt_words, hyp_words, autojunk=False,
|
| 146 |
-
)
|
| 147 |
-
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 148 |
-
if tag == "equal":
|
| 149 |
-
continue
|
| 150 |
-
if tag == "delete":
|
| 151 |
-
for offset in range(i2 - i1):
|
| 152 |
-
position = i1 + offset
|
| 153 |
-
bin_idx = _bin_for_position(position, n_gt, n_bins)
|
| 154 |
-
per_class["lacuna"][bin_idx] += 1
|
| 155 |
-
totals_per_bin[bin_idx] += 1
|
| 156 |
-
total_errors += 1
|
| 157 |
-
elif tag == "insert":
|
| 158 |
-
# L'insert n'a pas de position GT propre : on attribue
|
| 159 |
-
# à la tranche de la position d'insertion (i1).
|
| 160 |
-
for w in hyp_words[j1:j2]:
|
| 161 |
-
if not _is_oov_word(w):
|
| 162 |
-
continue
|
| 163 |
-
position = min(i1, n_gt - 1)
|
| 164 |
-
bin_idx = _bin_for_position(position, n_gt, n_bins)
|
| 165 |
-
per_class["oov_character"][bin_idx] += 1
|
| 166 |
-
totals_per_bin[bin_idx] += 1
|
| 167 |
-
total_errors += 1
|
| 168 |
-
elif tag == "replace":
|
| 169 |
-
gt_seg = gt_words[i1:i2]
|
| 170 |
-
hyp_seg = hyp_words[j1:j2]
|
| 171 |
-
if len(hyp_seg) != len(gt_seg):
|
| 172 |
-
# Segmentation : compte par diff de longueur
|
| 173 |
-
n_seg = abs(len(gt_seg) - len(hyp_seg))
|
| 174 |
-
bin_idx = _bin_for_position(i1, n_gt, n_bins)
|
| 175 |
-
per_class["segmentation_error"][bin_idx] += n_seg
|
| 176 |
-
totals_per_bin[bin_idx] += n_seg
|
| 177 |
-
total_errors += n_seg
|
| 178 |
-
else:
|
| 179 |
-
for offset, (gt_w, hyp_w) in enumerate(
|
| 180 |
-
zip(gt_seg, hyp_seg),
|
| 181 |
-
):
|
| 182 |
-
if gt_w == hyp_w:
|
| 183 |
-
continue
|
| 184 |
-
position = i1 + offset
|
| 185 |
-
bin_idx = _bin_for_position(position, n_gt, n_bins)
|
| 186 |
-
cls = _classify_word_pair(gt_w, hyp_w)
|
| 187 |
-
per_class[cls][bin_idx] += 1
|
| 188 |
-
totals_per_bin[bin_idx] += 1
|
| 189 |
-
total_errors += 1
|
| 190 |
-
|
| 191 |
-
return {
|
| 192 |
-
"n_bins": n_bins,
|
| 193 |
-
"n_words_gt": n_gt,
|
| 194 |
-
"total_errors": total_errors,
|
| 195 |
-
"per_class": per_class,
|
| 196 |
-
"totals_per_bin": totals_per_bin,
|
| 197 |
-
}
|
| 198 |
-
|
| 199 |
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.academic.taxonomy_intra_doc`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.core.taxonomy_intra_doc
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.academic.taxonomy_intra_doc import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.academic.taxonomy_intra_doc as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Plugins Picarones — Cercle 3 de l'architecture.
|
| 2 |
+
|
| 3 |
+
Modules optionnels, niche, ou préventifs qui ne servent pas
|
| 4 |
+
directement la question centrale du produit (« peut-on déployer ce
|
| 5 |
+
moteur en prod sur ce corpus ? »). Ils sont **séparables** : leur
|
| 6 |
+
absence ne casse pas le bench standard.
|
| 7 |
+
|
| 8 |
+
À terme, certains de ces sous-packages pourront être distribués comme
|
| 9 |
+
packages PyPI séparés (``picarones-historical``, ``picarones-importers``).
|
| 10 |
+
Pour l'instant ils vivent comme sous-packages internes pour limiter le
|
| 11 |
+
churn.
|
| 12 |
+
|
| 13 |
+
Convention de rétrocompat
|
| 14 |
+
-------------------------
|
| 15 |
+
Pour chaque module déplacé depuis ``picarones/core/`` ou
|
| 16 |
+
``picarones/report/`` vers ``picarones/extras/``, un fichier-shim est
|
| 17 |
+
laissé à l'ancien emplacement qui réexporte les noms publics. Les
|
| 18 |
+
imports historiques (``from picarones.core.taxonomy_intra_doc import
|
| 19 |
+
...``) continuent à fonctionner sans modification.
|
| 20 |
+
|
| 21 |
+
Voir :doc:`docs/architecture-cercles.md` pour la cartographie complète
|
| 22 |
+
et les critères d'assignation au Cercle 3.
|
| 23 |
+
"""
|
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Modules techniques sans cas d'usage prod direct.
|
| 2 |
+
|
| 3 |
+
Ces 3 modules calculent des distributions intéressantes pour la
|
| 4 |
+
recherche académique mais ne participent pas à la décision
|
| 5 |
+
*« peut-on déployer ce moteur en prod ? »*.
|
| 6 |
+
|
| 7 |
+
Modules
|
| 8 |
+
-------
|
| 9 |
+
- :mod:`taxonomy_intra_doc` — heatmap classe×position intra-document.
|
| 10 |
+
- :mod:`taxonomy_cooccurrence` — matrice Jaccard inter-classes au niveau document.
|
| 11 |
+
- :mod:`image_predictive` — score de complexité paléographique (poids éditoriaux).
|
| 12 |
+
|
| 13 |
+
Rétrocompat
|
| 14 |
+
-----------
|
| 15 |
+
Les imports historiques ``from picarones.core.taxonomy_intra_doc import
|
| 16 |
+
...`` continuent à fonctionner via des fichiers-shims laissés à
|
| 17 |
+
l'ancien emplacement.
|
| 18 |
+
"""
|
|
@@ -0,0 +1,283 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Métriques d'image prédictives — Sprint 93 (A.II.7).
|
| 2 |
+
|
| 3 |
+
Sprint 93 — A.II.7 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
``image_quality`` (Sprint 5) mesure des features d'image
|
| 8 |
+
indépendamment ; ce module **les combine** pour produire deux
|
| 9 |
+
indicateurs corpus-level :
|
| 10 |
+
|
| 11 |
+
1. **Score de complexité paléographique** ∈ [0, 1]. Combine
|
| 12 |
+
bruit, faible netteté, faible contraste et rotation en un
|
| 13 |
+
indicateur unique de la difficulté intrinsèque pour un OCR.
|
| 14 |
+
0 = document trivial, 1 = document extrême. Permet
|
| 15 |
+
d'expliquer une partie du CER observé.
|
| 16 |
+
|
| 17 |
+
2. **Score d'homogénéité du corpus** ∈ [0, 1]. Variance des
|
| 18 |
+
features entre documents. 0 = corpus uniforme (la moyenne
|
| 19 |
+
globale du benchmark est fiable), 1 = corpus hétérogène
|
| 20 |
+
(la moyenne ment, il faut stratifier). Couplé au détecteur
|
| 21 |
+
``stratification_recommended`` (Sprint 46) qui agit sur
|
| 22 |
+
``script_type``.
|
| 23 |
+
|
| 24 |
+
Pondérations
|
| 25 |
+
------------
|
| 26 |
+
La roadmap propose une combinaison **pondérée** sans fixer les
|
| 27 |
+
poids — on adopte une convention éditoriale documentée :
|
| 28 |
+
|
| 29 |
+
- ``noise_level`` : poids 0.30 (bruit franc → CER ↑)
|
| 30 |
+
- ``1 - sharpness_score`` : poids 0.30 (flou → CER ↑)
|
| 31 |
+
- ``1 - contrast_score`` : poids 0.20 (faible contraste → CER ↑)
|
| 32 |
+
- ``|rotation_degrees|/30`` : poids 0.20 (rotation > 30° = pire)
|
| 33 |
+
|
| 34 |
+
Les poids somment à 1. L'utilisateur peut surcharger via
|
| 35 |
+
``weights={...}``.
|
| 36 |
+
|
| 37 |
+
Pas de prédiction CER absolue
|
| 38 |
+
-----------------------------
|
| 39 |
+
On ne prétend **pas** prédire une valeur CER en pourcentage —
|
| 40 |
+
ça demanderait un modèle entraîné par moteur, ce que la
|
| 41 |
+
philosophie banc d'essai exclut. On fournit un score relatif
|
| 42 |
+
qui se corrèle au CER observé pour une **lecture
|
| 43 |
+
diagnostique** : *« le document A est ~3× plus complexe que le
|
| 44 |
+
document B, ce qui est cohérent avec le CER observé. »*
|
| 45 |
+
"""
|
| 46 |
+
|
| 47 |
+
from __future__ import annotations
|
| 48 |
+
|
| 49 |
+
import logging
|
| 50 |
+
import math
|
| 51 |
+
import statistics
|
| 52 |
+
from typing import Iterable, Optional
|
| 53 |
+
|
| 54 |
+
logger = logging.getLogger(__name__)
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
# Poids éditoriaux par défaut.
|
| 58 |
+
DEFAULT_COMPLEXITY_WEIGHTS = {
|
| 59 |
+
"noise_level": 0.30,
|
| 60 |
+
"blur": 0.30, # 1 - sharpness_score
|
| 61 |
+
"low_contrast": 0.20, # 1 - contrast_score
|
| 62 |
+
"rotation": 0.20, # |rotation_degrees| / 30
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
# Plage de saturation pour la rotation. Au-delà de 30°, on
|
| 67 |
+
# considère que c'est aussi pire que pire.
|
| 68 |
+
_ROTATION_SATURATION_DEG = 30.0
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def _clip01(x: float) -> float:
|
| 72 |
+
return max(0.0, min(1.0, x))
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def _extract_feature(
|
| 76 |
+
quality: dict, key: str, default: float = 0.0,
|
| 77 |
+
) -> float:
|
| 78 |
+
val = quality.get(key, default)
|
| 79 |
+
if val is None:
|
| 80 |
+
return default
|
| 81 |
+
try:
|
| 82 |
+
return float(val)
|
| 83 |
+
except (TypeError, ValueError):
|
| 84 |
+
return default
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def compute_paleographic_complexity(
|
| 88 |
+
quality: dict,
|
| 89 |
+
*,
|
| 90 |
+
weights: Optional[dict[str, float]] = None,
|
| 91 |
+
) -> Optional[dict]:
|
| 92 |
+
"""Score de complexité paléographique d'une image.
|
| 93 |
+
|
| 94 |
+
Parameters
|
| 95 |
+
----------
|
| 96 |
+
quality:
|
| 97 |
+
Dict ``ImageQualityResult.as_dict()`` ou compatible.
|
| 98 |
+
Champs lus : ``noise_level``, ``sharpness_score``,
|
| 99 |
+
``contrast_score``, ``rotation_degrees``.
|
| 100 |
+
weights:
|
| 101 |
+
Poids surchargeant les défauts. Doit contenir les
|
| 102 |
+
4 clés ``noise_level``, ``blur``, ``low_contrast``,
|
| 103 |
+
``rotation``. Les poids sont normalisés (somme = 1).
|
| 104 |
+
|
| 105 |
+
Returns
|
| 106 |
+
-------
|
| 107 |
+
dict | None
|
| 108 |
+
``{
|
| 109 |
+
"score": float, # ∈ [0, 1]
|
| 110 |
+
"components": {
|
| 111 |
+
"noise": float, "blur": float,
|
| 112 |
+
"low_contrast": float, "rotation": float,
|
| 113 |
+
},
|
| 114 |
+
"weights_used": dict,
|
| 115 |
+
}`` ou ``None`` si ``quality`` est falsy.
|
| 116 |
+
"""
|
| 117 |
+
if not quality:
|
| 118 |
+
return None
|
| 119 |
+
w = dict(DEFAULT_COMPLEXITY_WEIGHTS)
|
| 120 |
+
if weights:
|
| 121 |
+
for k in w:
|
| 122 |
+
if k in weights:
|
| 123 |
+
w[k] = float(weights[k])
|
| 124 |
+
total = sum(w.values())
|
| 125 |
+
if total <= 0:
|
| 126 |
+
return None
|
| 127 |
+
w = {k: v / total for k, v in w.items()}
|
| 128 |
+
noise = _clip01(_extract_feature(quality, "noise_level"))
|
| 129 |
+
sharpness = _clip01(_extract_feature(quality, "sharpness_score"))
|
| 130 |
+
contrast = _clip01(_extract_feature(quality, "contrast_score"))
|
| 131 |
+
rotation_deg = abs(_extract_feature(quality, "rotation_degrees"))
|
| 132 |
+
blur = 1.0 - sharpness
|
| 133 |
+
low_contrast = 1.0 - contrast
|
| 134 |
+
rotation = _clip01(rotation_deg / _ROTATION_SATURATION_DEG)
|
| 135 |
+
score = (
|
| 136 |
+
w["noise_level"] * noise
|
| 137 |
+
+ w["blur"] * blur
|
| 138 |
+
+ w["low_contrast"] * low_contrast
|
| 139 |
+
+ w["rotation"] * rotation
|
| 140 |
+
)
|
| 141 |
+
return {
|
| 142 |
+
"score": _clip01(score),
|
| 143 |
+
"components": {
|
| 144 |
+
"noise": noise,
|
| 145 |
+
"blur": blur,
|
| 146 |
+
"low_contrast": low_contrast,
|
| 147 |
+
"rotation": rotation,
|
| 148 |
+
},
|
| 149 |
+
"weights_used": w,
|
| 150 |
+
}
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
def compute_corpus_homogeneity(
|
| 154 |
+
image_qualities: Iterable[dict],
|
| 155 |
+
) -> Optional[dict]:
|
| 156 |
+
"""Score d'homogénéité du corpus ∈ [0, 1].
|
| 157 |
+
|
| 158 |
+
0 = corpus uniforme (faible variance entre documents),
|
| 159 |
+
1 = corpus hétérogène.
|
| 160 |
+
|
| 161 |
+
Méthode : pour chaque feature dans ``noise_level``,
|
| 162 |
+
``sharpness_score``, ``contrast_score``, ``rotation_degrees``,
|
| 163 |
+
on calcule l'écart-type *normalisé* sur les documents (par
|
| 164 |
+
une plage de référence), puis on prend la moyenne des 4.
|
| 165 |
+
|
| 166 |
+
Plages de normalisation :
|
| 167 |
+
- ``noise_level``, ``sharpness_score``, ``contrast_score``
|
| 168 |
+
∈ [0, 1] → écart-type / 0.5 (max théorique de l'écart-type
|
| 169 |
+
d'une distribution sur [0,1]) borné à 1.
|
| 170 |
+
- ``rotation_degrees`` → écart-type / 10°.
|
| 171 |
+
|
| 172 |
+
Parameters
|
| 173 |
+
----------
|
| 174 |
+
image_qualities:
|
| 175 |
+
Itérable de dicts ``ImageQualityResult.as_dict()``.
|
| 176 |
+
|
| 177 |
+
Returns
|
| 178 |
+
-------
|
| 179 |
+
dict | None
|
| 180 |
+
``{
|
| 181 |
+
"score": float, # ∈ [0, 1]
|
| 182 |
+
"n_docs": int,
|
| 183 |
+
"per_feature": {
|
| 184 |
+
feature: {"mean": float, "stdev": float,
|
| 185 |
+
"normalised": float},
|
| 186 |
+
},
|
| 187 |
+
}`` ou ``None`` si moins de 2 documents.
|
| 188 |
+
"""
|
| 189 |
+
docs = [q for q in image_qualities if q]
|
| 190 |
+
if len(docs) < 2:
|
| 191 |
+
return None
|
| 192 |
+
features = (
|
| 193 |
+
("noise_level", 0.5),
|
| 194 |
+
("sharpness_score", 0.5),
|
| 195 |
+
("contrast_score", 0.5),
|
| 196 |
+
("rotation_degrees", 10.0),
|
| 197 |
+
)
|
| 198 |
+
per_feature: dict[str, dict] = {}
|
| 199 |
+
norm_stdevs: list[float] = []
|
| 200 |
+
for key, divisor in features:
|
| 201 |
+
values = [
|
| 202 |
+
_extract_feature(q, key)
|
| 203 |
+
for q in docs
|
| 204 |
+
]
|
| 205 |
+
if not values:
|
| 206 |
+
continue
|
| 207 |
+
mean = statistics.fmean(values)
|
| 208 |
+
try:
|
| 209 |
+
stdev = statistics.stdev(values) if len(values) >= 2 else 0.0
|
| 210 |
+
except statistics.StatisticsError:
|
| 211 |
+
stdev = 0.0
|
| 212 |
+
normalised = _clip01(stdev / divisor) if divisor > 0 else 0.0
|
| 213 |
+
per_feature[key] = {
|
| 214 |
+
"mean": mean,
|
| 215 |
+
"stdev": stdev,
|
| 216 |
+
"normalised": normalised,
|
| 217 |
+
}
|
| 218 |
+
norm_stdevs.append(normalised)
|
| 219 |
+
if not norm_stdevs:
|
| 220 |
+
return None
|
| 221 |
+
score = statistics.fmean(norm_stdevs)
|
| 222 |
+
return {
|
| 223 |
+
"score": _clip01(score),
|
| 224 |
+
"n_docs": len(docs),
|
| 225 |
+
"per_feature": per_feature,
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
def aggregate_corpus_predictive(
|
| 230 |
+
image_qualities: Iterable[dict],
|
| 231 |
+
*,
|
| 232 |
+
weights: Optional[dict[str, float]] = None,
|
| 233 |
+
) -> Optional[dict]:
|
| 234 |
+
"""Synthèse corpus-wide : complexité moyenne + homogénéité.
|
| 235 |
+
|
| 236 |
+
Returns
|
| 237 |
+
-------
|
| 238 |
+
dict | None
|
| 239 |
+
``{
|
| 240 |
+
"n_docs": int,
|
| 241 |
+
"complexity_mean": float,
|
| 242 |
+
"complexity_median": float,
|
| 243 |
+
"complexity_min": float,
|
| 244 |
+
"complexity_max": float,
|
| 245 |
+
"complexity_stdev": float,
|
| 246 |
+
"homogeneity": dict, # sortie de
|
| 247 |
+
# compute_corpus_homogeneity
|
| 248 |
+
}`` ou ``None`` si moins d'un document.
|
| 249 |
+
"""
|
| 250 |
+
docs = [q for q in image_qualities if q]
|
| 251 |
+
if not docs:
|
| 252 |
+
return None
|
| 253 |
+
scores: list[float] = []
|
| 254 |
+
for q in docs:
|
| 255 |
+
result = compute_paleographic_complexity(q, weights=weights)
|
| 256 |
+
if result is not None:
|
| 257 |
+
scores.append(float(result["score"]))
|
| 258 |
+
if not scores:
|
| 259 |
+
return None
|
| 260 |
+
homogeneity = compute_corpus_homogeneity(docs)
|
| 261 |
+
return {
|
| 262 |
+
"n_docs": len(docs),
|
| 263 |
+
"complexity_mean": statistics.fmean(scores),
|
| 264 |
+
"complexity_median": statistics.median(scores),
|
| 265 |
+
"complexity_min": min(scores),
|
| 266 |
+
"complexity_max": max(scores),
|
| 267 |
+
"complexity_stdev": (
|
| 268 |
+
statistics.stdev(scores) if len(scores) >= 2 else 0.0
|
| 269 |
+
),
|
| 270 |
+
"homogeneity": homogeneity,
|
| 271 |
+
}
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
__all__ = [
|
| 275 |
+
"DEFAULT_COMPLEXITY_WEIGHTS",
|
| 276 |
+
"compute_paleographic_complexity",
|
| 277 |
+
"compute_corpus_homogeneity",
|
| 278 |
+
"aggregate_corpus_predictive",
|
| 279 |
+
]
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
# Évite warning import inutilisé
|
| 283 |
+
_ = math
|
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Co-occurrence des classes taxonomiques d'erreur — Sprint 75 (A.I.4 chantier 1).
|
| 2 |
+
|
| 3 |
+
Sprint 75 — A.I.4 chantier 1 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
La taxonomie d'erreurs (10 classes, ``picarones/core/taxonomy.py``)
|
| 8 |
+
est calculée par document mais le rapport actuel ne montre qu'un
|
| 9 |
+
seul histogramme global. La roadmap A.I.4 demande trois lectures
|
| 10 |
+
plus fines de cette taxonomie ; ce sprint livre la première :
|
| 11 |
+
**co-occurrence**.
|
| 12 |
+
|
| 13 |
+
Si ``ligature_error`` et ``abbreviation_error`` co-occurrent
|
| 14 |
+
toujours dans les mêmes documents, c'est un signal de scribe
|
| 15 |
+
particulier — utile pour stratifier le corpus *a posteriori*
|
| 16 |
+
(qu'est-ce qui caractérise les documents difficiles ?).
|
| 17 |
+
|
| 18 |
+
Mesure
|
| 19 |
+
------
|
| 20 |
+
Indice de **Jaccard** entre paires de classes au niveau
|
| 21 |
+
**document** :
|
| 22 |
+
|
| 23 |
+
.. math::
|
| 24 |
+
|
| 25 |
+
J(A, B) = \\frac{|D_A \\cap D_B|}{|D_A \\cup D_B|}
|
| 26 |
+
|
| 27 |
+
où ``D_X`` est l'ensemble des documents qui contiennent au moins
|
| 28 |
+
une erreur de classe ``X``.
|
| 29 |
+
|
| 30 |
+
- ``J(A, B) = 1`` : A et B apparaissent toujours ensemble (et
|
| 31 |
+
jamais l'un sans l'autre).
|
| 32 |
+
- ``J(A, B) = 0`` : A et B ne co-occurrent jamais.
|
| 33 |
+
- ``J(A, B) = 0,5`` : A et B partagent la moitié de leur union.
|
| 34 |
+
|
| 35 |
+
Stratégie de découpage
|
| 36 |
+
----------------------
|
| 37 |
+
Couche de calcul pure d'abord (pattern Sprint 35, 38, 52-58).
|
| 38 |
+
Le rendu HTML (heatmap SVG) est livré dans le même sprint pour
|
| 39 |
+
boucler la dimension ; les chantiers 2 et 3 d'A.I.4 (évolution
|
| 40 |
+
intra-document, taxonomie comparative) suivent.
|
| 41 |
+
"""
|
| 42 |
+
|
| 43 |
+
from __future__ import annotations
|
| 44 |
+
|
| 45 |
+
import logging
|
| 46 |
+
from typing import Iterable, Optional
|
| 47 |
+
|
| 48 |
+
logger = logging.getLogger(__name__)
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def compute_taxonomy_cooccurrence(
|
| 52 |
+
per_doc_classes: Iterable[Iterable[str]],
|
| 53 |
+
*,
|
| 54 |
+
min_doc_count: int = 1,
|
| 55 |
+
top_n_pairs: int = 10,
|
| 56 |
+
) -> Optional[dict]:
|
| 57 |
+
"""Calcule la matrice de Jaccard inter-classes au niveau document.
|
| 58 |
+
|
| 59 |
+
Parameters
|
| 60 |
+
----------
|
| 61 |
+
per_doc_classes:
|
| 62 |
+
Itérable de docs, chaque doc étant un itérable de noms de
|
| 63 |
+
classes taxonomiques détectées (set, list, tuple…).
|
| 64 |
+
Les doublons à l'intérieur d'un doc sont ignorés (présence
|
| 65 |
+
binaire au niveau doc).
|
| 66 |
+
min_doc_count:
|
| 67 |
+
Nombre minimum de documents dans lesquels une classe doit
|
| 68 |
+
apparaître pour figurer dans la matrice (défaut 1).
|
| 69 |
+
Permet d'écarter les classes anecdotiques.
|
| 70 |
+
top_n_pairs:
|
| 71 |
+
Nombre de paires retournées dans ``top_pairs`` (triées par
|
| 72 |
+
Jaccard décroissant). Défaut 10.
|
| 73 |
+
|
| 74 |
+
Returns
|
| 75 |
+
-------
|
| 76 |
+
Optional[dict]
|
| 77 |
+
``{
|
| 78 |
+
"classes": list[str], # triées alpha
|
| 79 |
+
"n_documents": int,
|
| 80 |
+
"doc_count": dict[str, int], # nb docs par classe
|
| 81 |
+
"cooccurrence_matrix": dict[str, dict[str, float]],
|
| 82 |
+
# symétrique, diagonale = 1.0 (sauf classe vide)
|
| 83 |
+
"top_pairs": list[tuple[str, str, float]],
|
| 84 |
+
# paires les plus co-occurrentes (Jaccard désc.)
|
| 85 |
+
}``
|
| 86 |
+
ou ``None`` si aucune classe ne dépasse ``min_doc_count``
|
| 87 |
+
ou si l'itérable est vide.
|
| 88 |
+
"""
|
| 89 |
+
docs: list[frozenset[str]] = []
|
| 90 |
+
for doc_classes in per_doc_classes:
|
| 91 |
+
if doc_classes is None:
|
| 92 |
+
continue
|
| 93 |
+
cleaned = frozenset(c for c in doc_classes if c)
|
| 94 |
+
docs.append(cleaned)
|
| 95 |
+
if not docs:
|
| 96 |
+
return None
|
| 97 |
+
|
| 98 |
+
# Comptage par classe
|
| 99 |
+
doc_count: dict[str, int] = {}
|
| 100 |
+
for doc in docs:
|
| 101 |
+
for cls in doc:
|
| 102 |
+
doc_count[cls] = doc_count.get(cls, 0) + 1
|
| 103 |
+
|
| 104 |
+
# Filtrage min_doc_count
|
| 105 |
+
classes = sorted(
|
| 106 |
+
c for c, n in doc_count.items() if n >= min_doc_count
|
| 107 |
+
)
|
| 108 |
+
if not classes:
|
| 109 |
+
return None
|
| 110 |
+
|
| 111 |
+
# Matrice de Jaccard
|
| 112 |
+
matrix: dict[str, dict[str, float]] = {
|
| 113 |
+
c: {} for c in classes
|
| 114 |
+
}
|
| 115 |
+
for i, ca in enumerate(classes):
|
| 116 |
+
docs_a = {idx for idx, d in enumerate(docs) if ca in d}
|
| 117 |
+
for cb in classes[i:]:
|
| 118 |
+
if ca == cb:
|
| 119 |
+
# Diagonale : Jaccard(X, X) = 1 si X est présent
|
| 120 |
+
matrix[ca][cb] = 1.0 if docs_a else 0.0
|
| 121 |
+
continue
|
| 122 |
+
docs_b = {idx for idx, d in enumerate(docs) if cb in d}
|
| 123 |
+
inter = len(docs_a & docs_b)
|
| 124 |
+
union = len(docs_a | docs_b)
|
| 125 |
+
jaccard = inter / union if union > 0 else 0.0
|
| 126 |
+
matrix[ca][cb] = jaccard
|
| 127 |
+
matrix[cb][ca] = jaccard # symétrique
|
| 128 |
+
|
| 129 |
+
# Top paires (hors diagonale)
|
| 130 |
+
pairs: list[tuple[str, str, float]] = []
|
| 131 |
+
for i, ca in enumerate(classes):
|
| 132 |
+
for cb in classes[i + 1:]:
|
| 133 |
+
j = matrix[ca][cb]
|
| 134 |
+
if j > 0:
|
| 135 |
+
pairs.append((ca, cb, j))
|
| 136 |
+
pairs.sort(key=lambda p: (-p[2], p[0], p[1]))
|
| 137 |
+
top_pairs = pairs[:top_n_pairs]
|
| 138 |
+
|
| 139 |
+
return {
|
| 140 |
+
"classes": classes,
|
| 141 |
+
"n_documents": len(docs),
|
| 142 |
+
"doc_count": doc_count,
|
| 143 |
+
"cooccurrence_matrix": matrix,
|
| 144 |
+
"top_pairs": top_pairs,
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
__all__ = [
|
| 149 |
+
"compute_taxonomy_cooccurrence",
|
| 150 |
+
]
|
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Évolution intra-document des classes taxonomiques — Sprint 76 (A.I.4 chantier 2).
|
| 2 |
+
|
| 3 |
+
Sprint 76 — A.I.4 chantier 2 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
La taxonomie d'erreurs (10 classes, ``picarones/core/taxonomy.py``)
|
| 8 |
+
est calculée par document mais agrégée en un seul histogramme
|
| 9 |
+
global. ``line_metrics.py`` (Sprint 10) a déjà une heatmap de
|
| 10 |
+
**CER par tranche de position** dans le document. Ce sprint
|
| 11 |
+
**étend cette heatmap à toutes les classes taxonomiques** : où
|
| 12 |
+
dans le document apparaît tel type d'erreur ?
|
| 13 |
+
|
| 14 |
+
Lecture concrète : si ``ligature_error`` est concentré dans la
|
| 15 |
+
première tranche, c'est une erreur de **marge** (haut de page) ;
|
| 16 |
+
si réparti uniformément, c'est une erreur de **scribe**.
|
| 17 |
+
|
| 18 |
+
Implémentation
|
| 19 |
+
--------------
|
| 20 |
+
On refait la classification mot-à-mot (cohérent avec
|
| 21 |
+
``classify_errors``) en gardant la position du mot GT
|
| 22 |
+
(``i1`` dans la diff word-level). Chaque erreur est binnifiée
|
| 23 |
+
selon sa position dans le document (``bin = floor(i1 / n_gt_words *
|
| 24 |
+
n_bins)``).
|
| 25 |
+
|
| 26 |
+
Sortie
|
| 27 |
+
------
|
| 28 |
+
``compute_taxonomy_position_heatmap(reference, hypothesis,
|
| 29 |
+
n_bins=10)`` retourne un dict ``{class_name: list[float]}`` où
|
| 30 |
+
chaque liste a ``n_bins`` valeurs représentant le **compte**
|
| 31 |
+
d'erreurs de cette classe dans la tranche correspondante.
|
| 32 |
+
|
| 33 |
+
Stratégie de découpage
|
| 34 |
+
----------------------
|
| 35 |
+
Couche de calcul + rendu HTML bout-en-bout, comme Sprint 75.
|
| 36 |
+
"""
|
| 37 |
+
|
| 38 |
+
from __future__ import annotations
|
| 39 |
+
|
| 40 |
+
import difflib
|
| 41 |
+
import logging
|
| 42 |
+
import unicodedata
|
| 43 |
+
from typing import Optional
|
| 44 |
+
|
| 45 |
+
from picarones.core.taxonomy import (
|
| 46 |
+
ERROR_CLASSES,
|
| 47 |
+
_is_abbreviation_error,
|
| 48 |
+
_is_diacritic_error,
|
| 49 |
+
_is_ligature_error,
|
| 50 |
+
_is_oov_word,
|
| 51 |
+
_is_visual_confusion,
|
| 52 |
+
)
|
| 53 |
+
|
| 54 |
+
logger = logging.getLogger(__name__)
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def _classify_word_pair(gt_word: str, hyp_word: str) -> str:
|
| 58 |
+
"""Retourne la classe taxonomique d'une erreur mot-à-mot.
|
| 59 |
+
|
| 60 |
+
Reproduit la logique de ``taxonomy._classify_word_error`` sans
|
| 61 |
+
modifier ses compteurs internes — utile pour avoir
|
| 62 |
+
``(position, class)`` paire.
|
| 63 |
+
"""
|
| 64 |
+
if gt_word.casefold() == hyp_word.casefold() and gt_word != hyp_word:
|
| 65 |
+
return "case_error"
|
| 66 |
+
gt_norm = unicodedata.normalize("NFC", gt_word)
|
| 67 |
+
hyp_norm = unicodedata.normalize("NFC", hyp_word)
|
| 68 |
+
if _is_ligature_error(gt_norm, hyp_norm):
|
| 69 |
+
return "ligature_error"
|
| 70 |
+
if _is_abbreviation_error(gt_norm, hyp_norm):
|
| 71 |
+
return "abbreviation_error"
|
| 72 |
+
if _is_diacritic_error(gt_norm, hyp_norm):
|
| 73 |
+
return "diacritic_error"
|
| 74 |
+
if _is_visual_confusion(gt_norm, hyp_norm):
|
| 75 |
+
return "visual_confusion"
|
| 76 |
+
if _is_oov_word(hyp_word):
|
| 77 |
+
return "oov_character"
|
| 78 |
+
return "hapax"
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def _bin_for_position(position: int, total: int, n_bins: int) -> int:
|
| 82 |
+
"""Retourne l'index de bin pour une position (0-based) sur un
|
| 83 |
+
total de mots GT. Garde-fou sur les bornes : si position == total
|
| 84 |
+
(peut arriver pour insert en fin de doc), on clip au dernier bin.
|
| 85 |
+
"""
|
| 86 |
+
if total <= 0 or n_bins <= 0:
|
| 87 |
+
return 0
|
| 88 |
+
bin_idx = int((position / total) * n_bins)
|
| 89 |
+
if bin_idx >= n_bins:
|
| 90 |
+
bin_idx = n_bins - 1
|
| 91 |
+
if bin_idx < 0:
|
| 92 |
+
bin_idx = 0
|
| 93 |
+
return bin_idx
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def compute_taxonomy_position_heatmap(
|
| 97 |
+
reference: Optional[str],
|
| 98 |
+
hypothesis: Optional[str],
|
| 99 |
+
*,
|
| 100 |
+
n_bins: int = 10,
|
| 101 |
+
) -> Optional[dict]:
|
| 102 |
+
"""Calcule la heatmap class × position pour un document.
|
| 103 |
+
|
| 104 |
+
Parameters
|
| 105 |
+
----------
|
| 106 |
+
reference:
|
| 107 |
+
Texte GT du document.
|
| 108 |
+
hypothesis:
|
| 109 |
+
Texte produit par l'OCR.
|
| 110 |
+
n_bins:
|
| 111 |
+
Nombre de tranches de position (défaut 10, cohérent avec
|
| 112 |
+
``line_metrics.heatmap``).
|
| 113 |
+
|
| 114 |
+
Returns
|
| 115 |
+
-------
|
| 116 |
+
Optional[dict]
|
| 117 |
+
``{
|
| 118 |
+
"n_bins": int,
|
| 119 |
+
"n_words_gt": int, # nb mots GT
|
| 120 |
+
"total_errors": int, # somme sur toutes classes
|
| 121 |
+
"per_class": {
|
| 122 |
+
class_name: list[int], # n_bins valeurs (compte par bin)
|
| 123 |
+
},
|
| 124 |
+
"totals_per_bin": list[int], # nb total d'erreurs par bin
|
| 125 |
+
}``
|
| 126 |
+
Ou ``None`` si la GT est vide.
|
| 127 |
+
"""
|
| 128 |
+
if n_bins <= 0:
|
| 129 |
+
raise ValueError("n_bins doit être > 0")
|
| 130 |
+
ref = reference or ""
|
| 131 |
+
hyp = hypothesis or ""
|
| 132 |
+
gt_words = ref.split()
|
| 133 |
+
hyp_words = hyp.split()
|
| 134 |
+
n_gt = len(gt_words)
|
| 135 |
+
if n_gt == 0:
|
| 136 |
+
return None
|
| 137 |
+
|
| 138 |
+
per_class: dict[str, list[int]] = {
|
| 139 |
+
cls: [0] * n_bins for cls in ERROR_CLASSES
|
| 140 |
+
}
|
| 141 |
+
totals_per_bin: list[int] = [0] * n_bins
|
| 142 |
+
total_errors = 0
|
| 143 |
+
|
| 144 |
+
matcher = difflib.SequenceMatcher(
|
| 145 |
+
None, gt_words, hyp_words, autojunk=False,
|
| 146 |
+
)
|
| 147 |
+
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 148 |
+
if tag == "equal":
|
| 149 |
+
continue
|
| 150 |
+
if tag == "delete":
|
| 151 |
+
for offset in range(i2 - i1):
|
| 152 |
+
position = i1 + offset
|
| 153 |
+
bin_idx = _bin_for_position(position, n_gt, n_bins)
|
| 154 |
+
per_class["lacuna"][bin_idx] += 1
|
| 155 |
+
totals_per_bin[bin_idx] += 1
|
| 156 |
+
total_errors += 1
|
| 157 |
+
elif tag == "insert":
|
| 158 |
+
# L'insert n'a pas de position GT propre : on attribue
|
| 159 |
+
# à la tranche de la position d'insertion (i1).
|
| 160 |
+
for w in hyp_words[j1:j2]:
|
| 161 |
+
if not _is_oov_word(w):
|
| 162 |
+
continue
|
| 163 |
+
position = min(i1, n_gt - 1)
|
| 164 |
+
bin_idx = _bin_for_position(position, n_gt, n_bins)
|
| 165 |
+
per_class["oov_character"][bin_idx] += 1
|
| 166 |
+
totals_per_bin[bin_idx] += 1
|
| 167 |
+
total_errors += 1
|
| 168 |
+
elif tag == "replace":
|
| 169 |
+
gt_seg = gt_words[i1:i2]
|
| 170 |
+
hyp_seg = hyp_words[j1:j2]
|
| 171 |
+
if len(hyp_seg) != len(gt_seg):
|
| 172 |
+
# Segmentation : compte par diff de longueur
|
| 173 |
+
n_seg = abs(len(gt_seg) - len(hyp_seg))
|
| 174 |
+
bin_idx = _bin_for_position(i1, n_gt, n_bins)
|
| 175 |
+
per_class["segmentation_error"][bin_idx] += n_seg
|
| 176 |
+
totals_per_bin[bin_idx] += n_seg
|
| 177 |
+
total_errors += n_seg
|
| 178 |
+
else:
|
| 179 |
+
for offset, (gt_w, hyp_w) in enumerate(
|
| 180 |
+
zip(gt_seg, hyp_seg),
|
| 181 |
+
):
|
| 182 |
+
if gt_w == hyp_w:
|
| 183 |
+
continue
|
| 184 |
+
position = i1 + offset
|
| 185 |
+
bin_idx = _bin_for_position(position, n_gt, n_bins)
|
| 186 |
+
cls = _classify_word_pair(gt_w, hyp_w)
|
| 187 |
+
per_class[cls][bin_idx] += 1
|
| 188 |
+
totals_per_bin[bin_idx] += 1
|
| 189 |
+
total_errors += 1
|
| 190 |
+
|
| 191 |
+
return {
|
| 192 |
+
"n_bins": n_bins,
|
| 193 |
+
"n_words_gt": n_gt,
|
| 194 |
+
"total_errors": total_errors,
|
| 195 |
+
"per_class": per_class,
|
| 196 |
+
"totals_per_bin": totals_per_bin,
|
| 197 |
+
}
|
| 198 |
+
|
| 199 |
+
|
| 200 |
+
__all__ = [
|
| 201 |
+
"compute_taxonomy_position_heatmap",
|
| 202 |
+
]
|
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Gouvernance préventive pour modules contribués externes.
|
| 2 |
+
|
| 3 |
+
Aujourd'hui Picarones n'a pas encore de modules tiers contribués par
|
| 4 |
+
des utilisateurs externes. Le module ``module_policy`` ici est livré
|
| 5 |
+
en avance pour préparer la phase d'ouverture (lointaine).
|
| 6 |
+
|
| 7 |
+
Sera réintégré au Cercle 2 si/quand 5+ modules tiers sont publiés.
|
| 8 |
+
"""
|
|
@@ -0,0 +1,333 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Politique de modules contribués — Sprint 97 (B.6).
|
| 2 |
+
|
| 3 |
+
Sprint 97 — B.6 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Pourquoi ce module
|
| 6 |
+
------------------
|
| 7 |
+
Avant d'ouvrir Picarones aux contributions externes (axe B —
|
| 8 |
+
modules tiers que l'utilisateur amène), il faut un cadre de
|
| 9 |
+
qualité explicite : *« un module qui ne passe pas l'audit
|
| 10 |
+
n'est pas exécutable. »*
|
| 11 |
+
|
| 12 |
+
Ce module fournit l'**enveloppe d'audit** :
|
| 13 |
+
|
| 14 |
+
- ``ModuleManifest`` — métadonnées obligatoires (auteur,
|
| 15 |
+
licence, version, citation, contrat d'entrée/sortie typé).
|
| 16 |
+
- ``validate_manifest(manifest)`` — vérifie que tous les champs
|
| 17 |
+
obligatoires sont présents et bien formés.
|
| 18 |
+
- ``audit_module(module_class_or_instance, manifest)`` —
|
| 19 |
+
vérifie en plus que la classe respecte le contrat ``BaseModule``
|
| 20 |
+
et que ``input_types``/``output_types`` correspondent au
|
| 21 |
+
manifeste.
|
| 22 |
+
- ``AuditResult`` — verdict structuré ``passed/failed`` + liste
|
| 23 |
+
des checks détaillés.
|
| 24 |
+
|
| 25 |
+
Stratégie d'ouverture
|
| 26 |
+
---------------------
|
| 27 |
+
Phase fermée actuelle : modules officiels uniquement,
|
| 28 |
+
contributions via PR sur le repo principal. Phase ouverte
|
| 29 |
+
future : une fois 5–6 modules officiels stables, ouverture via
|
| 30 |
+
``entry_points`` sur PyPI (``picarones-module-X``). Ce module
|
| 31 |
+
prépare la phase ouverte sans la déclencher : tout module
|
| 32 |
+
externe devra fournir un ``ModuleManifest`` valide pour être
|
| 33 |
+
exécuté.
|
| 34 |
+
|
| 35 |
+
Pas de SPDX validator
|
| 36 |
+
---------------------
|
| 37 |
+
On vérifie la présence et la non-vacuité des champs licence ;
|
| 38 |
+
on ne valide pas la conformité SPDX du nom (``MIT`` vs
|
| 39 |
+
``mit-license`` vs ``MIT License``). Le chercheur reste
|
| 40 |
+
responsable du choix de licence ; l'outil documente, il ne
|
| 41 |
+
juge pas.
|
| 42 |
+
"""
|
| 43 |
+
|
| 44 |
+
from __future__ import annotations
|
| 45 |
+
|
| 46 |
+
import logging
|
| 47 |
+
from dataclasses import dataclass, field
|
| 48 |
+
from typing import Any, Optional
|
| 49 |
+
|
| 50 |
+
logger = logging.getLogger(__name__)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
# Champs obligatoires d'un ManifestModule (texte non-vide).
|
| 54 |
+
_REQUIRED_TEXT_FIELDS = (
|
| 55 |
+
"name", "version", "author", "license",
|
| 56 |
+
"description",
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
@dataclass
|
| 61 |
+
class ModuleManifest:
|
| 62 |
+
"""Métadonnées d'un module contribué.
|
| 63 |
+
|
| 64 |
+
Attributes
|
| 65 |
+
----------
|
| 66 |
+
name:
|
| 67 |
+
Identifiant unique du module (ex. ``"my-llm-correcteur"``).
|
| 68 |
+
version:
|
| 69 |
+
Version sémantique (ex. ``"1.2.0"``).
|
| 70 |
+
author:
|
| 71 |
+
Auteur ou institution responsable.
|
| 72 |
+
license:
|
| 73 |
+
Identifiant de licence (SPDX recommandé, non validé).
|
| 74 |
+
description:
|
| 75 |
+
Description courte (≤ 1 phrase).
|
| 76 |
+
input_types:
|
| 77 |
+
Liste des types d'entrée (chaînes). Doit correspondre
|
| 78 |
+
à ``module.input_types`` (Sprint 33).
|
| 79 |
+
output_types:
|
| 80 |
+
Liste des types de sortie. Doit correspondre à
|
| 81 |
+
``module.output_types``.
|
| 82 |
+
citation:
|
| 83 |
+
Citation académique (BibTeX, DOI, ou texte libre).
|
| 84 |
+
Optionnel.
|
| 85 |
+
homepage:
|
| 86 |
+
URL du dépôt ou de la page projet. Optionnel.
|
| 87 |
+
picarones_min_version:
|
| 88 |
+
Version minimale de Picarones requise. Optionnel.
|
| 89 |
+
extra:
|
| 90 |
+
Métadonnées libres (clé → valeur).
|
| 91 |
+
"""
|
| 92 |
+
|
| 93 |
+
name: str
|
| 94 |
+
version: str
|
| 95 |
+
author: str
|
| 96 |
+
license: str
|
| 97 |
+
description: str
|
| 98 |
+
input_types: list[str] = field(default_factory=list)
|
| 99 |
+
output_types: list[str] = field(default_factory=list)
|
| 100 |
+
citation: Optional[str] = None
|
| 101 |
+
homepage: Optional[str] = None
|
| 102 |
+
picarones_min_version: Optional[str] = None
|
| 103 |
+
extra: dict = field(default_factory=dict)
|
| 104 |
+
|
| 105 |
+
def as_dict(self) -> dict:
|
| 106 |
+
return {
|
| 107 |
+
"name": self.name,
|
| 108 |
+
"version": self.version,
|
| 109 |
+
"author": self.author,
|
| 110 |
+
"license": self.license,
|
| 111 |
+
"description": self.description,
|
| 112 |
+
"input_types": list(self.input_types),
|
| 113 |
+
"output_types": list(self.output_types),
|
| 114 |
+
"citation": self.citation,
|
| 115 |
+
"homepage": self.homepage,
|
| 116 |
+
"picarones_min_version": self.picarones_min_version,
|
| 117 |
+
"extra": dict(self.extra),
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
@dataclass
|
| 122 |
+
class AuditCheck:
|
| 123 |
+
"""Un check individuel de l'audit."""
|
| 124 |
+
|
| 125 |
+
name: str
|
| 126 |
+
passed: bool
|
| 127 |
+
detail: Optional[str] = None
|
| 128 |
+
|
| 129 |
+
def as_dict(self) -> dict:
|
| 130 |
+
return {
|
| 131 |
+
"name": self.name,
|
| 132 |
+
"passed": self.passed,
|
| 133 |
+
"detail": self.detail,
|
| 134 |
+
}
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
@dataclass
|
| 138 |
+
class AuditResult:
|
| 139 |
+
"""Résultat global d'un audit de module."""
|
| 140 |
+
|
| 141 |
+
module_name: str
|
| 142 |
+
passed: bool
|
| 143 |
+
checks: list[AuditCheck] = field(default_factory=list)
|
| 144 |
+
|
| 145 |
+
@property
|
| 146 |
+
def n_passed(self) -> int:
|
| 147 |
+
return sum(1 for c in self.checks if c.passed)
|
| 148 |
+
|
| 149 |
+
@property
|
| 150 |
+
def n_failed(self) -> int:
|
| 151 |
+
return sum(1 for c in self.checks if not c.passed)
|
| 152 |
+
|
| 153 |
+
def as_dict(self) -> dict:
|
| 154 |
+
return {
|
| 155 |
+
"module_name": self.module_name,
|
| 156 |
+
"passed": self.passed,
|
| 157 |
+
"n_passed": self.n_passed,
|
| 158 |
+
"n_failed": self.n_failed,
|
| 159 |
+
"checks": [c.as_dict() for c in self.checks],
|
| 160 |
+
}
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
def validate_manifest(manifest: ModuleManifest) -> list[AuditCheck]:
|
| 164 |
+
"""Vérifie qu'un manifest est complet et bien formé.
|
| 165 |
+
|
| 166 |
+
Returns
|
| 167 |
+
-------
|
| 168 |
+
list[AuditCheck]
|
| 169 |
+
Un check par champ obligatoire + un check pour
|
| 170 |
+
``input_types``/``output_types`` non vides.
|
| 171 |
+
"""
|
| 172 |
+
checks: list[AuditCheck] = []
|
| 173 |
+
for field_name in _REQUIRED_TEXT_FIELDS:
|
| 174 |
+
value = getattr(manifest, field_name, None)
|
| 175 |
+
ok = isinstance(value, str) and bool(value.strip())
|
| 176 |
+
checks.append(AuditCheck(
|
| 177 |
+
name=f"manifest.{field_name}",
|
| 178 |
+
passed=ok,
|
| 179 |
+
detail=None if ok else f"champ '{field_name}' vide ou absent",
|
| 180 |
+
))
|
| 181 |
+
# input_types / output_types : au moins une entrée chacun
|
| 182 |
+
in_ok = (
|
| 183 |
+
isinstance(manifest.input_types, list)
|
| 184 |
+
and len(manifest.input_types) > 0
|
| 185 |
+
and all(
|
| 186 |
+
isinstance(t, str) and t for t in manifest.input_types
|
| 187 |
+
)
|
| 188 |
+
)
|
| 189 |
+
checks.append(AuditCheck(
|
| 190 |
+
name="manifest.input_types",
|
| 191 |
+
passed=in_ok,
|
| 192 |
+
detail=None if in_ok else "input_types vide ou non-string",
|
| 193 |
+
))
|
| 194 |
+
out_ok = (
|
| 195 |
+
isinstance(manifest.output_types, list)
|
| 196 |
+
and len(manifest.output_types) > 0
|
| 197 |
+
and all(
|
| 198 |
+
isinstance(t, str) and t for t in manifest.output_types
|
| 199 |
+
)
|
| 200 |
+
)
|
| 201 |
+
checks.append(AuditCheck(
|
| 202 |
+
name="manifest.output_types",
|
| 203 |
+
passed=out_ok,
|
| 204 |
+
detail=None if out_ok else "output_types vide ou non-string",
|
| 205 |
+
))
|
| 206 |
+
return checks
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
def _is_base_module(cls: Any) -> bool:
|
| 210 |
+
"""Best-effort : vérifie que cls hérite de BaseModule.
|
| 211 |
+
|
| 212 |
+
On ne **pas** importer ``BaseModule`` au top-level pour
|
| 213 |
+
éviter les cycles : on inspecte la chaîne de classes par
|
| 214 |
+
leur nom.
|
| 215 |
+
"""
|
| 216 |
+
try:
|
| 217 |
+
for base in cls.__mro__:
|
| 218 |
+
if base.__name__ == "BaseModule":
|
| 219 |
+
return True
|
| 220 |
+
except AttributeError:
|
| 221 |
+
return False
|
| 222 |
+
return False
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
def audit_module(
|
| 226 |
+
module_class_or_instance: Any,
|
| 227 |
+
manifest: ModuleManifest,
|
| 228 |
+
) -> AuditResult:
|
| 229 |
+
"""Audite un module contribué : interface + manifest.
|
| 230 |
+
|
| 231 |
+
Parameters
|
| 232 |
+
----------
|
| 233 |
+
module_class_or_instance:
|
| 234 |
+
Soit la classe ``BaseModule`` (Sprint 33), soit une
|
| 235 |
+
instance.
|
| 236 |
+
manifest:
|
| 237 |
+
``ModuleManifest`` correspondant au module.
|
| 238 |
+
|
| 239 |
+
Returns
|
| 240 |
+
-------
|
| 241 |
+
AuditResult
|
| 242 |
+
``passed=True`` ssi tous les checks passent.
|
| 243 |
+
"""
|
| 244 |
+
checks = validate_manifest(manifest)
|
| 245 |
+
|
| 246 |
+
# Check : héritage de BaseModule
|
| 247 |
+
cls = (
|
| 248 |
+
type(module_class_or_instance)
|
| 249 |
+
if not isinstance(module_class_or_instance, type)
|
| 250 |
+
else module_class_or_instance
|
| 251 |
+
)
|
| 252 |
+
inherits_base = _is_base_module(cls)
|
| 253 |
+
checks.append(AuditCheck(
|
| 254 |
+
name="module.inherits_base_module",
|
| 255 |
+
passed=inherits_base,
|
| 256 |
+
detail=(
|
| 257 |
+
None if inherits_base
|
| 258 |
+
else "la classe n'hérite pas de picarones.core.modules.BaseModule"
|
| 259 |
+
),
|
| 260 |
+
))
|
| 261 |
+
|
| 262 |
+
# Check : input_types / output_types correspondent
|
| 263 |
+
declared_in: list[str] = []
|
| 264 |
+
declared_out: list[str] = []
|
| 265 |
+
try:
|
| 266 |
+
instance = (
|
| 267 |
+
module_class_or_instance
|
| 268 |
+
if not isinstance(module_class_or_instance, type)
|
| 269 |
+
else None
|
| 270 |
+
)
|
| 271 |
+
attr_in = getattr(cls, "input_types", None)
|
| 272 |
+
attr_out = getattr(cls, "output_types", None)
|
| 273 |
+
if instance is not None:
|
| 274 |
+
attr_in = getattr(instance, "input_types", attr_in)
|
| 275 |
+
attr_out = getattr(instance, "output_types", attr_out)
|
| 276 |
+
if attr_in is not None:
|
| 277 |
+
declared_in = [
|
| 278 |
+
getattr(t, "value", str(t)) for t in attr_in
|
| 279 |
+
]
|
| 280 |
+
if attr_out is not None:
|
| 281 |
+
declared_out = [
|
| 282 |
+
getattr(t, "value", str(t)) for t in attr_out
|
| 283 |
+
]
|
| 284 |
+
except Exception: # noqa: BLE001
|
| 285 |
+
pass
|
| 286 |
+
# Comparaison case-insensitive : on accepte "TEXT" ou "text"
|
| 287 |
+
# côté manifest, le contrat sémantique est le même.
|
| 288 |
+
declared_in_lower = sorted(t.lower() for t in declared_in)
|
| 289 |
+
declared_out_lower = sorted(t.lower() for t in declared_out)
|
| 290 |
+
manifest_in_lower = sorted(t.lower() for t in manifest.input_types)
|
| 291 |
+
manifest_out_lower = sorted(t.lower() for t in manifest.output_types)
|
| 292 |
+
in_match = declared_in_lower == manifest_in_lower
|
| 293 |
+
checks.append(AuditCheck(
|
| 294 |
+
name="module.input_types_match_manifest",
|
| 295 |
+
passed=in_match,
|
| 296 |
+
detail=(
|
| 297 |
+
None if in_match
|
| 298 |
+
else f"déclaré {declared_in} vs manifest {manifest.input_types}"
|
| 299 |
+
),
|
| 300 |
+
))
|
| 301 |
+
out_match = declared_out_lower == manifest_out_lower
|
| 302 |
+
checks.append(AuditCheck(
|
| 303 |
+
name="module.output_types_match_manifest",
|
| 304 |
+
passed=out_match,
|
| 305 |
+
detail=(
|
| 306 |
+
None if out_match
|
| 307 |
+
else f"déclaré {declared_out} vs manifest {manifest.output_types}"
|
| 308 |
+
),
|
| 309 |
+
))
|
| 310 |
+
|
| 311 |
+
# Check : process callable
|
| 312 |
+
has_process = callable(getattr(cls, "process", None))
|
| 313 |
+
checks.append(AuditCheck(
|
| 314 |
+
name="module.has_process",
|
| 315 |
+
passed=has_process,
|
| 316 |
+
detail=None if has_process else "méthode process() absente",
|
| 317 |
+
))
|
| 318 |
+
|
| 319 |
+
passed = all(c.passed for c in checks)
|
| 320 |
+
return AuditResult(
|
| 321 |
+
module_name=manifest.name,
|
| 322 |
+
passed=passed,
|
| 323 |
+
checks=checks,
|
| 324 |
+
)
|
| 325 |
+
|
| 326 |
+
|
| 327 |
+
__all__ = [
|
| 328 |
+
"ModuleManifest",
|
| 329 |
+
"AuditCheck",
|
| 330 |
+
"AuditResult",
|
| 331 |
+
"validate_manifest",
|
| 332 |
+
"audit_module",
|
| 333 |
+
]
|
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Renderers atomiques pour les modules ``extras/``.
|
| 2 |
+
|
| 3 |
+
Importés conditionnellement par les vues thématiques du chantier 3
|
| 4 |
+
(``picarones.report.views.advanced_taxonomy``, etc.) qui restent
|
| 5 |
+
dans le Cercle 2. Si les modules ``extras/academic/`` ou
|
| 6 |
+
``extras/governance/`` sont absents, ces renderers ne sont pas
|
| 7 |
+
sollicités et la vue masque la sous-section.
|
| 8 |
+
|
| 9 |
+
Rétrocompat
|
| 10 |
+
-----------
|
| 11 |
+
Imports historiques ``from picarones.report.taxonomy_intra_doc_render
|
| 12 |
+
import ...`` continuent à fonctionner via des fichiers-shims.
|
| 13 |
+
"""
|
|
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Rendu HTML « Profil d'image du corpus » — Sprint 93 (A.II.7).
|
| 2 |
+
|
| 3 |
+
Suite directe ``picarones/core/image_predictive.py``. Pattern
|
| 4 |
+
identique aux autres rendus : server-side, pas de JS, anti-
|
| 5 |
+
injection systématique.
|
| 6 |
+
|
| 7 |
+
Vue
|
| 8 |
+
---
|
| 9 |
+
Deux blocs dans une section unique :
|
| 10 |
+
|
| 11 |
+
1. **Complexité paléographique** : moyenne, médiane, min, max,
|
| 12 |
+
écart-type sur l'ensemble du corpus.
|
| 13 |
+
2. **Homogénéité du corpus** : score combiné + détail par
|
| 14 |
+
feature (mean, stdev, contribution normalisée).
|
| 15 |
+
|
| 16 |
+
Adaptive : ``""`` si pas de données.
|
| 17 |
+
|
| 18 |
+
Note d'intégration
|
| 19 |
+
------------------
|
| 20 |
+
Module pur — l'utilisateur compose :
|
| 21 |
+
|
| 22 |
+
.. code-block:: python
|
| 23 |
+
|
| 24 |
+
from picarones.core.image_predictive import aggregate_corpus_predictive
|
| 25 |
+
from picarones.report.image_predictive_render import (
|
| 26 |
+
build_image_predictive_html,
|
| 27 |
+
)
|
| 28 |
+
|
| 29 |
+
qualities = [doc.image_quality.as_dict() for doc in benchmark.docs]
|
| 30 |
+
agg = aggregate_corpus_predictive(qualities)
|
| 31 |
+
html = build_image_predictive_html(agg, labels)
|
| 32 |
+
"""
|
| 33 |
+
|
| 34 |
+
from __future__ import annotations
|
| 35 |
+
|
| 36 |
+
from html import escape as _e
|
| 37 |
+
from typing import Optional
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def _color_for_score(score: float) -> str:
|
| 41 |
+
"""Vert (faible) → orange → rouge (élevé)."""
|
| 42 |
+
f = max(0.0, min(1.0, score))
|
| 43 |
+
if f < 0.5:
|
| 44 |
+
t = f / 0.5
|
| 45 |
+
r = int(167 + (235 - 167) * t)
|
| 46 |
+
g = int(240 + (180 - 240) * t)
|
| 47 |
+
b = int(167 + (60 - 167) * t)
|
| 48 |
+
else:
|
| 49 |
+
t = (f - 0.5) / 0.5
|
| 50 |
+
r = int(235 + (220 - 235) * t)
|
| 51 |
+
g = int(180 + (50 - 180) * t)
|
| 52 |
+
b = int(60 + (50 - 60) * t)
|
| 53 |
+
return f"#{r:02x}{g:02x}{b:02x}"
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
_FEATURE_LABEL_KEYS = {
|
| 57 |
+
"noise_level": "imgpred_feat_noise",
|
| 58 |
+
"sharpness_score": "imgpred_feat_sharpness",
|
| 59 |
+
"contrast_score": "imgpred_feat_contrast",
|
| 60 |
+
"rotation_degrees": "imgpred_feat_rotation",
|
| 61 |
+
}
|
| 62 |
+
|
| 63 |
+
|
| 64 |
+
def _render_complexity_block(
|
| 65 |
+
aggregated: dict, labels: dict[str, str],
|
| 66 |
+
) -> str:
|
| 67 |
+
h_complex = labels.get(
|
| 68 |
+
"imgpred_complexity", "Complexité paléographique",
|
| 69 |
+
)
|
| 70 |
+
h_mean = labels.get("imgpred_mean", "Moyenne")
|
| 71 |
+
h_median = labels.get("imgpred_median", "Médiane")
|
| 72 |
+
h_min = labels.get("imgpred_min", "Min")
|
| 73 |
+
h_max = labels.get("imgpred_max", "Max")
|
| 74 |
+
h_stdev = labels.get("imgpred_stdev", "Écart-type")
|
| 75 |
+
h_docs = labels.get("imgpred_docs", "Docs")
|
| 76 |
+
mean = float(aggregated.get("complexity_mean") or 0.0)
|
| 77 |
+
median = float(aggregated.get("complexity_median") or 0.0)
|
| 78 |
+
mn = float(aggregated.get("complexity_min") or 0.0)
|
| 79 |
+
mx = float(aggregated.get("complexity_max") or 0.0)
|
| 80 |
+
sd = float(aggregated.get("complexity_stdev") or 0.0)
|
| 81 |
+
n_docs = int(aggregated.get("n_docs") or 0)
|
| 82 |
+
color_mean = _color_for_score(mean)
|
| 83 |
+
return (
|
| 84 |
+
f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
|
| 85 |
+
f'{_e(h_complex)}</div>'
|
| 86 |
+
'<table style="border-collapse:collapse;width:100%;'
|
| 87 |
+
'font-size:.9rem;margin-bottom:.8rem">'
|
| 88 |
+
f'<thead><tr>'
|
| 89 |
+
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 90 |
+
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_mean)}</th>'
|
| 91 |
+
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 92 |
+
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_median)}</th>'
|
| 93 |
+
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 94 |
+
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_min)}</th>'
|
| 95 |
+
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 96 |
+
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_max)}</th>'
|
| 97 |
+
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 98 |
+
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_stdev)}</th>'
|
| 99 |
+
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 100 |
+
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_docs)}</th>'
|
| 101 |
+
f'</tr></thead>'
|
| 102 |
+
f'<tbody><tr>'
|
| 103 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 104 |
+
f'background:{color_mean};font-family:monospace;font-weight:600">'
|
| 105 |
+
f'{mean:.3f}</td>'
|
| 106 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 107 |
+
f'font-family:monospace">{median:.3f}</td>'
|
| 108 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 109 |
+
f'font-family:monospace">{mn:.3f}</td>'
|
| 110 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 111 |
+
f'font-family:monospace">{mx:.3f}</td>'
|
| 112 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 113 |
+
f'font-family:monospace">{sd:.3f}</td>'
|
| 114 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 115 |
+
f'font-family:monospace">{n_docs}</td>'
|
| 116 |
+
f'</tr></tbody></table>'
|
| 117 |
+
)
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def _render_homogeneity_block(
|
| 121 |
+
homogeneity: dict, labels: dict[str, str],
|
| 122 |
+
) -> str:
|
| 123 |
+
h_homo = labels.get(
|
| 124 |
+
"imgpred_homogeneity", "Homogénéité du corpus",
|
| 125 |
+
)
|
| 126 |
+
h_feat = labels.get("imgpred_feature", "Feature")
|
| 127 |
+
h_mean = labels.get("imgpred_feat_mean", "Moyenne")
|
| 128 |
+
h_stdev = labels.get("imgpred_feat_stdev", "Écart-type")
|
| 129 |
+
h_norm = labels.get(
|
| 130 |
+
"imgpred_feat_norm", "Contribution normalisée",
|
| 131 |
+
)
|
| 132 |
+
score = float(homogeneity.get("score") or 0.0)
|
| 133 |
+
color = _color_for_score(score)
|
| 134 |
+
parts = [
|
| 135 |
+
f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
|
| 136 |
+
f'{_e(h_homo)} : '
|
| 137 |
+
f'<span style="background:{color};padding:.1rem .4rem;'
|
| 138 |
+
f'border-radius:.3rem;font-family:monospace">{score:.3f}</span>'
|
| 139 |
+
f'</div>',
|
| 140 |
+
'<table style="border-collapse:collapse;width:100%;'
|
| 141 |
+
'font-size:.9rem">',
|
| 142 |
+
'<thead><tr>',
|
| 143 |
+
]
|
| 144 |
+
for col in (h_feat, h_mean, h_stdev, h_norm):
|
| 145 |
+
parts.append(
|
| 146 |
+
f'<th style="padding:.4rem .6rem;text-align:left;'
|
| 147 |
+
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 148 |
+
f'{_e(col)}</th>'
|
| 149 |
+
)
|
| 150 |
+
parts.append("</tr></thead><tbody>")
|
| 151 |
+
per_feat = homogeneity.get("per_feature") or {}
|
| 152 |
+
for key, label_key in _FEATURE_LABEL_KEYS.items():
|
| 153 |
+
if key not in per_feat:
|
| 154 |
+
continue
|
| 155 |
+
slot = per_feat[key]
|
| 156 |
+
feat_label = labels.get(label_key, key)
|
| 157 |
+
feat_mean = float(slot.get("mean") or 0.0)
|
| 158 |
+
feat_stdev = float(slot.get("stdev") or 0.0)
|
| 159 |
+
feat_norm = float(slot.get("normalised") or 0.0)
|
| 160 |
+
norm_color = _color_for_score(feat_norm)
|
| 161 |
+
parts.append(
|
| 162 |
+
f'<tr>'
|
| 163 |
+
f'<td style="padding:.4rem .6rem">{_e(feat_label)}</td>'
|
| 164 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 165 |
+
f'font-family:monospace">{feat_mean:.3f}</td>'
|
| 166 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 167 |
+
f'font-family:monospace">{feat_stdev:.3f}</td>'
|
| 168 |
+
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 169 |
+
f'background:{norm_color};font-family:monospace">'
|
| 170 |
+
f'{feat_norm:.3f}</td>'
|
| 171 |
+
f'</tr>'
|
| 172 |
+
)
|
| 173 |
+
parts.append("</tbody></table>")
|
| 174 |
+
return "".join(parts)
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
def build_image_predictive_html(
|
| 178 |
+
aggregated: Optional[dict],
|
| 179 |
+
labels: Optional[dict[str, str]] = None,
|
| 180 |
+
) -> str:
|
| 181 |
+
"""Construit la vue HTML « Profil d'image du corpus ».
|
| 182 |
+
|
| 183 |
+
Parameters
|
| 184 |
+
----------
|
| 185 |
+
aggregated:
|
| 186 |
+
Sortie de ``aggregate_corpus_predictive``. Si ``None``
|
| 187 |
+
ou ``n_docs == 0``, retourne ``""``.
|
| 188 |
+
labels:
|
| 189 |
+
Dict i18n. Clés sous le préfixe ``imgpred_*``.
|
| 190 |
+
"""
|
| 191 |
+
if not aggregated:
|
| 192 |
+
return ""
|
| 193 |
+
if not aggregated.get("n_docs"):
|
| 194 |
+
return ""
|
| 195 |
+
labels = labels or {}
|
| 196 |
+
title = labels.get(
|
| 197 |
+
"imgpred_title", "Profil d'image du corpus",
|
| 198 |
+
)
|
| 199 |
+
note = labels.get(
|
| 200 |
+
"imgpred_note",
|
| 201 |
+
"Score de complexité paléographique combinant bruit, "
|
| 202 |
+
"flou, faible contraste et rotation. Le score "
|
| 203 |
+
"d'homogénéité signale si la moyenne globale est fiable "
|
| 204 |
+
"(corpus uniforme) ou trompeuse (corpus hétérogène — "
|
| 205 |
+
"voir alors la vue stratifiée).",
|
| 206 |
+
)
|
| 207 |
+
parts = [
|
| 208 |
+
'<section class="imgpred-section" style="margin:1rem 0">',
|
| 209 |
+
f'<h3 style="margin:0 0 .3rem 0">{_e(title)}</h3>',
|
| 210 |
+
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.6rem">'
|
| 211 |
+
f'{_e(note)}</div>',
|
| 212 |
+
]
|
| 213 |
+
parts.append(_render_complexity_block(aggregated, labels))
|
| 214 |
+
homo = aggregated.get("homogeneity")
|
| 215 |
+
if isinstance(homo, dict):
|
| 216 |
+
parts.append(_render_homogeneity_block(homo, labels))
|
| 217 |
+
parts.append("</section>")
|
| 218 |
+
return "".join(parts)
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
__all__ = ["build_image_predictive_html"]
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Rendu HTML « Modules audités » — Sprint 97 (B.6).
|
| 2 |
+
|
| 3 |
+
Suite directe ``picarones/core/module_policy.py``. Pattern
|
| 4 |
+
identique aux autres rendus : server-side, pas de JS, anti-
|
| 5 |
+
injection systématique.
|
| 6 |
+
|
| 7 |
+
Vue
|
| 8 |
+
---
|
| 9 |
+
Tableau récapitulatif des modules utilisés dans une pipeline
|
| 10 |
+
composée, chacun avec :
|
| 11 |
+
|
| 12 |
+
- Statut d'audit (✓ vert si tous les checks passent, ✗ rouge
|
| 13 |
+
sinon, avec compte des échecs) ;
|
| 14 |
+
- Métadonnées : version, auteur, licence ;
|
| 15 |
+
- Citation académique si fournie ;
|
| 16 |
+
- Lien vers la homepage si fourni.
|
| 17 |
+
|
| 18 |
+
Adaptive : ``""`` si la liste est vide.
|
| 19 |
+
|
| 20 |
+
Note d'intégration
|
| 21 |
+
------------------
|
| 22 |
+
Module pur — l'utilisateur compose la liste depuis sa
|
| 23 |
+
``PipelineSpec`` augmentée des ``ModuleManifest`` :
|
| 24 |
+
|
| 25 |
+
.. code-block:: python
|
| 26 |
+
|
| 27 |
+
from picarones.core.module_policy import audit_module
|
| 28 |
+
from picarones.report.module_audit_render import build_module_audit_html
|
| 29 |
+
|
| 30 |
+
audits = []
|
| 31 |
+
for step in pipeline.steps:
|
| 32 |
+
manifest = step.module.manifest # convention applicative
|
| 33 |
+
result = audit_module(step.module, manifest)
|
| 34 |
+
audits.append({
|
| 35 |
+
"manifest": manifest.as_dict(),
|
| 36 |
+
"audit": result.as_dict(),
|
| 37 |
+
})
|
| 38 |
+
html = build_module_audit_html(audits, labels)
|
| 39 |
+
"""
|
| 40 |
+
|
| 41 |
+
from __future__ import annotations
|
| 42 |
+
|
| 43 |
+
from html import escape as _e
|
| 44 |
+
from typing import Optional
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def _passed_badge(passed: bool, n_failed: int, label_pass: str,
|
| 48 |
+
label_fail: str) -> str:
|
| 49 |
+
if passed:
|
| 50 |
+
return (
|
| 51 |
+
f'<span style="color:#16a34a;font-weight:700">'
|
| 52 |
+
f'✓ {_e(label_pass)}</span>'
|
| 53 |
+
)
|
| 54 |
+
return (
|
| 55 |
+
f'<span style="color:#dc2626;font-weight:700">'
|
| 56 |
+
f'✗ {_e(label_fail)} ({n_failed})</span>'
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def build_module_audit_html(
|
| 61 |
+
audits: Optional[list],
|
| 62 |
+
labels: Optional[dict[str, str]] = None,
|
| 63 |
+
) -> str:
|
| 64 |
+
"""Construit la vue HTML « Modules audités ».
|
| 65 |
+
|
| 66 |
+
Parameters
|
| 67 |
+
----------
|
| 68 |
+
audits:
|
| 69 |
+
Liste de dicts ``{"manifest": ManifestDict, "audit":
|
| 70 |
+
AuditResultDict}``. Si vide ou ``None``, retourne ``""``.
|
| 71 |
+
labels:
|
| 72 |
+
Dict i18n. Clés sous le préfixe ``audit_*``.
|
| 73 |
+
"""
|
| 74 |
+
if not audits:
|
| 75 |
+
return ""
|
| 76 |
+
rows = [
|
| 77 |
+
a for a in audits
|
| 78 |
+
if isinstance(a, dict)
|
| 79 |
+
and isinstance(a.get("manifest"), dict)
|
| 80 |
+
and isinstance(a.get("audit"), dict)
|
| 81 |
+
]
|
| 82 |
+
if not rows:
|
| 83 |
+
return ""
|
| 84 |
+
labels = labels or {}
|
| 85 |
+
title = labels.get("audit_title", "Modules audités")
|
| 86 |
+
note = labels.get(
|
| 87 |
+
"audit_note",
|
| 88 |
+
"Récapitulatif des modules utilisés dans la pipeline "
|
| 89 |
+
"composée. Un module qui ne passe pas l'audit n'est "
|
| 90 |
+
"pas exécutable. Métadonnées issues du manifest fourni "
|
| 91 |
+
"par le contributeur (auteur, licence, citation).",
|
| 92 |
+
)
|
| 93 |
+
label_pass = labels.get("audit_pass", "audit OK")
|
| 94 |
+
label_fail = labels.get("audit_fail", "checks échoués")
|
| 95 |
+
h_module = labels.get("audit_module", "Module")
|
| 96 |
+
h_status = labels.get("audit_status", "Audit")
|
| 97 |
+
h_version = labels.get("audit_version", "Version")
|
| 98 |
+
h_author = labels.get("audit_author", "Auteur")
|
| 99 |
+
h_license = labels.get("audit_license", "Licence")
|
| 100 |
+
h_io = labels.get("audit_io", "Entrée → sortie")
|
| 101 |
+
h_citation = labels.get("audit_citation", "Citation")
|
| 102 |
+
h_homepage = labels.get("audit_homepage", "Page projet")
|
| 103 |
+
|
| 104 |
+
parts = [
|
| 105 |
+
'<section class="audit-section" style="margin:1rem 0">',
|
| 106 |
+
f'<h3 style="margin:0 0 .3rem 0">{_e(title)}</h3>',
|
| 107 |
+
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
|
| 108 |
+
f'{_e(note)}</div>',
|
| 109 |
+
'<table style="border-collapse:collapse;width:100%;'
|
| 110 |
+
'font-size:.9rem">',
|
| 111 |
+
'<thead><tr>',
|
| 112 |
+
]
|
| 113 |
+
for col in (h_module, h_status, h_version, h_author,
|
| 114 |
+
h_license, h_io, h_citation, h_homepage):
|
| 115 |
+
parts.append(
|
| 116 |
+
f'<th style="padding:.4rem .6rem;text-align:left;'
|
| 117 |
+
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 118 |
+
f'{_e(col)}</th>'
|
| 119 |
+
)
|
| 120 |
+
parts.append("</tr></thead><tbody>")
|
| 121 |
+
|
| 122 |
+
for entry in rows:
|
| 123 |
+
manifest = entry["manifest"]
|
| 124 |
+
audit = entry["audit"]
|
| 125 |
+
name = str(manifest.get("name") or "?")
|
| 126 |
+
version = str(manifest.get("version") or "—")
|
| 127 |
+
author = str(manifest.get("author") or "—")
|
| 128 |
+
license_ = str(manifest.get("license") or "—")
|
| 129 |
+
in_types = ", ".join(manifest.get("input_types") or []) or "—"
|
| 130 |
+
out_types = ", ".join(manifest.get("output_types") or []) or "—"
|
| 131 |
+
citation = manifest.get("citation") or ""
|
| 132 |
+
homepage = manifest.get("homepage") or ""
|
| 133 |
+
passed = bool(audit.get("passed"))
|
| 134 |
+
n_failed = int(audit.get("n_failed") or 0)
|
| 135 |
+
status_cell = _passed_badge(
|
| 136 |
+
passed, n_failed, label_pass, label_fail,
|
| 137 |
+
)
|
| 138 |
+
# Citation : tronqué si trop long
|
| 139 |
+
citation_str = str(citation)[:120]
|
| 140 |
+
if len(str(citation)) > 120:
|
| 141 |
+
citation_str += "…"
|
| 142 |
+
citation_cell = (
|
| 143 |
+
_e(citation_str) if citation_str.strip() else "—"
|
| 144 |
+
)
|
| 145 |
+
# Homepage : on n'auto-link **pas** (anti-injection +
|
| 146 |
+
# honnêteté : l'URL peut pointer ailleurs). On affiche
|
| 147 |
+
# le texte échappé tel quel.
|
| 148 |
+
homepage_cell = (
|
| 149 |
+
_e(str(homepage))[:80] + ("…" if len(str(homepage)) > 80 else "")
|
| 150 |
+
) if str(homepage).strip() else "—"
|
| 151 |
+
parts.append(
|
| 152 |
+
f'<tr>'
|
| 153 |
+
f'<td style="padding:.4rem .6rem;font-family:monospace">'
|
| 154 |
+
f'{_e(name)}</td>'
|
| 155 |
+
f'<td style="padding:.4rem .6rem">{status_cell}</td>'
|
| 156 |
+
f'<td style="padding:.4rem .6rem;font-family:monospace">'
|
| 157 |
+
f'{_e(version)}</td>'
|
| 158 |
+
f'<td style="padding:.4rem .6rem">{_e(author)}</td>'
|
| 159 |
+
f'<td style="padding:.4rem .6rem;font-family:monospace">'
|
| 160 |
+
f'{_e(license_)}</td>'
|
| 161 |
+
f'<td style="padding:.4rem .6rem;font-family:monospace;'
|
| 162 |
+
f'font-size:.8rem">{_e(in_types)} → {_e(out_types)}</td>'
|
| 163 |
+
f'<td style="padding:.4rem .6rem;font-size:.8rem;'
|
| 164 |
+
f'opacity:.85">{citation_cell}</td>'
|
| 165 |
+
f'<td style="padding:.4rem .6rem;font-family:monospace;'
|
| 166 |
+
f'font-size:.8rem">{homepage_cell}</td>'
|
| 167 |
+
f'</tr>'
|
| 168 |
+
)
|
| 169 |
+
parts.append("</tbody></table></section>")
|
| 170 |
+
return "".join(parts)
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
__all__ = ["build_module_audit_html"]
|
|
@@ -0,0 +1,199 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Rendu HTML de la heatmap de co-occurrence taxonomique — Sprint 75.
|
| 2 |
+
|
| 3 |
+
A.I.4 chantier 1 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Suite directe ``picarones/core/taxonomy_cooccurrence.py``. Pattern
|
| 6 |
+
identique aux autres rendus (Sprints 41/43/62/67/72/74) :
|
| 7 |
+
**server-side**, pas de JavaScript, anti-injection systématique.
|
| 8 |
+
|
| 9 |
+
Sortie typique
|
| 10 |
+
--------------
|
| 11 |
+
- ``build_taxonomy_cooccurrence_html(data, labels)`` produit un
|
| 12 |
+
bloc complet : titre + note d'usage + heatmap SVG + table des
|
| 13 |
+
paires les plus co-occurrentes.
|
| 14 |
+
- ``""`` retourné si ``data is None`` ou si la matrice est vide
|
| 15 |
+
(rapport adaptatif).
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
from __future__ import annotations
|
| 19 |
+
|
| 20 |
+
from html import escape as _e
|
| 21 |
+
from typing import Optional
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def _color_for_jaccard(j: float) -> str:
|
| 25 |
+
"""Gradient blanc → bleu profond pour Jaccard ∈ [0, 1].
|
| 26 |
+
|
| 27 |
+
Interpolation entre #ffffff (j=0) et #1e3a8a (j=1).
|
| 28 |
+
"""
|
| 29 |
+
f = max(0.0, min(1.0, j))
|
| 30 |
+
r = int(255 + (30 - 255) * f)
|
| 31 |
+
g = int(255 + (58 - 255) * f)
|
| 32 |
+
b = int(255 + (138 - 255) * f)
|
| 33 |
+
return f"#{r:02x}{g:02x}{b:02x}"
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def _text_color_for_bg(j: float) -> str:
|
| 37 |
+
"""Texte blanc si fond foncé, noir sinon (lisibilité)."""
|
| 38 |
+
return "#fff" if j > 0.55 else "#222"
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def _build_heatmap_svg(
|
| 42 |
+
classes: list[str],
|
| 43 |
+
matrix: dict[str, dict[str, float]],
|
| 44 |
+
*,
|
| 45 |
+
cell_size: int = 36,
|
| 46 |
+
label_left: int = 130,
|
| 47 |
+
label_top: int = 80,
|
| 48 |
+
) -> str:
|
| 49 |
+
"""Construit la heatmap SVG.
|
| 50 |
+
|
| 51 |
+
Cellule = carré coloré ``_color_for_jaccard``, valeur Jaccard
|
| 52 |
+
affichée en chiffres si > 0,05. Étiquettes des classes en
|
| 53 |
+
colonne (haut) et en ligne (gauche).
|
| 54 |
+
"""
|
| 55 |
+
n = len(classes)
|
| 56 |
+
if n == 0:
|
| 57 |
+
return ""
|
| 58 |
+
width = label_left + n * cell_size + 10
|
| 59 |
+
height = label_top + n * cell_size + 10
|
| 60 |
+
|
| 61 |
+
parts = [
|
| 62 |
+
f'<svg xmlns="http://www.w3.org/2000/svg" '
|
| 63 |
+
f'width="{width}" height="{height}" '
|
| 64 |
+
f'viewBox="0 0 {width} {height}" '
|
| 65 |
+
f'role="img" aria-label="Heatmap Jaccard co-occurrence taxonomique">',
|
| 66 |
+
]
|
| 67 |
+
# Étiquettes de colonnes (rotées -45°)
|
| 68 |
+
for j, cls in enumerate(classes):
|
| 69 |
+
cx = label_left + j * cell_size + cell_size // 2
|
| 70 |
+
cy = label_top - 6
|
| 71 |
+
parts.append(
|
| 72 |
+
f'<text x="{cx}" y="{cy}" '
|
| 73 |
+
f'transform="rotate(-45 {cx} {cy})" '
|
| 74 |
+
f'font-size="11" fill="#333" text-anchor="start">'
|
| 75 |
+
f'{_e(cls)}</text>'
|
| 76 |
+
)
|
| 77 |
+
# Étiquettes de lignes
|
| 78 |
+
for i, cls in enumerate(classes):
|
| 79 |
+
rx = label_left - 6
|
| 80 |
+
ry = label_top + i * cell_size + cell_size // 2 + 4
|
| 81 |
+
parts.append(
|
| 82 |
+
f'<text x="{rx}" y="{ry}" '
|
| 83 |
+
f'font-size="11" fill="#333" text-anchor="end">'
|
| 84 |
+
f'{_e(cls)}</text>'
|
| 85 |
+
)
|
| 86 |
+
# Cellules
|
| 87 |
+
for i, ca in enumerate(classes):
|
| 88 |
+
for j, cb in enumerate(classes):
|
| 89 |
+
value = matrix.get(ca, {}).get(cb, 0.0)
|
| 90 |
+
x = label_left + j * cell_size
|
| 91 |
+
y = label_top + i * cell_size
|
| 92 |
+
color = _color_for_jaccard(value)
|
| 93 |
+
text_color = _text_color_for_bg(value)
|
| 94 |
+
parts.append(
|
| 95 |
+
f'<rect x="{x}" y="{y}" '
|
| 96 |
+
f'width="{cell_size}" height="{cell_size}" '
|
| 97 |
+
f'fill="{color}" stroke="#ddd" stroke-width="0.5"/>'
|
| 98 |
+
)
|
| 99 |
+
if value > 0.05:
|
| 100 |
+
parts.append(
|
| 101 |
+
f'<text x="{x + cell_size // 2}" '
|
| 102 |
+
f'y="{y + cell_size // 2 + 4}" '
|
| 103 |
+
f'font-size="10" fill="{text_color}" '
|
| 104 |
+
f'text-anchor="middle">'
|
| 105 |
+
f'{value:.2f}</text>'
|
| 106 |
+
)
|
| 107 |
+
parts.append("</svg>")
|
| 108 |
+
return "".join(parts)
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def _build_top_pairs_table(
|
| 112 |
+
top_pairs: list,
|
| 113 |
+
labels: dict,
|
| 114 |
+
) -> str:
|
| 115 |
+
"""Construit la table HTML des paires les plus co-occurrentes."""
|
| 116 |
+
if not top_pairs:
|
| 117 |
+
return ""
|
| 118 |
+
pair_label = labels.get("taxocooc_pair_label", "Paire")
|
| 119 |
+
jaccard_label = labels.get("taxocooc_jaccard_label", "Jaccard")
|
| 120 |
+
|
| 121 |
+
parts = [
|
| 122 |
+
'<table style="border-collapse:collapse;font-size:.85rem;'
|
| 123 |
+
'margin-top:.5rem">',
|
| 124 |
+
'<thead><tr>',
|
| 125 |
+
f'<th style="padding:.3rem .5rem;text-align:left;'
|
| 126 |
+
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 127 |
+
f'{_e(pair_label)}</th>',
|
| 128 |
+
f'<th style="padding:.3rem .5rem;text-align:right;'
|
| 129 |
+
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 130 |
+
f'{_e(jaccard_label)}</th>',
|
| 131 |
+
'</tr></thead><tbody>',
|
| 132 |
+
]
|
| 133 |
+
for ca, cb, j in top_pairs:
|
| 134 |
+
parts.append(
|
| 135 |
+
f'<tr>'
|
| 136 |
+
f'<td style="padding:.2rem .5rem">'
|
| 137 |
+
f'<code>{_e(ca)}</code> ↔ <code>{_e(cb)}</code></td>'
|
| 138 |
+
f'<td style="padding:.2rem .5rem;text-align:right;'
|
| 139 |
+
f'font-family:monospace;background:{_color_for_jaccard(j)};'
|
| 140 |
+
f'color:{_text_color_for_bg(j)}">{j:.2f}</td>'
|
| 141 |
+
f'</tr>'
|
| 142 |
+
)
|
| 143 |
+
parts.append("</tbody></table>")
|
| 144 |
+
return "".join(parts)
|
| 145 |
+
|
| 146 |
+
|
| 147 |
+
def build_taxonomy_cooccurrence_html(
|
| 148 |
+
data: Optional[dict],
|
| 149 |
+
labels: Optional[dict[str, str]] = None,
|
| 150 |
+
) -> str:
|
| 151 |
+
"""Construit le bloc HTML complet de co-occurrence taxonomique.
|
| 152 |
+
|
| 153 |
+
Retourne ``""`` si ``data is None`` ou matrice vide.
|
| 154 |
+
"""
|
| 155 |
+
if not data:
|
| 156 |
+
return ""
|
| 157 |
+
classes = data.get("classes") or []
|
| 158 |
+
matrix = data.get("cooccurrence_matrix") or {}
|
| 159 |
+
if not classes or not matrix:
|
| 160 |
+
return ""
|
| 161 |
+
labels = labels or {}
|
| 162 |
+
title = labels.get(
|
| 163 |
+
"taxocooc_title",
|
| 164 |
+
"Co-occurrence des classes d'erreur",
|
| 165 |
+
)
|
| 166 |
+
note = labels.get(
|
| 167 |
+
"taxocooc_note",
|
| 168 |
+
"Indice de Jaccard au niveau document : 1,00 = ces deux classes "
|
| 169 |
+
"apparaissent toujours ensemble ; 0,00 = jamais. Lecture par paires "
|
| 170 |
+
"co-occurrentes ci-dessous.",
|
| 171 |
+
)
|
| 172 |
+
n_docs = data.get("n_documents", 0)
|
| 173 |
+
n_docs_label_template = labels.get(
|
| 174 |
+
"taxocooc_n_docs", "Calculé sur {n_docs} documents.",
|
| 175 |
+
)
|
| 176 |
+
n_docs_phrase = n_docs_label_template.format(n_docs=n_docs)
|
| 177 |
+
|
| 178 |
+
svg = _build_heatmap_svg(classes, matrix)
|
| 179 |
+
top_table = _build_top_pairs_table(
|
| 180 |
+
data.get("top_pairs") or [], labels,
|
| 181 |
+
)
|
| 182 |
+
|
| 183 |
+
parts = [
|
| 184 |
+
'<div class="taxocooc" style="margin:1rem 0">',
|
| 185 |
+
f'<div style="font-weight:600;margin-bottom:.4rem">{_e(title)}</div>',
|
| 186 |
+
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
|
| 187 |
+
f'{_e(note)}</div>',
|
| 188 |
+
f'<div style="font-size:.8rem;opacity:.7;margin-bottom:.5rem">'
|
| 189 |
+
f'{_e(n_docs_phrase)}</div>',
|
| 190 |
+
svg,
|
| 191 |
+
top_table,
|
| 192 |
+
"</div>",
|
| 193 |
+
]
|
| 194 |
+
return "".join(parts)
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
__all__ = [
|
| 198 |
+
"build_taxonomy_cooccurrence_html",
|
| 199 |
+
]
|
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Rendu HTML de la heatmap class × position — Sprint 76.
|
| 2 |
+
|
| 3 |
+
A.I.4 chantier 2 du plan d'évolution 2026.
|
| 4 |
+
|
| 5 |
+
Suite directe ``picarones/core/taxonomy_intra_doc.py``. Pattern
|
| 6 |
+
identique aux autres rendus (Sprints 41/43/62/67/72/74/75) :
|
| 7 |
+
**server-side**, pas de JavaScript, anti-injection systématique.
|
| 8 |
+
|
| 9 |
+
Sortie typique
|
| 10 |
+
--------------
|
| 11 |
+
Une grille N_classes × N_bins où chaque cellule indique la densité
|
| 12 |
+
d'erreurs de cette classe à cette position dans le document.
|
| 13 |
+
Lecture immédiate : « ligature_error concentré dans la première
|
| 14 |
+
tranche → erreur de marge ; visual_confusion uniformément réparti
|
| 15 |
+
→ erreur de scribe ».
|
| 16 |
+
|
| 17 |
+
Adaptive : si ``data is None`` ou si toutes les classes ont 0
|
| 18 |
+
erreur, retourne ``""``.
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
from html import escape as _e
|
| 24 |
+
from typing import Optional
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def _color_for_density(density: float) -> str:
|
| 28 |
+
"""Gradient blanc → orange profond pour densité ∈ [0, 1].
|
| 29 |
+
|
| 30 |
+
Interpolation entre #ffffff (0) et #c2410c (1).
|
| 31 |
+
"""
|
| 32 |
+
f = max(0.0, min(1.0, density))
|
| 33 |
+
r = int(255 + (194 - 255) * f)
|
| 34 |
+
g = int(255 + (65 - 255) * f)
|
| 35 |
+
b = int(255 + (12 - 255) * f)
|
| 36 |
+
return f"#{r:02x}{g:02x}{b:02x}"
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def _text_color_for_bg(density: float) -> str:
|
| 40 |
+
return "#fff" if density > 0.55 else "#222"
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def _build_heatmap_svg(
|
| 44 |
+
classes_with_errors: list[str],
|
| 45 |
+
per_class: dict[str, list[int]],
|
| 46 |
+
n_bins: int,
|
| 47 |
+
*,
|
| 48 |
+
cell_w: int = 36,
|
| 49 |
+
cell_h: int = 26,
|
| 50 |
+
label_left: int = 150,
|
| 51 |
+
label_top: int = 30,
|
| 52 |
+
) -> str:
|
| 53 |
+
"""Construit la heatmap SVG class × position."""
|
| 54 |
+
n_rows = len(classes_with_errors)
|
| 55 |
+
if n_rows == 0:
|
| 56 |
+
return ""
|
| 57 |
+
width = label_left + n_bins * cell_w + 10
|
| 58 |
+
height = label_top + n_rows * cell_h + 30 # +30 pour étiquette X
|
| 59 |
+
|
| 60 |
+
# Normalisation : pour chaque classe, densité relative au max
|
| 61 |
+
# de cette classe (mise en évidence des positions concentrées).
|
| 62 |
+
parts = [
|
| 63 |
+
f'<svg xmlns="http://www.w3.org/2000/svg" '
|
| 64 |
+
f'width="{width}" height="{height}" '
|
| 65 |
+
f'viewBox="0 0 {width} {height}" '
|
| 66 |
+
f'role="img" aria-label="Heatmap class taxonomique × position">',
|
| 67 |
+
]
|
| 68 |
+
# Étiquettes des colonnes (positions)
|
| 69 |
+
for j in range(n_bins):
|
| 70 |
+
cx = label_left + j * cell_w + cell_w // 2
|
| 71 |
+
cy = label_top - 6
|
| 72 |
+
parts.append(
|
| 73 |
+
f'<text x="{cx}" y="{cy}" '
|
| 74 |
+
f'font-size="10" fill="#666" text-anchor="middle">'
|
| 75 |
+
f'{j + 1}</text>'
|
| 76 |
+
)
|
| 77 |
+
# Cellules
|
| 78 |
+
for i, cls in enumerate(classes_with_errors):
|
| 79 |
+
# Étiquette de ligne (classe)
|
| 80 |
+
rx = label_left - 6
|
| 81 |
+
ry = label_top + i * cell_h + cell_h // 2 + 4
|
| 82 |
+
parts.append(
|
| 83 |
+
f'<text x="{rx}" y="{ry}" '
|
| 84 |
+
f'font-size="11" fill="#333" text-anchor="end">'
|
| 85 |
+
f'{_e(cls)}</text>'
|
| 86 |
+
)
|
| 87 |
+
counts = per_class.get(cls, [0] * n_bins)
|
| 88 |
+
max_count = max(counts) if counts else 0
|
| 89 |
+
for j in range(n_bins):
|
| 90 |
+
x = label_left + j * cell_w
|
| 91 |
+
y = label_top + i * cell_h
|
| 92 |
+
count = counts[j] if j < len(counts) else 0
|
| 93 |
+
density = (count / max_count) if max_count > 0 else 0.0
|
| 94 |
+
color = _color_for_density(density)
|
| 95 |
+
text_color = _text_color_for_bg(density)
|
| 96 |
+
parts.append(
|
| 97 |
+
f'<rect x="{x}" y="{y}" '
|
| 98 |
+
f'width="{cell_w}" height="{cell_h}" '
|
| 99 |
+
f'fill="{color}" stroke="#ddd" stroke-width="0.5"/>'
|
| 100 |
+
)
|
| 101 |
+
if count > 0:
|
| 102 |
+
parts.append(
|
| 103 |
+
f'<text x="{x + cell_w // 2}" '
|
| 104 |
+
f'y="{y + cell_h // 2 + 4}" '
|
| 105 |
+
f'font-size="10" fill="{text_color}" '
|
| 106 |
+
f'text-anchor="middle">{count}</text>'
|
| 107 |
+
)
|
| 108 |
+
# Étiquette axe X en bas
|
| 109 |
+
cx_axis = label_left + (n_bins * cell_w) // 2
|
| 110 |
+
cy_axis = height - 6
|
| 111 |
+
parts.append(
|
| 112 |
+
f'<text x="{cx_axis}" y="{cy_axis}" '
|
| 113 |
+
f'font-size="11" fill="#666" text-anchor="middle" '
|
| 114 |
+
f'font-style="italic">'
|
| 115 |
+
f'Position dans le document (1 = début)</text>'
|
| 116 |
+
)
|
| 117 |
+
parts.append("</svg>")
|
| 118 |
+
return "".join(parts)
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def build_taxonomy_intra_doc_html(
|
| 122 |
+
data: Optional[dict],
|
| 123 |
+
labels: Optional[dict[str, str]] = None,
|
| 124 |
+
) -> str:
|
| 125 |
+
"""Construit le bloc HTML complet de la heatmap intra-document.
|
| 126 |
+
|
| 127 |
+
Retourne ``""`` si ``data is None`` ou aucune erreur.
|
| 128 |
+
"""
|
| 129 |
+
if not data:
|
| 130 |
+
return ""
|
| 131 |
+
n_bins = data.get("n_bins", 0)
|
| 132 |
+
per_class = data.get("per_class") or {}
|
| 133 |
+
total_errors = data.get("total_errors", 0)
|
| 134 |
+
if total_errors == 0 or n_bins <= 0:
|
| 135 |
+
return ""
|
| 136 |
+
# Filtre : uniquement les classes ayant au moins une erreur
|
| 137 |
+
classes_with_errors = [
|
| 138 |
+
cls for cls, counts in per_class.items()
|
| 139 |
+
if isinstance(counts, list) and sum(counts) > 0
|
| 140 |
+
]
|
| 141 |
+
if not classes_with_errors:
|
| 142 |
+
return ""
|
| 143 |
+
|
| 144 |
+
labels = labels or {}
|
| 145 |
+
title = labels.get(
|
| 146 |
+
"intradoc_title",
|
| 147 |
+
"Évolution intra-document des classes d'erreur",
|
| 148 |
+
)
|
| 149 |
+
note = labels.get(
|
| 150 |
+
"intradoc_note",
|
| 151 |
+
"Heatmap class × position : densité relative par classe "
|
| 152 |
+
"(plus foncé = concentré). Une classe concentrée dans la "
|
| 153 |
+
"première colonne suggère une erreur de marge ; "
|
| 154 |
+
"une distribution uniforme suggère une erreur de scribe.",
|
| 155 |
+
)
|
| 156 |
+
n_words_gt = data.get("n_words_gt", 0)
|
| 157 |
+
n_words_template = labels.get(
|
| 158 |
+
"intradoc_n_words",
|
| 159 |
+
"Calculé sur {n_words_gt} mots GT, répartis en {n_bins} tranches.",
|
| 160 |
+
)
|
| 161 |
+
n_words_phrase = n_words_template.format(
|
| 162 |
+
n_words_gt=n_words_gt, n_bins=n_bins,
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
+
svg = _build_heatmap_svg(classes_with_errors, per_class, n_bins)
|
| 166 |
+
|
| 167 |
+
parts = [
|
| 168 |
+
'<div class="intradoc" style="margin:1rem 0">',
|
| 169 |
+
f'<div style="font-weight:600;margin-bottom:.4rem">{_e(title)}</div>',
|
| 170 |
+
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
|
| 171 |
+
f'{_e(note)}</div>',
|
| 172 |
+
f'<div style="font-size:.8rem;opacity:.7;margin-bottom:.5rem">'
|
| 173 |
+
f'{_e(n_words_phrase)}</div>',
|
| 174 |
+
svg,
|
| 175 |
+
"</div>",
|
| 176 |
+
]
|
| 177 |
+
return "".join(parts)
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
__all__ = [
|
| 181 |
+
"build_taxonomy_intra_doc_html",
|
| 182 |
+
]
|
|
@@ -60,8 +60,8 @@
|
|
| 60 |
"gallery_sort_difficulty": "Difficulty",
|
| 61 |
"gallery_sort_id": "Identifier",
|
| 62 |
"gallery_sort_label": "Sort by:",
|
| 63 |
-
"gini_cer_ideal": "—
|
| 64 |
-
"gini_cer_note": "X-axis = mean CER, Y-axis = Gini coefficient. An
|
| 65 |
"glossary_definition": "Definition",
|
| 66 |
"glossary_empty": "No entry for this term.",
|
| 67 |
"glossary_limits": "Limits",
|
|
@@ -252,7 +252,7 @@
|
|
| 252 |
"intradoc_note": "Heatmap class × position: relative density per class (darker = concentrated). A class concentrated in the first column suggests a margin error; a uniform distribution suggests a scribe error.",
|
| 253 |
"intradoc_n_words": "Computed on {n_words_gt} GT words, split into {n_bins} bins.",
|
| 254 |
"taxocomp_title": "Taxonomic profile: {engine_a} vs {engine_b}",
|
| 255 |
-
"taxocomp_note": "Mirror chart of error proportions per class. Color by editorial recoverability (green = correctable, red = irrecoverable). At equal global CER, an engine whose errors are mostly green
|
| 256 |
"taxocomp_level_label": "Category",
|
| 257 |
"taxocomp_recoverable": "Recoverable",
|
| 258 |
"taxocomp_difficult": "Difficult",
|
|
|
|
| 60 |
"gallery_sort_difficulty": "Difficulty",
|
| 61 |
"gallery_sort_id": "Identifier",
|
| 62 |
"gallery_sort_label": "Sort by:",
|
| 63 |
+
"gini_cer_ideal": "— reading: bottom-left",
|
| 64 |
+
"gini_cer_note": "X-axis = mean CER, Y-axis = Gini coefficient. An engine in the bottom-left area combines low CER AND low Gini (rare, uniformly distributed errors). The right choice depends on the target workflow.",
|
| 65 |
"glossary_definition": "Definition",
|
| 66 |
"glossary_empty": "No entry for this term.",
|
| 67 |
"glossary_limits": "Limits",
|
|
|
|
| 252 |
"intradoc_note": "Heatmap class × position: relative density per class (darker = concentrated). A class concentrated in the first column suggests a margin error; a uniform distribution suggests a scribe error.",
|
| 253 |
"intradoc_n_words": "Computed on {n_words_gt} GT words, split into {n_bins} bins.",
|
| 254 |
"taxocomp_title": "Taxonomic profile: {engine_a} vs {engine_b}",
|
| 255 |
+
"taxocomp_note": "Mirror chart of error proportions per class. Color by editorial recoverability (green = correctable, red = irrecoverable). At equal global CER, an engine whose errors are mostly green tends to produce errors more easily corrected in a critical edition workflow.",
|
| 256 |
"taxocomp_level_label": "Category",
|
| 257 |
"taxocomp_recoverable": "Recoverable",
|
| 258 |
"taxocomp_difficult": "Difficult",
|
|
@@ -60,8 +60,8 @@
|
|
| 60 |
"gallery_sort_difficulty": "Difficulté",
|
| 61 |
"gallery_sort_id": "Identifiant",
|
| 62 |
"gallery_sort_label": "Trier par :",
|
| 63 |
-
"gini_cer_ideal": "—
|
| 64 |
-
"gini_cer_note": "Axe X = CER moyen, Axe Y = coefficient de Gini. Un moteur
|
| 65 |
"glossary_definition": "Définition",
|
| 66 |
"glossary_empty": "Aucune entrée pour ce terme.",
|
| 67 |
"glossary_limits": "Limites",
|
|
@@ -252,7 +252,7 @@
|
|
| 252 |
"intradoc_note": "Heatmap class × position : densité relative par classe (plus foncé = concentré). Une classe concentrée dans la première colonne suggère une erreur de marge ; une distribution uniforme suggère une erreur de scribe.",
|
| 253 |
"intradoc_n_words": "Calculé sur {n_words_gt} mots GT, répartis en {n_bins} tranches.",
|
| 254 |
"taxocomp_title": "Profil taxonomique : {engine_a} vs {engine_b}",
|
| 255 |
-
"taxocomp_note": "Diagramme miroir des proportions d'erreurs par classe. Couleur selon récupérabilité éditoriale (vert = corrigeable, rouge = irrécupérable). À CER global égal, un moteur dont les erreurs sont majoritairement vertes
|
| 256 |
"taxocomp_level_label": "Catégorie",
|
| 257 |
"taxocomp_recoverable": "Récupérable",
|
| 258 |
"taxocomp_difficult": "Difficile",
|
|
|
|
| 60 |
"gallery_sort_difficulty": "Difficulté",
|
| 61 |
"gallery_sort_id": "Identifiant",
|
| 62 |
"gallery_sort_label": "Trier par :",
|
| 63 |
+
"gini_cer_ideal": "— lecture : bas-gauche",
|
| 64 |
+
"gini_cer_note": "Axe X = CER moyen, Axe Y = coefficient de Gini. Un moteur dans la zone bas-gauche combine CER bas ET Gini bas (erreurs rares et uniformément réparties). Le choix selon ce graphe dépend du workflow visé.",
|
| 65 |
"glossary_definition": "Définition",
|
| 66 |
"glossary_empty": "Aucune entrée pour ce terme.",
|
| 67 |
"glossary_limits": "Limites",
|
|
|
|
| 252 |
"intradoc_note": "Heatmap class × position : densité relative par classe (plus foncé = concentré). Une classe concentrée dans la première colonne suggère une erreur de marge ; une distribution uniforme suggère une erreur de scribe.",
|
| 253 |
"intradoc_n_words": "Calculé sur {n_words_gt} mots GT, répartis en {n_bins} tranches.",
|
| 254 |
"taxocomp_title": "Profil taxonomique : {engine_a} vs {engine_b}",
|
| 255 |
+
"taxocomp_note": "Diagramme miroir des proportions d'erreurs par classe. Couleur selon récupérabilité éditoriale (vert = corrigeable, rouge = irrécupérable). À CER global égal, un moteur dont les erreurs sont majoritairement vertes tend à produire des erreurs plus facilement corrigées en édition critique.",
|
| 256 |
"taxocomp_level_label": "Catégorie",
|
| 257 |
"taxocomp_recoverable": "Récupérable",
|
| 258 |
"taxocomp_difficult": "Difficile",
|
|
@@ -1,221 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
Deux blocs dans une section unique :
|
| 10 |
-
|
| 11 |
-
1. **Complexité paléographique** : moyenne, médiane, min, max,
|
| 12 |
-
écart-type sur l'ensemble du corpus.
|
| 13 |
-
2. **Homogénéité du corpus** : score combiné + détail par
|
| 14 |
-
feature (mean, stdev, contribution normalisée).
|
| 15 |
-
|
| 16 |
-
Adaptive : ``""`` si pas de données.
|
| 17 |
-
|
| 18 |
-
Note d'intégration
|
| 19 |
-
------------------
|
| 20 |
-
Module pur — l'utilisateur compose :
|
| 21 |
-
|
| 22 |
-
.. code-block:: python
|
| 23 |
-
|
| 24 |
-
from picarones.core.image_predictive import aggregate_corpus_predictive
|
| 25 |
-
from picarones.report.image_predictive_render import (
|
| 26 |
-
build_image_predictive_html,
|
| 27 |
-
)
|
| 28 |
-
|
| 29 |
-
qualities = [doc.image_quality.as_dict() for doc in benchmark.docs]
|
| 30 |
-
agg = aggregate_corpus_predictive(qualities)
|
| 31 |
-
html = build_image_predictive_html(agg, labels)
|
| 32 |
"""
|
| 33 |
|
| 34 |
-
from
|
| 35 |
-
|
| 36 |
-
from html import escape as _e
|
| 37 |
-
from typing import Optional
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
def _color_for_score(score: float) -> str:
|
| 41 |
-
"""Vert (faible) → orange → rouge (élevé)."""
|
| 42 |
-
f = max(0.0, min(1.0, score))
|
| 43 |
-
if f < 0.5:
|
| 44 |
-
t = f / 0.5
|
| 45 |
-
r = int(167 + (235 - 167) * t)
|
| 46 |
-
g = int(240 + (180 - 240) * t)
|
| 47 |
-
b = int(167 + (60 - 167) * t)
|
| 48 |
-
else:
|
| 49 |
-
t = (f - 0.5) / 0.5
|
| 50 |
-
r = int(235 + (220 - 235) * t)
|
| 51 |
-
g = int(180 + (50 - 180) * t)
|
| 52 |
-
b = int(60 + (50 - 60) * t)
|
| 53 |
-
return f"#{r:02x}{g:02x}{b:02x}"
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
_FEATURE_LABEL_KEYS = {
|
| 57 |
-
"noise_level": "imgpred_feat_noise",
|
| 58 |
-
"sharpness_score": "imgpred_feat_sharpness",
|
| 59 |
-
"contrast_score": "imgpred_feat_contrast",
|
| 60 |
-
"rotation_degrees": "imgpred_feat_rotation",
|
| 61 |
-
}
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
def _render_complexity_block(
|
| 65 |
-
aggregated: dict, labels: dict[str, str],
|
| 66 |
-
) -> str:
|
| 67 |
-
h_complex = labels.get(
|
| 68 |
-
"imgpred_complexity", "Complexité paléographique",
|
| 69 |
-
)
|
| 70 |
-
h_mean = labels.get("imgpred_mean", "Moyenne")
|
| 71 |
-
h_median = labels.get("imgpred_median", "Médiane")
|
| 72 |
-
h_min = labels.get("imgpred_min", "Min")
|
| 73 |
-
h_max = labels.get("imgpred_max", "Max")
|
| 74 |
-
h_stdev = labels.get("imgpred_stdev", "Écart-type")
|
| 75 |
-
h_docs = labels.get("imgpred_docs", "Docs")
|
| 76 |
-
mean = float(aggregated.get("complexity_mean") or 0.0)
|
| 77 |
-
median = float(aggregated.get("complexity_median") or 0.0)
|
| 78 |
-
mn = float(aggregated.get("complexity_min") or 0.0)
|
| 79 |
-
mx = float(aggregated.get("complexity_max") or 0.0)
|
| 80 |
-
sd = float(aggregated.get("complexity_stdev") or 0.0)
|
| 81 |
-
n_docs = int(aggregated.get("n_docs") or 0)
|
| 82 |
-
color_mean = _color_for_score(mean)
|
| 83 |
-
return (
|
| 84 |
-
f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
|
| 85 |
-
f'{_e(h_complex)}</div>'
|
| 86 |
-
'<table style="border-collapse:collapse;width:100%;'
|
| 87 |
-
'font-size:.9rem;margin-bottom:.8rem">'
|
| 88 |
-
f'<thead><tr>'
|
| 89 |
-
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 90 |
-
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_mean)}</th>'
|
| 91 |
-
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 92 |
-
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_median)}</th>'
|
| 93 |
-
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 94 |
-
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_min)}</th>'
|
| 95 |
-
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 96 |
-
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_max)}</th>'
|
| 97 |
-
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 98 |
-
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_stdev)}</th>'
|
| 99 |
-
f'<th style="padding:.4rem .6rem;text-align:right;'
|
| 100 |
-
f'border-bottom:1px solid #ccc;font-weight:600">{_e(h_docs)}</th>'
|
| 101 |
-
f'</tr></thead>'
|
| 102 |
-
f'<tbody><tr>'
|
| 103 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 104 |
-
f'background:{color_mean};font-family:monospace;font-weight:600">'
|
| 105 |
-
f'{mean:.3f}</td>'
|
| 106 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 107 |
-
f'font-family:monospace">{median:.3f}</td>'
|
| 108 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 109 |
-
f'font-family:monospace">{mn:.3f}</td>'
|
| 110 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 111 |
-
f'font-family:monospace">{mx:.3f}</td>'
|
| 112 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 113 |
-
f'font-family:monospace">{sd:.3f}</td>'
|
| 114 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 115 |
-
f'font-family:monospace">{n_docs}</td>'
|
| 116 |
-
f'</tr></tbody></table>'
|
| 117 |
-
)
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
def _render_homogeneity_block(
|
| 121 |
-
homogeneity: dict, labels: dict[str, str],
|
| 122 |
-
) -> str:
|
| 123 |
-
h_homo = labels.get(
|
| 124 |
-
"imgpred_homogeneity", "Homogénéité du corpus",
|
| 125 |
-
)
|
| 126 |
-
h_feat = labels.get("imgpred_feature", "Feature")
|
| 127 |
-
h_mean = labels.get("imgpred_feat_mean", "Moyenne")
|
| 128 |
-
h_stdev = labels.get("imgpred_feat_stdev", "Écart-type")
|
| 129 |
-
h_norm = labels.get(
|
| 130 |
-
"imgpred_feat_norm", "Contribution normalisée",
|
| 131 |
-
)
|
| 132 |
-
score = float(homogeneity.get("score") or 0.0)
|
| 133 |
-
color = _color_for_score(score)
|
| 134 |
-
parts = [
|
| 135 |
-
f'<div style="font-weight:600;margin:.4rem 0 .3rem 0">'
|
| 136 |
-
f'{_e(h_homo)} : '
|
| 137 |
-
f'<span style="background:{color};padding:.1rem .4rem;'
|
| 138 |
-
f'border-radius:.3rem;font-family:monospace">{score:.3f}</span>'
|
| 139 |
-
f'</div>',
|
| 140 |
-
'<table style="border-collapse:collapse;width:100%;'
|
| 141 |
-
'font-size:.9rem">',
|
| 142 |
-
'<thead><tr>',
|
| 143 |
-
]
|
| 144 |
-
for col in (h_feat, h_mean, h_stdev, h_norm):
|
| 145 |
-
parts.append(
|
| 146 |
-
f'<th style="padding:.4rem .6rem;text-align:left;'
|
| 147 |
-
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 148 |
-
f'{_e(col)}</th>'
|
| 149 |
-
)
|
| 150 |
-
parts.append("</tr></thead><tbody>")
|
| 151 |
-
per_feat = homogeneity.get("per_feature") or {}
|
| 152 |
-
for key, label_key in _FEATURE_LABEL_KEYS.items():
|
| 153 |
-
if key not in per_feat:
|
| 154 |
-
continue
|
| 155 |
-
slot = per_feat[key]
|
| 156 |
-
feat_label = labels.get(label_key, key)
|
| 157 |
-
feat_mean = float(slot.get("mean") or 0.0)
|
| 158 |
-
feat_stdev = float(slot.get("stdev") or 0.0)
|
| 159 |
-
feat_norm = float(slot.get("normalised") or 0.0)
|
| 160 |
-
norm_color = _color_for_score(feat_norm)
|
| 161 |
-
parts.append(
|
| 162 |
-
f'<tr>'
|
| 163 |
-
f'<td style="padding:.4rem .6rem">{_e(feat_label)}</td>'
|
| 164 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 165 |
-
f'font-family:monospace">{feat_mean:.3f}</td>'
|
| 166 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 167 |
-
f'font-family:monospace">{feat_stdev:.3f}</td>'
|
| 168 |
-
f'<td style="padding:.4rem .6rem;text-align:right;'
|
| 169 |
-
f'background:{norm_color};font-family:monospace">'
|
| 170 |
-
f'{feat_norm:.3f}</td>'
|
| 171 |
-
f'</tr>'
|
| 172 |
-
)
|
| 173 |
-
parts.append("</tbody></table>")
|
| 174 |
-
return "".join(parts)
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
def build_image_predictive_html(
|
| 178 |
-
aggregated: Optional[dict],
|
| 179 |
-
labels: Optional[dict[str, str]] = None,
|
| 180 |
-
) -> str:
|
| 181 |
-
"""Construit la vue HTML « Profil d'image du corpus ».
|
| 182 |
-
|
| 183 |
-
Parameters
|
| 184 |
-
----------
|
| 185 |
-
aggregated:
|
| 186 |
-
Sortie de ``aggregate_corpus_predictive``. Si ``None``
|
| 187 |
-
ou ``n_docs == 0``, retourne ``""``.
|
| 188 |
-
labels:
|
| 189 |
-
Dict i18n. Clés sous le préfixe ``imgpred_*``.
|
| 190 |
-
"""
|
| 191 |
-
if not aggregated:
|
| 192 |
-
return ""
|
| 193 |
-
if not aggregated.get("n_docs"):
|
| 194 |
-
return ""
|
| 195 |
-
labels = labels or {}
|
| 196 |
-
title = labels.get(
|
| 197 |
-
"imgpred_title", "Profil d'image du corpus",
|
| 198 |
-
)
|
| 199 |
-
note = labels.get(
|
| 200 |
-
"imgpred_note",
|
| 201 |
-
"Score de complexité paléographique combinant bruit, "
|
| 202 |
-
"flou, faible contraste et rotation. Le score "
|
| 203 |
-
"d'homogénéité signale si la moyenne globale est fiable "
|
| 204 |
-
"(corpus uniforme) ou trompeuse (corpus hétérogène — "
|
| 205 |
-
"voir alors la vue stratifiée).",
|
| 206 |
-
)
|
| 207 |
-
parts = [
|
| 208 |
-
'<section class="imgpred-section" style="margin:1rem 0">',
|
| 209 |
-
f'<h3 style="margin:0 0 .3rem 0">{_e(title)}</h3>',
|
| 210 |
-
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.6rem">'
|
| 211 |
-
f'{_e(note)}</div>',
|
| 212 |
-
]
|
| 213 |
-
parts.append(_render_complexity_block(aggregated, labels))
|
| 214 |
-
homo = aggregated.get("homogeneity")
|
| 215 |
-
if isinstance(homo, dict):
|
| 216 |
-
parts.append(_render_homogeneity_block(homo, labels))
|
| 217 |
-
parts.append("</section>")
|
| 218 |
-
return "".join(parts)
|
| 219 |
-
|
| 220 |
|
| 221 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.render.image_predictive_render`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.report.image_predictive_render
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.render.image_predictive_render import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.render.image_predictive_render as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -1,173 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
Tableau récapitulatif des modules utilisés dans une pipeline
|
| 10 |
-
composée, chacun avec :
|
| 11 |
-
|
| 12 |
-
- Statut d'audit (✓ vert si tous les checks passent, ✗ rouge
|
| 13 |
-
sinon, avec compte des échecs) ;
|
| 14 |
-
- Métadonnées : version, auteur, licence ;
|
| 15 |
-
- Citation académique si fournie ;
|
| 16 |
-
- Lien vers la homepage si fourni.
|
| 17 |
-
|
| 18 |
-
Adaptive : ``""`` si la liste est vide.
|
| 19 |
-
|
| 20 |
-
Note d'intégration
|
| 21 |
-
------------------
|
| 22 |
-
Module pur — l'utilisateur compose la liste depuis sa
|
| 23 |
-
``PipelineSpec`` augmentée des ``ModuleManifest`` :
|
| 24 |
-
|
| 25 |
-
.. code-block:: python
|
| 26 |
-
|
| 27 |
-
from picarones.core.module_policy import audit_module
|
| 28 |
-
from picarones.report.module_audit_render import build_module_audit_html
|
| 29 |
-
|
| 30 |
-
audits = []
|
| 31 |
-
for step in pipeline.steps:
|
| 32 |
-
manifest = step.module.manifest # convention applicative
|
| 33 |
-
result = audit_module(step.module, manifest)
|
| 34 |
-
audits.append({
|
| 35 |
-
"manifest": manifest.as_dict(),
|
| 36 |
-
"audit": result.as_dict(),
|
| 37 |
-
})
|
| 38 |
-
html = build_module_audit_html(audits, labels)
|
| 39 |
"""
|
| 40 |
|
| 41 |
-
from
|
| 42 |
-
|
| 43 |
-
from html import escape as _e
|
| 44 |
-
from typing import Optional
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
def _passed_badge(passed: bool, n_failed: int, label_pass: str,
|
| 48 |
-
label_fail: str) -> str:
|
| 49 |
-
if passed:
|
| 50 |
-
return (
|
| 51 |
-
f'<span style="color:#16a34a;font-weight:700">'
|
| 52 |
-
f'✓ {_e(label_pass)}</span>'
|
| 53 |
-
)
|
| 54 |
-
return (
|
| 55 |
-
f'<span style="color:#dc2626;font-weight:700">'
|
| 56 |
-
f'✗ {_e(label_fail)} ({n_failed})</span>'
|
| 57 |
-
)
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
def build_module_audit_html(
|
| 61 |
-
audits: Optional[list],
|
| 62 |
-
labels: Optional[dict[str, str]] = None,
|
| 63 |
-
) -> str:
|
| 64 |
-
"""Construit la vue HTML « Modules audités ».
|
| 65 |
-
|
| 66 |
-
Parameters
|
| 67 |
-
----------
|
| 68 |
-
audits:
|
| 69 |
-
Liste de dicts ``{"manifest": ManifestDict, "audit":
|
| 70 |
-
AuditResultDict}``. Si vide ou ``None``, retourne ``""``.
|
| 71 |
-
labels:
|
| 72 |
-
Dict i18n. Clés sous le préfixe ``audit_*``.
|
| 73 |
-
"""
|
| 74 |
-
if not audits:
|
| 75 |
-
return ""
|
| 76 |
-
rows = [
|
| 77 |
-
a for a in audits
|
| 78 |
-
if isinstance(a, dict)
|
| 79 |
-
and isinstance(a.get("manifest"), dict)
|
| 80 |
-
and isinstance(a.get("audit"), dict)
|
| 81 |
-
]
|
| 82 |
-
if not rows:
|
| 83 |
-
return ""
|
| 84 |
-
labels = labels or {}
|
| 85 |
-
title = labels.get("audit_title", "Modules audités")
|
| 86 |
-
note = labels.get(
|
| 87 |
-
"audit_note",
|
| 88 |
-
"Récapitulatif des modules utilisés dans la pipeline "
|
| 89 |
-
"composée. Un module qui ne passe pas l'audit n'est "
|
| 90 |
-
"pas exécutable. Métadonnées issues du manifest fourni "
|
| 91 |
-
"par le contributeur (auteur, licence, citation).",
|
| 92 |
-
)
|
| 93 |
-
label_pass = labels.get("audit_pass", "audit OK")
|
| 94 |
-
label_fail = labels.get("audit_fail", "checks échoués")
|
| 95 |
-
h_module = labels.get("audit_module", "Module")
|
| 96 |
-
h_status = labels.get("audit_status", "Audit")
|
| 97 |
-
h_version = labels.get("audit_version", "Version")
|
| 98 |
-
h_author = labels.get("audit_author", "Auteur")
|
| 99 |
-
h_license = labels.get("audit_license", "Licence")
|
| 100 |
-
h_io = labels.get("audit_io", "Entrée → sortie")
|
| 101 |
-
h_citation = labels.get("audit_citation", "Citation")
|
| 102 |
-
h_homepage = labels.get("audit_homepage", "Page projet")
|
| 103 |
-
|
| 104 |
-
parts = [
|
| 105 |
-
'<section class="audit-section" style="margin:1rem 0">',
|
| 106 |
-
f'<h3 style="margin:0 0 .3rem 0">{_e(title)}</h3>',
|
| 107 |
-
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
|
| 108 |
-
f'{_e(note)}</div>',
|
| 109 |
-
'<table style="border-collapse:collapse;width:100%;'
|
| 110 |
-
'font-size:.9rem">',
|
| 111 |
-
'<thead><tr>',
|
| 112 |
-
]
|
| 113 |
-
for col in (h_module, h_status, h_version, h_author,
|
| 114 |
-
h_license, h_io, h_citation, h_homepage):
|
| 115 |
-
parts.append(
|
| 116 |
-
f'<th style="padding:.4rem .6rem;text-align:left;'
|
| 117 |
-
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 118 |
-
f'{_e(col)}</th>'
|
| 119 |
-
)
|
| 120 |
-
parts.append("</tr></thead><tbody>")
|
| 121 |
-
|
| 122 |
-
for entry in rows:
|
| 123 |
-
manifest = entry["manifest"]
|
| 124 |
-
audit = entry["audit"]
|
| 125 |
-
name = str(manifest.get("name") or "?")
|
| 126 |
-
version = str(manifest.get("version") or "—")
|
| 127 |
-
author = str(manifest.get("author") or "—")
|
| 128 |
-
license_ = str(manifest.get("license") or "—")
|
| 129 |
-
in_types = ", ".join(manifest.get("input_types") or []) or "—"
|
| 130 |
-
out_types = ", ".join(manifest.get("output_types") or []) or "—"
|
| 131 |
-
citation = manifest.get("citation") or ""
|
| 132 |
-
homepage = manifest.get("homepage") or ""
|
| 133 |
-
passed = bool(audit.get("passed"))
|
| 134 |
-
n_failed = int(audit.get("n_failed") or 0)
|
| 135 |
-
status_cell = _passed_badge(
|
| 136 |
-
passed, n_failed, label_pass, label_fail,
|
| 137 |
-
)
|
| 138 |
-
# Citation : tronqué si trop long
|
| 139 |
-
citation_str = str(citation)[:120]
|
| 140 |
-
if len(str(citation)) > 120:
|
| 141 |
-
citation_str += "…"
|
| 142 |
-
citation_cell = (
|
| 143 |
-
_e(citation_str) if citation_str.strip() else "—"
|
| 144 |
-
)
|
| 145 |
-
# Homepage : on n'auto-link **pas** (anti-injection +
|
| 146 |
-
# honnêteté : l'URL peut pointer ailleurs). On affiche
|
| 147 |
-
# le texte échappé tel quel.
|
| 148 |
-
homepage_cell = (
|
| 149 |
-
_e(str(homepage))[:80] + ("…" if len(str(homepage)) > 80 else "")
|
| 150 |
-
) if str(homepage).strip() else "—"
|
| 151 |
-
parts.append(
|
| 152 |
-
f'<tr>'
|
| 153 |
-
f'<td style="padding:.4rem .6rem;font-family:monospace">'
|
| 154 |
-
f'{_e(name)}</td>'
|
| 155 |
-
f'<td style="padding:.4rem .6rem">{status_cell}</td>'
|
| 156 |
-
f'<td style="padding:.4rem .6rem;font-family:monospace">'
|
| 157 |
-
f'{_e(version)}</td>'
|
| 158 |
-
f'<td style="padding:.4rem .6rem">{_e(author)}</td>'
|
| 159 |
-
f'<td style="padding:.4rem .6rem;font-family:monospace">'
|
| 160 |
-
f'{_e(license_)}</td>'
|
| 161 |
-
f'<td style="padding:.4rem .6rem;font-family:monospace;'
|
| 162 |
-
f'font-size:.8rem">{_e(in_types)} → {_e(out_types)}</td>'
|
| 163 |
-
f'<td style="padding:.4rem .6rem;font-size:.8rem;'
|
| 164 |
-
f'opacity:.85">{citation_cell}</td>'
|
| 165 |
-
f'<td style="padding:.4rem .6rem;font-family:monospace;'
|
| 166 |
-
f'font-size:.8rem">{homepage_cell}</td>'
|
| 167 |
-
f'</tr>'
|
| 168 |
-
)
|
| 169 |
-
parts.append("</tbody></table></section>")
|
| 170 |
-
return "".join(parts)
|
| 171 |
-
|
| 172 |
|
| 173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.render.module_audit_render`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.report.module_audit_render
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.render.module_audit_render import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.render.module_audit_render as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -1,199 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
A
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
**server-side**, pas de JavaScript, anti-injection systématique.
|
| 8 |
-
|
| 9 |
-
Sortie typique
|
| 10 |
-
--------------
|
| 11 |
-
- ``build_taxonomy_cooccurrence_html(data, labels)`` produit un
|
| 12 |
-
bloc complet : titre + note d'usage + heatmap SVG + table des
|
| 13 |
-
paires les plus co-occurrentes.
|
| 14 |
-
- ``""`` retourné si ``data is None`` ou si la matrice est vide
|
| 15 |
-
(rapport adaptatif).
|
| 16 |
"""
|
| 17 |
|
| 18 |
-
from
|
| 19 |
-
|
| 20 |
-
from html import escape as _e
|
| 21 |
-
from typing import Optional
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
def _color_for_jaccard(j: float) -> str:
|
| 25 |
-
"""Gradient blanc → bleu profond pour Jaccard ∈ [0, 1].
|
| 26 |
-
|
| 27 |
-
Interpolation entre #ffffff (j=0) et #1e3a8a (j=1).
|
| 28 |
-
"""
|
| 29 |
-
f = max(0.0, min(1.0, j))
|
| 30 |
-
r = int(255 + (30 - 255) * f)
|
| 31 |
-
g = int(255 + (58 - 255) * f)
|
| 32 |
-
b = int(255 + (138 - 255) * f)
|
| 33 |
-
return f"#{r:02x}{g:02x}{b:02x}"
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
def _text_color_for_bg(j: float) -> str:
|
| 37 |
-
"""Texte blanc si fond foncé, noir sinon (lisibilité)."""
|
| 38 |
-
return "#fff" if j > 0.55 else "#222"
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
def _build_heatmap_svg(
|
| 42 |
-
classes: list[str],
|
| 43 |
-
matrix: dict[str, dict[str, float]],
|
| 44 |
-
*,
|
| 45 |
-
cell_size: int = 36,
|
| 46 |
-
label_left: int = 130,
|
| 47 |
-
label_top: int = 80,
|
| 48 |
-
) -> str:
|
| 49 |
-
"""Construit la heatmap SVG.
|
| 50 |
-
|
| 51 |
-
Cellule = carré coloré ``_color_for_jaccard``, valeur Jaccard
|
| 52 |
-
affichée en chiffres si > 0,05. Étiquettes des classes en
|
| 53 |
-
colonne (haut) et en ligne (gauche).
|
| 54 |
-
"""
|
| 55 |
-
n = len(classes)
|
| 56 |
-
if n == 0:
|
| 57 |
-
return ""
|
| 58 |
-
width = label_left + n * cell_size + 10
|
| 59 |
-
height = label_top + n * cell_size + 10
|
| 60 |
-
|
| 61 |
-
parts = [
|
| 62 |
-
f'<svg xmlns="http://www.w3.org/2000/svg" '
|
| 63 |
-
f'width="{width}" height="{height}" '
|
| 64 |
-
f'viewBox="0 0 {width} {height}" '
|
| 65 |
-
f'role="img" aria-label="Heatmap Jaccard co-occurrence taxonomique">',
|
| 66 |
-
]
|
| 67 |
-
# Étiquettes de colonnes (rotées -45°)
|
| 68 |
-
for j, cls in enumerate(classes):
|
| 69 |
-
cx = label_left + j * cell_size + cell_size // 2
|
| 70 |
-
cy = label_top - 6
|
| 71 |
-
parts.append(
|
| 72 |
-
f'<text x="{cx}" y="{cy}" '
|
| 73 |
-
f'transform="rotate(-45 {cx} {cy})" '
|
| 74 |
-
f'font-size="11" fill="#333" text-anchor="start">'
|
| 75 |
-
f'{_e(cls)}</text>'
|
| 76 |
-
)
|
| 77 |
-
# Étiquettes de lignes
|
| 78 |
-
for i, cls in enumerate(classes):
|
| 79 |
-
rx = label_left - 6
|
| 80 |
-
ry = label_top + i * cell_size + cell_size // 2 + 4
|
| 81 |
-
parts.append(
|
| 82 |
-
f'<text x="{rx}" y="{ry}" '
|
| 83 |
-
f'font-size="11" fill="#333" text-anchor="end">'
|
| 84 |
-
f'{_e(cls)}</text>'
|
| 85 |
-
)
|
| 86 |
-
# Cellules
|
| 87 |
-
for i, ca in enumerate(classes):
|
| 88 |
-
for j, cb in enumerate(classes):
|
| 89 |
-
value = matrix.get(ca, {}).get(cb, 0.0)
|
| 90 |
-
x = label_left + j * cell_size
|
| 91 |
-
y = label_top + i * cell_size
|
| 92 |
-
color = _color_for_jaccard(value)
|
| 93 |
-
text_color = _text_color_for_bg(value)
|
| 94 |
-
parts.append(
|
| 95 |
-
f'<rect x="{x}" y="{y}" '
|
| 96 |
-
f'width="{cell_size}" height="{cell_size}" '
|
| 97 |
-
f'fill="{color}" stroke="#ddd" stroke-width="0.5"/>'
|
| 98 |
-
)
|
| 99 |
-
if value > 0.05:
|
| 100 |
-
parts.append(
|
| 101 |
-
f'<text x="{x + cell_size // 2}" '
|
| 102 |
-
f'y="{y + cell_size // 2 + 4}" '
|
| 103 |
-
f'font-size="10" fill="{text_color}" '
|
| 104 |
-
f'text-anchor="middle">'
|
| 105 |
-
f'{value:.2f}</text>'
|
| 106 |
-
)
|
| 107 |
-
parts.append("</svg>")
|
| 108 |
-
return "".join(parts)
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
def _build_top_pairs_table(
|
| 112 |
-
top_pairs: list,
|
| 113 |
-
labels: dict,
|
| 114 |
-
) -> str:
|
| 115 |
-
"""Construit la table HTML des paires les plus co-occurrentes."""
|
| 116 |
-
if not top_pairs:
|
| 117 |
-
return ""
|
| 118 |
-
pair_label = labels.get("taxocooc_pair_label", "Paire")
|
| 119 |
-
jaccard_label = labels.get("taxocooc_jaccard_label", "Jaccard")
|
| 120 |
-
|
| 121 |
-
parts = [
|
| 122 |
-
'<table style="border-collapse:collapse;font-size:.85rem;'
|
| 123 |
-
'margin-top:.5rem">',
|
| 124 |
-
'<thead><tr>',
|
| 125 |
-
f'<th style="padding:.3rem .5rem;text-align:left;'
|
| 126 |
-
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 127 |
-
f'{_e(pair_label)}</th>',
|
| 128 |
-
f'<th style="padding:.3rem .5rem;text-align:right;'
|
| 129 |
-
f'border-bottom:1px solid #ccc;font-weight:600">'
|
| 130 |
-
f'{_e(jaccard_label)}</th>',
|
| 131 |
-
'</tr></thead><tbody>',
|
| 132 |
-
]
|
| 133 |
-
for ca, cb, j in top_pairs:
|
| 134 |
-
parts.append(
|
| 135 |
-
f'<tr>'
|
| 136 |
-
f'<td style="padding:.2rem .5rem">'
|
| 137 |
-
f'<code>{_e(ca)}</code> ↔ <code>{_e(cb)}</code></td>'
|
| 138 |
-
f'<td style="padding:.2rem .5rem;text-align:right;'
|
| 139 |
-
f'font-family:monospace;background:{_color_for_jaccard(j)};'
|
| 140 |
-
f'color:{_text_color_for_bg(j)}">{j:.2f}</td>'
|
| 141 |
-
f'</tr>'
|
| 142 |
-
)
|
| 143 |
-
parts.append("</tbody></table>")
|
| 144 |
-
return "".join(parts)
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
def build_taxonomy_cooccurrence_html(
|
| 148 |
-
data: Optional[dict],
|
| 149 |
-
labels: Optional[dict[str, str]] = None,
|
| 150 |
-
) -> str:
|
| 151 |
-
"""Construit le bloc HTML complet de co-occurrence taxonomique.
|
| 152 |
-
|
| 153 |
-
Retourne ``""`` si ``data is None`` ou matrice vide.
|
| 154 |
-
"""
|
| 155 |
-
if not data:
|
| 156 |
-
return ""
|
| 157 |
-
classes = data.get("classes") or []
|
| 158 |
-
matrix = data.get("cooccurrence_matrix") or {}
|
| 159 |
-
if not classes or not matrix:
|
| 160 |
-
return ""
|
| 161 |
-
labels = labels or {}
|
| 162 |
-
title = labels.get(
|
| 163 |
-
"taxocooc_title",
|
| 164 |
-
"Co-occurrence des classes d'erreur",
|
| 165 |
-
)
|
| 166 |
-
note = labels.get(
|
| 167 |
-
"taxocooc_note",
|
| 168 |
-
"Indice de Jaccard au niveau document : 1,00 = ces deux classes "
|
| 169 |
-
"apparaissent toujours ensemble ; 0,00 = jamais. Lecture par paires "
|
| 170 |
-
"co-occurrentes ci-dessous.",
|
| 171 |
-
)
|
| 172 |
-
n_docs = data.get("n_documents", 0)
|
| 173 |
-
n_docs_label_template = labels.get(
|
| 174 |
-
"taxocooc_n_docs", "Calculé sur {n_docs} documents.",
|
| 175 |
-
)
|
| 176 |
-
n_docs_phrase = n_docs_label_template.format(n_docs=n_docs)
|
| 177 |
-
|
| 178 |
-
svg = _build_heatmap_svg(classes, matrix)
|
| 179 |
-
top_table = _build_top_pairs_table(
|
| 180 |
-
data.get("top_pairs") or [], labels,
|
| 181 |
-
)
|
| 182 |
-
|
| 183 |
-
parts = [
|
| 184 |
-
'<div class="taxocooc" style="margin:1rem 0">',
|
| 185 |
-
f'<div style="font-weight:600;margin-bottom:.4rem">{_e(title)}</div>',
|
| 186 |
-
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
|
| 187 |
-
f'{_e(note)}</div>',
|
| 188 |
-
f'<div style="font-size:.8rem;opacity:.7;margin-bottom:.5rem">'
|
| 189 |
-
f'{_e(n_docs_phrase)}</div>',
|
| 190 |
-
svg,
|
| 191 |
-
top_table,
|
| 192 |
-
"</div>",
|
| 193 |
-
]
|
| 194 |
-
return "".join(parts)
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.render.taxonomy_cooccurrence_render`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.report.taxonomy_cooccurrence_render
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.render.taxonomy_cooccurrence_render import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.render.taxonomy_cooccurrence_render as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -1,182 +1,20 @@
|
|
| 1 |
-
"""
|
| 2 |
|
| 3 |
-
A
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
**server-side**, pas de JavaScript, anti-injection systématique.
|
| 8 |
-
|
| 9 |
-
Sortie typique
|
| 10 |
-
--------------
|
| 11 |
-
Une grille N_classes × N_bins où chaque cellule indique la densité
|
| 12 |
-
d'erreurs de cette classe à cette position dans le document.
|
| 13 |
-
Lecture immédiate : « ligature_error concentré dans la première
|
| 14 |
-
tranche → erreur de marge ; visual_confusion uniformément réparti
|
| 15 |
-
→ erreur de scribe ».
|
| 16 |
-
|
| 17 |
-
Adaptive : si ``data is None`` ou si toutes les classes ont 0
|
| 18 |
-
erreur, retourne ``""``.
|
| 19 |
"""
|
| 20 |
|
| 21 |
-
from
|
| 22 |
-
|
| 23 |
-
from html import escape as _e
|
| 24 |
-
from typing import Optional
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
def _color_for_density(density: float) -> str:
|
| 28 |
-
"""Gradient blanc → orange profond pour densité ∈ [0, 1].
|
| 29 |
-
|
| 30 |
-
Interpolation entre #ffffff (0) et #c2410c (1).
|
| 31 |
-
"""
|
| 32 |
-
f = max(0.0, min(1.0, density))
|
| 33 |
-
r = int(255 + (194 - 255) * f)
|
| 34 |
-
g = int(255 + (65 - 255) * f)
|
| 35 |
-
b = int(255 + (12 - 255) * f)
|
| 36 |
-
return f"#{r:02x}{g:02x}{b:02x}"
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
def _text_color_for_bg(density: float) -> str:
|
| 40 |
-
return "#fff" if density > 0.55 else "#222"
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
def _build_heatmap_svg(
|
| 44 |
-
classes_with_errors: list[str],
|
| 45 |
-
per_class: dict[str, list[int]],
|
| 46 |
-
n_bins: int,
|
| 47 |
-
*,
|
| 48 |
-
cell_w: int = 36,
|
| 49 |
-
cell_h: int = 26,
|
| 50 |
-
label_left: int = 150,
|
| 51 |
-
label_top: int = 30,
|
| 52 |
-
) -> str:
|
| 53 |
-
"""Construit la heatmap SVG class × position."""
|
| 54 |
-
n_rows = len(classes_with_errors)
|
| 55 |
-
if n_rows == 0:
|
| 56 |
-
return ""
|
| 57 |
-
width = label_left + n_bins * cell_w + 10
|
| 58 |
-
height = label_top + n_rows * cell_h + 30 # +30 pour étiquette X
|
| 59 |
-
|
| 60 |
-
# Normalisation : pour chaque classe, densité relative au max
|
| 61 |
-
# de cette classe (mise en évidence des positions concentrées).
|
| 62 |
-
parts = [
|
| 63 |
-
f'<svg xmlns="http://www.w3.org/2000/svg" '
|
| 64 |
-
f'width="{width}" height="{height}" '
|
| 65 |
-
f'viewBox="0 0 {width} {height}" '
|
| 66 |
-
f'role="img" aria-label="Heatmap class taxonomique × position">',
|
| 67 |
-
]
|
| 68 |
-
# Étiquettes des colonnes (positions)
|
| 69 |
-
for j in range(n_bins):
|
| 70 |
-
cx = label_left + j * cell_w + cell_w // 2
|
| 71 |
-
cy = label_top - 6
|
| 72 |
-
parts.append(
|
| 73 |
-
f'<text x="{cx}" y="{cy}" '
|
| 74 |
-
f'font-size="10" fill="#666" text-anchor="middle">'
|
| 75 |
-
f'{j + 1}</text>'
|
| 76 |
-
)
|
| 77 |
-
# Cellules
|
| 78 |
-
for i, cls in enumerate(classes_with_errors):
|
| 79 |
-
# Étiquette de ligne (classe)
|
| 80 |
-
rx = label_left - 6
|
| 81 |
-
ry = label_top + i * cell_h + cell_h // 2 + 4
|
| 82 |
-
parts.append(
|
| 83 |
-
f'<text x="{rx}" y="{ry}" '
|
| 84 |
-
f'font-size="11" fill="#333" text-anchor="end">'
|
| 85 |
-
f'{_e(cls)}</text>'
|
| 86 |
-
)
|
| 87 |
-
counts = per_class.get(cls, [0] * n_bins)
|
| 88 |
-
max_count = max(counts) if counts else 0
|
| 89 |
-
for j in range(n_bins):
|
| 90 |
-
x = label_left + j * cell_w
|
| 91 |
-
y = label_top + i * cell_h
|
| 92 |
-
count = counts[j] if j < len(counts) else 0
|
| 93 |
-
density = (count / max_count) if max_count > 0 else 0.0
|
| 94 |
-
color = _color_for_density(density)
|
| 95 |
-
text_color = _text_color_for_bg(density)
|
| 96 |
-
parts.append(
|
| 97 |
-
f'<rect x="{x}" y="{y}" '
|
| 98 |
-
f'width="{cell_w}" height="{cell_h}" '
|
| 99 |
-
f'fill="{color}" stroke="#ddd" stroke-width="0.5"/>'
|
| 100 |
-
)
|
| 101 |
-
if count > 0:
|
| 102 |
-
parts.append(
|
| 103 |
-
f'<text x="{x + cell_w // 2}" '
|
| 104 |
-
f'y="{y + cell_h // 2 + 4}" '
|
| 105 |
-
f'font-size="10" fill="{text_color}" '
|
| 106 |
-
f'text-anchor="middle">{count}</text>'
|
| 107 |
-
)
|
| 108 |
-
# Étiquette axe X en bas
|
| 109 |
-
cx_axis = label_left + (n_bins * cell_w) // 2
|
| 110 |
-
cy_axis = height - 6
|
| 111 |
-
parts.append(
|
| 112 |
-
f'<text x="{cx_axis}" y="{cy_axis}" '
|
| 113 |
-
f'font-size="11" fill="#666" text-anchor="middle" '
|
| 114 |
-
f'font-style="italic">'
|
| 115 |
-
f'Position dans le document (1 = début)</text>'
|
| 116 |
-
)
|
| 117 |
-
parts.append("</svg>")
|
| 118 |
-
return "".join(parts)
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
def build_taxonomy_intra_doc_html(
|
| 122 |
-
data: Optional[dict],
|
| 123 |
-
labels: Optional[dict[str, str]] = None,
|
| 124 |
-
) -> str:
|
| 125 |
-
"""Construit le bloc HTML complet de la heatmap intra-document.
|
| 126 |
-
|
| 127 |
-
Retourne ``""`` si ``data is None`` ou aucune erreur.
|
| 128 |
-
"""
|
| 129 |
-
if not data:
|
| 130 |
-
return ""
|
| 131 |
-
n_bins = data.get("n_bins", 0)
|
| 132 |
-
per_class = data.get("per_class") or {}
|
| 133 |
-
total_errors = data.get("total_errors", 0)
|
| 134 |
-
if total_errors == 0 or n_bins <= 0:
|
| 135 |
-
return ""
|
| 136 |
-
# Filtre : uniquement les classes ayant au moins une erreur
|
| 137 |
-
classes_with_errors = [
|
| 138 |
-
cls for cls, counts in per_class.items()
|
| 139 |
-
if isinstance(counts, list) and sum(counts) > 0
|
| 140 |
-
]
|
| 141 |
-
if not classes_with_errors:
|
| 142 |
-
return ""
|
| 143 |
-
|
| 144 |
-
labels = labels or {}
|
| 145 |
-
title = labels.get(
|
| 146 |
-
"intradoc_title",
|
| 147 |
-
"Évolution intra-document des classes d'erreur",
|
| 148 |
-
)
|
| 149 |
-
note = labels.get(
|
| 150 |
-
"intradoc_note",
|
| 151 |
-
"Heatmap class × position : densité relative par classe "
|
| 152 |
-
"(plus foncé = concentré). Une classe concentrée dans la "
|
| 153 |
-
"première colonne suggère une erreur de marge ; "
|
| 154 |
-
"une distribution uniforme suggère une erreur de scribe.",
|
| 155 |
-
)
|
| 156 |
-
n_words_gt = data.get("n_words_gt", 0)
|
| 157 |
-
n_words_template = labels.get(
|
| 158 |
-
"intradoc_n_words",
|
| 159 |
-
"Calculé sur {n_words_gt} mots GT, répartis en {n_bins} tranches.",
|
| 160 |
-
)
|
| 161 |
-
n_words_phrase = n_words_template.format(
|
| 162 |
-
n_words_gt=n_words_gt, n_bins=n_bins,
|
| 163 |
-
)
|
| 164 |
-
|
| 165 |
-
svg = _build_heatmap_svg(classes_with_errors, per_class, n_bins)
|
| 166 |
-
|
| 167 |
-
parts = [
|
| 168 |
-
'<div class="intradoc" style="margin:1rem 0">',
|
| 169 |
-
f'<div style="font-weight:600;margin-bottom:.4rem">{_e(title)}</div>',
|
| 170 |
-
f'<div style="font-size:.85rem;opacity:.75;margin-bottom:.5rem">'
|
| 171 |
-
f'{_e(note)}</div>',
|
| 172 |
-
f'<div style="font-size:.8rem;opacity:.7;margin-bottom:.5rem">'
|
| 173 |
-
f'{_e(n_words_phrase)}</div>',
|
| 174 |
-
svg,
|
| 175 |
-
"</div>",
|
| 176 |
-
]
|
| 177 |
-
return "".join(parts)
|
| 178 |
-
|
| 179 |
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Alias rétrocompat — module déplacé dans :mod:`picarones.extras.render.taxonomy_intra_doc_render`.
|
| 2 |
|
| 3 |
+
Phase A du chantier de refonte en 3 cercles (architecture-cercles.md).
|
| 4 |
+
Le contenu vit désormais dans son cercle 3 ``extras/``. Cet alias
|
| 5 |
+
permet aux imports historiques (``from picarones.report.taxonomy_intra_doc_render
|
| 6 |
+
import ...``) de continuer à fonctionner sans modification.
|
| 7 |
|
| 8 |
+
Voir :doc:`docs/architecture-cercles.md` pour la justification du
|
| 9 |
+
classement de ce module au Cercle 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
"""
|
| 11 |
|
| 12 |
+
from picarones.extras.render.taxonomy_intra_doc_render import * # noqa: F401, F403
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
+
# Réexport explicite des éventuels noms privés ou modules accédés
|
| 15 |
+
# directement par leur attribut (rare mais possible). Pour la plupart
|
| 16 |
+
# des modules, l'``import *`` ci-dessus suffit.
|
| 17 |
+
import picarones.extras.render.taxonomy_intra_doc_render as _module
|
| 18 |
+
__all__ = getattr(_module, "__all__", [
|
| 19 |
+
name for name in dir(_module) if not name.startswith("_")
|
| 20 |
+
])
|
|
@@ -0,0 +1,318 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests de la phase A — refonte en 3 cercles (post-chantier 6).
|
| 2 |
+
|
| 3 |
+
Couvre :
|
| 4 |
+
|
| 5 |
+
- 4 modules `core/` déplacés vers `extras/academic/` ou
|
| 6 |
+
`extras/governance/` avec shims rétrocompat.
|
| 7 |
+
- 4 renderers `report/` déplacés vers `extras/render/` avec shims.
|
| 8 |
+
- Identité préservée : ``shim.X is new_location.X`` (pas de duplication
|
| 9 |
+
ni de redéfinition).
|
| 10 |
+
- Hygiène anti-verdict : 5 phrases reformulées dans les templates
|
| 11 |
+
narratifs et l'i18n du rapport.
|
| 12 |
+
- Document `docs/architecture-cercles.md` présent et complet.
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from __future__ import annotations
|
| 16 |
+
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
import pytest
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 23 |
+
# 1. Modules déplacés vers extras/ — rétrocompat des imports historiques
|
| 24 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
class TestRetrocompatHistoricalImports:
|
| 28 |
+
"""Les imports `from picarones.core.X` doivent continuer à fonctionner
|
| 29 |
+
après le déplacement vers `picarones.extras.*`."""
|
| 30 |
+
|
| 31 |
+
@pytest.mark.parametrize("module_path, attribute", [
|
| 32 |
+
("picarones.core.taxonomy_intra_doc", "compute_taxonomy_position_heatmap"),
|
| 33 |
+
("picarones.core.taxonomy_cooccurrence", "compute_taxonomy_cooccurrence"),
|
| 34 |
+
("picarones.core.image_predictive", "compute_paleographic_complexity"),
|
| 35 |
+
("picarones.core.image_predictive", "compute_corpus_homogeneity"),
|
| 36 |
+
("picarones.core.image_predictive", "aggregate_corpus_predictive"),
|
| 37 |
+
("picarones.core.module_policy", "ModuleManifest"),
|
| 38 |
+
("picarones.core.module_policy", "validate_manifest"),
|
| 39 |
+
("picarones.core.module_policy", "audit_module"),
|
| 40 |
+
])
|
| 41 |
+
def test_core_alias_still_works(self, module_path: str, attribute: str):
|
| 42 |
+
import importlib
|
| 43 |
+
mod = importlib.import_module(module_path)
|
| 44 |
+
assert hasattr(mod, attribute), (
|
| 45 |
+
f"{module_path}.{attribute} a disparu après la phase A — "
|
| 46 |
+
"le shim rétrocompat est cassé"
|
| 47 |
+
)
|
| 48 |
+
|
| 49 |
+
@pytest.mark.parametrize("module_path, attribute", [
|
| 50 |
+
("picarones.report.taxonomy_intra_doc_render", "build_taxonomy_intra_doc_html"),
|
| 51 |
+
("picarones.report.taxonomy_cooccurrence_render", "build_taxonomy_cooccurrence_html"),
|
| 52 |
+
("picarones.report.image_predictive_render", "build_image_predictive_html"),
|
| 53 |
+
("picarones.report.module_audit_render", "build_module_audit_html"),
|
| 54 |
+
])
|
| 55 |
+
def test_report_alias_still_works(self, module_path: str, attribute: str):
|
| 56 |
+
import importlib
|
| 57 |
+
mod = importlib.import_module(module_path)
|
| 58 |
+
assert hasattr(mod, attribute)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 62 |
+
# 2. Modules accessibles via leur nouveau chemin extras/
|
| 63 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
class TestNewExtrasImports:
|
| 67 |
+
@pytest.mark.parametrize("new_path, attribute", [
|
| 68 |
+
("picarones.extras.academic.taxonomy_intra_doc", "compute_taxonomy_position_heatmap"),
|
| 69 |
+
("picarones.extras.academic.taxonomy_cooccurrence", "compute_taxonomy_cooccurrence"),
|
| 70 |
+
("picarones.extras.academic.image_predictive", "aggregate_corpus_predictive"),
|
| 71 |
+
("picarones.extras.governance.module_policy", "ModuleManifest"),
|
| 72 |
+
("picarones.extras.render.taxonomy_intra_doc_render", "build_taxonomy_intra_doc_html"),
|
| 73 |
+
("picarones.extras.render.taxonomy_cooccurrence_render", "build_taxonomy_cooccurrence_html"),
|
| 74 |
+
("picarones.extras.render.image_predictive_render", "build_image_predictive_html"),
|
| 75 |
+
("picarones.extras.render.module_audit_render", "build_module_audit_html"),
|
| 76 |
+
])
|
| 77 |
+
def test_extras_path_works(self, new_path: str, attribute: str):
|
| 78 |
+
import importlib
|
| 79 |
+
mod = importlib.import_module(new_path)
|
| 80 |
+
assert hasattr(mod, attribute)
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 84 |
+
# 3. Identité préservée — pas de redéfinition par le shim
|
| 85 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
class TestIdentityThroughShim:
|
| 89 |
+
"""Le shim doit réexporter la fonction du nouveau chemin, pas la
|
| 90 |
+
redéfinir. Sinon une métrique serait calculée différemment selon
|
| 91 |
+
le chemin d'import."""
|
| 92 |
+
|
| 93 |
+
def test_taxonomy_intra_doc_identity(self):
|
| 94 |
+
from picarones.core.taxonomy_intra_doc import (
|
| 95 |
+
compute_taxonomy_position_heatmap as via_old,
|
| 96 |
+
)
|
| 97 |
+
from picarones.extras.academic.taxonomy_intra_doc import (
|
| 98 |
+
compute_taxonomy_position_heatmap as via_new,
|
| 99 |
+
)
|
| 100 |
+
assert via_old is via_new
|
| 101 |
+
|
| 102 |
+
def test_image_predictive_identity(self):
|
| 103 |
+
from picarones.core.image_predictive import (
|
| 104 |
+
aggregate_corpus_predictive as via_old,
|
| 105 |
+
)
|
| 106 |
+
from picarones.extras.academic.image_predictive import (
|
| 107 |
+
aggregate_corpus_predictive as via_new,
|
| 108 |
+
)
|
| 109 |
+
assert via_old is via_new
|
| 110 |
+
|
| 111 |
+
def test_module_policy_identity(self):
|
| 112 |
+
from picarones.core.module_policy import ModuleManifest as via_old
|
| 113 |
+
from picarones.extras.governance.module_policy import (
|
| 114 |
+
ModuleManifest as via_new,
|
| 115 |
+
)
|
| 116 |
+
assert via_old is via_new
|
| 117 |
+
|
| 118 |
+
def test_renderer_identity(self):
|
| 119 |
+
from picarones.report.taxonomy_intra_doc_render import (
|
| 120 |
+
build_taxonomy_intra_doc_html as via_old,
|
| 121 |
+
)
|
| 122 |
+
from picarones.extras.render.taxonomy_intra_doc_render import (
|
| 123 |
+
build_taxonomy_intra_doc_html as via_new,
|
| 124 |
+
)
|
| 125 |
+
assert via_old is via_new
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 129 |
+
# 4. Vues du chantier 3 — toujours fonctionnelles
|
| 130 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
class TestChantier3ViewsStillWork:
|
| 134 |
+
"""Les 5 vues du chantier 3 importent (sous-section opt-in) les
|
| 135 |
+
modules déplacés. Vérifier qu'elles tournent encore après la
|
| 136 |
+
migration."""
|
| 137 |
+
|
| 138 |
+
def test_views_import(self):
|
| 139 |
+
from picarones.report.views import (
|
| 140 |
+
build_advanced_taxonomy_view_html,
|
| 141 |
+
build_diagnostics_view_html,
|
| 142 |
+
build_economics_view_html,
|
| 143 |
+
build_pipeline_view_html,
|
| 144 |
+
build_robustness_view_html,
|
| 145 |
+
)
|
| 146 |
+
assert callable(build_advanced_taxonomy_view_html)
|
| 147 |
+
assert callable(build_diagnostics_view_html)
|
| 148 |
+
assert callable(build_economics_view_html)
|
| 149 |
+
assert callable(build_pipeline_view_html)
|
| 150 |
+
assert callable(build_robustness_view_html)
|
| 151 |
+
|
| 152 |
+
def test_advanced_taxonomy_with_intra_doc_data(self):
|
| 153 |
+
"""La vue advanced_taxonomy accepte des données opt-in
|
| 154 |
+
``intra_doc`` dont le calcul vient désormais de
|
| 155 |
+
``picarones.extras.academic``."""
|
| 156 |
+
from picarones.extras.academic.taxonomy_intra_doc import (
|
| 157 |
+
compute_taxonomy_position_heatmap,
|
| 158 |
+
)
|
| 159 |
+
from picarones.report.views import build_advanced_taxonomy_view_html
|
| 160 |
+
|
| 161 |
+
# Calcul d'une heatmap minimaliste
|
| 162 |
+
result = compute_taxonomy_position_heatmap(
|
| 163 |
+
"abc def ghi", "abx def ghi", n_bins=3,
|
| 164 |
+
)
|
| 165 |
+
# La vue doit pouvoir composer sans crasher quand on lui passe
|
| 166 |
+
# ces données opt-in
|
| 167 |
+
report_data = {"engines": [
|
| 168 |
+
{"name": "tess", "cer": 0.05,
|
| 169 |
+
"aggregated_taxonomy": {"class_distribution": {"x": 5}}},
|
| 170 |
+
{"name": "pero", "cer": 0.08,
|
| 171 |
+
"aggregated_taxonomy": {"class_distribution": {"x": 8}}},
|
| 172 |
+
]}
|
| 173 |
+
html = build_advanced_taxonomy_view_html(
|
| 174 |
+
report_data, {}, intra_doc=result,
|
| 175 |
+
)
|
| 176 |
+
# Pas de crash + au moins du contenu (comparison + intra_doc)
|
| 177 |
+
assert isinstance(html, str)
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 181 |
+
# 5. Hygiène anti-verdict — phrases reformulées
|
| 182 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
class TestAntiVerdictHygiene:
|
| 186 |
+
"""Les 5 phrases identifiées comme prescriptives ont été reformulées
|
| 187 |
+
factuellement. Tests anti-régression."""
|
| 188 |
+
|
| 189 |
+
@pytest.fixture
|
| 190 |
+
def fr_templates(self) -> str:
|
| 191 |
+
path = (Path(__file__).parent.parent
|
| 192 |
+
/ "picarones" / "core" / "narrative" / "templates" / "fr.yaml")
|
| 193 |
+
return path.read_text(encoding="utf-8")
|
| 194 |
+
|
| 195 |
+
@pytest.fixture
|
| 196 |
+
def en_templates(self) -> str:
|
| 197 |
+
path = (Path(__file__).parent.parent
|
| 198 |
+
/ "picarones" / "core" / "narrative" / "templates" / "en.yaml")
|
| 199 |
+
return path.read_text(encoding="utf-8")
|
| 200 |
+
|
| 201 |
+
@pytest.fixture
|
| 202 |
+
def fr_i18n(self) -> str:
|
| 203 |
+
path = (Path(__file__).parent.parent
|
| 204 |
+
/ "picarones" / "report" / "i18n" / "fr.json")
|
| 205 |
+
return path.read_text(encoding="utf-8")
|
| 206 |
+
|
| 207 |
+
@pytest.fixture
|
| 208 |
+
def en_i18n(self) -> str:
|
| 209 |
+
path = (Path(__file__).parent.parent
|
| 210 |
+
/ "picarones" / "report" / "i18n" / "en.json")
|
| 211 |
+
return path.read_text(encoding="utf-8")
|
| 212 |
+
|
| 213 |
+
def test_stratum_winner_no_dominate(self, fr_templates, en_templates):
|
| 214 |
+
"""`stratum_winner` ne dit plus « domine nettement » /
|
| 215 |
+
« clearly dominates ». Phrasage factuel attendu."""
|
| 216 |
+
assert "domine\n nettement" not in fr_templates
|
| 217 |
+
assert "domine nettement" not in fr_templates
|
| 218 |
+
assert "clearly\n dominates" not in en_templates
|
| 219 |
+
assert "clearly dominates" not in en_templates
|
| 220 |
+
# Confirmation présence du nouveau phrasage factuel
|
| 221 |
+
assert "le CER le plus bas" in fr_templates
|
| 222 |
+
assert "the lowest CER" in en_templates
|
| 223 |
+
|
| 224 |
+
def test_confidence_warning_no_fragile(self, fr_templates, en_templates):
|
| 225 |
+
"""`confidence_warning` ne dit plus « fragile » mais
|
| 226 |
+
« incertitude statistique élevée »."""
|
| 227 |
+
assert "Classement fragile" not in fr_templates
|
| 228 |
+
assert "Ranking is fragile" not in en_templates
|
| 229 |
+
assert "Incertitude statistique" in fr_templates
|
| 230 |
+
assert "High statistical uncertainty" in en_templates
|
| 231 |
+
|
| 232 |
+
def test_gini_no_ideal(self, fr_i18n, en_i18n):
|
| 233 |
+
"""`gini_cer_ideal` et `gini_cer_note` n'utilisent plus
|
| 234 |
+
« idéal » / « ideal » mais « lecture » / « reading »."""
|
| 235 |
+
assert "\"gini_cer_ideal\": \"— idéal" not in fr_i18n
|
| 236 |
+
assert "\"gini_cer_ideal\": \"— ideal" not in en_i18n
|
| 237 |
+
# Confirmer le nouveau phrasage
|
| 238 |
+
assert "lecture : bas-gauche" in fr_i18n
|
| 239 |
+
assert "reading: bottom-left" in en_i18n
|
| 240 |
+
|
| 241 |
+
def test_taxocomp_no_preferable(self, fr_i18n, en_i18n):
|
| 242 |
+
"""`taxocomp_note` ne dit plus « préférable » / « preferable »."""
|
| 243 |
+
assert "préférable pour une édition critique" not in fr_i18n
|
| 244 |
+
assert "preferable for a critical edition" not in en_i18n
|
| 245 |
+
# Phrasage factuel
|
| 246 |
+
assert "tend à produire des erreurs plus facilement" in fr_i18n
|
| 247 |
+
assert "tends to produce errors more easily" in en_i18n
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 251 |
+
# 6. Document docs/architecture-cercles.md présent et complet
|
| 252 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
class TestArchitectureCerclesDoc:
|
| 256 |
+
@pytest.fixture
|
| 257 |
+
def doc(self) -> str:
|
| 258 |
+
path = (Path(__file__).parent.parent / "docs" / "architecture-cercles.md")
|
| 259 |
+
return path.read_text(encoding="utf-8")
|
| 260 |
+
|
| 261 |
+
def test_doc_exists(self, doc):
|
| 262 |
+
assert len(doc) > 1000
|
| 263 |
+
|
| 264 |
+
def test_doc_describes_three_circles(self, doc):
|
| 265 |
+
assert "Cercle 1" in doc
|
| 266 |
+
assert "Cercle 2" in doc
|
| 267 |
+
assert "Cercle 3" in doc
|
| 268 |
+
assert "Noyau invariant" in doc or "noyau invariant" in doc
|
| 269 |
+
assert "Plugins" in doc or "plugins" in doc
|
| 270 |
+
|
| 271 |
+
def test_doc_assigns_specific_modules(self, doc):
|
| 272 |
+
"""Le document doit lister explicitement les modules de chaque cercle."""
|
| 273 |
+
# Cercle 1 — quelques noms
|
| 274 |
+
for name in ["corpus.py", "modules.py", "runner.py",
|
| 275 |
+
"metric_registry.py", "alto_metrics.py"]:
|
| 276 |
+
assert name in doc, f"{name} doit être listé dans le doc"
|
| 277 |
+
# Cercle 3 — modules déplacés en phase A
|
| 278 |
+
for name in ["taxonomy_intra_doc", "image_predictive",
|
| 279 |
+
"module_policy"]:
|
| 280 |
+
assert name in doc, f"{name} doit être listé dans le doc"
|
| 281 |
+
|
| 282 |
+
def test_doc_mentions_extras_path(self, doc):
|
| 283 |
+
"""Le doc explique que les Cercle 3 vivent dans `extras/`."""
|
| 284 |
+
assert "extras/academic" in doc
|
| 285 |
+
assert "extras/governance" in doc
|
| 286 |
+
assert "extras/render" in doc
|
| 287 |
+
|
| 288 |
+
|
| 289 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 290 |
+
# 7. Modules originaux ne contiennent plus de logique métier
|
| 291 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 292 |
+
|
| 293 |
+
|
| 294 |
+
class TestOriginalsAreShims:
|
| 295 |
+
"""Vérifie que les fichiers laissés à l'ancien emplacement sont
|
| 296 |
+
bien des shims minces, pas des copies de la logique."""
|
| 297 |
+
|
| 298 |
+
@pytest.mark.parametrize("path", [
|
| 299 |
+
"picarones/core/taxonomy_intra_doc.py",
|
| 300 |
+
"picarones/core/taxonomy_cooccurrence.py",
|
| 301 |
+
"picarones/core/image_predictive.py",
|
| 302 |
+
"picarones/core/module_policy.py",
|
| 303 |
+
"picarones/report/taxonomy_intra_doc_render.py",
|
| 304 |
+
"picarones/report/taxonomy_cooccurrence_render.py",
|
| 305 |
+
"picarones/report/image_predictive_render.py",
|
| 306 |
+
"picarones/report/module_audit_render.py",
|
| 307 |
+
])
|
| 308 |
+
def test_is_thin_shim(self, path):
|
| 309 |
+
repo_root = Path(__file__).parent.parent
|
| 310 |
+
content = (repo_root / path).read_text(encoding="utf-8")
|
| 311 |
+
# Un shim < 30 lignes (juste docstring + 2 imports + __all__)
|
| 312 |
+
n_lines = len([line for line in content.splitlines() if line.strip()])
|
| 313 |
+
assert n_lines < 30, (
|
| 314 |
+
f"{path} fait {n_lines} lignes — devrait être un shim mince "
|
| 315 |
+
"(import + réexport, pas de logique métier)"
|
| 316 |
+
)
|
| 317 |
+
# Doit contenir l'indication du déplacement
|
| 318 |
+
assert "déplacé" in content or "extras" in content
|