Spaces:
Running
Running
| """Calibration des moteurs : ECE, MCE, reliability diagram. | |
| Sprint 39 β A.II.1.b du plan d'Γ©volution 2026 : couche de calcul pure. | |
| Pourquoi ce module | |
| ------------------ | |
| Tous les moteurs OCR cibles fournissent une confidence par token ou par | |
| ligne (Tesseract via le ``tsv``, Pero OCR via le ``PageLayout``, | |
| Mistral OCR via ``confidence``, Google Vision via ``Word.confidence``). | |
| La question naturelle pour un workflow patrimonial est : *Β« quand le | |
| moteur dit qu'il est sΓ»r, est-il vraiment sΓ»r ? Β»*. Pour une Γ©quipe | |
| qui doit vΓ©rifier humainement un corpus de 50 000 pages, la diffΓ©rence | |
| entre vΓ©rifier 100 % vs 15 % du volume est l'effet de la calibration. | |
| Ce module fournit les trois mesures classiques : | |
| - **Expected Calibration Error (ECE)** β moyenne pondΓ©rΓ©e par bin de | |
| l'Γ©cart absolu entre confiance moyenne et prΓ©cision moyenne. | |
| ``ECE = 0`` β moteur parfaitement calibrΓ© ; ``ECE`` Γ©levΓ© β Γ©cart | |
| systΓ©matique entre confiance affichΓ©e et fiabilitΓ© rΓ©elle. | |
| - **Maximum Calibration Error (MCE)** β max de cet Γ©cart sur les bins. | |
| Utile pour repΓ©rer le pire mensonge du moteur (ex. il dit toujours | |
| 95 % de confiance et il a tort une fois sur deux). | |
| - **Reliability diagram** β table ``[(bin_low, bin_high, avg_conf, | |
| accuracy, count)]`` qui peut Γͺtre rendue en SVG cΓ΄tΓ© serveur ou en | |
| Chart.js cΓ΄tΓ© navigateur dans un sprint suivant. | |
| StratΓ©gie de dΓ©coupage | |
| ---------------------- | |
| Comme pour le NER (Sprint 38) et la divergence (Sprints 35-37), | |
| on dΓ©coupe : | |
| - **Sprint 39** (ici) β couche de calcul pure : entrΓ©e = deux listes | |
| parallΓ¨les ``confidences`` (β [0, 1]) et ``is_correct`` (bool/0-1). | |
| Aucune dΓ©pendance externe. | |
| - **Sprint Γ venir** β exposition de ``token_confidences`` sur | |
| ``EngineResult``, alignement caractère/token avec la GT pour produire | |
| ``is_correct``, intΓ©gration dans le runner et vue HTML reliability. | |
| Ce qui est explicitement hors scope | |
| ----------------------------------- | |
| Ce sprint ne touche **aucun adaptateur OCR**. Aucune confiance n'est | |
| extraite ; on calcule uniquement Γ partir de sΓ©quences de prΓ©dictions | |
| fournies en entrΓ©e. C'est ce qui permet de tester rigoureusement les | |
| invariants mathΓ©matiques (ECE = 0 β calibrΓ©, ECE = |bias| pour bias | |
| constant, etc.) sans dΓ©pendre d'un backend. | |
| """ | |
| from __future__ import annotations | |
| import logging | |
| from dataclasses import dataclass | |
| from typing import Iterable | |
| logger = logging.getLogger(__name__) | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Modèle de données | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| class CalibrationBin: | |
| """Un bin du reliability diagram. | |
| Attributs | |
| --------- | |
| bin_low, bin_high: | |
| Bornes du bin sur l'axe de confiance (``[bin_low, bin_high)`` β | |
| sauf le dernier bin qui inclut ``1.0``). | |
| avg_confidence: | |
| Moyenne des confidences des prΓ©dictions tombΓ©es dans le bin. | |
| ``None`` si le bin est vide. | |
| accuracy: | |
| Fraction de prΓ©dictions correctes dans le bin (``β [0, 1]``). | |
| ``None`` si le bin est vide. | |
| count: | |
| Nombre de prΓ©dictions dans le bin. | |
| """ | |
| bin_low: float | |
| bin_high: float | |
| avg_confidence: float | None | |
| accuracy: float | None | |
| count: int | |
| def gap(self) -> float | None: | |
| """Γcart absolu ``|confidence - accuracy|`` ou ``None`` si vide.""" | |
| if self.avg_confidence is None or self.accuracy is None: | |
| return None | |
| return abs(self.avg_confidence - self.accuracy) | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Validation | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def _validate_inputs( | |
| confidences: list[float], | |
| is_correct: list[bool | int], | |
| ) -> None: | |
| if len(confidences) != len(is_correct): | |
| raise ValueError( | |
| f"Longueurs incompatibles : confidences={len(confidences)} " | |
| f"vs is_correct={len(is_correct)}" | |
| ) | |
| for i, c in enumerate(confidences): | |
| if not (0.0 <= float(c) <= 1.0): | |
| raise ValueError( | |
| f"Confiance hors [0, 1] Γ l'index {i} : {c!r}" | |
| ) | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Reliability diagram (binning) | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def reliability_diagram( | |
| confidences: Iterable[float], | |
| is_correct: Iterable[bool | int], | |
| n_bins: int = 10, | |
| ) -> list[CalibrationBin]: | |
| """DΓ©coupe les prΓ©dictions en ``n_bins`` bins Γ©quidistants par confiance | |
| et calcule pour chacun la confiance moyenne, la prΓ©cision et le compte. | |
| Parameters | |
| ---------- | |
| confidences: | |
| Confidences des prΓ©dictions, ``β [0, 1]``. | |
| is_correct: | |
| Indicateur boolΓ©en (1 = prΓ©diction correcte, 0 = incorrecte). | |
| n_bins: | |
| Nombre de bins (dΓ©faut : 10). Bornes : ``[k/n_bins, (k+1)/n_bins)`` | |
| sauf le dernier bin qui inclut ``1.0``. | |
| Returns | |
| ------- | |
| list[CalibrationBin] | |
| Liste de ``n_bins`` bins, dans l'ordre croissant des confidences. | |
| """ | |
| if n_bins < 1: | |
| raise ValueError(f"n_bins doit Γͺtre β₯ 1 β reΓ§u {n_bins}") | |
| confs = [float(c) for c in confidences] | |
| correct = [int(bool(x)) for x in is_correct] | |
| _validate_inputs(confs, correct) | |
| bin_width = 1.0 / n_bins | |
| sums: list[float] = [0.0] * n_bins | |
| correct_counts: list[int] = [0] * n_bins | |
| counts: list[int] = [0] * n_bins | |
| for c, ok in zip(confs, correct): | |
| # Calcul du bin index par multiplication ``c * n_bins`` plutΓ΄t que | |
| # division ``c / bin_width`` pour éviter les pièges de | |
| # reprΓ©sentation flottante (ex. ``0.6 / 0.1 = 5.999β¦`` en IEEE 754 | |
| # qui placerait 0.6 dans le bin [0.5, 0.6) au lieu de [0.6, 0.7)). | |
| if c >= 1.0: | |
| idx = n_bins - 1 | |
| else: | |
| idx = int(c * n_bins) | |
| # Garde-fou en cas d'arrondi flottant | |
| if idx >= n_bins: | |
| idx = n_bins - 1 | |
| elif idx < 0: | |
| idx = 0 | |
| sums[idx] += c | |
| correct_counts[idx] += ok | |
| counts[idx] += 1 | |
| bins: list[CalibrationBin] = [] | |
| for k in range(n_bins): | |
| low = k * bin_width | |
| high = (k + 1) * bin_width | |
| n = counts[k] | |
| if n == 0: | |
| bins.append(CalibrationBin(low, high, None, None, 0)) | |
| else: | |
| bins.append(CalibrationBin( | |
| bin_low=low, | |
| bin_high=high, | |
| avg_confidence=sums[k] / n, | |
| accuracy=correct_counts[k] / n, | |
| count=n, | |
| )) | |
| return bins | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # ECE et MCE | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def expected_calibration_error( | |
| confidences: Iterable[float], | |
| is_correct: Iterable[bool | int], | |
| n_bins: int = 10, | |
| ) -> float: | |
| """Expected Calibration Error : moyenne pondΓ©rΓ©e par bin de l'Γ©cart | |
| absolu confiance β prΓ©cision. | |
| ``ECE = sum_k (n_k / N) * |avg_conf_k - accuracy_k|`` | |
| oΓΉ la somme porte sur les bins non vides. | |
| Returns | |
| ------- | |
| float | |
| ``β [0, 1]``. ``0`` β calibration parfaite. | |
| """ | |
| bins = reliability_diagram(confidences, is_correct, n_bins=n_bins) | |
| total = sum(b.count for b in bins) | |
| if total == 0: | |
| return 0.0 | |
| ece = 0.0 | |
| for b in bins: | |
| if b.count == 0 or b.gap is None: | |
| continue | |
| ece += (b.count / total) * b.gap | |
| return ece | |
| def maximum_calibration_error( | |
| confidences: Iterable[float], | |
| is_correct: Iterable[bool | int], | |
| n_bins: int = 10, | |
| ) -> float: | |
| """Maximum Calibration Error : pire Γ©cart confiance β prΓ©cision sur | |
| tous les bins non vides. | |
| Utile pour repΓ©rer un mensonge ponctuel du moteur (ex. il dit 95 % | |
| de confiance et il a tort une fois sur deux dans ce bin). | |
| Returns | |
| ------- | |
| float | |
| ``β [0, 1]``. ``0`` β calibration parfaite. | |
| """ | |
| bins = reliability_diagram(confidences, is_correct, n_bins=n_bins) | |
| gaps = [b.gap for b in bins if b.gap is not None] | |
| return max(gaps) if gaps else 0.0 | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Vue agrΓ©gΓ©e | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def compute_calibration_metrics( | |
| confidences: Iterable[float], | |
| is_correct: Iterable[bool | int], | |
| n_bins: int = 10, | |
| ) -> dict: | |
| """Calcule l'ensemble des mΓ©triques de calibration en un appel. | |
| Returns | |
| ------- | |
| dict | |
| ``{ | |
| "ece": float, | |
| "mce": float, | |
| "n_bins": int, | |
| "n_predictions": int, | |
| "overall_accuracy": float, | |
| "overall_confidence": float, | |
| "bins": [ | |
| {"bin_low", "bin_high", "avg_confidence", | |
| "accuracy", "count", "gap"}, | |
| ... | |
| ], | |
| }`` | |
| """ | |
| confs = list(confidences) | |
| correct = list(is_correct) | |
| bins = reliability_diagram(confs, correct, n_bins=n_bins) | |
| total = sum(b.count for b in bins) | |
| overall_acc = ( | |
| sum(int(bool(x)) for x in correct) / total if total > 0 else 0.0 | |
| ) | |
| overall_conf = ( | |
| sum(float(c) for c in confs) / total if total > 0 else 0.0 | |
| ) | |
| ece = 0.0 | |
| if total > 0: | |
| for b in bins: | |
| if b.gap is None: | |
| continue | |
| ece += (b.count / total) * b.gap | |
| mce = max((b.gap for b in bins if b.gap is not None), default=0.0) | |
| return { | |
| "ece": ece, | |
| "mce": mce, | |
| "n_bins": n_bins, | |
| "n_predictions": total, | |
| "overall_accuracy": overall_acc, | |
| "overall_confidence": overall_conf, | |
| "bins": [ | |
| { | |
| "bin_low": b.bin_low, | |
| "bin_high": b.bin_high, | |
| "avg_confidence": b.avg_confidence, | |
| "accuracy": b.accuracy, | |
| "count": b.count, | |
| "gap": b.gap, | |
| } | |
| for b in bins | |
| ], | |
| } | |
| __all__ = [ | |
| "CalibrationBin", | |
| "reliability_diagram", | |
| "expected_calibration_error", | |
| "maximum_calibration_error", | |
| "compute_calibration_metrics", | |
| ] | |