File size: 15,089 Bytes
75e6d94
 
979f3c3
75e6d94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
979f3c3
75e6d94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
"""Tests Sprint 39 โ€” mรฉtriques de calibration (ECE, MCE, reliability).

Le module ``picarones.measurements.calibration`` expose :

- ``CalibrationBin`` : un bin du reliability diagram
- ``reliability_diagram(confidences, is_correct, n_bins=10)``
- ``expected_calibration_error`` (ECE)
- ``maximum_calibration_error`` (MCE)
- ``compute_calibration_metrics`` : vue agrรฉgรฉe

Les tests vรฉrifient :

1. **Calibration parfaite** : confidences uniformes รฉgales ร  la prรฉcision
   du bin โ†’ ECE = MCE = 0.
2. **Sur-confiance extrรชme** : confidence = 1.0 mais 50 % correct โ†’
   ECE = 0.5 et MCE = 0.5.
3. **Sous-confiance extrรชme** : confidence = 0.5 mais 100 % correct โ†’
   ECE = 0.5.
4. **Calibration constante** : confidence = c, accuracy = a โ†’ ECE = |c-a|.
5. **Reliability diagram** : binning correct, bornes correctes,
   bin 1.0 inclus dans le dernier bin.
6. **Bins vides** correctement gรฉrรฉs (avg_confidence/accuracy = None,
   count = 0, gap = None).
7. **Listes vides** โ†’ ECE = 0, MCE = 0.
8. **Garde-fous** : longueurs incompatibles โ†’ ValueError ;
   confidence hors [0, 1] โ†’ ValueError ; n_bins < 1 โ†’ ValueError.
9. **n_bins paramรฉtrable** : 5 bins vs 20 bins, bornes adaptรฉes.
10. **compute_calibration_metrics** : structure de retour complรจte et
    cohรฉrente avec les fonctions individuelles.
11. **CalibrationBin.gap** : comportement attendu (None pour bin vide).
"""

from __future__ import annotations

import pytest

from picarones.measurements.calibration import (
    CalibrationBin,
    compute_calibration_metrics,
    expected_calibration_error,
    maximum_calibration_error,
    reliability_diagram,
)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 1. Calibration parfaite
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestPerfectCalibration:
    def test_uniform_confidence_matching_accuracy_per_bin(self) -> None:
        """Toutes les prรฉdictions ร  confidence 0.75, 75 % correctes.
        Le seul bin non vide est [0.7, 0.8) avec gap = 0.
        """
        confs = [0.75] * 100
        correct = [1] * 75 + [0] * 25
        assert expected_calibration_error(confs, correct) == pytest.approx(0.0, abs=1e-9)
        assert maximum_calibration_error(confs, correct) == pytest.approx(0.0, abs=1e-9)

    def test_two_bins_each_perfectly_calibrated(self) -> None:
        # Bin [0.2, 0.3) : 25 % correct, 25 % conf
        # Bin [0.8, 0.9) : 85 % correct, 85 % conf
        confs = [0.25] * 100 + [0.85] * 100
        correct = [1] * 25 + [0] * 75 + [1] * 85 + [0] * 15
        assert expected_calibration_error(confs, correct) == pytest.approx(0.0, abs=1e-9)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 2-3. Cas extrรชmes
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestExtremeCases:
    def test_extreme_overconfidence(self) -> None:
        # Le moteur dit "100 % sรปr" mais a tort une fois sur deux
        confs = [1.0] * 10
        correct = [1] * 5 + [0] * 5
        assert expected_calibration_error(confs, correct) == pytest.approx(0.5)
        assert maximum_calibration_error(confs, correct) == pytest.approx(0.5)

    def test_extreme_underconfidence(self) -> None:
        # Le moteur dit "50 % sรปr" mais a toujours raison
        confs = [0.5] * 10
        correct = [1] * 10
        assert expected_calibration_error(confs, correct) == pytest.approx(0.5)
        assert maximum_calibration_error(confs, correct) == pytest.approx(0.5)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 4. Calibration constante (gap = |c - a|)
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestConstantBias:
    @pytest.mark.parametrize("conf,acc", [(0.6, 0.4), (0.3, 0.7), (0.95, 0.85)])
    def test_constant_bias_is_absolute_gap(
        self, conf: float, acc: float
    ) -> None:
        """Avec un seul bin non vide, ECE = |conf - acc|."""
        n = 100
        confs = [conf] * n
        n_correct = int(round(acc * n))
        correct = [1] * n_correct + [0] * (n - n_correct)
        ece = expected_calibration_error(confs, correct)
        # acc effective = n_correct/n (peut diffรฉrer lรฉgรจrement de acc cible
        # par arrondi entier)
        actual_acc = n_correct / n
        assert ece == pytest.approx(abs(conf - actual_acc), abs=1e-9)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 5. Reliability diagram โ€” binning
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestReliabilityDiagramBinning:
    def test_default_returns_10_bins(self) -> None:
        bins = reliability_diagram([0.5], [1])
        assert len(bins) == 10

    def test_bin_bounds_are_equidistant(self) -> None:
        bins = reliability_diagram([], [], n_bins=5)
        widths = [b.bin_high - b.bin_low for b in bins]
        for w in widths:
            assert w == pytest.approx(0.2, abs=1e-9)
        assert bins[0].bin_low == pytest.approx(0.0)
        assert bins[-1].bin_high == pytest.approx(1.0)

    def test_confidence_1_falls_in_last_bin(self) -> None:
        bins = reliability_diagram([1.0, 1.0, 1.0], [1, 0, 1], n_bins=10)
        # Toutes les prรฉdictions doivent รชtre dans le dernier bin
        assert bins[-1].count == 3
        assert sum(b.count for b in bins[:-1]) == 0

    def test_predictions_assigned_to_correct_bin(self) -> None:
        bins = reliability_diagram(
            [0.05, 0.15, 0.55, 0.95],
            [0, 1, 1, 0],
            n_bins=10,
        )
        # bin [0.0, 0.1) โ†’ 1 prรฉdiction
        assert bins[0].count == 1
        # bin [0.1, 0.2) โ†’ 1
        assert bins[1].count == 1
        # bin [0.5, 0.6) โ†’ 1
        assert bins[5].count == 1
        # bin [0.9, 1.0] โ†’ 1
        assert bins[9].count == 1

    def test_avg_confidence_and_accuracy_per_bin(self) -> None:
        # Bin [0.6, 0.7) : confidences 0.6, 0.65 ; correct 1, 0
        bins = reliability_diagram([0.6, 0.65], [1, 0], n_bins=10)
        b6 = bins[6]
        assert b6.count == 2
        assert b6.avg_confidence == pytest.approx((0.6 + 0.65) / 2)
        assert b6.accuracy == pytest.approx(0.5)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 6. Bins vides
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestEmptyBins:
    def test_empty_bin_has_none_avg_and_accuracy(self) -> None:
        bins = reliability_diagram([0.95], [1], n_bins=10)
        # Tous les bins sauf le dernier sont vides
        for b in bins[:-1]:
            assert b.count == 0
            assert b.avg_confidence is None
            assert b.accuracy is None
            assert b.gap is None

    def test_ece_skips_empty_bins(self) -> None:
        # Avec un seul bin non vide ร  gap 0, ECE doit รชtre 0
        bins = reliability_diagram([0.55] * 10, [1] * 6 + [0] * 4)
        assert expected_calibration_error([0.55] * 10, [1] * 6 + [0] * 4) == \
            pytest.approx(0.05)
        # Confirmer que beaucoup de bins sont vides
        empty = [b for b in bins if b.count == 0]
        assert len(empty) == 9


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 7. Listes vides
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestEmptyInputs:
    def test_empty_lists_return_zero(self) -> None:
        assert expected_calibration_error([], []) == 0.0
        assert maximum_calibration_error([], []) == 0.0

    def test_empty_reliability_diagram(self) -> None:
        bins = reliability_diagram([], [], n_bins=10)
        assert len(bins) == 10
        assert all(b.count == 0 for b in bins)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 8. Garde-fous
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestGuards:
    def test_length_mismatch_raises(self) -> None:
        with pytest.raises(ValueError, match="Longueurs"):
            expected_calibration_error([0.5, 0.5], [1])

    def test_confidence_above_one_raises(self) -> None:
        with pytest.raises(ValueError, match="hors"):
            expected_calibration_error([1.5], [1])

    def test_negative_confidence_raises(self) -> None:
        with pytest.raises(ValueError, match="hors"):
            expected_calibration_error([-0.1], [1])

    def test_invalid_n_bins_raises(self) -> None:
        with pytest.raises(ValueError, match="n_bins"):
            reliability_diagram([0.5], [1], n_bins=0)

    def test_n_bins_negative_raises(self) -> None:
        with pytest.raises(ValueError, match="n_bins"):
            reliability_diagram([0.5], [1], n_bins=-3)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 9. n_bins paramรฉtrable
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestVariableNBins:
    @pytest.mark.parametrize("n_bins,expected_width", [
        (5, 0.2), (10, 0.1), (20, 0.05), (1, 1.0),
    ])
    def test_bin_width_scales_with_n_bins(
        self, n_bins: int, expected_width: float
    ) -> None:
        bins = reliability_diagram([], [], n_bins=n_bins)
        assert len(bins) == n_bins
        for b in bins:
            assert (b.bin_high - b.bin_low) == pytest.approx(expected_width)

    def test_finer_bins_can_only_increase_or_keep_ece(self) -> None:
        """ร€ distribution donnรฉe, n_bins plus grand rรฉvรจle des รฉcarts
        masquรฉs par un binning grossier โ€” ECE ne dรฉcroรฎt pas."""
        confs = [0.6, 0.65, 0.7, 0.95, 0.95]
        correct = [1, 0, 1, 1, 0]
        ece_5 = expected_calibration_error(confs, correct, n_bins=5)
        ece_20 = expected_calibration_error(confs, correct, n_bins=20)
        assert ece_20 >= ece_5 - 1e-9


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 10. compute_calibration_metrics
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestComputeCalibrationMetrics:
    def test_returns_full_structure(self) -> None:
        confs = [0.6, 0.7, 0.95, 0.95]
        correct = [1, 0, 1, 1]
        out = compute_calibration_metrics(confs, correct, n_bins=10)
        assert set(out.keys()) >= {
            "ece", "mce", "n_bins", "n_predictions",
            "overall_accuracy", "overall_confidence", "bins",
        }
        assert out["n_predictions"] == 4
        assert out["overall_accuracy"] == pytest.approx(3 / 4)
        assert out["overall_confidence"] == pytest.approx((0.6 + 0.7 + 0.95 + 0.95) / 4)
        assert len(out["bins"]) == 10

    def test_ece_matches_function(self) -> None:
        confs = [0.55, 0.65, 0.75, 0.85, 0.95]
        correct = [1, 0, 1, 0, 1]
        out = compute_calibration_metrics(confs, correct)
        assert out["ece"] == pytest.approx(
            expected_calibration_error(confs, correct), abs=1e-9
        )
        assert out["mce"] == pytest.approx(
            maximum_calibration_error(confs, correct), abs=1e-9
        )

    def test_bin_dicts_contain_gap(self) -> None:
        out = compute_calibration_metrics([0.55] * 4, [1, 1, 0, 1])
        # Bin [0.5, 0.6) : avg_conf = 0.55, accuracy = 0.75, gap = 0.20
        b5 = out["bins"][5]
        assert b5["count"] == 4
        assert b5["gap"] == pytest.approx(0.20, abs=1e-9)


# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# 11. CalibrationBin.gap
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€


class TestCalibrationBinGap:
    def test_gap_for_empty_bin_is_none(self) -> None:
        b = CalibrationBin(0.0, 0.1, None, None, 0)
        assert b.gap is None

    def test_gap_is_absolute_difference(self) -> None:
        b = CalibrationBin(0.5, 0.6, 0.55, 0.30, 10)
        assert b.gap == pytest.approx(0.25)

    def test_gap_symmetric(self) -> None:
        b1 = CalibrationBin(0.5, 0.6, 0.55, 0.30, 10)
        b2 = CalibrationBin(0.5, 0.6, 0.30, 0.55, 10)
        assert b1.gap == pytest.approx(b2.gap)