Spaces:
Sleeping
Sleeping
File size: 16,098 Bytes
76e79a0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 | # Contextual glossary — English.
# Mirror of fr.yaml. Every term listed in fr.yaml must have an entry here.
cer:
title: "CER — Character Error Rate"
definition: >-
Character-level error rate, computed as the ratio of the Levenshtein
edit distance (substitutions + insertions + deletions) to the length
of the reference string. Expressed as a percentage.
measures: >-
Character-by-character fidelity between the predicted transcript and
the ground truth, without normalization.
usage: >-
The most common metric in OCR/HTR evaluation, used by ICDAR
competitions since the early 2000s.
limits: >-
Insensitive to graphic variants (ſ vs s, u vs v) which may be
preserved in a heritage corpus ground truth — see diplomatic CER.
reference: >-
Kay, M. (2007). "Optical Character Recognition". Handbook of Natural
Language Processing, 2nd ed.
cer_nfc:
title: "NFC CER"
definition: >-
CER computed after Unicode NFC (Canonical Decomposition, followed by
Canonical Composition) normalization of both reference and hypothesis.
measures: >-
Text fidelity while ignoring Unicode representation differences that
are semantically equivalent (e.g. precomposed é vs decomposed é).
usage: >-
Essential when ground truth and OCR output use different but
equivalent Unicode forms.
limits: >-
Does not solve semantically significant graphic variants (ſ, ligatures
that cannot be decomposed).
reference: >-
Unicode Technical Report #15 — Unicode Normalization Forms.
cer_caseless:
title: "Case-insensitive CER"
definition: >-
CER computed after lowercase folding (``casefold``) of reference and
hypothesis.
measures: >-
Text fidelity ignoring uppercase/lowercase differences.
usage: >-
Useful for corpora where case is not deemed significant (many early
printed books, inconsistent capitalization).
limits: >-
Masks editorial choices regarding proper nouns and sentence openings.
reference: >-
Ibid. — CER.
cer_diplomatic:
title: "Diplomatic CER"
definition: >-
CER computed after diplomatic normalization of a heritage corpus:
merging ``ſ=s``, ``u=v``, ``i=j``, expanding abbreviations, etc.
measures: >-
Substantial errors, while ignoring graphic variants codified by
editorial conventions (diplomatic vs normalized).
usage: >-
Often used when evaluating OCR/HTR on pre-19th-century corpora where
the ground truth preserves old spellings irrelevant to searchability.
limits: >-
Masks editorial choices that are relevant to strict philology. The
applied profile depends on conventions (MUFI, Capitains…) that vary
between communities.
reference: >-
Pierazzo, E. (2015). "Digital Scholarly Editing". Ashgate.
wer:
title: "WER — Word Error Rate"
definition: >-
Word-level error rate, computed as the word-to-word Levenshtein
distance divided by the number of words in the reference.
measures: >-
Word-to-word fidelity, sensitive to segmentation (a misplaced space
counts as two errors).
usage: >-
Historical standard in speech recognition, adopted in OCR/HTR to
assess full-text search usability.
limits: >-
Very sensitive to segmentation. A 5 % CER can translate to a 20 %
WER if errors touch different words each time.
reference: >-
Morris, A. C., Maier, V., & Green, P. (2004). "From WER and RIL to
MER and WIL". ICSLP.
mer:
title: "MER — Match Error Rate"
definition: >-
WER variant that caps error at 1 by accounting for insertions (WER
can exceed 1, MER cannot).
measures: >-
A more stable version of WER, bounded in [0, 1].
usage: >-
Proposed by Morris et al. (2004) to correct the asymmetry of WER in
the presence of excessive insertions.
limits: >-
Less widespread than WER — historical comparative tables often use
WER, not MER.
reference: >-
Morris, A. C., Maier, V., & Green, P. (2004). Ibid.
wil:
title: "WIL — Word Information Lost"
definition: >-
Measures word-level information loss; accounts for both correctly
recognized content and noise introduced by the system.
measures: >-
The amount of semantic information lost at the word level.
usage: >-
Useful alongside WER to diagnose noisy hypotheses (many unrelated
insertions).
limits: >-
Less intuitive than a simple error rate.
reference: >-
Morris, A. C., Maier, V., & Green, P. (2004). Ibid.
ligature_score:
title: "Ligature score"
definition: >-
Proportion of ligatures (``fi``, ``fl``, ``œ``, ``æ``, ``ꝑ``, ``ꝓ``…)
correctly rendered by the engine.
measures: >-
The engine's ability to recognize the fused characters typical of
early printed books and medieval manuscripts.
usage: >-
Strong indicator for critical editions and philology.
limits: >-
Depends on Picarones' ligature table — some rare ligatures may be
absent.
reference: >-
MUFI — Medieval Unicode Font Initiative, Recommendations v4.
diacritic_score:
title: "Diacritic score"
definition: >-
Rate of diacritic preservation (acute, grave, tilde, cedilla, dieresis…)
between ground truth and OCR output.
measures: >-
Diacritic fidelity, measured after NFD decomposition.
usage: >-
Important for multilingual corpora and philological transcriptions
where diacritics are meaningful.
limits: >-
An engine may place a diacritic on the wrong letter — this metric
alone will not detect it.
reference: >-
Unicode Technical Report #15.
taxonomy:
title: "Error taxonomy (9 classes)"
definition: >-
Systematic classification of each error into 9 categories: visual
confusion, diacritic error, case error, ligature error, abbreviation,
hapax, segmentation, OOV character, lacuna.
measures: >-
An engine's error profile — reveals its specific weaknesses.
usage: >-
Fine-grained diagnosis for a given engine, useful for deciding whether
to switch models or tune an LLM post-correction prompt.
limits: >-
The ``difflib``-based classification is heuristic; a character can fall
into several classes simultaneously.
reference: >-
Clausner, C., Antonacopoulos, A., Pletschacher, S. (2020). "ICDAR 2019
Competition on Recognition of Historical Arabic Scientific Manuscripts".
confusion_matrix:
title: "Unicode confusion matrix"
definition: >-
Cross-table listing substitutions (GT char → OCR char) and their
frequencies across the corpus.
measures: >-
Character-to-character substitution patterns, readable symmetrically
(which GT character was confused with what?).
usage: >-
Compare two engines' "genetic fingerprints": if they confuse the same
characters, they were likely trained on similar data.
limits: >-
Does not capture segmentation errors (spaces) nor unmatched insertions.
reference: >-
Pletschacher, S., Clausner, C., Antonacopoulos, A. (2015). "Performance
Analysis Framework for Layout Analysis Methods".
gini:
title: "Gini coefficient of errors"
definition: >-
Measures error concentration across a document (between 0 = uniformly
distributed errors and 1 = all errors on a single line).
measures: >-
The unequal distribution of errors within a document — a high Gini
signals that a small fraction of lines concentrates most errors.
usage: >-
Identifies hard regions (marginal notes, damaged areas) that would
benefit from targeted correction.
limits: >-
Sensitive to the number of lines — not very informative on very short
documents.
reference: >-
Gini, C. (1912). "Variabilità e mutabilità".
hallucination_score:
title: "Hallucination score (LLM/VLM)"
definition: >-
Composite indicator combining trigram anchoring (fraction of
hypothesis trigrams present in GT) and length ratio
(hypothesis/GT) to detect hallucinations in LLM and VLM pipelines.
measures: >-
How likely the model invented text instead of reading the image.
usage: >-
Essential for OCR+LLM pipelines and zero-shot VLMs, where CER alone
is misleading (low CER can hide a hallucinated paraphrase).
limits: >-
A faithful but rephrased output may be falsely flagged.
reference: >-
Wiland, A. et al. (2024). "Hallucination Detection for Visual Language
Models on Historical Documents". DHd.
anchor_score:
title: "Trigram anchor score"
definition: >-
Fraction of word-level trigrams from the OCR hypothesis that also
exist in the ground truth.
measures: >-
How "anchored" the output is in the source text. High score = faithful
transcription; low score = probable hallucinations.
usage: >-
Complements CER for LLM/VLM pipelines.
limits: >-
On very short outputs, the score can be noisy (few trigrams available).
reference: >-
Wiland, A. et al. (2024). Ibid.
length_ratio:
title: "Length ratio"
definition: >-
Ratio of hypothesis character length to GT character length. Ratios
> 1.2 or < 0.8 are warning signals.
measures: >-
Excess or deficit of text produced by the engine.
usage: >-
Used with anchor score to flag hallucinations (verbose LLMs) or
omissions (LLMs skipping hard passages).
limits: >-
Highly dependent on the GT style (abbreviated vs expanded).
reference: >-
Wiland, A. et al. (2024). Ibid.
bootstrap_ci:
title: "Bootstrap confidence interval"
definition: >-
95 % confidence interval of the mean CER, computed by resampling
documents with replacement (1000 iterations by default).
measures: >-
Uncertainty associated with the mean CER — the wider the interval,
the less reliable an ordinal ranking becomes.
usage: >-
Essential context for any mean CER; especially important on small
corpora (< 30 documents).
limits: >-
Assumes document independence — not strictly true for series (same
scribe, same manuscript).
reference: >-
Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife".
Annals of Statistics.
wilcoxon:
title: "Wilcoxon signed-rank test"
definition: >-
Non-parametric test of equality between two series of paired
measurements (same documents, two different engines).
measures: >-
Statistical significance of an observed gap between two engines,
without assuming normality of distributions.
usage: >-
Pairwise comparison of two engines on a corpus.
limits: >-
When applied repeatedly across all pairs of k engines, the Type-I
error risk grows — prefer Friedman-Nemenyi to compare more than two
engines.
reference: >-
Wilcoxon, F. (1945). "Individual Comparisons by Ranking Methods".
Biometrics Bulletin.
friedman:
title: "Friedman test"
definition: >-
Non-parametric equivalent of repeated-measures ANOVA: tests whether
at least one engine among k differs from the others on n documents.
measures: >-
A global difference between k engines across n blocks (documents).
usage: >-
Prelude to the Nemenyi post-hoc. Recommended whenever more than two
engines are compared, to control the multi-comparison risk.
limits: >-
Does not identify which pairs differ — the post-hoc is required.
reference: >-
Friedman, M. (1937). "The Use of Ranks to Avoid the Assumption of
Normality Implicit in the Analysis of Variance".
nemenyi:
title: "Nemenyi post-hoc"
definition: >-
Post-hoc test applied after a Friedman test to identify distinguishable
engine pairs. Computes a ``critical distance`` (CD) depending on the
number of engines and documents.
measures: >-
Pairs of engines whose mean ranks differ significantly.
usage: >-
Basis of the Critical Difference Diagram (Demšar 2006).
limits: >-
Conservative by construction (corrects for multiple comparisons);
may miss real but subtle differences.
reference: >-
Nemenyi, P. (1963). "Distribution-free Multiple Comparisons".
cdd:
title: "Critical Difference Diagram"
definition: >-
Graphical rendering of Friedman-Nemenyi results: engines placed on a
horizontal axis (mean rank), connected by a bar if they are not
statistically distinguishable at the α level.
measures: >-
Global ordering of engines and indistinguishability groups.
usage: >-
De facto standard in ML since Demšar 2006 for comparing multiple
systems over multiple datasets.
limits: >-
Can be hard to read when several groups partially overlap.
reference: >-
Demšar, J. (2006). "Statistical Comparisons of Classifiers over
Multiple Data Sets". JMLR 7:1-30.
pareto_front:
title: "Pareto front"
definition: >-
Set of engines for which no other offers simultaneously better quality
AND better cost (or any other objective pair).
measures: >-
"Non-dominated" trade-offs — choosing outside the Pareto front is
always suboptimal, but choosing on the front depends on each
institution's priorities.
usage: >-
Core of the report's quality/cost view. Also applicable to
quality/speed or quality/carbon.
limits: >-
Costs used are indicative (see ``pricing.yaml``) and age quickly.
Always cross-check with real invoices before purchase decisions.
reference: >-
Pareto, V. (1906). "Manuale di economia politica".
difficulty_score:
title: "Intrinsic difficulty score"
definition: >-
Score in [0, 1] combining inter-engine CER variance, image quality,
and density of heritage-specific characters.
measures: >-
How intrinsically difficult a document is, independently of the
evaluation instrument.
usage: >-
Allows stratifying the report (easy vs hard documents) and
interpreting a global CER while accounting for corpus specifics.
limits: >-
Default weights (0.4, 0.35, 0.25) are heuristic and can be adjusted
to the context.
reference: >-
Stutzmann, D. (2017). "Clustering of medieval scripts through
computer image analysis".
normalization_profile:
title: "Normalization profile"
definition: >-
Set of transformation rules applied to GT and hypothesis before CER
computation: ſ=s merge, u=v, i=j, abbreviation expansion, character
exclusion, etc.
measures: >-
The choice of an editorial convention for the CER computation — does
not affect source data.
usage: >-
Picarones ships 9 preconfigured profiles (medieval_french,
early_modern_english, medieval_latin…). Additional profiles can be
loaded from YAML.
limits: >-
Too aggressive → masks real errors; too strict → overestimates error.
reference: >-
See ``picarones/core/normalization.py`` for the profile list.
structure:
title: "Structural scores"
definition: >-
Set of structure-level measures: line fusion rate, fragmentation rate,
reading-order (LCS), paragraph preservation.
measures: >-
The integrity of reconstructed layout, beyond character-level text.
usage: >-
Crucial for multi-column documents (newspapers, glossed Bibles) where
a low CER can hide a broken reading order.
limits: >-
Depends on structure annotations in the GT — not always available.
reference: >-
Antonacopoulos, A. et al. (2015). "ICDAR 2015 Competition on Text
Line Detection in Historical Documents".
image_quality:
title: "Image quality"
definition: >-
Composite [0, 1] score combining sharpness (Laplacian variance),
noise level, contrast, and rotation angle estimate.
measures: >-
The physical characteristics of the source image that may degrade
recognition.
usage: >-
Used to stratify results (good vs bad images) and to identify
documents that would benefit from rescanning.
limits: >-
Pure image-level score; does not capture paleographic difficulties
(cursive scripts, dense abbreviations).
reference: >-
Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.
(2009). "A Realistic Dataset for Performance Evaluation of Document
Layout Analysis".
|