Spaces:
Sleeping
Sleeping
Claude
Sprint 6 du plan rapport — glossaire contextuel + panneau personnalisation
76e79a0 unverified | # Contextual glossary — English. | |
| # Mirror of fr.yaml. Every term listed in fr.yaml must have an entry here. | |
| cer: | |
| title: "CER — Character Error Rate" | |
| definition: >- | |
| Character-level error rate, computed as the ratio of the Levenshtein | |
| edit distance (substitutions + insertions + deletions) to the length | |
| of the reference string. Expressed as a percentage. | |
| measures: >- | |
| Character-by-character fidelity between the predicted transcript and | |
| the ground truth, without normalization. | |
| usage: >- | |
| The most common metric in OCR/HTR evaluation, used by ICDAR | |
| competitions since the early 2000s. | |
| limits: >- | |
| Insensitive to graphic variants (ſ vs s, u vs v) which may be | |
| preserved in a heritage corpus ground truth — see diplomatic CER. | |
| reference: >- | |
| Kay, M. (2007). "Optical Character Recognition". Handbook of Natural | |
| Language Processing, 2nd ed. | |
| cer_nfc: | |
| title: "NFC CER" | |
| definition: >- | |
| CER computed after Unicode NFC (Canonical Decomposition, followed by | |
| Canonical Composition) normalization of both reference and hypothesis. | |
| measures: >- | |
| Text fidelity while ignoring Unicode representation differences that | |
| are semantically equivalent (e.g. precomposed é vs decomposed é). | |
| usage: >- | |
| Essential when ground truth and OCR output use different but | |
| equivalent Unicode forms. | |
| limits: >- | |
| Does not solve semantically significant graphic variants (ſ, ligatures | |
| that cannot be decomposed). | |
| reference: >- | |
| Unicode Technical Report #15 — Unicode Normalization Forms. | |
| cer_caseless: | |
| title: "Case-insensitive CER" | |
| definition: >- | |
| CER computed after lowercase folding (``casefold``) of reference and | |
| hypothesis. | |
| measures: >- | |
| Text fidelity ignoring uppercase/lowercase differences. | |
| usage: >- | |
| Useful for corpora where case is not deemed significant (many early | |
| printed books, inconsistent capitalization). | |
| limits: >- | |
| Masks editorial choices regarding proper nouns and sentence openings. | |
| reference: >- | |
| Ibid. — CER. | |
| cer_diplomatic: | |
| title: "Diplomatic CER" | |
| definition: >- | |
| CER computed after diplomatic normalization of a heritage corpus: | |
| merging ``ſ=s``, ``u=v``, ``i=j``, expanding abbreviations, etc. | |
| measures: >- | |
| Substantial errors, while ignoring graphic variants codified by | |
| editorial conventions (diplomatic vs normalized). | |
| usage: >- | |
| Often used when evaluating OCR/HTR on pre-19th-century corpora where | |
| the ground truth preserves old spellings irrelevant to searchability. | |
| limits: >- | |
| Masks editorial choices that are relevant to strict philology. The | |
| applied profile depends on conventions (MUFI, Capitains…) that vary | |
| between communities. | |
| reference: >- | |
| Pierazzo, E. (2015). "Digital Scholarly Editing". Ashgate. | |
| wer: | |
| title: "WER — Word Error Rate" | |
| definition: >- | |
| Word-level error rate, computed as the word-to-word Levenshtein | |
| distance divided by the number of words in the reference. | |
| measures: >- | |
| Word-to-word fidelity, sensitive to segmentation (a misplaced space | |
| counts as two errors). | |
| usage: >- | |
| Historical standard in speech recognition, adopted in OCR/HTR to | |
| assess full-text search usability. | |
| limits: >- | |
| Very sensitive to segmentation. A 5 % CER can translate to a 20 % | |
| WER if errors touch different words each time. | |
| reference: >- | |
| Morris, A. C., Maier, V., & Green, P. (2004). "From WER and RIL to | |
| MER and WIL". ICSLP. | |
| mer: | |
| title: "MER — Match Error Rate" | |
| definition: >- | |
| WER variant that caps error at 1 by accounting for insertions (WER | |
| can exceed 1, MER cannot). | |
| measures: >- | |
| A more stable version of WER, bounded in [0, 1]. | |
| usage: >- | |
| Proposed by Morris et al. (2004) to correct the asymmetry of WER in | |
| the presence of excessive insertions. | |
| limits: >- | |
| Less widespread than WER — historical comparative tables often use | |
| WER, not MER. | |
| reference: >- | |
| Morris, A. C., Maier, V., & Green, P. (2004). Ibid. | |
| wil: | |
| title: "WIL — Word Information Lost" | |
| definition: >- | |
| Measures word-level information loss; accounts for both correctly | |
| recognized content and noise introduced by the system. | |
| measures: >- | |
| The amount of semantic information lost at the word level. | |
| usage: >- | |
| Useful alongside WER to diagnose noisy hypotheses (many unrelated | |
| insertions). | |
| limits: >- | |
| Less intuitive than a simple error rate. | |
| reference: >- | |
| Morris, A. C., Maier, V., & Green, P. (2004). Ibid. | |
| ligature_score: | |
| title: "Ligature score" | |
| definition: >- | |
| Proportion of ligatures (``fi``, ``fl``, ``œ``, ``æ``, ``ꝑ``, ``ꝓ``…) | |
| correctly rendered by the engine. | |
| measures: >- | |
| The engine's ability to recognize the fused characters typical of | |
| early printed books and medieval manuscripts. | |
| usage: >- | |
| Strong indicator for critical editions and philology. | |
| limits: >- | |
| Depends on Picarones' ligature table — some rare ligatures may be | |
| absent. | |
| reference: >- | |
| MUFI — Medieval Unicode Font Initiative, Recommendations v4. | |
| diacritic_score: | |
| title: "Diacritic score" | |
| definition: >- | |
| Rate of diacritic preservation (acute, grave, tilde, cedilla, dieresis…) | |
| between ground truth and OCR output. | |
| measures: >- | |
| Diacritic fidelity, measured after NFD decomposition. | |
| usage: >- | |
| Important for multilingual corpora and philological transcriptions | |
| where diacritics are meaningful. | |
| limits: >- | |
| An engine may place a diacritic on the wrong letter — this metric | |
| alone will not detect it. | |
| reference: >- | |
| Unicode Technical Report #15. | |
| taxonomy: | |
| title: "Error taxonomy (9 classes)" | |
| definition: >- | |
| Systematic classification of each error into 9 categories: visual | |
| confusion, diacritic error, case error, ligature error, abbreviation, | |
| hapax, segmentation, OOV character, lacuna. | |
| measures: >- | |
| An engine's error profile — reveals its specific weaknesses. | |
| usage: >- | |
| Fine-grained diagnosis for a given engine, useful for deciding whether | |
| to switch models or tune an LLM post-correction prompt. | |
| limits: >- | |
| The ``difflib``-based classification is heuristic; a character can fall | |
| into several classes simultaneously. | |
| reference: >- | |
| Clausner, C., Antonacopoulos, A., Pletschacher, S. (2020). "ICDAR 2019 | |
| Competition on Recognition of Historical Arabic Scientific Manuscripts". | |
| confusion_matrix: | |
| title: "Unicode confusion matrix" | |
| definition: >- | |
| Cross-table listing substitutions (GT char → OCR char) and their | |
| frequencies across the corpus. | |
| measures: >- | |
| Character-to-character substitution patterns, readable symmetrically | |
| (which GT character was confused with what?). | |
| usage: >- | |
| Compare two engines' "genetic fingerprints": if they confuse the same | |
| characters, they were likely trained on similar data. | |
| limits: >- | |
| Does not capture segmentation errors (spaces) nor unmatched insertions. | |
| reference: >- | |
| Pletschacher, S., Clausner, C., Antonacopoulos, A. (2015). "Performance | |
| Analysis Framework for Layout Analysis Methods". | |
| gini: | |
| title: "Gini coefficient of errors" | |
| definition: >- | |
| Measures error concentration across a document (between 0 = uniformly | |
| distributed errors and 1 = all errors on a single line). | |
| measures: >- | |
| The unequal distribution of errors within a document — a high Gini | |
| signals that a small fraction of lines concentrates most errors. | |
| usage: >- | |
| Identifies hard regions (marginal notes, damaged areas) that would | |
| benefit from targeted correction. | |
| limits: >- | |
| Sensitive to the number of lines — not very informative on very short | |
| documents. | |
| reference: >- | |
| Gini, C. (1912). "Variabilità e mutabilità". | |
| hallucination_score: | |
| title: "Hallucination score (LLM/VLM)" | |
| definition: >- | |
| Composite indicator combining trigram anchoring (fraction of | |
| hypothesis trigrams present in GT) and length ratio | |
| (hypothesis/GT) to detect hallucinations in LLM and VLM pipelines. | |
| measures: >- | |
| How likely the model invented text instead of reading the image. | |
| usage: >- | |
| Essential for OCR+LLM pipelines and zero-shot VLMs, where CER alone | |
| is misleading (low CER can hide a hallucinated paraphrase). | |
| limits: >- | |
| A faithful but rephrased output may be falsely flagged. | |
| reference: >- | |
| Wiland, A. et al. (2024). "Hallucination Detection for Visual Language | |
| Models on Historical Documents". DHd. | |
| anchor_score: | |
| title: "Trigram anchor score" | |
| definition: >- | |
| Fraction of word-level trigrams from the OCR hypothesis that also | |
| exist in the ground truth. | |
| measures: >- | |
| How "anchored" the output is in the source text. High score = faithful | |
| transcription; low score = probable hallucinations. | |
| usage: >- | |
| Complements CER for LLM/VLM pipelines. | |
| limits: >- | |
| On very short outputs, the score can be noisy (few trigrams available). | |
| reference: >- | |
| Wiland, A. et al. (2024). Ibid. | |
| length_ratio: | |
| title: "Length ratio" | |
| definition: >- | |
| Ratio of hypothesis character length to GT character length. Ratios | |
| > 1.2 or < 0.8 are warning signals. | |
| measures: >- | |
| Excess or deficit of text produced by the engine. | |
| usage: >- | |
| Used with anchor score to flag hallucinations (verbose LLMs) or | |
| omissions (LLMs skipping hard passages). | |
| limits: >- | |
| Highly dependent on the GT style (abbreviated vs expanded). | |
| reference: >- | |
| Wiland, A. et al. (2024). Ibid. | |
| bootstrap_ci: | |
| title: "Bootstrap confidence interval" | |
| definition: >- | |
| 95 % confidence interval of the mean CER, computed by resampling | |
| documents with replacement (1000 iterations by default). | |
| measures: >- | |
| Uncertainty associated with the mean CER — the wider the interval, | |
| the less reliable an ordinal ranking becomes. | |
| usage: >- | |
| Essential context for any mean CER; especially important on small | |
| corpora (< 30 documents). | |
| limits: >- | |
| Assumes document independence — not strictly true for series (same | |
| scribe, same manuscript). | |
| reference: >- | |
| Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife". | |
| Annals of Statistics. | |
| wilcoxon: | |
| title: "Wilcoxon signed-rank test" | |
| definition: >- | |
| Non-parametric test of equality between two series of paired | |
| measurements (same documents, two different engines). | |
| measures: >- | |
| Statistical significance of an observed gap between two engines, | |
| without assuming normality of distributions. | |
| usage: >- | |
| Pairwise comparison of two engines on a corpus. | |
| limits: >- | |
| When applied repeatedly across all pairs of k engines, the Type-I | |
| error risk grows — prefer Friedman-Nemenyi to compare more than two | |
| engines. | |
| reference: >- | |
| Wilcoxon, F. (1945). "Individual Comparisons by Ranking Methods". | |
| Biometrics Bulletin. | |
| friedman: | |
| title: "Friedman test" | |
| definition: >- | |
| Non-parametric equivalent of repeated-measures ANOVA: tests whether | |
| at least one engine among k differs from the others on n documents. | |
| measures: >- | |
| A global difference between k engines across n blocks (documents). | |
| usage: >- | |
| Prelude to the Nemenyi post-hoc. Recommended whenever more than two | |
| engines are compared, to control the multi-comparison risk. | |
| limits: >- | |
| Does not identify which pairs differ — the post-hoc is required. | |
| reference: >- | |
| Friedman, M. (1937). "The Use of Ranks to Avoid the Assumption of | |
| Normality Implicit in the Analysis of Variance". | |
| nemenyi: | |
| title: "Nemenyi post-hoc" | |
| definition: >- | |
| Post-hoc test applied after a Friedman test to identify distinguishable | |
| engine pairs. Computes a ``critical distance`` (CD) depending on the | |
| number of engines and documents. | |
| measures: >- | |
| Pairs of engines whose mean ranks differ significantly. | |
| usage: >- | |
| Basis of the Critical Difference Diagram (Demšar 2006). | |
| limits: >- | |
| Conservative by construction (corrects for multiple comparisons); | |
| may miss real but subtle differences. | |
| reference: >- | |
| Nemenyi, P. (1963). "Distribution-free Multiple Comparisons". | |
| cdd: | |
| title: "Critical Difference Diagram" | |
| definition: >- | |
| Graphical rendering of Friedman-Nemenyi results: engines placed on a | |
| horizontal axis (mean rank), connected by a bar if they are not | |
| statistically distinguishable at the α level. | |
| measures: >- | |
| Global ordering of engines and indistinguishability groups. | |
| usage: >- | |
| De facto standard in ML since Demšar 2006 for comparing multiple | |
| systems over multiple datasets. | |
| limits: >- | |
| Can be hard to read when several groups partially overlap. | |
| reference: >- | |
| Demšar, J. (2006). "Statistical Comparisons of Classifiers over | |
| Multiple Data Sets". JMLR 7:1-30. | |
| pareto_front: | |
| title: "Pareto front" | |
| definition: >- | |
| Set of engines for which no other offers simultaneously better quality | |
| AND better cost (or any other objective pair). | |
| measures: >- | |
| "Non-dominated" trade-offs — choosing outside the Pareto front is | |
| always suboptimal, but choosing on the front depends on each | |
| institution's priorities. | |
| usage: >- | |
| Core of the report's quality/cost view. Also applicable to | |
| quality/speed or quality/carbon. | |
| limits: >- | |
| Costs used are indicative (see ``pricing.yaml``) and age quickly. | |
| Always cross-check with real invoices before purchase decisions. | |
| reference: >- | |
| Pareto, V. (1906). "Manuale di economia politica". | |
| difficulty_score: | |
| title: "Intrinsic difficulty score" | |
| definition: >- | |
| Score in [0, 1] combining inter-engine CER variance, image quality, | |
| and density of heritage-specific characters. | |
| measures: >- | |
| How intrinsically difficult a document is, independently of the | |
| evaluation instrument. | |
| usage: >- | |
| Allows stratifying the report (easy vs hard documents) and | |
| interpreting a global CER while accounting for corpus specifics. | |
| limits: >- | |
| Default weights (0.4, 0.35, 0.25) are heuristic and can be adjusted | |
| to the context. | |
| reference: >- | |
| Stutzmann, D. (2017). "Clustering of medieval scripts through | |
| computer image analysis". | |
| normalization_profile: | |
| title: "Normalization profile" | |
| definition: >- | |
| Set of transformation rules applied to GT and hypothesis before CER | |
| computation: ſ=s merge, u=v, i=j, abbreviation expansion, character | |
| exclusion, etc. | |
| measures: >- | |
| The choice of an editorial convention for the CER computation — does | |
| not affect source data. | |
| usage: >- | |
| Picarones ships 9 preconfigured profiles (medieval_french, | |
| early_modern_english, medieval_latin…). Additional profiles can be | |
| loaded from YAML. | |
| limits: >- | |
| Too aggressive → masks real errors; too strict → overestimates error. | |
| reference: >- | |
| See ``picarones/core/normalization.py`` for the profile list. | |
| structure: | |
| title: "Structural scores" | |
| definition: >- | |
| Set of structure-level measures: line fusion rate, fragmentation rate, | |
| reading-order (LCS), paragraph preservation. | |
| measures: >- | |
| The integrity of reconstructed layout, beyond character-level text. | |
| usage: >- | |
| Crucial for multi-column documents (newspapers, glossed Bibles) where | |
| a low CER can hide a broken reading order. | |
| limits: >- | |
| Depends on structure annotations in the GT — not always available. | |
| reference: >- | |
| Antonacopoulos, A. et al. (2015). "ICDAR 2015 Competition on Text | |
| Line Detection in Historical Documents". | |
| image_quality: | |
| title: "Image quality" | |
| definition: >- | |
| Composite [0, 1] score combining sharpness (Laplacian variance), | |
| noise level, contrast, and rotation angle estimate. | |
| measures: >- | |
| The physical characteristics of the source image that may degrade | |
| recognition. | |
| usage: >- | |
| Used to stratify results (good vs bad images) and to identify | |
| documents that would benefit from rescanning. | |
| limits: >- | |
| Pure image-level score; does not capture paleographic difficulties | |
| (cursive scripts, dense abbreviations). | |
| reference: >- | |
| Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S. | |
| (2009). "A Realistic Dataset for Performance Evaluation of Document | |
| Layout Analysis". | |