Claude
Sprint 6 du plan rapport — glossaire contextuel + panneau personnalisation
76e79a0 unverified
Raw
History Blame
16.1 kB
# Contextual glossary — English.
# Mirror of fr.yaml. Every term listed in fr.yaml must have an entry here.
cer:
title: "CER — Character Error Rate"
definition: >-
Character-level error rate, computed as the ratio of the Levenshtein
edit distance (substitutions + insertions + deletions) to the length
of the reference string. Expressed as a percentage.
measures: >-
Character-by-character fidelity between the predicted transcript and
the ground truth, without normalization.
usage: >-
The most common metric in OCR/HTR evaluation, used by ICDAR
competitions since the early 2000s.
limits: >-
Insensitive to graphic variants (ſ vs s, u vs v) which may be
preserved in a heritage corpus ground truth — see diplomatic CER.
reference: >-
Kay, M. (2007). "Optical Character Recognition". Handbook of Natural
Language Processing, 2nd ed.
cer_nfc:
title: "NFC CER"
definition: >-
CER computed after Unicode NFC (Canonical Decomposition, followed by
Canonical Composition) normalization of both reference and hypothesis.
measures: >-
Text fidelity while ignoring Unicode representation differences that
are semantically equivalent (e.g. precomposed é vs decomposed é).
usage: >-
Essential when ground truth and OCR output use different but
equivalent Unicode forms.
limits: >-
Does not solve semantically significant graphic variants (ſ, ligatures
that cannot be decomposed).
reference: >-
Unicode Technical Report #15 — Unicode Normalization Forms.
cer_caseless:
title: "Case-insensitive CER"
definition: >-
CER computed after lowercase folding (``casefold``) of reference and
hypothesis.
measures: >-
Text fidelity ignoring uppercase/lowercase differences.
usage: >-
Useful for corpora where case is not deemed significant (many early
printed books, inconsistent capitalization).
limits: >-
Masks editorial choices regarding proper nouns and sentence openings.
reference: >-
Ibid. — CER.
cer_diplomatic:
title: "Diplomatic CER"
definition: >-
CER computed after diplomatic normalization of a heritage corpus:
merging ``ſ=s``, ``u=v``, ``i=j``, expanding abbreviations, etc.
measures: >-
Substantial errors, while ignoring graphic variants codified by
editorial conventions (diplomatic vs normalized).
usage: >-
Often used when evaluating OCR/HTR on pre-19th-century corpora where
the ground truth preserves old spellings irrelevant to searchability.
limits: >-
Masks editorial choices that are relevant to strict philology. The
applied profile depends on conventions (MUFI, Capitains…) that vary
between communities.
reference: >-
Pierazzo, E. (2015). "Digital Scholarly Editing". Ashgate.
wer:
title: "WER — Word Error Rate"
definition: >-
Word-level error rate, computed as the word-to-word Levenshtein
distance divided by the number of words in the reference.
measures: >-
Word-to-word fidelity, sensitive to segmentation (a misplaced space
counts as two errors).
usage: >-
Historical standard in speech recognition, adopted in OCR/HTR to
assess full-text search usability.
limits: >-
Very sensitive to segmentation. A 5 % CER can translate to a 20 %
WER if errors touch different words each time.
reference: >-
Morris, A. C., Maier, V., & Green, P. (2004). "From WER and RIL to
MER and WIL". ICSLP.
mer:
title: "MER — Match Error Rate"
definition: >-
WER variant that caps error at 1 by accounting for insertions (WER
can exceed 1, MER cannot).
measures: >-
A more stable version of WER, bounded in [0, 1].
usage: >-
Proposed by Morris et al. (2004) to correct the asymmetry of WER in
the presence of excessive insertions.
limits: >-
Less widespread than WER — historical comparative tables often use
WER, not MER.
reference: >-
Morris, A. C., Maier, V., & Green, P. (2004). Ibid.
wil:
title: "WIL — Word Information Lost"
definition: >-
Measures word-level information loss; accounts for both correctly
recognized content and noise introduced by the system.
measures: >-
The amount of semantic information lost at the word level.
usage: >-
Useful alongside WER to diagnose noisy hypotheses (many unrelated
insertions).
limits: >-
Less intuitive than a simple error rate.
reference: >-
Morris, A. C., Maier, V., & Green, P. (2004). Ibid.
ligature_score:
title: "Ligature score"
definition: >-
Proportion of ligatures (``fi``, ``fl``, ``œ``, ``æ``, ``ꝑ``, ``ꝓ``…)
correctly rendered by the engine.
measures: >-
The engine's ability to recognize the fused characters typical of
early printed books and medieval manuscripts.
usage: >-
Strong indicator for critical editions and philology.
limits: >-
Depends on Picarones' ligature table — some rare ligatures may be
absent.
reference: >-
MUFI — Medieval Unicode Font Initiative, Recommendations v4.
diacritic_score:
title: "Diacritic score"
definition: >-
Rate of diacritic preservation (acute, grave, tilde, cedilla, dieresis…)
between ground truth and OCR output.
measures: >-
Diacritic fidelity, measured after NFD decomposition.
usage: >-
Important for multilingual corpora and philological transcriptions
where diacritics are meaningful.
limits: >-
An engine may place a diacritic on the wrong letter — this metric
alone will not detect it.
reference: >-
Unicode Technical Report #15.
taxonomy:
title: "Error taxonomy (9 classes)"
definition: >-
Systematic classification of each error into 9 categories: visual
confusion, diacritic error, case error, ligature error, abbreviation,
hapax, segmentation, OOV character, lacuna.
measures: >-
An engine's error profile — reveals its specific weaknesses.
usage: >-
Fine-grained diagnosis for a given engine, useful for deciding whether
to switch models or tune an LLM post-correction prompt.
limits: >-
The ``difflib``-based classification is heuristic; a character can fall
into several classes simultaneously.
reference: >-
Clausner, C., Antonacopoulos, A., Pletschacher, S. (2020). "ICDAR 2019
Competition on Recognition of Historical Arabic Scientific Manuscripts".
confusion_matrix:
title: "Unicode confusion matrix"
definition: >-
Cross-table listing substitutions (GT char → OCR char) and their
frequencies across the corpus.
measures: >-
Character-to-character substitution patterns, readable symmetrically
(which GT character was confused with what?).
usage: >-
Compare two engines' "genetic fingerprints": if they confuse the same
characters, they were likely trained on similar data.
limits: >-
Does not capture segmentation errors (spaces) nor unmatched insertions.
reference: >-
Pletschacher, S., Clausner, C., Antonacopoulos, A. (2015). "Performance
Analysis Framework for Layout Analysis Methods".
gini:
title: "Gini coefficient of errors"
definition: >-
Measures error concentration across a document (between 0 = uniformly
distributed errors and 1 = all errors on a single line).
measures: >-
The unequal distribution of errors within a document — a high Gini
signals that a small fraction of lines concentrates most errors.
usage: >-
Identifies hard regions (marginal notes, damaged areas) that would
benefit from targeted correction.
limits: >-
Sensitive to the number of lines — not very informative on very short
documents.
reference: >-
Gini, C. (1912). "Variabilità e mutabilità".
hallucination_score:
title: "Hallucination score (LLM/VLM)"
definition: >-
Composite indicator combining trigram anchoring (fraction of
hypothesis trigrams present in GT) and length ratio
(hypothesis/GT) to detect hallucinations in LLM and VLM pipelines.
measures: >-
How likely the model invented text instead of reading the image.
usage: >-
Essential for OCR+LLM pipelines and zero-shot VLMs, where CER alone
is misleading (low CER can hide a hallucinated paraphrase).
limits: >-
A faithful but rephrased output may be falsely flagged.
reference: >-
Wiland, A. et al. (2024). "Hallucination Detection for Visual Language
Models on Historical Documents". DHd.
anchor_score:
title: "Trigram anchor score"
definition: >-
Fraction of word-level trigrams from the OCR hypothesis that also
exist in the ground truth.
measures: >-
How "anchored" the output is in the source text. High score = faithful
transcription; low score = probable hallucinations.
usage: >-
Complements CER for LLM/VLM pipelines.
limits: >-
On very short outputs, the score can be noisy (few trigrams available).
reference: >-
Wiland, A. et al. (2024). Ibid.
length_ratio:
title: "Length ratio"
definition: >-
Ratio of hypothesis character length to GT character length. Ratios
> 1.2 or < 0.8 are warning signals.
measures: >-
Excess or deficit of text produced by the engine.
usage: >-
Used with anchor score to flag hallucinations (verbose LLMs) or
omissions (LLMs skipping hard passages).
limits: >-
Highly dependent on the GT style (abbreviated vs expanded).
reference: >-
Wiland, A. et al. (2024). Ibid.
bootstrap_ci:
title: "Bootstrap confidence interval"
definition: >-
95 % confidence interval of the mean CER, computed by resampling
documents with replacement (1000 iterations by default).
measures: >-
Uncertainty associated with the mean CER — the wider the interval,
the less reliable an ordinal ranking becomes.
usage: >-
Essential context for any mean CER; especially important on small
corpora (< 30 documents).
limits: >-
Assumes document independence — not strictly true for series (same
scribe, same manuscript).
reference: >-
Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife".
Annals of Statistics.
wilcoxon:
title: "Wilcoxon signed-rank test"
definition: >-
Non-parametric test of equality between two series of paired
measurements (same documents, two different engines).
measures: >-
Statistical significance of an observed gap between two engines,
without assuming normality of distributions.
usage: >-
Pairwise comparison of two engines on a corpus.
limits: >-
When applied repeatedly across all pairs of k engines, the Type-I
error risk grows — prefer Friedman-Nemenyi to compare more than two
engines.
reference: >-
Wilcoxon, F. (1945). "Individual Comparisons by Ranking Methods".
Biometrics Bulletin.
friedman:
title: "Friedman test"
definition: >-
Non-parametric equivalent of repeated-measures ANOVA: tests whether
at least one engine among k differs from the others on n documents.
measures: >-
A global difference between k engines across n blocks (documents).
usage: >-
Prelude to the Nemenyi post-hoc. Recommended whenever more than two
engines are compared, to control the multi-comparison risk.
limits: >-
Does not identify which pairs differ — the post-hoc is required.
reference: >-
Friedman, M. (1937). "The Use of Ranks to Avoid the Assumption of
Normality Implicit in the Analysis of Variance".
nemenyi:
title: "Nemenyi post-hoc"
definition: >-
Post-hoc test applied after a Friedman test to identify distinguishable
engine pairs. Computes a ``critical distance`` (CD) depending on the
number of engines and documents.
measures: >-
Pairs of engines whose mean ranks differ significantly.
usage: >-
Basis of the Critical Difference Diagram (Demšar 2006).
limits: >-
Conservative by construction (corrects for multiple comparisons);
may miss real but subtle differences.
reference: >-
Nemenyi, P. (1963). "Distribution-free Multiple Comparisons".
cdd:
title: "Critical Difference Diagram"
definition: >-
Graphical rendering of Friedman-Nemenyi results: engines placed on a
horizontal axis (mean rank), connected by a bar if they are not
statistically distinguishable at the α level.
measures: >-
Global ordering of engines and indistinguishability groups.
usage: >-
De facto standard in ML since Demšar 2006 for comparing multiple
systems over multiple datasets.
limits: >-
Can be hard to read when several groups partially overlap.
reference: >-
Demšar, J. (2006). "Statistical Comparisons of Classifiers over
Multiple Data Sets". JMLR 7:1-30.
pareto_front:
title: "Pareto front"
definition: >-
Set of engines for which no other offers simultaneously better quality
AND better cost (or any other objective pair).
measures: >-
"Non-dominated" trade-offs — choosing outside the Pareto front is
always suboptimal, but choosing on the front depends on each
institution's priorities.
usage: >-
Core of the report's quality/cost view. Also applicable to
quality/speed or quality/carbon.
limits: >-
Costs used are indicative (see ``pricing.yaml``) and age quickly.
Always cross-check with real invoices before purchase decisions.
reference: >-
Pareto, V. (1906). "Manuale di economia politica".
difficulty_score:
title: "Intrinsic difficulty score"
definition: >-
Score in [0, 1] combining inter-engine CER variance, image quality,
and density of heritage-specific characters.
measures: >-
How intrinsically difficult a document is, independently of the
evaluation instrument.
usage: >-
Allows stratifying the report (easy vs hard documents) and
interpreting a global CER while accounting for corpus specifics.
limits: >-
Default weights (0.4, 0.35, 0.25) are heuristic and can be adjusted
to the context.
reference: >-
Stutzmann, D. (2017). "Clustering of medieval scripts through
computer image analysis".
normalization_profile:
title: "Normalization profile"
definition: >-
Set of transformation rules applied to GT and hypothesis before CER
computation: ſ=s merge, u=v, i=j, abbreviation expansion, character
exclusion, etc.
measures: >-
The choice of an editorial convention for the CER computation — does
not affect source data.
usage: >-
Picarones ships 9 preconfigured profiles (medieval_french,
early_modern_english, medieval_latin…). Additional profiles can be
loaded from YAML.
limits: >-
Too aggressive → masks real errors; too strict → overestimates error.
reference: >-
See ``picarones/core/normalization.py`` for the profile list.
structure:
title: "Structural scores"
definition: >-
Set of structure-level measures: line fusion rate, fragmentation rate,
reading-order (LCS), paragraph preservation.
measures: >-
The integrity of reconstructed layout, beyond character-level text.
usage: >-
Crucial for multi-column documents (newspapers, glossed Bibles) where
a low CER can hide a broken reading order.
limits: >-
Depends on structure annotations in the GT — not always available.
reference: >-
Antonacopoulos, A. et al. (2015). "ICDAR 2015 Competition on Text
Line Detection in Historical Documents".
image_quality:
title: "Image quality"
definition: >-
Composite [0, 1] score combining sharpness (Laplacian variance),
noise level, contrast, and rotation angle estimate.
measures: >-
The physical characteristics of the source image that may degrade
recognition.
usage: >-
Used to stratify results (good vs bad images) and to identify
documents that would benefit from rescanning.
limits: >-
Pure image-level score; does not capture paleographic difficulties
(cursive scripts, dense abbreviations).
reference: >-
Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.
(2009). "A Realistic Dataset for Performance Evaluation of Document
Layout Analysis".