# Contextual glossary — English.
# Mirror of fr.yaml. Every term listed in fr.yaml must have an entry here.

cer:
  title: "CER — Character Error Rate"
  definition: >-
    Character-level error rate, computed as the ratio of the Levenshtein
    edit distance (substitutions + insertions + deletions) to the length
    of the reference string. Expressed as a percentage.
  measures: >-
    Character-by-character fidelity between the predicted transcript and
    the ground truth, without normalization.
  usage: >-
    The most common metric in OCR/HTR evaluation, used by ICDAR
    competitions since the early 2000s.
  limits: >-
    Insensitive to graphic variants (ſ vs s, u vs v) which may be
    preserved in a heritage corpus ground truth — see diplomatic CER.
  reference: >-
    Kay, M. (2007). "Optical Character Recognition". Handbook of Natural
    Language Processing, 2nd ed.

cer_nfc:
  title: "NFC CER"
  definition: >-
    CER computed after Unicode NFC (Canonical Decomposition, followed by
    Canonical Composition) normalization of both reference and hypothesis.
  measures: >-
    Text fidelity while ignoring Unicode representation differences that
    are semantically equivalent (e.g. precomposed é vs decomposed é).
  usage: >-
    Essential when ground truth and OCR output use different but
    equivalent Unicode forms.
  limits: >-
    Does not solve semantically significant graphic variants (ſ, ligatures
    that cannot be decomposed).
  reference: >-
    Unicode Technical Report #15 — Unicode Normalization Forms.

cer_caseless:
  title: "Case-insensitive CER"
  definition: >-
    CER computed after lowercase folding (``casefold``) of reference and
    hypothesis.
  measures: >-
    Text fidelity ignoring uppercase/lowercase differences.
  usage: >-
    Useful for corpora where case is not deemed significant (many early
    printed books, inconsistent capitalization).
  limits: >-
    Masks editorial choices regarding proper nouns and sentence openings.
  reference: >-
    Ibid. — CER.

cer_diplomatic:
  title: "Diplomatic CER"
  definition: >-
    CER computed after diplomatic normalization of a heritage corpus:
    merging ``ſ=s``, ``u=v``, ``i=j``, expanding abbreviations, etc.
  measures: >-
    Substantial errors, while ignoring graphic variants codified by
    editorial conventions (diplomatic vs normalized).
  usage: >-
    Often used when evaluating OCR/HTR on pre-19th-century corpora where
    the ground truth preserves old spellings irrelevant to searchability.
  limits: >-
    Masks editorial choices that are relevant to strict philology. The
    applied profile depends on conventions (MUFI, Capitains…) that vary
    between communities.
  reference: >-
    Pierazzo, E. (2015). "Digital Scholarly Editing". Ashgate.

wer:
  title: "WER — Word Error Rate"
  definition: >-
    Word-level error rate, computed as the word-to-word Levenshtein
    distance divided by the number of words in the reference.
  measures: >-
    Word-to-word fidelity, sensitive to segmentation (a misplaced space
    counts as two errors).
  usage: >-
    Historical standard in speech recognition, adopted in OCR/HTR to
    assess full-text search usability.
  limits: >-
    Very sensitive to segmentation. A 5 % CER can translate to a 20 %
    WER if errors touch different words each time.
  reference: >-
    Morris, A. C., Maier, V., & Green, P. (2004). "From WER and RIL to
    MER and WIL". ICSLP.

mer:
  title: "MER — Match Error Rate"
  definition: >-
    WER variant that caps error at 1 by accounting for insertions (WER
    can exceed 1, MER cannot).
  measures: >-
    A more stable version of WER, bounded in [0, 1].
  usage: >-
    Proposed by Morris et al. (2004) to correct the asymmetry of WER in
    the presence of excessive insertions.
  limits: >-
    Less widespread than WER — historical comparative tables often use
    WER, not MER.
  reference: >-
    Morris, A. C., Maier, V., & Green, P. (2004). Ibid.

wil:
  title: "WIL — Word Information Lost"
  definition: >-
    Measures word-level information loss; accounts for both correctly
    recognized content and noise introduced by the system.
  measures: >-
    The amount of semantic information lost at the word level.
  usage: >-
    Useful alongside WER to diagnose noisy hypotheses (many unrelated
    insertions).
  limits: >-
    Less intuitive than a simple error rate.
  reference: >-
    Morris, A. C., Maier, V., & Green, P. (2004). Ibid.

ligature_score:
  title: "Ligature score"
  definition: >-
    Proportion of ligatures (``fi``, ``fl``, ``œ``, ``æ``, ``ꝑ``, ``ꝓ``…)
    correctly rendered by the engine.
  measures: >-
    The engine's ability to recognize the fused characters typical of
    early printed books and medieval manuscripts.
  usage: >-
    Strong indicator for critical editions and philology.
  limits: >-
    Depends on Picarones' ligature table — some rare ligatures may be
    absent.
  reference: >-
    MUFI — Medieval Unicode Font Initiative, Recommendations v4.

diacritic_score:
  title: "Diacritic score"
  definition: >-
    Rate of diacritic preservation (acute, grave, tilde, cedilla, dieresis…)
    between ground truth and OCR output.
  measures: >-
    Diacritic fidelity, measured after NFD decomposition.
  usage: >-
    Important for multilingual corpora and philological transcriptions
    where diacritics are meaningful.
  limits: >-
    An engine may place a diacritic on the wrong letter — this metric
    alone will not detect it.
  reference: >-
    Unicode Technical Report #15.

taxonomy:
  title: "Error taxonomy (9 classes)"
  definition: >-
    Systematic classification of each error into 9 categories: visual
    confusion, diacritic error, case error, ligature error, abbreviation,
    hapax, segmentation, OOV character, lacuna.
  measures: >-
    An engine's error profile — reveals its specific weaknesses.
  usage: >-
    Fine-grained diagnosis for a given engine, useful for deciding whether
    to switch models or tune an LLM post-correction prompt.
  limits: >-
    The ``difflib``-based classification is heuristic; a character can fall
    into several classes simultaneously.
  reference: >-
    Clausner, C., Antonacopoulos, A., Pletschacher, S. (2020). "ICDAR 2019
    Competition on Recognition of Historical Arabic Scientific Manuscripts".

confusion_matrix:
  title: "Unicode confusion matrix"
  definition: >-
    Cross-table listing substitutions (GT char → OCR char) and their
    frequencies across the corpus.
  measures: >-
    Character-to-character substitution patterns, readable symmetrically
    (which GT character was confused with what?).
  usage: >-
    Compare two engines' "genetic fingerprints": if they confuse the same
    characters, they were likely trained on similar data.
  limits: >-
    Does not capture segmentation errors (spaces) nor unmatched insertions.
  reference: >-
    Pletschacher, S., Clausner, C., Antonacopoulos, A. (2015). "Performance
    Analysis Framework for Layout Analysis Methods".

gini:
  title: "Gini coefficient of errors"
  definition: >-
    Measures error concentration across a document (between 0 = uniformly
    distributed errors and 1 = all errors on a single line).
  measures: >-
    The unequal distribution of errors within a document — a high Gini
    signals that a small fraction of lines concentrates most errors.
  usage: >-
    Identifies hard regions (marginal notes, damaged areas) that would
    benefit from targeted correction.
  limits: >-
    Sensitive to the number of lines — not very informative on very short
    documents.
  reference: >-
    Gini, C. (1912). "Variabilità e mutabilità".

hallucination_score:
  title: "Hallucination score (LLM/VLM)"
  definition: >-
    Composite indicator combining trigram anchoring (fraction of
    hypothesis trigrams present in GT) and length ratio
    (hypothesis/GT) to detect hallucinations in LLM and VLM pipelines.
  measures: >-
    How likely the model invented text instead of reading the image.
  usage: >-
    Essential for OCR+LLM pipelines and zero-shot VLMs, where CER alone
    is misleading (low CER can hide a hallucinated paraphrase).
  limits: >-
    A faithful but rephrased output may be falsely flagged.
  reference: >-
    Wiland, A. et al. (2024). "Hallucination Detection for Visual Language
    Models on Historical Documents". DHd.

anchor_score:
  title: "Trigram anchor score"
  definition: >-
    Fraction of word-level trigrams from the OCR hypothesis that also
    exist in the ground truth.
  measures: >-
    How "anchored" the output is in the source text. High score = faithful
    transcription; low score = probable hallucinations.
  usage: >-
    Complements CER for LLM/VLM pipelines.
  limits: >-
    On very short outputs, the score can be noisy (few trigrams available).
  reference: >-
    Wiland, A. et al. (2024). Ibid.

length_ratio:
  title: "Length ratio"
  definition: >-
    Ratio of hypothesis character length to GT character length. Ratios
    > 1.2 or < 0.8 are warning signals.
  measures: >-
    Excess or deficit of text produced by the engine.
  usage: >-
    Used with anchor score to flag hallucinations (verbose LLMs) or
    omissions (LLMs skipping hard passages).
  limits: >-
    Highly dependent on the GT style (abbreviated vs expanded).
  reference: >-
    Wiland, A. et al. (2024). Ibid.

bootstrap_ci:
  title: "Bootstrap confidence interval"
  definition: >-
    95 % confidence interval of the mean CER, computed by resampling
    documents with replacement (1000 iterations by default).
  measures: >-
    Uncertainty associated with the mean CER — the wider the interval,
    the less reliable an ordinal ranking becomes.
  usage: >-
    Essential context for any mean CER; especially important on small
    corpora (< 30 documents).
  limits: >-
    Assumes document independence — not strictly true for series (same
    scribe, same manuscript).
  reference: >-
    Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife".
    Annals of Statistics.

wilcoxon:
  title: "Wilcoxon signed-rank test"
  definition: >-
    Non-parametric test of equality between two series of paired
    measurements (same documents, two different engines).
  measures: >-
    Statistical significance of an observed gap between two engines,
    without assuming normality of distributions.
  usage: >-
    Pairwise comparison of two engines on a corpus.
  limits: >-
    When applied repeatedly across all pairs of k engines, the Type-I
    error risk grows — prefer Friedman-Nemenyi to compare more than two
    engines.
  reference: >-
    Wilcoxon, F. (1945). "Individual Comparisons by Ranking Methods".
    Biometrics Bulletin.

friedman:
  title: "Friedman test"
  definition: >-
    Non-parametric equivalent of repeated-measures ANOVA: tests whether
    at least one engine among k differs from the others on n documents.
  measures: >-
    A global difference between k engines across n blocks (documents).
  usage: >-
    Prelude to the Nemenyi post-hoc. Recommended whenever more than two
    engines are compared, to control the multi-comparison risk.
  limits: >-
    Does not identify which pairs differ — the post-hoc is required.
  reference: >-
    Friedman, M. (1937). "The Use of Ranks to Avoid the Assumption of
    Normality Implicit in the Analysis of Variance".

nemenyi:
  title: "Nemenyi post-hoc"
  definition: >-
    Post-hoc test applied after a Friedman test to identify distinguishable
    engine pairs. Computes a ``critical distance`` (CD) depending on the
    number of engines and documents.
  measures: >-
    Pairs of engines whose mean ranks differ significantly.
  usage: >-
    Basis of the Critical Difference Diagram (Demšar 2006).
  limits: >-
    Conservative by construction (corrects for multiple comparisons);
    may miss real but subtle differences.
  reference: >-
    Nemenyi, P. (1963). "Distribution-free Multiple Comparisons".

cdd:
  title: "Critical Difference Diagram"
  definition: >-
    Graphical rendering of Friedman-Nemenyi results: engines placed on a
    horizontal axis (mean rank), connected by a bar if they are not
    statistically distinguishable at the α level.
  measures: >-
    Global ordering of engines and indistinguishability groups.
  usage: >-
    De facto standard in ML since Demšar 2006 for comparing multiple
    systems over multiple datasets.
  limits: >-
    Can be hard to read when several groups partially overlap.
  reference: >-
    Demšar, J. (2006). "Statistical Comparisons of Classifiers over
    Multiple Data Sets". JMLR 7:1-30.

pareto_front:
  title: "Pareto front"
  definition: >-
    Set of engines for which no other offers simultaneously better quality
    AND better cost (or any other objective pair).
  measures: >-
    "Non-dominated" trade-offs — choosing outside the Pareto front is
    always suboptimal, but choosing on the front depends on each
    institution's priorities.
  usage: >-
    Core of the report's quality/cost view. Also applicable to
    quality/speed or quality/carbon.
  limits: >-
    Costs used are indicative (see ``pricing.yaml``) and age quickly.
    Always cross-check with real invoices before purchase decisions.
  reference: >-
    Pareto, V. (1906). "Manuale di economia politica".

difficulty_score:
  title: "Intrinsic difficulty score"
  definition: >-
    Score in [0, 1] combining inter-engine CER variance, image quality,
    and density of heritage-specific characters.
  measures: >-
    How intrinsically difficult a document is, independently of the
    evaluation instrument.
  usage: >-
    Allows stratifying the report (easy vs hard documents) and
    interpreting a global CER while accounting for corpus specifics.
  limits: >-
    Default weights (0.4, 0.35, 0.25) are heuristic and can be adjusted
    to the context.
  reference: >-
    Stutzmann, D. (2017). "Clustering of medieval scripts through
    computer image analysis".

normalization_profile:
  title: "Normalization profile"
  definition: >-
    Set of transformation rules applied to GT and hypothesis before CER
    computation: ſ=s merge, u=v, i=j, abbreviation expansion, character
    exclusion, etc.
  measures: >-
    The choice of an editorial convention for the CER computation — does
    not affect source data.
  usage: >-
    Picarones ships 9 preconfigured profiles (medieval_french,
    early_modern_english, medieval_latin…). Additional profiles can be
    loaded from YAML.
  limits: >-
    Too aggressive → masks real errors; too strict → overestimates error.
  reference: >-
    See ``picarones/core/normalization.py`` for the profile list.

structure:
  title: "Structural scores"
  definition: >-
    Set of structure-level measures: line fusion rate, fragmentation rate,
    reading-order (LCS), paragraph preservation.
  measures: >-
    The integrity of reconstructed layout, beyond character-level text.
  usage: >-
    Crucial for multi-column documents (newspapers, glossed Bibles) where
    a low CER can hide a broken reading order.
  limits: >-
    Depends on structure annotations in the GT — not always available.
  reference: >-
    Antonacopoulos, A. et al. (2015). "ICDAR 2015 Competition on Text
    Line Detection in Historical Documents".

image_quality:
  title: "Image quality"
  definition: >-
    Composite [0, 1] score combining sharpness (Laplacian variance),
    noise level, contrast, and rotation angle estimate.
  measures: >-
    The physical characteristics of the source image that may degrade
    recognition.
  usage: >-
    Used to stratify results (good vs bad images) and to identify
    documents that would benefit from rescanning.
  limits: >-
    Pure image-level score; does not capture paleographic difficulties
    (cursive scripts, dense abbreviations).
  reference: >-
    Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.
    (2009). "A Realistic Dataset for Performance Evaluation of Document
    Layout Analysis".