Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Picarones / picarones /report /glossary /en.yaml

Claude

Sprint 6 du plan rapport — glossaire contextuel + panneau personnalisation

76e79a0 unverified 2 months ago

16.1 kB

	# Contextual glossary — English.
	# Mirror of fr.yaml. Every term listed in fr.yaml must have an entry here.

	cer:
	title: "CER — Character Error Rate"
	definition: >-
	Character-level error rate, computed as the ratio of the Levenshtein
	edit distance (substitutions + insertions + deletions) to the length
	of the reference string. Expressed as a percentage.
	measures: >-
	Character-by-character fidelity between the predicted transcript and
	the ground truth, without normalization.
	usage: >-
	The most common metric in OCR/HTR evaluation, used by ICDAR
	competitions since the early 2000s.
	limits: >-
	Insensitive to graphic variants (ſ vs s, u vs v) which may be
	preserved in a heritage corpus ground truth — see diplomatic CER.
	reference: >-
	Kay, M. (2007). "Optical Character Recognition". Handbook of Natural
	Language Processing, 2nd ed.

	cer_nfc:
	title: "NFC CER"
	definition: >-
	CER computed after Unicode NFC (Canonical Decomposition, followed by
	Canonical Composition) normalization of both reference and hypothesis.
	measures: >-
	Text fidelity while ignoring Unicode representation differences that
	are semantically equivalent (e.g. precomposed é vs decomposed é).
	usage: >-
	Essential when ground truth and OCR output use different but
	equivalent Unicode forms.
	limits: >-
	Does not solve semantically significant graphic variants (ſ, ligatures
	that cannot be decomposed).
	reference: >-
	Unicode Technical Report #15 — Unicode Normalization Forms.

	cer_caseless:
	title: "Case-insensitive CER"
	definition: >-
	CER computed after lowercase folding (``casefold``) of reference and
	hypothesis.
	measures: >-
	Text fidelity ignoring uppercase/lowercase differences.
	usage: >-
	Useful for corpora where case is not deemed significant (many early
	printed books, inconsistent capitalization).
	limits: >-
	Masks editorial choices regarding proper nouns and sentence openings.
	reference: >-
	Ibid. — CER.

	cer_diplomatic:
	title: "Diplomatic CER"
	definition: >-
	CER computed after diplomatic normalization of a heritage corpus:
	merging ``ſ=s``, ``u=v``, ``i=j``, expanding abbreviations, etc.
	measures: >-
	Substantial errors, while ignoring graphic variants codified by
	editorial conventions (diplomatic vs normalized).
	usage: >-
	Often used when evaluating OCR/HTR on pre-19th-century corpora where
	the ground truth preserves old spellings irrelevant to searchability.
	limits: >-
	Masks editorial choices that are relevant to strict philology. The
	applied profile depends on conventions (MUFI, Capitains…) that vary
	between communities.
	reference: >-
	Pierazzo, E. (2015). "Digital Scholarly Editing". Ashgate.

	wer:
	title: "WER — Word Error Rate"
	definition: >-
	Word-level error rate, computed as the word-to-word Levenshtein
	distance divided by the number of words in the reference.
	measures: >-
	Word-to-word fidelity, sensitive to segmentation (a misplaced space
	counts as two errors).
	usage: >-
	Historical standard in speech recognition, adopted in OCR/HTR to
	assess full-text search usability.
	limits: >-
	Very sensitive to segmentation. A 5 % CER can translate to a 20 %
	WER if errors touch different words each time.
	reference: >-
	Morris, A. C., Maier, V., & Green, P. (2004). "From WER and RIL to
	MER and WIL". ICSLP.

	mer:
	title: "MER — Match Error Rate"
	definition: >-
	WER variant that caps error at 1 by accounting for insertions (WER
	can exceed 1, MER cannot).
	measures: >-
	A more stable version of WER, bounded in [0, 1].
	usage: >-
	Proposed by Morris et al. (2004) to correct the asymmetry of WER in
	the presence of excessive insertions.
	limits: >-
	Less widespread than WER — historical comparative tables often use
	WER, not MER.
	reference: >-
	Morris, A. C., Maier, V., & Green, P. (2004). Ibid.

	wil:
	title: "WIL — Word Information Lost"
	definition: >-
	Measures word-level information loss; accounts for both correctly
	recognized content and noise introduced by the system.
	measures: >-
	The amount of semantic information lost at the word level.
	usage: >-
	Useful alongside WER to diagnose noisy hypotheses (many unrelated
	insertions).
	limits: >-
	Less intuitive than a simple error rate.
	reference: >-
	Morris, A. C., Maier, V., & Green, P. (2004). Ibid.

	ligature_score:
	title: "Ligature score"
	definition: >-
	Proportion of ligatures (``fi``, ``fl``, ``œ``, ``æ``, ``ꝑ``, ``ꝓ``…)
	correctly rendered by the engine.
	measures: >-
	The engine's ability to recognize the fused characters typical of
	early printed books and medieval manuscripts.
	usage: >-
	Strong indicator for critical editions and philology.
	limits: >-
	Depends on Picarones' ligature table — some rare ligatures may be
	absent.
	reference: >-
	MUFI — Medieval Unicode Font Initiative, Recommendations v4.

	diacritic_score:
	title: "Diacritic score"
	definition: >-
	Rate of diacritic preservation (acute, grave, tilde, cedilla, dieresis…)
	between ground truth and OCR output.
	measures: >-
	Diacritic fidelity, measured after NFD decomposition.
	usage: >-
	Important for multilingual corpora and philological transcriptions
	where diacritics are meaningful.
	limits: >-
	An engine may place a diacritic on the wrong letter — this metric
	alone will not detect it.
	reference: >-
	Unicode Technical Report #15.

	taxonomy:
	title: "Error taxonomy (9 classes)"
	definition: >-
	Systematic classification of each error into 9 categories: visual
	confusion, diacritic error, case error, ligature error, abbreviation,
	hapax, segmentation, OOV character, lacuna.
	measures: >-
	An engine's error profile — reveals its specific weaknesses.
	usage: >-
	Fine-grained diagnosis for a given engine, useful for deciding whether
	to switch models or tune an LLM post-correction prompt.
	limits: >-
	The ``difflib``-based classification is heuristic; a character can fall
	into several classes simultaneously.
	reference: >-
	Clausner, C., Antonacopoulos, A., Pletschacher, S. (2020). "ICDAR 2019
	Competition on Recognition of Historical Arabic Scientific Manuscripts".

	confusion_matrix:
	title: "Unicode confusion matrix"
	definition: >-
	Cross-table listing substitutions (GT char → OCR char) and their
	frequencies across the corpus.
	measures: >-
	Character-to-character substitution patterns, readable symmetrically
	(which GT character was confused with what?).
	usage: >-
	Compare two engines' "genetic fingerprints": if they confuse the same
	characters, they were likely trained on similar data.
	limits: >-
	Does not capture segmentation errors (spaces) nor unmatched insertions.
	reference: >-
	Pletschacher, S., Clausner, C., Antonacopoulos, A. (2015). "Performance
	Analysis Framework for Layout Analysis Methods".

	gini:
	title: "Gini coefficient of errors"
	definition: >-
	Measures error concentration across a document (between 0 = uniformly
	distributed errors and 1 = all errors on a single line).
	measures: >-
	The unequal distribution of errors within a document — a high Gini
	signals that a small fraction of lines concentrates most errors.
	usage: >-
	Identifies hard regions (marginal notes, damaged areas) that would
	benefit from targeted correction.
	limits: >-
	Sensitive to the number of lines — not very informative on very short
	documents.
	reference: >-
	Gini, C. (1912). "Variabilità e mutabilità".

	hallucination_score:
	title: "Hallucination score (LLM/VLM)"
	definition: >-
	Composite indicator combining trigram anchoring (fraction of
	hypothesis trigrams present in GT) and length ratio
	(hypothesis/GT) to detect hallucinations in LLM and VLM pipelines.
	measures: >-
	How likely the model invented text instead of reading the image.
	usage: >-
	Essential for OCR+LLM pipelines and zero-shot VLMs, where CER alone
	is misleading (low CER can hide a hallucinated paraphrase).
	limits: >-
	A faithful but rephrased output may be falsely flagged.
	reference: >-
	Wiland, A. et al. (2024). "Hallucination Detection for Visual Language
	Models on Historical Documents". DHd.

	anchor_score:
	title: "Trigram anchor score"
	definition: >-
	Fraction of word-level trigrams from the OCR hypothesis that also
	exist in the ground truth.
	measures: >-
	How "anchored" the output is in the source text. High score = faithful
	transcription; low score = probable hallucinations.
	usage: >-
	Complements CER for LLM/VLM pipelines.
	limits: >-
	On very short outputs, the score can be noisy (few trigrams available).
	reference: >-
	Wiland, A. et al. (2024). Ibid.

	length_ratio:
	title: "Length ratio"
	definition: >-
	Ratio of hypothesis character length to GT character length. Ratios
	> 1.2 or < 0.8 are warning signals.
	measures: >-
	Excess or deficit of text produced by the engine.
	usage: >-
	Used with anchor score to flag hallucinations (verbose LLMs) or
	omissions (LLMs skipping hard passages).
	limits: >-
	Highly dependent on the GT style (abbreviated vs expanded).
	reference: >-
	Wiland, A. et al. (2024). Ibid.

	bootstrap_ci:
	title: "Bootstrap confidence interval"
	definition: >-
	95 % confidence interval of the mean CER, computed by resampling
	documents with replacement (1000 iterations by default).
	measures: >-
	Uncertainty associated with the mean CER — the wider the interval,
	the less reliable an ordinal ranking becomes.
	usage: >-
	Essential context for any mean CER; especially important on small
	corpora (< 30 documents).
	limits: >-
	Assumes document independence — not strictly true for series (same
	scribe, same manuscript).
	reference: >-
	Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife".
	Annals of Statistics.

	wilcoxon:
	title: "Wilcoxon signed-rank test"
	definition: >-
	Non-parametric test of equality between two series of paired
	measurements (same documents, two different engines).
	measures: >-
	Statistical significance of an observed gap between two engines,
	without assuming normality of distributions.
	usage: >-
	Pairwise comparison of two engines on a corpus.
	limits: >-
	When applied repeatedly across all pairs of k engines, the Type-I
	error risk grows — prefer Friedman-Nemenyi to compare more than two
	engines.
	reference: >-
	Wilcoxon, F. (1945). "Individual Comparisons by Ranking Methods".
	Biometrics Bulletin.

	friedman:
	title: "Friedman test"
	definition: >-
	Non-parametric equivalent of repeated-measures ANOVA: tests whether
	at least one engine among k differs from the others on n documents.
	measures: >-
	A global difference between k engines across n blocks (documents).
	usage: >-
	Prelude to the Nemenyi post-hoc. Recommended whenever more than two
	engines are compared, to control the multi-comparison risk.
	limits: >-
	Does not identify which pairs differ — the post-hoc is required.
	reference: >-
	Friedman, M. (1937). "The Use of Ranks to Avoid the Assumption of
	Normality Implicit in the Analysis of Variance".

	nemenyi:
	title: "Nemenyi post-hoc"
	definition: >-
	Post-hoc test applied after a Friedman test to identify distinguishable
	engine pairs. Computes a ``critical distance`` (CD) depending on the
	number of engines and documents.
	measures: >-
	Pairs of engines whose mean ranks differ significantly.
	usage: >-
	Basis of the Critical Difference Diagram (Demšar 2006).
	limits: >-
	Conservative by construction (corrects for multiple comparisons);
	may miss real but subtle differences.
	reference: >-
	Nemenyi, P. (1963). "Distribution-free Multiple Comparisons".

	cdd:
	title: "Critical Difference Diagram"
	definition: >-
	Graphical rendering of Friedman-Nemenyi results: engines placed on a
	horizontal axis (mean rank), connected by a bar if they are not
	statistically distinguishable at the α level.
	measures: >-
	Global ordering of engines and indistinguishability groups.
	usage: >-
	De facto standard in ML since Demšar 2006 for comparing multiple
	systems over multiple datasets.
	limits: >-
	Can be hard to read when several groups partially overlap.
	reference: >-
	Demšar, J. (2006). "Statistical Comparisons of Classifiers over
	Multiple Data Sets". JMLR 7:1-30.

	pareto_front:
	title: "Pareto front"
	definition: >-
	Set of engines for which no other offers simultaneously better quality
	AND better cost (or any other objective pair).
	measures: >-
	"Non-dominated" trade-offs — choosing outside the Pareto front is
	always suboptimal, but choosing on the front depends on each
	institution's priorities.
	usage: >-
	Core of the report's quality/cost view. Also applicable to
	quality/speed or quality/carbon.
	limits: >-
	Costs used are indicative (see ``pricing.yaml``) and age quickly.
	Always cross-check with real invoices before purchase decisions.
	reference: >-
	Pareto, V. (1906). "Manuale di economia politica".

	difficulty_score:
	title: "Intrinsic difficulty score"
	definition: >-
	Score in [0, 1] combining inter-engine CER variance, image quality,
	and density of heritage-specific characters.
	measures: >-
	How intrinsically difficult a document is, independently of the
	evaluation instrument.
	usage: >-
	Allows stratifying the report (easy vs hard documents) and
	interpreting a global CER while accounting for corpus specifics.
	limits: >-
	Default weights (0.4, 0.35, 0.25) are heuristic and can be adjusted
	to the context.
	reference: >-
	Stutzmann, D. (2017). "Clustering of medieval scripts through
	computer image analysis".

	normalization_profile:
	title: "Normalization profile"
	definition: >-
	Set of transformation rules applied to GT and hypothesis before CER
	computation: ſ=s merge, u=v, i=j, abbreviation expansion, character
	exclusion, etc.
	measures: >-
	The choice of an editorial convention for the CER computation — does
	not affect source data.
	usage: >-
	Picarones ships 9 preconfigured profiles (medieval_french,
	early_modern_english, medieval_latin…). Additional profiles can be
	loaded from YAML.
	limits: >-
	Too aggressive → masks real errors; too strict → overestimates error.
	reference: >-
	See ``picarones/core/normalization.py`` for the profile list.

	structure:
	title: "Structural scores"
	definition: >-
	Set of structure-level measures: line fusion rate, fragmentation rate,
	reading-order (LCS), paragraph preservation.
	measures: >-
	The integrity of reconstructed layout, beyond character-level text.
	usage: >-
	Crucial for multi-column documents (newspapers, glossed Bibles) where
	a low CER can hide a broken reading order.
	limits: >-
	Depends on structure annotations in the GT — not always available.
	reference: >-
	Antonacopoulos, A. et al. (2015). "ICDAR 2015 Competition on Text
	Line Detection in Historical Documents".

	image_quality:
	title: "Image quality"
	definition: >-
	Composite [0, 1] score combining sharpness (Laplacian variance),
	noise level, contrast, and rotation angle estimate.
	measures: >-
	The physical characteristics of the source image that may degrade
	recognition.
	usage: >-
	Used to stratify results (good vs bad images) and to identify
	documents that would benefit from rescanning.
	limits: >-
	Pure image-level score; does not capture paleographic difficulties
	(cursive scripts, dense abbreviations).
	reference: >-
	Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.
	(2009). "A Realistic Dataset for Performance Evaluation of Document
	Layout Analysis".