Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Picarones / picarones /core /narrative /templates /en.yaml

Claude

sprint92: A.II.9 - métriques longitudinales (régression + change-point + détecteur)

cf6df23 unverified about 2 months ago

4.28 kB

	# Narrative rendering templates — English.
	# Anti-hallucination rule: never introduce a number or entity name that is not
	# already in the Fact ``payload``. Tests verify traceability of every number
	# appearing in the rendered synthesis.

	global_leader_cer: >-
	On this corpus of {n_docs} documents, {engine} achieves the lowest mean CER
	({cer_pct} %).

	statistical_tie: >-
	Engines {engines_list} are not statistically distinguishable
	(Friedman-Nemenyi, α = {alpha}, n = {n_blocks} documents, CD = {critical_distance}).

	significant_gap: >-
	The gap between {leader} and {runner_up} is statistically significant
	(Wilcoxon, p = {p_value:.4f}, Δ CER = {delta_cer_pct} points over {n_pairs} pairs).

	stratum_winner: >-
	On stratum "{stratum}" ({n_docs_stratum} documents), {engine} clearly
	dominates with a CER of {cer_pct} % vs. {second_cer_pct} % for {second_engine}.

	stratum_collapse: >-
	{engine} is globally competitive ({global_cer_pct} %) but collapses on
	stratum "{stratum}" ({local_cer_pct} % over {n_docs_stratum} documents,
	i.e. {delta_cer_pct} points above its own average).

	error_profile_outlier: >-
	{engine} has an atypical error profile: {proportion_pct} % of errors fall
	into class "{error_class}", vs. a median of {median_proportion_pct} % across
	other engines (×{ratio_to_median} the median).

	llm_hallucination_flag: >-
	Hallucination signal on {engine} ({reasons_list}) —
	{hallucinating_rate_pct} % of documents above alert thresholds.

	robustness_fragile: >-
	{engine} is fragile under "{degradation}" degradation: its CER rises from
	{cer_baseline_pct} % to {cer_degraded_pct} % at maximum level (×{ratio}).

	speed_winner: >-
	{engine} is the fastest ({mean_duration} s/doc, ×{speedup} faster than the
	median) for comparable quality (CER {cer_pct} %).

	confidence_warning: >-
	Ranking is fragile: the {confidence_level} % confidence interval of {engine} spans
	{ci_width_pct} CER points, compared with a gap of {gap_to_runner_up_pct} points to the runner-up.

	pareto_alternative: >-
	At much lower cost, {engine} offers an interesting trade-off ({cer_pct} %
	CER for {cost} €/{cost_unit_pages} pages, vs {leader_cer_pct} % / {leader_cost} € for
	{leader}, i.e. ×{cost_saving_ratio} cheaper).

	cost_outlier: >-
	Disproportionate cost for {engine} ({cost} €/{cost_unit_pages} pages, ×{ratio_to_median}
	the median) without a compensating quality advantage (CER {cer_pct} %).

	ensemble_opportunity: >-
	Engines {pair_a} and {pair_b} have divergent error profiles
	({divergence_metric}={divergence}). On this corpus of {doc_count} documents,
	{best_engine} preserves {best_recall_pct} % of tokens; a majority vote
	among the engines would preserve {oracle_recall_pct} % — i.e.
	{absolute_gap_pct} points recoverable ({relative_gap_pct} % of the best
	engine's errors).

	median_mean_gap_warning: >-
	Asymmetric distribution for {engine}: median CER {median_cer_pct} %
	vs mean {mean_cer_pct} % across {n_docs} documents (relative gap
	{relative_gap_pct} %). The mean is pulled by a few catastrophic
	documents — the median (now used for default ranking) is more
	representative.

	stratification_recommended: >-
	Heterogeneous corpus ({n_strata} strata): {leader} performs very
	differently depending on document type — median CER
	{min_stratum_cer_pct} % on "{min_stratum}" vs
	{max_stratum_cer_pct} % on "{max_stratum}", a gap of {gap_pct}
	points. The global ranking hides this disparity; consult the
	stratified view.

	engine_off_baseline: >-
	{engine} achieved {cer_current_pct} % CER here, vs {cer_historical_mean_pct} %
	on average over the last {n_runs} runs of your institution on this
	same corpus (relative delta {relative_delta_pct} %). This corpus is
	harder for it than usual.

	engine_unstable: >-
	Over {n_runs} successive runs, {engine} produces variable outputs
	(CER CV {cer_cv_pct} %, identical-run pair rate {identical_run_rate_pct} %).
	Reproducibility is limited — interpret the average CER with caution.

	regression_in_history: >-
	Over the {n_runs} historical runs for {engine}, the average CER
	moved from {first_cer_pct} % to {last_cer_pct} %
	(cumulative change {absolute_delta_pct} points). Investigate what
	changed in the pipeline or the models.