Spaces:
Sleeping
Sleeping
Claude
sprint92: A.II.9 - mΓ©triques longitudinales (rΓ©gression + change-point + dΓ©tecteur)
cf6df23 unverified | # Narrative rendering templates β English. | |
| # Anti-hallucination rule: never introduce a number or entity name that is not | |
| # already in the Fact ``payload``. Tests verify traceability of every number | |
| # appearing in the rendered synthesis. | |
| global_leader_cer: >- | |
| On this corpus of {n_docs} documents, {engine} achieves the lowest mean CER | |
| ({cer_pct} %). | |
| statistical_tie: >- | |
| Engines {engines_list} are not statistically distinguishable | |
| (Friedman-Nemenyi, Ξ± = {alpha}, n = {n_blocks} documents, CD = {critical_distance}). | |
| significant_gap: >- | |
| The gap between {leader} and {runner_up} is statistically significant | |
| (Wilcoxon, p = {p_value:.4f}, Ξ CER = {delta_cer_pct} points over {n_pairs} pairs). | |
| stratum_winner: >- | |
| On stratum "{stratum}" ({n_docs_stratum} documents), {engine} clearly | |
| dominates with a CER of {cer_pct} % vs. {second_cer_pct} % for {second_engine}. | |
| stratum_collapse: >- | |
| {engine} is globally competitive ({global_cer_pct} %) but collapses on | |
| stratum "{stratum}" ({local_cer_pct} % over {n_docs_stratum} documents, | |
| i.e. {delta_cer_pct} points above its own average). | |
| error_profile_outlier: >- | |
| {engine} has an atypical error profile: {proportion_pct} % of errors fall | |
| into class "{error_class}", vs. a median of {median_proportion_pct} % across | |
| other engines (Γ{ratio_to_median} the median). | |
| llm_hallucination_flag: >- | |
| Hallucination signal on {engine} ({reasons_list}) β | |
| {hallucinating_rate_pct} % of documents above alert thresholds. | |
| robustness_fragile: >- | |
| {engine} is fragile under "{degradation}" degradation: its CER rises from | |
| {cer_baseline_pct} % to {cer_degraded_pct} % at maximum level (Γ{ratio}). | |
| speed_winner: >- | |
| {engine} is the fastest ({mean_duration} s/doc, Γ{speedup} faster than the | |
| median) for comparable quality (CER {cer_pct} %). | |
| confidence_warning: >- | |
| Ranking is fragile: the {confidence_level} % confidence interval of {engine} spans | |
| {ci_width_pct} CER points, compared with a gap of {gap_to_runner_up_pct} points to the runner-up. | |
| pareto_alternative: >- | |
| At much lower cost, {engine} offers an interesting trade-off ({cer_pct} % | |
| CER for {cost} β¬/{cost_unit_pages} pages, vs {leader_cer_pct} % / {leader_cost} β¬ for | |
| {leader}, i.e. Γ{cost_saving_ratio} cheaper). | |
| cost_outlier: >- | |
| Disproportionate cost for {engine} ({cost} β¬/{cost_unit_pages} pages, Γ{ratio_to_median} | |
| the median) without a compensating quality advantage (CER {cer_pct} %). | |
| ensemble_opportunity: >- | |
| Engines {pair_a} and {pair_b} have divergent error profiles | |
| ({divergence_metric}={divergence}). On this corpus of {doc_count} documents, | |
| {best_engine} preserves {best_recall_pct} % of tokens; a majority vote | |
| among the engines would preserve {oracle_recall_pct} % β i.e. | |
| {absolute_gap_pct} points recoverable ({relative_gap_pct} % of the best | |
| engine's errors). | |
| median_mean_gap_warning: >- | |
| Asymmetric distribution for {engine}: median CER {median_cer_pct} % | |
| vs mean {mean_cer_pct} % across {n_docs} documents (relative gap | |
| {relative_gap_pct} %). The mean is pulled by a few catastrophic | |
| documents β the median (now used for default ranking) is more | |
| representative. | |
| stratification_recommended: >- | |
| Heterogeneous corpus ({n_strata} strata): {leader} performs very | |
| differently depending on document type β median CER | |
| {min_stratum_cer_pct} % on "{min_stratum}" vs | |
| {max_stratum_cer_pct} % on "{max_stratum}", a gap of {gap_pct} | |
| points. The global ranking hides this disparity; consult the | |
| stratified view. | |
| engine_off_baseline: >- | |
| {engine} achieved {cer_current_pct} % CER here, vs {cer_historical_mean_pct} % | |
| on average over the last {n_runs} runs of your institution on this | |
| same corpus (relative delta {relative_delta_pct} %). This corpus is | |
| harder for it than usual. | |
| engine_unstable: >- | |
| Over {n_runs} successive runs, {engine} produces variable outputs | |
| (CER CV {cer_cv_pct} %, identical-run pair rate {identical_run_rate_pct} %). | |
| Reproducibility is limited β interpret the average CER with caution. | |
| regression_in_history: >- | |
| Over the {n_runs} historical runs for {engine}, the average CER | |
| moved from {first_cer_pct} % to {last_cer_pct} % | |
| (cumulative change {absolute_delta_pct} points). Investigate what | |
| changed in the pipeline or the models. | |