arxiv:2605.12895

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Published on May 30

Authors:

Abstract

A comprehensive pre-deployment evaluation framework for clinical decision-support systems is introduced, assessing reliability, inclusivity, sensitivity, equity, and deployability across diverse datasets to reveal failures hidden by traditional accuracy metrics.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.12895

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12895 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12895 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12895 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.