MedGRPO Team Claude Opus 4.7 (1M context) commited on
Commit Β·
03adf33
1
Parent(s): 3a73928
Rank against the union of official + community to keep ranks consistent
Browse filesSame model was getting a different rank in the Official vs Community
views because each view ran sort_by_avg_rank against its own competitor
set (15 vs 35 models). With genuinely different avg-ranks per view, no
amount of tiebreaking could make the displayed rank match.
Fix: compute a single global ranking against the union of all
submissions (official βͺ community), then map each view's rows to their
global rank number. The same model now shows the same rank in either
table; the Official table just skips rank numbers for rows that aren't
in the official subset (e.g., rank 7 β 9 if rank 8 isn't promoted).
Update the About-tab explanation to describe the global ranking.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
app.py
CHANGED
|
@@ -216,6 +216,89 @@ TEST_SET_STATS = {
|
|
| 216 |
}
|
| 217 |
|
| 218 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
def sort_by_avg_rank(df: pd.DataFrame) -> pd.DataFrame:
|
| 220 |
"""Sort the leaderboard by average rank across all metrics.
|
| 221 |
|
|
@@ -307,9 +390,12 @@ def load_leaderboard() -> pd.DataFrame:
|
|
| 307 |
if 'average' in df.columns:
|
| 308 |
df = df.drop('average', axis=1)
|
| 309 |
|
| 310 |
-
# Sort by average rank across all metrics (lower avg rank = better)
|
|
|
|
|
|
|
| 311 |
if 'cvs_acc' in df.columns:
|
| 312 |
df = sort_by_avg_rank(df)
|
|
|
|
| 313 |
|
| 314 |
print(f"β Loaded leaderboard from private repo: {len(df)} entries")
|
| 315 |
return df
|
|
@@ -330,9 +416,11 @@ def load_leaderboard() -> pd.DataFrame:
|
|
| 330 |
if 'average' in df.columns:
|
| 331 |
df = df.drop('average', axis=1)
|
| 332 |
|
| 333 |
-
# Sort by
|
|
|
|
| 334 |
if 'cvs_acc' in df.columns:
|
| 335 |
df = sort_by_avg_rank(df)
|
|
|
|
| 336 |
|
| 337 |
print(f"β Loaded leaderboard from local file: {len(df)} entries")
|
| 338 |
return df
|
|
@@ -428,6 +516,7 @@ def load_official_leaderboard() -> pd.DataFrame:
|
|
| 428 |
df = pd.DataFrame(data)
|
| 429 |
if 'cvs_acc' in df.columns:
|
| 430 |
df = sort_by_avg_rank(df)
|
|
|
|
| 431 |
print(f"β Loaded official leaderboard from private repo: {len(df)} entries")
|
| 432 |
return df
|
| 433 |
except Exception as e:
|
|
@@ -443,6 +532,7 @@ def load_official_leaderboard() -> pd.DataFrame:
|
|
| 443 |
df = pd.DataFrame(data)
|
| 444 |
if 'cvs_acc' in df.columns:
|
| 445 |
df = sort_by_avg_rank(df)
|
|
|
|
| 446 |
print(f"β Loaded official leaderboard from local file: {len(df)} entries")
|
| 447 |
return df
|
| 448 |
|
|
@@ -2421,14 +2511,14 @@ with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
|
|
| 2421 |
|
| 2422 |
Models are ranked by **average rank across all 10 metrics** β lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0β1 vs. LLM-judge 1β5) and rewards models that are strong across tasks rather than exceptional on one.
|
| 2423 |
|
|
|
|
|
|
|
| 2424 |
**Tiebreakers** (applied in order when two models have the same average rank):
|
| 2425 |
1. **Number of metrics won outright** β a model that's #1 on more metrics wins over one that ties closely on many.
|
| 2426 |
2. **Sum of per-metric ranks** β catches near-ties where the mean rounded equal.
|
| 2427 |
3. **Sum of normalized scores** β favors the model with marginally higher absolute scores.
|
| 2428 |
4. **Model name alphabetical** β final fallback for full determinism.
|
| 2429 |
|
| 2430 |
-
This guarantees the same model gets the same rank across the Official and Community tables regardless of submission order.
|
| 2431 |
-
|
| 2432 |
---
|
| 2433 |
|
| 2434 |
### Benchmark Tasks
|
|
|
|
| 216 |
}
|
| 217 |
|
| 218 |
|
| 219 |
+
def _read_leaderboard_json(filename: str) -> List[dict]:
|
| 220 |
+
"""Read a leaderboard JSON (raw, no sort) from the private HF repo with
|
| 221 |
+
a local file fallback. Used by apply_global_rank to build the union of
|
| 222 |
+
official + community models so ranks are consistent across both tables.
|
| 223 |
+
"""
|
| 224 |
+
try:
|
| 225 |
+
token = os.environ.get('HF_TOKEN')
|
| 226 |
+
if token:
|
| 227 |
+
try:
|
| 228 |
+
path = hf_hub_download(
|
| 229 |
+
repo_id="UII-AI/MedVidBench-GroundTruth",
|
| 230 |
+
filename=filename,
|
| 231 |
+
repo_type="dataset",
|
| 232 |
+
token=token,
|
| 233 |
+
cache_dir="./cache",
|
| 234 |
+
)
|
| 235 |
+
with open(path) as f:
|
| 236 |
+
return json.load(f) or []
|
| 237 |
+
except Exception:
|
| 238 |
+
pass
|
| 239 |
+
except Exception:
|
| 240 |
+
pass
|
| 241 |
+
local = PERSISTENT_DIR / filename
|
| 242 |
+
if local.exists():
|
| 243 |
+
with open(local) as f:
|
| 244 |
+
return json.load(f) or []
|
| 245 |
+
return []
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
def _global_rank_lookup() -> dict:
|
| 249 |
+
"""Compute a global rank for each model by ranking against the union of
|
| 250 |
+
all submissions (official βͺ community). Returns {model_name: global_rank}.
|
| 251 |
+
|
| 252 |
+
This ensures the same model gets the same rank number in either the
|
| 253 |
+
Official or Community leaderboard view, regardless of which set of
|
| 254 |
+
competitors is being displayed.
|
| 255 |
+
"""
|
| 256 |
+
official_raw = _read_leaderboard_json("official_leaderboard.json")
|
| 257 |
+
community_raw = _read_leaderboard_json("leaderboard.json")
|
| 258 |
+
|
| 259 |
+
# Union by model_name; if a name is in both, prefer official entry's
|
| 260 |
+
# numeric values (they're the maintainer-verified ones, though in
|
| 261 |
+
# practice they're identical to community).
|
| 262 |
+
union_by_name = {}
|
| 263 |
+
for entry in community_raw:
|
| 264 |
+
name = entry.get("model_name")
|
| 265 |
+
if name:
|
| 266 |
+
union_by_name[name] = entry
|
| 267 |
+
for entry in official_raw:
|
| 268 |
+
name = entry.get("model_name")
|
| 269 |
+
if name:
|
| 270 |
+
union_by_name[name] = entry
|
| 271 |
+
|
| 272 |
+
if not union_by_name:
|
| 273 |
+
return {}
|
| 274 |
+
|
| 275 |
+
df = pd.DataFrame(list(union_by_name.values()))
|
| 276 |
+
if 'cvs_acc' not in df.columns:
|
| 277 |
+
return {}
|
| 278 |
+
df = sort_by_avg_rank(df)
|
| 279 |
+
return {row["model_name"]: int(row["rank"]) for _, row in df.iterrows()}
|
| 280 |
+
|
| 281 |
+
|
| 282 |
+
def _apply_global_rank(df: pd.DataFrame) -> pd.DataFrame:
|
| 283 |
+
"""Re-rank the rows of df using global ranks (computed against the union
|
| 284 |
+
of official + community models). Rows are re-sorted by global rank and
|
| 285 |
+
the 'rank' column is reassigned to display the global rank value.
|
| 286 |
+
"""
|
| 287 |
+
if df.empty:
|
| 288 |
+
return df
|
| 289 |
+
lookup = _global_rank_lookup()
|
| 290 |
+
if not lookup:
|
| 291 |
+
return df
|
| 292 |
+
df = df.copy()
|
| 293 |
+
# Fall back to a large number for any model not in the lookup (shouldn't
|
| 294 |
+
# happen, but keeps the sort total-ordered).
|
| 295 |
+
df["_global_rank"] = df["model_name"].map(lookup).fillna(10**9).astype(int)
|
| 296 |
+
df = df.sort_values("_global_rank", ascending=True, kind="mergesort").reset_index(drop=True)
|
| 297 |
+
df["rank"] = df["_global_rank"]
|
| 298 |
+
df = df.drop(columns=["_global_rank"])
|
| 299 |
+
return df
|
| 300 |
+
|
| 301 |
+
|
| 302 |
def sort_by_avg_rank(df: pd.DataFrame) -> pd.DataFrame:
|
| 303 |
"""Sort the leaderboard by average rank across all metrics.
|
| 304 |
|
|
|
|
| 390 |
if 'average' in df.columns:
|
| 391 |
df = df.drop('average', axis=1)
|
| 392 |
|
| 393 |
+
# Sort by average rank across all metrics (lower avg rank = better).
|
| 394 |
+
# Then re-rank against the union of official + community so the
|
| 395 |
+
# same model gets the same rank in either view.
|
| 396 |
if 'cvs_acc' in df.columns:
|
| 397 |
df = sort_by_avg_rank(df)
|
| 398 |
+
df = _apply_global_rank(df)
|
| 399 |
|
| 400 |
print(f"β Loaded leaderboard from private repo: {len(df)} entries")
|
| 401 |
return df
|
|
|
|
| 416 |
if 'average' in df.columns:
|
| 417 |
df = df.drop('average', axis=1)
|
| 418 |
|
| 419 |
+
# Sort by avg-rank, then apply union-based global rank so the
|
| 420 |
+
# number shown matches the official table for the same model.
|
| 421 |
if 'cvs_acc' in df.columns:
|
| 422 |
df = sort_by_avg_rank(df)
|
| 423 |
+
df = _apply_global_rank(df)
|
| 424 |
|
| 425 |
print(f"β Loaded leaderboard from local file: {len(df)} entries")
|
| 426 |
return df
|
|
|
|
| 516 |
df = pd.DataFrame(data)
|
| 517 |
if 'cvs_acc' in df.columns:
|
| 518 |
df = sort_by_avg_rank(df)
|
| 519 |
+
df = _apply_global_rank(df)
|
| 520 |
print(f"β Loaded official leaderboard from private repo: {len(df)} entries")
|
| 521 |
return df
|
| 522 |
except Exception as e:
|
|
|
|
| 532 |
df = pd.DataFrame(data)
|
| 533 |
if 'cvs_acc' in df.columns:
|
| 534 |
df = sort_by_avg_rank(df)
|
| 535 |
+
df = _apply_global_rank(df)
|
| 536 |
print(f"β Loaded official leaderboard from local file: {len(df)} entries")
|
| 537 |
return df
|
| 538 |
|
|
|
|
| 2511 |
|
| 2512 |
Models are ranked by **average rank across all 10 metrics** β lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0β1 vs. LLM-judge 1β5) and rewards models that are strong across tasks rather than exceptional on one.
|
| 2513 |
|
| 2514 |
+
**Global ranking across views:** the rank shown is computed against the **union of all submissions** (official βͺ community), so the same model gets the same rank number in either the Official or the Community table β even though each table only displays a subset of rows. The Official table omits rows from the global ranking; the rank column shows each row's position in the full ranking, not its position within the visible subset.
|
| 2515 |
+
|
| 2516 |
**Tiebreakers** (applied in order when two models have the same average rank):
|
| 2517 |
1. **Number of metrics won outright** β a model that's #1 on more metrics wins over one that ties closely on many.
|
| 2518 |
2. **Sum of per-metric ranks** β catches near-ties where the mean rounded equal.
|
| 2519 |
3. **Sum of normalized scores** β favors the model with marginally higher absolute scores.
|
| 2520 |
4. **Model name alphabetical** β final fallback for full determinism.
|
| 2521 |
|
|
|
|
|
|
|
| 2522 |
---
|
| 2523 |
|
| 2524 |
### Benchmark Tasks
|