MedGRPO Team Claude Opus 4.7 (1M context) commited on
Commit
03adf33
Β·
1 Parent(s): 3a73928

Rank against the union of official + community to keep ranks consistent

Browse files

Same model was getting a different rank in the Official vs Community
views because each view ran sort_by_avg_rank against its own competitor
set (15 vs 35 models). With genuinely different avg-ranks per view, no
amount of tiebreaking could make the displayed rank match.

Fix: compute a single global ranking against the union of all
submissions (official βˆͺ community), then map each view's rows to their
global rank number. The same model now shows the same rank in either
table; the Official table just skips rank numbers for rows that aren't
in the official subset (e.g., rank 7 β†’ 9 if rank 8 isn't promoted).

Update the About-tab explanation to describe the global ranking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show
  1. app.py +94 -4
app.py CHANGED
@@ -216,6 +216,89 @@ TEST_SET_STATS = {
216
  }
217
 
218
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
  def sort_by_avg_rank(df: pd.DataFrame) -> pd.DataFrame:
220
  """Sort the leaderboard by average rank across all metrics.
221
 
@@ -307,9 +390,12 @@ def load_leaderboard() -> pd.DataFrame:
307
  if 'average' in df.columns:
308
  df = df.drop('average', axis=1)
309
 
310
- # Sort by average rank across all metrics (lower avg rank = better)
 
 
311
  if 'cvs_acc' in df.columns:
312
  df = sort_by_avg_rank(df)
 
313
 
314
  print(f"βœ“ Loaded leaderboard from private repo: {len(df)} entries")
315
  return df
@@ -330,9 +416,11 @@ def load_leaderboard() -> pd.DataFrame:
330
  if 'average' in df.columns:
331
  df = df.drop('average', axis=1)
332
 
333
- # Sort by average rank across all metrics (lower avg rank = better)
 
334
  if 'cvs_acc' in df.columns:
335
  df = sort_by_avg_rank(df)
 
336
 
337
  print(f"βœ“ Loaded leaderboard from local file: {len(df)} entries")
338
  return df
@@ -428,6 +516,7 @@ def load_official_leaderboard() -> pd.DataFrame:
428
  df = pd.DataFrame(data)
429
  if 'cvs_acc' in df.columns:
430
  df = sort_by_avg_rank(df)
 
431
  print(f"βœ“ Loaded official leaderboard from private repo: {len(df)} entries")
432
  return df
433
  except Exception as e:
@@ -443,6 +532,7 @@ def load_official_leaderboard() -> pd.DataFrame:
443
  df = pd.DataFrame(data)
444
  if 'cvs_acc' in df.columns:
445
  df = sort_by_avg_rank(df)
 
446
  print(f"βœ“ Loaded official leaderboard from local file: {len(df)} entries")
447
  return df
448
 
@@ -2421,14 +2511,14 @@ with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
2421
 
2422
  Models are ranked by **average rank across all 10 metrics** β€” lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0–1 vs. LLM-judge 1–5) and rewards models that are strong across tasks rather than exceptional on one.
2423
 
 
 
2424
  **Tiebreakers** (applied in order when two models have the same average rank):
2425
  1. **Number of metrics won outright** β€” a model that's #1 on more metrics wins over one that ties closely on many.
2426
  2. **Sum of per-metric ranks** β€” catches near-ties where the mean rounded equal.
2427
  3. **Sum of normalized scores** β€” favors the model with marginally higher absolute scores.
2428
  4. **Model name alphabetical** β€” final fallback for full determinism.
2429
 
2430
- This guarantees the same model gets the same rank across the Official and Community tables regardless of submission order.
2431
-
2432
  ---
2433
 
2434
  ### Benchmark Tasks
 
216
  }
217
 
218
 
219
+ def _read_leaderboard_json(filename: str) -> List[dict]:
220
+ """Read a leaderboard JSON (raw, no sort) from the private HF repo with
221
+ a local file fallback. Used by apply_global_rank to build the union of
222
+ official + community models so ranks are consistent across both tables.
223
+ """
224
+ try:
225
+ token = os.environ.get('HF_TOKEN')
226
+ if token:
227
+ try:
228
+ path = hf_hub_download(
229
+ repo_id="UII-AI/MedVidBench-GroundTruth",
230
+ filename=filename,
231
+ repo_type="dataset",
232
+ token=token,
233
+ cache_dir="./cache",
234
+ )
235
+ with open(path) as f:
236
+ return json.load(f) or []
237
+ except Exception:
238
+ pass
239
+ except Exception:
240
+ pass
241
+ local = PERSISTENT_DIR / filename
242
+ if local.exists():
243
+ with open(local) as f:
244
+ return json.load(f) or []
245
+ return []
246
+
247
+
248
+ def _global_rank_lookup() -> dict:
249
+ """Compute a global rank for each model by ranking against the union of
250
+ all submissions (official βˆͺ community). Returns {model_name: global_rank}.
251
+
252
+ This ensures the same model gets the same rank number in either the
253
+ Official or Community leaderboard view, regardless of which set of
254
+ competitors is being displayed.
255
+ """
256
+ official_raw = _read_leaderboard_json("official_leaderboard.json")
257
+ community_raw = _read_leaderboard_json("leaderboard.json")
258
+
259
+ # Union by model_name; if a name is in both, prefer official entry's
260
+ # numeric values (they're the maintainer-verified ones, though in
261
+ # practice they're identical to community).
262
+ union_by_name = {}
263
+ for entry in community_raw:
264
+ name = entry.get("model_name")
265
+ if name:
266
+ union_by_name[name] = entry
267
+ for entry in official_raw:
268
+ name = entry.get("model_name")
269
+ if name:
270
+ union_by_name[name] = entry
271
+
272
+ if not union_by_name:
273
+ return {}
274
+
275
+ df = pd.DataFrame(list(union_by_name.values()))
276
+ if 'cvs_acc' not in df.columns:
277
+ return {}
278
+ df = sort_by_avg_rank(df)
279
+ return {row["model_name"]: int(row["rank"]) for _, row in df.iterrows()}
280
+
281
+
282
+ def _apply_global_rank(df: pd.DataFrame) -> pd.DataFrame:
283
+ """Re-rank the rows of df using global ranks (computed against the union
284
+ of official + community models). Rows are re-sorted by global rank and
285
+ the 'rank' column is reassigned to display the global rank value.
286
+ """
287
+ if df.empty:
288
+ return df
289
+ lookup = _global_rank_lookup()
290
+ if not lookup:
291
+ return df
292
+ df = df.copy()
293
+ # Fall back to a large number for any model not in the lookup (shouldn't
294
+ # happen, but keeps the sort total-ordered).
295
+ df["_global_rank"] = df["model_name"].map(lookup).fillna(10**9).astype(int)
296
+ df = df.sort_values("_global_rank", ascending=True, kind="mergesort").reset_index(drop=True)
297
+ df["rank"] = df["_global_rank"]
298
+ df = df.drop(columns=["_global_rank"])
299
+ return df
300
+
301
+
302
  def sort_by_avg_rank(df: pd.DataFrame) -> pd.DataFrame:
303
  """Sort the leaderboard by average rank across all metrics.
304
 
 
390
  if 'average' in df.columns:
391
  df = df.drop('average', axis=1)
392
 
393
+ # Sort by average rank across all metrics (lower avg rank = better).
394
+ # Then re-rank against the union of official + community so the
395
+ # same model gets the same rank in either view.
396
  if 'cvs_acc' in df.columns:
397
  df = sort_by_avg_rank(df)
398
+ df = _apply_global_rank(df)
399
 
400
  print(f"βœ“ Loaded leaderboard from private repo: {len(df)} entries")
401
  return df
 
416
  if 'average' in df.columns:
417
  df = df.drop('average', axis=1)
418
 
419
+ # Sort by avg-rank, then apply union-based global rank so the
420
+ # number shown matches the official table for the same model.
421
  if 'cvs_acc' in df.columns:
422
  df = sort_by_avg_rank(df)
423
+ df = _apply_global_rank(df)
424
 
425
  print(f"βœ“ Loaded leaderboard from local file: {len(df)} entries")
426
  return df
 
516
  df = pd.DataFrame(data)
517
  if 'cvs_acc' in df.columns:
518
  df = sort_by_avg_rank(df)
519
+ df = _apply_global_rank(df)
520
  print(f"βœ“ Loaded official leaderboard from private repo: {len(df)} entries")
521
  return df
522
  except Exception as e:
 
532
  df = pd.DataFrame(data)
533
  if 'cvs_acc' in df.columns:
534
  df = sort_by_avg_rank(df)
535
+ df = _apply_global_rank(df)
536
  print(f"βœ“ Loaded official leaderboard from local file: {len(df)} entries")
537
  return df
538
 
 
2511
 
2512
  Models are ranked by **average rank across all 10 metrics** β€” lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0–1 vs. LLM-judge 1–5) and rewards models that are strong across tasks rather than exceptional on one.
2513
 
2514
+ **Global ranking across views:** the rank shown is computed against the **union of all submissions** (official βˆͺ community), so the same model gets the same rank number in either the Official or the Community table β€” even though each table only displays a subset of rows. The Official table omits rows from the global ranking; the rank column shows each row's position in the full ranking, not its position within the visible subset.
2515
+
2516
  **Tiebreakers** (applied in order when two models have the same average rank):
2517
  1. **Number of metrics won outright** β€” a model that's #1 on more metrics wins over one that ties closely on many.
2518
  2. **Sum of per-metric ranks** β€” catches near-ties where the mean rounded equal.
2519
  3. **Sum of normalized scores** β€” favors the model with marginally higher absolute scores.
2520
  4. **Model name alphabetical** β€” final fallback for full determinism.
2521
 
 
 
2522
  ---
2523
 
2524
  ### Benchmark Tasks