Spaces:

UII-AI
/

MedVidBench-Leaderboard

Running

MedGRPO Team Claude Opus 4.7 (1M context) commited on 26 days ago

Commit

03adf33

1 Parent(s): 3a73928

Rank against the union of official + community to keep ranks consistent

Same model was getting a different rank in the Official vs Community
views because each view ran sort_by_avg_rank against its own competitor
set (15 vs 35 models). With genuinely different avg-ranks per view, no
amount of tiebreaking could make the displayed rank match.

Fix: compute a single global ranking against the union of all
submissions (official ∪ community), then map each view's rows to their
global rank number. The same model now shows the same rank in either
table; the Official table just skips rank numbers for rows that aren't
in the official subset (e.g., rank 7 → 9 if rank 8 isn't promoted).

Update the About-tab explanation to describe the global ranking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

app.py +94 -4

app.py CHANGED Viewed

@@ -216,6 +216,89 @@ TEST_SET_STATS = {
 }
 def sort_by_avg_rank(df: pd.DataFrame) -> pd.DataFrame:
     """Sort the leaderboard by average rank across all metrics.
@@ -307,9 +390,12 @@ def load_leaderboard() -> pd.DataFrame:
                     if 'average' in df.columns:
                         df = df.drop('average', axis=1)
-                    # Sort by average rank across all metrics (lower avg rank = better)
                     if 'cvs_acc' in df.columns:
                         df = sort_by_avg_rank(df)
                     print(f"✓ Loaded leaderboard from private repo: {len(df)} entries")
                     return df
@@ -330,9 +416,11 @@ def load_leaderboard() -> pd.DataFrame:
             if 'average' in df.columns:
                 df = df.drop('average', axis=1)
-            # Sort by average rank across all metrics (lower avg rank = better)
             if 'cvs_acc' in df.columns:
                 df = sort_by_avg_rank(df)
             print(f"✓ Loaded leaderboard from local file: {len(df)} entries")
             return df
@@ -428,6 +516,7 @@ def load_official_leaderboard() -> pd.DataFrame:
                     df = pd.DataFrame(data)
                     if 'cvs_acc' in df.columns:
                         df = sort_by_avg_rank(df)
                     print(f"✓ Loaded official leaderboard from private repo: {len(df)} entries")
                     return df
             except Exception as e:
@@ -443,6 +532,7 @@ def load_official_leaderboard() -> pd.DataFrame:
             df = pd.DataFrame(data)
             if 'cvs_acc' in df.columns:
                 df = sort_by_avg_rank(df)
             print(f"✓ Loaded official leaderboard from local file: {len(df)} entries")
             return df
@@ -2421,14 +2511,14 @@ with gr.Blocks(title="MedVidBench Leaderboard", theme=gr.themes.Soft()) as demo:
             Models are ranked by **average rank across all 10 metrics** — lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0–1 vs. LLM-judge 1–5) and rewards models that are strong across tasks rather than exceptional on one.
             **Tiebreakers** (applied in order when two models have the same average rank):
             1. **Number of metrics won outright** — a model that's #1 on more metrics wins over one that ties closely on many.
             2. **Sum of per-metric ranks** — catches near-ties where the mean rounded equal.
             3. **Sum of normalized scores** — favors the model with marginally higher absolute scores.
             4. **Model name alphabetical** — final fallback for full determinism.
-            This guarantees the same model gets the same rank across the Official and Community tables regardless of submission order.
             ---
             ### Benchmark Tasks

 }
+def _read_leaderboard_json(filename: str) -> List[dict]:
+    """Read a leaderboard JSON (raw, no sort) from the private HF repo with
+    a local file fallback. Used by apply_global_rank to build the union of
+    official + community models so ranks are consistent across both tables.
+    """
+    try:
+        token = os.environ.get('HF_TOKEN')
+        if token:
+            try:
+                path = hf_hub_download(
+                    repo_id="UII-AI/MedVidBench-GroundTruth",
+                    filename=filename,
+                    repo_type="dataset",
+                    token=token,
+                    cache_dir="./cache",
+                )
+                with open(path) as f:
+                    return json.load(f) or []
+            except Exception:
+                pass
+    except Exception:
+        pass
+    local = PERSISTENT_DIR / filename
+    if local.exists():
+        with open(local) as f:
+            return json.load(f) or []
+    return []
+def _global_rank_lookup() -> dict:
+    """Compute a global rank for each model by ranking against the union of
+    all submissions (official ∪ community). Returns {model_name: global_rank}.
+    This ensures the same model gets the same rank number in either the
+    Official or Community leaderboard view, regardless of which set of
+    competitors is being displayed.
+    """
+    official_raw = _read_leaderboard_json("official_leaderboard.json")
+    community_raw = _read_leaderboard_json("leaderboard.json")
+    # Union by model_name; if a name is in both, prefer official entry's
+    # numeric values (they're the maintainer-verified ones, though in
+    # practice they're identical to community).
+    union_by_name = {}
+    for entry in community_raw:
+        name = entry.get("model_name")
+        if name:
+            union_by_name[name] = entry
+    for entry in official_raw:
+        name = entry.get("model_name")
+        if name:
+            union_by_name[name] = entry
+    if not union_by_name:
+        return {}
+    df = pd.DataFrame(list(union_by_name.values()))
+    if 'cvs_acc' not in df.columns:
+        return {}
+    df = sort_by_avg_rank(df)
+    return {row["model_name"]: int(row["rank"]) for _, row in df.iterrows()}
+def _apply_global_rank(df: pd.DataFrame) -> pd.DataFrame:
+    """Re-rank the rows of df using global ranks (computed against the union
+    of official + community models). Rows are re-sorted by global rank and
+    the 'rank' column is reassigned to display the global rank value.
+    """
+    if df.empty:
+        return df
+    lookup = _global_rank_lookup()
+    if not lookup:
+        return df
+    df = df.copy()
+    # Fall back to a large number for any model not in the lookup (shouldn't
+    # happen, but keeps the sort total-ordered).
+    df["_global_rank"] = df["model_name"].map(lookup).fillna(10**9).astype(int)
+    df = df.sort_values("_global_rank", ascending=True, kind="mergesort").reset_index(drop=True)
+    df["rank"] = df["_global_rank"]
+    df = df.drop(columns=["_global_rank"])
+    return df
 def sort_by_avg_rank(df: pd.DataFrame) -> pd.DataFrame:
     """Sort the leaderboard by average rank across all metrics.
                     if 'average' in df.columns:
                         df = df.drop('average', axis=1)
+                    # Sort by average rank across all metrics (lower avg rank = better).
+                    # Then re-rank against the union of official + community so the
+                    # same model gets the same rank in either view.
                     if 'cvs_acc' in df.columns:
                         df = sort_by_avg_rank(df)
+                        df = _apply_global_rank(df)
                     print(f"✓ Loaded leaderboard from private repo: {len(df)} entries")
                     return df
             if 'average' in df.columns:
                 df = df.drop('average', axis=1)
+            # Sort by avg-rank, then apply union-based global rank so the
+            # number shown matches the official table for the same model.
             if 'cvs_acc' in df.columns:
                 df = sort_by_avg_rank(df)
+                df = _apply_global_rank(df)
             print(f"✓ Loaded leaderboard from local file: {len(df)} entries")
             return df
                     df = pd.DataFrame(data)
                     if 'cvs_acc' in df.columns:
                         df = sort_by_avg_rank(df)
+                        df = _apply_global_rank(df)
                     print(f"✓ Loaded official leaderboard from private repo: {len(df)} entries")
                     return df
             except Exception as e:
             df = pd.DataFrame(data)
             if 'cvs_acc' in df.columns:
                 df = sort_by_avg_rank(df)
+                df = _apply_global_rank(df)
             print(f"✓ Loaded official leaderboard from local file: {len(df)} entries")
             return df
             Models are ranked by **average rank across all 10 metrics** — lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0–1 vs. LLM-judge 1–5) and rewards models that are strong across tasks rather than exceptional on one.
+            **Global ranking across views:** the rank shown is computed against the **union of all submissions** (official ∪ community), so the same model gets the same rank number in either the Official or the Community table — even though each table only displays a subset of rows. The Official table omits rows from the global ranking; the rank column shows each row's position in the full ranking, not its position within the visible subset.
             **Tiebreakers** (applied in order when two models have the same average rank):
             1. **Number of metrics won outright** — a model that's #1 on more metrics wins over one that ties closely on many.
             2. **Sum of per-metric ranks** — catches near-ties where the mean rounded equal.
             3. **Sum of normalized scores** — favors the model with marginally higher absolute scores.
             4. **Model name alphabetical** — final fallback for full determinism.
             ---
             ### Benchmark Tasks