SaylorTwift HF Staff commited on
Commit
fa808e0
·
verified ·
1 Parent(s): 40e90a6

Migrate to Gradio app with interactive features

Browse files

Replace static HTML with Gradio application featuring:
- Interactive search and filtering
- Model size range slider
- Benchmark category filters
- Quick filter presets (size and category)
- Sortable columns
- CSV export
- Provider logos and clickable model links
- Color-coded scores by benchmark category

README.md CHANGED
@@ -3,100 +3,101 @@ title: Official Benchmarks Leaderboard 2026
3
  emoji: 🏆
4
  colorFrom: purple
5
  colorTo: blue
6
- sdk: static
 
 
7
  pinned: false
8
- hf_oauth: true
9
- hf_oauth_expiration_minutes: 480
10
- hf_oauth_scopes:
11
- - email
12
- - read-repos
13
- - gated-repos
14
  ---
15
 
16
- # Official Benchmarks Leaderboard 2026
17
 
18
- A unified leaderboard aggregating scores from 12 official HuggingFace benchmarks, covering diverse AI capabilities from mathematical reasoning to coding, vision, and language understanding.
19
 
20
- ## Features
21
 
22
- - 📊 **12 Official Benchmarks**: GSM8K, MMLU-Pro, GPQA, HLE, olmOCR, SWE-bench, and more
23
- - 🔓 **OAuth Authentication**: Sign in with HuggingFace to access gated datasets (GPQA, HLE)
24
- - 🎨 **Beautiful Design**: Modern gradient UI with dark mode support
25
- - 🔍 **Interactive Filters**: Search and filter models by provider and type
26
- - 📈 **Real-time Data**: Fetched directly from official HuggingFace APIs
27
- - 🏢 **Provider Logos**: Official organization avatars from HuggingFace
 
 
28
 
29
- ## Benchmarks Included
30
 
31
- ### Math & Reasoning
32
  - **GSM8K**: Grade School Math (8.5K problems)
33
  - **AIME 2026**: American Invitational Mathematics Examination
34
- - **HMMT Feb 2026**: Harvard-MIT Mathematics Tournament
35
 
36
- ### Knowledge & Understanding
37
- - **MMLU-Pro**: Massive Multi-task Language Understanding (57K questions)
38
- - **GPQA Diamond**: PhD-level expert questions (🔒 gated)
39
- - **HLE**: Humanity's Last Exam (🔒 gated)
40
 
41
- ### Coding
42
  - **SWE-bench Verified**: Real-world software engineering tasks
43
  - **SWE-bench Pro**: Advanced software engineering challenges
44
 
45
- ### Vision
46
  - **olmOCR**: OCR evaluation benchmark
47
 
48
- ### Other
49
  - **Terminal-Bench 2.0**: Terminal command understanding
50
- - **ArguAna**: MTEB text retrieval
51
- - **EvasionBench**: Language understanding challenges
52
 
53
- ## OAuth & Gated Datasets
 
54
 
55
- This Space uses OAuth to access gated datasets like GPQA and HLE.
56
 
57
- **To access all benchmarks:**
58
- 1. Click "Sign in with HuggingFace" button
59
- 2. Grant permissions to access gated repositories
60
- 3. The leaderboard will automatically fetch data from gated benchmarks
61
 
62
- **Required Scopes:**
63
- - `openid`, `profile`: User identification
64
- - `read-repos`: Access to your repositories
65
- - `gated-repos`: Access to gated datasets you've been granted access to
66
 
67
- ## Data Sources
68
 
69
- All scores are fetched from official HuggingFace leaderboard APIs:
70
- - API Pattern: `https://huggingface.co/api/datasets/{org}/{dataset}/leaderboard`
71
- - Provider logos: `https://huggingface.co/api/organizations/{org}/avatar`
72
 
73
- ## Development
74
 
75
- ### Fetching Latest Data
76
 
77
  ```bash
78
- # Fetch all public benchmarks
79
- python3 scripts/fetch_api_only.py
80
 
81
- # Fetch provider logos
82
- python3 scripts/fetch_provider_logos.py
83
  ```
84
 
85
- ### Project Structure
86
 
87
  ```
88
  .
89
- ├── benchmarks.html # Main leaderboard page
 
 
 
 
 
 
 
90
  ├── data/
91
- ── leaderboard.json # Model scores and metadata
92
- └── provider_logos.json # Provider avatar URLs
93
- ├── scripts/
94
- │ ├── fetch_api_only.py # Fetch benchmark data
95
- │ └── fetch_provider_logos.py # Fetch provider logos
96
- └── README.md
97
  ```
98
 
99
- ## License
 
 
 
 
 
 
 
100
 
101
  Data is sourced from official HuggingFace benchmarks. Please refer to individual benchmark pages for specific licensing information.
102
 
 
3
  emoji: 🏆
4
  colorFrom: purple
5
  colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 5.50.0
8
+ app_file: app.py
9
  pinned: false
 
 
 
 
 
 
10
  ---
11
 
12
+ # 🏆 Official Benchmarks Leaderboard 2026
13
 
14
+ A unified leaderboard for **11 official HuggingFace benchmarks**. Compare AI models across math, coding, knowledge, vision, agent, and language tasks.
15
 
16
+ ## Features
17
 
18
+ - 📊 **11 Official Benchmarks**: GSM8K, MMLU-Pro, GPQA, HLE, SWE-bench, AIME, HMMT, and more
19
+ - 🎛️ **Quick Filters**: One-click presets for model sizes and benchmark categories
20
+ - 🔍 **Interactive Search**: Filter by model name or provider
21
+ - 📏 **Size Range Slider**: Filter models by parameter count (0-1100B+)
22
+ - 🎯 **Category Selection**: Choose specific benchmark categories to display
23
+ - 📥 **Export CSV**: Download filtered leaderboard data
24
+ - 🔄 **Sortable Columns**: Click any header to sort the table
25
+ - 🎨 **Modern Design**: Clean, responsive interface with provider logos
26
 
27
+ ## 🎯 Benchmarks Included
28
 
29
+ ### 📐 Math
30
  - **GSM8K**: Grade School Math (8.5K problems)
31
  - **AIME 2026**: American Invitational Mathematics Examination
32
+ - **HMMT 2026**: Harvard-MIT Mathematics Tournament
33
 
34
+ ### 🧠 Knowledge
35
+ - **MMLU-Pro**: Massive Multi-task Language Understanding
36
+ - **GPQA Diamond**: PhD-level expert questions
37
+ - **HLE**: Humanity's Last Exam
38
 
39
+ ### 💻 Coding
40
  - **SWE-bench Verified**: Real-world software engineering tasks
41
  - **SWE-bench Pro**: Advanced software engineering challenges
42
 
43
+ ### 👁️ Vision
44
  - **olmOCR**: OCR evaluation benchmark
45
 
46
+ ### 🤖 Agent
47
  - **Terminal-Bench 2.0**: Terminal command understanding
 
 
48
 
49
+ ### 💬 Language
50
+ - **EvasionBench**: Language understanding challenges
51
 
52
+ ## 🚀 Quick Start
53
 
54
+ The leaderboard loads automatically from the HuggingFace dataset: `OpenEvals/leaderboard-data`
 
 
 
55
 
56
+ **Quick Filters:**
57
+ - 🔹 **Small (<10B)**, 🔸 **Medium (10-100B)**, 🔶 **Large (100B+)** - Filter by model size
58
+ - 💻 **Coding**, 🧠 **Knowledge**, 📐 **Math**, etc. - Show only specific categories
 
59
 
60
+ ## 📊 Data Source
61
 
62
+ **Dataset**: [OpenEvals/leaderboard-data](https://huggingface.co/datasets/OpenEvals/leaderboard-data)
 
 
63
 
64
+ All scores are aggregated from official HuggingFace benchmark leaderboards. The dataset is updated regularly with the latest model evaluations.
65
 
66
+ ## 💻 Local Development
67
 
68
  ```bash
69
+ # Install dependencies
70
+ pip install -r requirements.txt
71
 
72
+ # Run the app
73
+ python app.py
74
  ```
75
 
76
+ ## 📁 Project Structure
77
 
78
  ```
79
  .
80
+ ├── app.py # Main Gradio application
81
+ ├── utils/
82
+ │ ├── data_loader.py # Load data from HuggingFace dataset
83
+ │ ├── filters.py # Filter and search logic
84
+ │ ├── formatters.py # Data formatting utilities
85
+ │ └── html_generator.py # Generate HTML leaderboard table
86
+ ├── static/
87
+ │ └── sortTable.js # Client-side table sorting
88
  ├── data/
89
+ ── provider_logos.json # Provider avatar URLs
90
+ └── requirements.txt # Python dependencies
 
 
 
 
91
  ```
92
 
93
+ ## 🔧 Technologies
94
+
95
+ - **Gradio 5.50.0**: Interactive web interface
96
+ - **Datasets**: HuggingFace datasets library
97
+ - **Pandas**: Data manipulation
98
+ - **RangeSlider**: Custom Gradio component for size filtering
99
+
100
+ ## 📝 License
101
 
102
  Data is sourced from official HuggingFace benchmarks. Please refer to individual benchmark pages for specific licensing information.
103
 
app.py ADDED
@@ -0,0 +1,597 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Official Benchmarks Leaderboard 2026 - Gradio App
3
+
4
+ A unified leaderboard aggregating scores from 11 official HuggingFace benchmarks.
5
+ """
6
+
7
+ import gradio as gr
8
+ import pandas as pd
9
+ from gradio_rangeslider import RangeSlider
10
+ from utils.data_loader import (
11
+ load_leaderboard_data,
12
+ get_benchmark_info,
13
+ load_provider_logos,
14
+ )
15
+ from utils.filters import filter_data, calculate_stats, parse_benchmark_selections
16
+ from utils.formatters import format_for_display, create_empty_table, prepare_export_data
17
+ from utils.html_generator import generate_leaderboard_html
18
+
19
+ # Global data cache
20
+ leaderboard_data = None
21
+ provider_logos = None
22
+
23
+
24
+ def initialize_data():
25
+ """Load initial data on app startup."""
26
+ global leaderboard_data, provider_logos
27
+ leaderboard_data = load_leaderboard_data()
28
+ provider_logos = load_provider_logos()
29
+ return leaderboard_data
30
+
31
+
32
+ def refresh_data():
33
+ """Reload data from HuggingFace dataset."""
34
+ global leaderboard_data
35
+ print("Refreshing data from HuggingFace...")
36
+ leaderboard_data = load_leaderboard_data()
37
+
38
+ # Return updated table with current filters - we'll trigger a full update
39
+ return gr.Info("Data refreshed successfully!")
40
+
41
+
42
+ def update_table(
43
+ search_term,
44
+ size_range,
45
+ bench_math,
46
+ bench_knowledge,
47
+ bench_coding,
48
+ bench_vision,
49
+ bench_agent,
50
+ bench_language,
51
+ ):
52
+ """
53
+ Update the leaderboard table based on all filters.
54
+
55
+ Returns:
56
+ tuple: (html_string, num_models, num_benchmarks, num_scores)
57
+ """
58
+ # Extract min and max from range slider tuple
59
+ size_min, size_max = size_range
60
+
61
+ # Parse benchmark selections from all checkbox groups
62
+ selected_benchmarks = parse_benchmark_selections(
63
+ bench_math,
64
+ bench_knowledge,
65
+ bench_coding,
66
+ bench_vision,
67
+ bench_agent,
68
+ bench_language,
69
+ )
70
+
71
+ # Handle case where no benchmarks are selected
72
+ if not selected_benchmarks or len(selected_benchmarks) == 0:
73
+ empty_html = generate_leaderboard_html(pd.DataFrame(), [], provider_logos)
74
+ return empty_html, 0, 0, 0
75
+
76
+ # Filter the data
77
+ filtered_df = filter_data(
78
+ leaderboard_data, search_term, size_min, size_max, selected_benchmarks
79
+ )
80
+
81
+ # Calculate statistics
82
+ stats = calculate_stats(filtered_df, selected_benchmarks)
83
+
84
+ # Generate HTML table
85
+ html_table = generate_leaderboard_html(
86
+ filtered_df, selected_benchmarks, provider_logos
87
+ )
88
+
89
+ return (html_table, stats["models"], stats["benchmarks"], stats["scores"])
90
+
91
+
92
+ def select_all_benchmarks():
93
+ """Select all benchmark checkboxes."""
94
+ return (
95
+ ["GSM8K", "AIME 2026", "HMMT"], # Math
96
+ ["MMLU-Pro", "GPQA", "HLE"], # Knowledge
97
+ ["SWE-V", "SWE-Pro"], # Coding
98
+ ["olmOCR"], # Vision
99
+ ["TB 2.0"], # Agent
100
+ ["EvasionB"], # Language
101
+ )
102
+
103
+
104
+ def clear_all_benchmarks():
105
+ """Clear all benchmark checkboxes."""
106
+ return [], [], [], [], [], []
107
+
108
+
109
+ def export_to_csv(
110
+ search_term,
111
+ size_range,
112
+ bench_math,
113
+ bench_knowledge,
114
+ bench_coding,
115
+ bench_vision,
116
+ bench_agent,
117
+ bench_language,
118
+ ):
119
+ """
120
+ Export filtered data to CSV file.
121
+
122
+ Returns:
123
+ str: Path to temporary CSV file
124
+ """
125
+ # Extract min and max from range slider tuple
126
+ size_min, size_max = size_range
127
+
128
+ # Parse benchmark selections
129
+ selected_benchmarks = parse_benchmark_selections(
130
+ bench_math,
131
+ bench_knowledge,
132
+ bench_coding,
133
+ bench_vision,
134
+ bench_agent,
135
+ bench_language,
136
+ )
137
+
138
+ if not selected_benchmarks:
139
+ return None
140
+
141
+ # Filter the data
142
+ filtered_df = filter_data(
143
+ leaderboard_data, search_term, size_min, size_max, selected_benchmarks
144
+ )
145
+
146
+ # Prepare for export (without HTML/markdown)
147
+ export_df = prepare_export_data(filtered_df, selected_benchmarks)
148
+
149
+ # Save to temporary file
150
+ import tempfile
151
+
152
+ tmp_file = tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".csv")
153
+ export_df.to_csv(tmp_file.name, index=False)
154
+ tmp_file.close()
155
+
156
+ return tmp_file.name
157
+
158
+
159
+ # Minimal CSS - only for leaderboard table
160
+ custom_css = """
161
+ /* Leaderboard table container */
162
+ .leaderboard-html-container {
163
+ margin-top: 16px;
164
+ }
165
+ """
166
+
167
+ # JavaScript to enable table sorting
168
+ custom_js = """
169
+ function() {
170
+ // Load and execute the sorting script
171
+ const script = document.createElement('script');
172
+ script.textContent = `
173
+ let currentSortColumn = null;
174
+ let currentSortDirection = 'desc';
175
+
176
+ function sortTable(colIndex) {
177
+ const table = document.querySelector('#leaderboardTable');
178
+ if (!table) return;
179
+
180
+ const tbody = table.querySelector('tbody');
181
+ if (!tbody) return;
182
+
183
+ const rows = Array.from(tbody.querySelectorAll('tr'));
184
+
185
+ if (currentSortColumn === colIndex) {
186
+ currentSortDirection = currentSortDirection === 'desc' ? 'asc' : 'desc';
187
+ } else {
188
+ currentSortColumn = colIndex;
189
+ currentSortDirection = 'desc';
190
+ }
191
+
192
+ rows.sort((a, b) => {
193
+ if (colIndex === 0) {
194
+ const aVal = a.dataset.name || '';
195
+ const bVal = b.dataset.name || '';
196
+ return currentSortDirection === 'asc' ? aVal.localeCompare(bVal) : bVal.localeCompare(aVal);
197
+ } else {
198
+ const aCell = a.cells[colIndex];
199
+ const bCell = b.cells[colIndex];
200
+ const aText = aCell ? aCell.textContent.trim() : '';
201
+ const bText = bCell ? bCell.textContent.trim() : '';
202
+ const aScore = aText === '—' ? -1 : parseFloat(aText);
203
+ const bScore = bText === '—' ? -1 : parseFloat(bText);
204
+
205
+ if (isNaN(aScore) && isNaN(bScore)) return 0;
206
+ if (isNaN(aScore)) return 1;
207
+ if (isNaN(bScore)) return -1;
208
+
209
+ return currentSortDirection === 'desc' ? bScore - aScore : aScore - bScore;
210
+ }
211
+ });
212
+
213
+ rows.forEach(row => tbody.appendChild(row));
214
+ updateSortIndicators(colIndex);
215
+ }
216
+
217
+ function updateSortIndicators(colIndex) {
218
+ const headers = document.querySelectorAll('#leaderboardTable thead th');
219
+ headers.forEach((th, index) => {
220
+ const sortArrow = th.querySelector('.sa');
221
+ if (sortArrow) {
222
+ if (index === colIndex) {
223
+ sortArrow.textContent = currentSortDirection === 'desc' ? '↓' : '↑';
224
+ th.classList.add('sorted');
225
+ } else {
226
+ sortArrow.textContent = '↕';
227
+ th.classList.remove('sorted');
228
+ }
229
+ }
230
+ });
231
+ }
232
+ `;
233
+ document.head.appendChild(script);
234
+ }
235
+ """
236
+
237
+ # Build the Gradio interface
238
+ with gr.Blocks(
239
+ title="Official Benchmarks Leaderboard 2026", css=custom_css, js=custom_js
240
+ ) as app:
241
+ # Header
242
+ gr.Markdown("# 🏆 Official Benchmarks Leaderboard 2026")
243
+ gr.Markdown(
244
+ "Unified leaderboard for **11 official Hugging Face benchmarks**. "
245
+ "Compare AI models across math, coding, knowledge, vision, agent, and language tasks."
246
+ )
247
+
248
+ # Statistics row
249
+ with gr.Row():
250
+ stat_models = gr.Number(
251
+ label="📊 Models", value=0, precision=0, interactive=False
252
+ )
253
+ stat_benchmarks = gr.Number(
254
+ label="🎯 Benchmarks", value=11, precision=0, interactive=False
255
+ )
256
+ stat_scores = gr.Number(
257
+ label="✅ Total Scores", value=0, precision=0, interactive=False
258
+ )
259
+
260
+ # Quick filter presets
261
+ with gr.Row():
262
+ gr.Markdown("**Quick Filters:**")
263
+ preset_small = gr.Button("🔹 Small (<10B)", size="sm", variant="secondary")
264
+ preset_medium = gr.Button("🔸 Medium (10-100B)", size="sm", variant="secondary")
265
+ preset_large = gr.Button("🔶 Large (100B+)", size="sm", variant="secondary")
266
+
267
+ with gr.Row():
268
+ gr.Markdown("**By Category:**")
269
+ preset_coding = gr.Button("💻 Coding", size="sm", variant="secondary")
270
+ preset_knowledge = gr.Button("🧠 Knowledge", size="sm", variant="secondary")
271
+ preset_math = gr.Button("📐 Math", size="sm", variant="secondary")
272
+ preset_vision = gr.Button("👁️ Vision", size="sm", variant="secondary")
273
+ preset_agent = gr.Button("🤖 Agent", size="sm", variant="secondary")
274
+ preset_language = gr.Button("💬 Language", size="sm", variant="secondary")
275
+
276
+ # Filters Section
277
+ with gr.Accordion("🎛️ Filters & Settings", open=True):
278
+ # Search, Size Range, and Refresh on same row
279
+ with gr.Row():
280
+ search_box = gr.Textbox(
281
+ label="🔍 Search", placeholder="Try 'Llama', 'GPT', 'Qwen'...", scale=2
282
+ )
283
+ size_range = RangeSlider(
284
+ minimum=0,
285
+ maximum=1100,
286
+ value=(0, 1100),
287
+ step=10,
288
+ label="📏 Size Range (Billions)",
289
+ scale=2,
290
+ )
291
+ refresh_btn = gr.Button("🔄 Refresh", scale=1)
292
+
293
+ # Benchmark category filters
294
+ gr.Markdown("### 🎯 Benchmarks")
295
+
296
+ with gr.Row():
297
+ with gr.Column(scale=1):
298
+ bench_math = gr.CheckboxGroup(
299
+ choices=["GSM8K", "AIME 2026", "HMMT"],
300
+ value=["GSM8K", "AIME 2026", "HMMT"],
301
+ label="📐 Math",
302
+ )
303
+ with gr.Column(scale=1):
304
+ bench_knowledge = gr.CheckboxGroup(
305
+ choices=["MMLU-Pro", "GPQA", "HLE"],
306
+ value=["MMLU-Pro", "GPQA", "HLE"],
307
+ label="🧠 Knowledge",
308
+ )
309
+ with gr.Column(scale=1):
310
+ bench_coding = gr.CheckboxGroup(
311
+ choices=["SWE-V", "SWE-Pro"],
312
+ value=["SWE-V", "SWE-Pro"],
313
+ label="💻 Coding",
314
+ )
315
+ with gr.Column(scale=1):
316
+ bench_vision = gr.CheckboxGroup(
317
+ choices=["olmOCR"], value=[], label="👁️ Vision"
318
+ )
319
+ with gr.Column(scale=1):
320
+ bench_agent = gr.CheckboxGroup(
321
+ choices=["TB 2.0"], value=["TB 2.0"], label="🤖 Agent"
322
+ )
323
+ with gr.Column(scale=1):
324
+ bench_language = gr.CheckboxGroup(
325
+ choices=["EvasionB"], value=["EvasionB"], label="💬 Language"
326
+ )
327
+
328
+ # Quick actions for benchmark selection
329
+ with gr.Row():
330
+ select_all_btn = gr.Button("✓ Select All", size="sm")
331
+ clear_all_btn = gr.Button("✗ Clear All", size="sm")
332
+
333
+ # Status message for user feedback
334
+ status_msg = gr.Markdown("", visible=False)
335
+
336
+ # Main leaderboard table
337
+ gr.Markdown("## 📊 Leaderboard")
338
+ gr.Markdown("*💡 Tip: Click any column header to sort the table*")
339
+
340
+ leaderboard_table = gr.HTML(
341
+ value="<div style='text-align:center;padding:40px;color:#94a3b8;'>Loading leaderboard data...</div>",
342
+ label="",
343
+ elem_classes="leaderboard-html-container",
344
+ )
345
+
346
+ # Export button with better feedback
347
+ with gr.Row():
348
+ export_btn = gr.Button("📥 Export CSV", size="sm")
349
+ export_file = gr.File(label="Download", visible=False)
350
+
351
+ # Footer
352
+ gr.Markdown(
353
+ "---\n"
354
+ "**Data Source**: [OpenEvals/leaderboard-data](https://huggingface.co/datasets/OpenEvals/leaderboard-data) | "
355
+ "**Open Source Models Only** | "
356
+ "Made with ❤️ by the Benchmarks Team"
357
+ )
358
+
359
+ # Define all filter inputs
360
+ filter_inputs = [
361
+ search_box,
362
+ size_range,
363
+ bench_math,
364
+ bench_knowledge,
365
+ bench_coding,
366
+ bench_vision,
367
+ bench_agent,
368
+ bench_language,
369
+ ]
370
+
371
+ # Define all outputs
372
+ table_outputs = [leaderboard_table, stat_models, stat_benchmarks, stat_scores]
373
+
374
+ benchmark_outputs = [
375
+ bench_math,
376
+ bench_knowledge,
377
+ bench_coding,
378
+ bench_vision,
379
+ bench_agent,
380
+ bench_language,
381
+ ]
382
+
383
+ # Event handlers - attach update_table to all filter changes
384
+ # Use trigger_mode for smoother interactions (debounce on typing)
385
+ search_box.change(
386
+ fn=update_table,
387
+ inputs=filter_inputs,
388
+ outputs=table_outputs,
389
+ show_progress="hidden",
390
+ trigger_mode="always_last", # Debounce search input
391
+ )
392
+
393
+ # Other filters update immediately
394
+ for filter_input in [
395
+ size_range,
396
+ bench_math,
397
+ bench_knowledge,
398
+ bench_coding,
399
+ bench_vision,
400
+ bench_agent,
401
+ bench_language,
402
+ ]:
403
+ filter_input.change(
404
+ fn=update_table,
405
+ inputs=filter_inputs,
406
+ outputs=table_outputs,
407
+ show_progress="minimal",
408
+ )
409
+
410
+ # Refresh button - reloads data and updates table
411
+ def refresh_and_update(*filter_args):
412
+ refresh_data()
413
+ return update_table(*filter_args)
414
+
415
+ refresh_btn.click(
416
+ fn=refresh_and_update,
417
+ inputs=filter_inputs,
418
+ outputs=table_outputs,
419
+ show_progress="full",
420
+ )
421
+
422
+ # Select All / Clear All buttons
423
+ select_all_btn.click(fn=select_all_benchmarks, outputs=benchmark_outputs).then(
424
+ fn=update_table, inputs=filter_inputs, outputs=table_outputs
425
+ )
426
+
427
+ clear_all_btn.click(fn=clear_all_benchmarks, outputs=benchmark_outputs).then(
428
+ fn=update_table, inputs=filter_inputs, outputs=table_outputs
429
+ )
430
+
431
+ # Export button with success message
432
+ def export_with_feedback(*args):
433
+ filepath = export_to_csv(*args)
434
+ return filepath, gr.File(visible=True)
435
+
436
+ export_btn.click(
437
+ fn=export_with_feedback,
438
+ inputs=filter_inputs,
439
+ outputs=[export_file, export_file],
440
+ show_progress="minimal",
441
+ )
442
+
443
+ # Preset filter handlers
444
+ def apply_small_models():
445
+ return "", (0, 10) # search, size_range
446
+
447
+ def apply_medium_models():
448
+ return "", (10, 100)
449
+
450
+ def apply_large_models():
451
+ return "", (100, 1100)
452
+
453
+ # Category filter functions - deselect all except the chosen category
454
+ def apply_coding_filter():
455
+ return (
456
+ "",
457
+ (0, 1100),
458
+ [],
459
+ [],
460
+ ["SWE-V", "SWE-Pro"],
461
+ [],
462
+ [],
463
+ [],
464
+ ) # search, size_range, math, knowledge, coding, vision, agent, language
465
+
466
+ def apply_knowledge_filter():
467
+ return "", (0, 1100), [], ["MMLU-Pro", "GPQA", "HLE"], [], [], [], []
468
+
469
+ def apply_math_filter():
470
+ return "", (0, 1100), ["GSM8K", "AIME 2026", "HMMT"], [], [], [], [], []
471
+
472
+ def apply_vision_filter():
473
+ return "", (0, 1100), [], [], [], ["olmOCR"], [], []
474
+
475
+ def apply_agent_filter():
476
+ return "", (0, 1100), [], [], [], [], ["TB 2.0"], []
477
+
478
+ def apply_language_filter():
479
+ return "", (0, 1100), [], [], [], [], [], ["EvasionB"]
480
+
481
+ # Size preset handlers
482
+ preset_small.click(fn=apply_small_models, outputs=[search_box, size_range]).then(
483
+ fn=update_table, inputs=filter_inputs, outputs=table_outputs
484
+ )
485
+
486
+ preset_medium.click(fn=apply_medium_models, outputs=[search_box, size_range]).then(
487
+ fn=update_table, inputs=filter_inputs, outputs=table_outputs
488
+ )
489
+
490
+ preset_large.click(fn=apply_large_models, outputs=[search_box, size_range]).then(
491
+ fn=update_table, inputs=filter_inputs, outputs=table_outputs
492
+ )
493
+
494
+ # Category preset handlers
495
+ preset_coding.click(
496
+ fn=apply_coding_filter,
497
+ outputs=[
498
+ search_box,
499
+ size_range,
500
+ bench_math,
501
+ bench_knowledge,
502
+ bench_coding,
503
+ bench_vision,
504
+ bench_agent,
505
+ bench_language,
506
+ ],
507
+ ).then(fn=update_table, inputs=filter_inputs, outputs=table_outputs)
508
+
509
+ preset_knowledge.click(
510
+ fn=apply_knowledge_filter,
511
+ outputs=[
512
+ search_box,
513
+ size_range,
514
+ bench_math,
515
+ bench_knowledge,
516
+ bench_coding,
517
+ bench_vision,
518
+ bench_agent,
519
+ bench_language,
520
+ ],
521
+ ).then(fn=update_table, inputs=filter_inputs, outputs=table_outputs)
522
+
523
+ preset_math.click(
524
+ fn=apply_math_filter,
525
+ outputs=[
526
+ search_box,
527
+ size_range,
528
+ bench_math,
529
+ bench_knowledge,
530
+ bench_coding,
531
+ bench_vision,
532
+ bench_agent,
533
+ bench_language,
534
+ ],
535
+ ).then(fn=update_table, inputs=filter_inputs, outputs=table_outputs)
536
+
537
+ preset_vision.click(
538
+ fn=apply_vision_filter,
539
+ outputs=[
540
+ search_box,
541
+ size_range,
542
+ bench_math,
543
+ bench_knowledge,
544
+ bench_coding,
545
+ bench_vision,
546
+ bench_agent,
547
+ bench_language,
548
+ ],
549
+ ).then(fn=update_table, inputs=filter_inputs, outputs=table_outputs)
550
+
551
+ preset_agent.click(
552
+ fn=apply_agent_filter,
553
+ outputs=[
554
+ search_box,
555
+ size_range,
556
+ bench_math,
557
+ bench_knowledge,
558
+ bench_coding,
559
+ bench_vision,
560
+ bench_agent,
561
+ bench_language,
562
+ ],
563
+ ).then(fn=update_table, inputs=filter_inputs, outputs=table_outputs)
564
+
565
+ preset_language.click(
566
+ fn=apply_language_filter,
567
+ outputs=[
568
+ search_box,
569
+ size_range,
570
+ bench_math,
571
+ bench_knowledge,
572
+ bench_coding,
573
+ bench_vision,
574
+ bench_agent,
575
+ bench_language,
576
+ ],
577
+ ).then(fn=update_table, inputs=filter_inputs, outputs=table_outputs)
578
+
579
+ # Initialize data and populate table on app load
580
+ def init_wrapper():
581
+ initialize_data()
582
+ return None
583
+
584
+ app.load(
585
+ fn=init_wrapper, # Load data without returning it
586
+ outputs=None,
587
+ ).then(fn=update_table, inputs=filter_inputs, outputs=table_outputs)
588
+
589
+
590
+ if __name__ == "__main__":
591
+ # Initialize data before launching
592
+ print("Initializing leaderboard app...")
593
+ initialize_data()
594
+ print("✓ Data loaded successfully")
595
+ print("Launching Gradio app...")
596
+
597
+ app.launch(server_name="0.0.0.0", server_port=7860, share=False)
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gradio Leaderboard Requirements
2
+
3
+ # Core dependencies
4
+ gradio>=4.0.0
5
+ gradio_rangeslider>=0.0.8
6
+ datasets>=2.14.0
7
+ pandas>=2.0.0
8
+ huggingface_hub>=0.19.0
9
+ pyarrow>=14.0.0
10
+
11
+ # Additional utilities
12
+ requests>=2.31.0
static/sortTable.js ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Table sorting functionality for the leaderboard
3
+ * Allows clicking column headers to sort by model name or benchmark scores
4
+ */
5
+
6
+ let currentSortColumn = null;
7
+ let currentSortDirection = 'desc';
8
+
9
+ /**
10
+ * Sort the leaderboard table by the specified column index
11
+ * @param {number} colIndex - The column index to sort by (0 = model name, 1+ = benchmarks)
12
+ */
13
+ function sortTable(colIndex) {
14
+ const table = document.querySelector('#leaderboardTable');
15
+ if (!table) return;
16
+
17
+ const tbody = table.querySelector('tbody');
18
+ if (!tbody) return;
19
+
20
+ const rows = Array.from(tbody.querySelectorAll('tr'));
21
+
22
+ // Toggle sort direction if clicking same column
23
+ if (currentSortColumn === colIndex) {
24
+ currentSortDirection = currentSortDirection === 'desc' ? 'asc' : 'desc';
25
+ } else {
26
+ currentSortColumn = colIndex;
27
+ currentSortDirection = 'desc';
28
+ }
29
+
30
+ // Sort rows
31
+ rows.sort((a, b) => {
32
+ let aVal, bVal;
33
+
34
+ if (colIndex === 0) {
35
+ // Sort by model name (stored in data-name attribute)
36
+ aVal = a.dataset.name || '';
37
+ bVal = b.dataset.name || '';
38
+ return currentSortDirection === 'asc'
39
+ ? aVal.localeCompare(bVal)
40
+ : bVal.localeCompare(aVal);
41
+ } else {
42
+ // Sort by benchmark score
43
+ const aCell = a.cells[colIndex];
44
+ const bCell = b.cells[colIndex];
45
+
46
+ // Extract score from cell text content
47
+ const aText = aCell ? aCell.textContent.trim() : '';
48
+ const bText = bCell ? bCell.textContent.trim() : '';
49
+
50
+ // Parse scores (handle "—" as missing = -1)
51
+ const aScore = aText === '—' ? -1 : parseFloat(aText);
52
+ const bScore = bText === '—' ? -1 : parseFloat(bText);
53
+
54
+ // Handle missing scores - put them at the end
55
+ if (isNaN(aScore) && isNaN(bScore)) return 0;
56
+ if (isNaN(aScore)) return 1;
57
+ if (isNaN(bScore)) return -1;
58
+
59
+ // Both are numbers, compare them
60
+ return currentSortDirection === 'desc'
61
+ ? bScore - aScore
62
+ : aScore - bScore;
63
+ }
64
+ });
65
+
66
+ // Re-append rows in sorted order
67
+ rows.forEach(row => tbody.appendChild(row));
68
+
69
+ // Update sort indicators
70
+ updateSortIndicators(colIndex);
71
+ }
72
+
73
+ /**
74
+ * Update the sort direction indicators in column headers
75
+ * @param {number} colIndex - The currently sorted column index
76
+ */
77
+ function updateSortIndicators(colIndex) {
78
+ const headers = document.querySelectorAll('#leaderboardTable thead th');
79
+ headers.forEach((th, index) => {
80
+ const sortArrow = th.querySelector('.sa');
81
+ if (sortArrow) {
82
+ if (index === colIndex) {
83
+ // Update arrow for sorted column
84
+ sortArrow.textContent = currentSortDirection === 'desc' ? '↓' : '↑';
85
+ th.classList.add('sorted');
86
+ } else {
87
+ // Reset arrow for other columns
88
+ sortArrow.textContent = '↕';
89
+ th.classList.remove('sorted');
90
+ }
91
+ }
92
+ });
93
+ }
utils/__init__.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility modules for the Gradio leaderboard app.
3
+ """
4
+
5
+ from .data_loader import load_leaderboard_data
6
+ from .filters import filter_data
7
+ from .formatters import format_for_display, format_score
8
+
9
+ __all__ = [
10
+ "load_leaderboard_data",
11
+ "filter_data",
12
+ "format_for_display",
13
+ "format_score",
14
+ ]
utils/data_loader.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data loading utilities for the leaderboard.
3
+ Loads data from HuggingFace dataset and integrates provider logos.
4
+ """
5
+
6
+ import json
7
+ import os
8
+ import pandas as pd
9
+ from datasets import load_dataset
10
+
11
+
12
+ def load_provider_logos():
13
+ """
14
+ Load provider logos from data/provider_logos.json
15
+
16
+ Returns:
17
+ dict: Provider name -> logo URL mapping
18
+ """
19
+ logos_path = os.path.join(
20
+ os.path.dirname(__file__), "..", "data", "provider_logos.json"
21
+ )
22
+
23
+ try:
24
+ with open(logos_path, "r") as f:
25
+ logos = json.load(f)
26
+ return logos
27
+ except FileNotFoundError:
28
+ print(f"Warning: Provider logos file not found at {logos_path}")
29
+ return {}
30
+ except json.JSONDecodeError as e:
31
+ print(f"Warning: Could not parse provider logos JSON: {e}")
32
+ return {}
33
+
34
+
35
+ def format_params(param_billions):
36
+ """
37
+ Format parameter count for display.
38
+
39
+ Args:
40
+ param_billions: Parameter count in billions (float or None)
41
+
42
+ Returns:
43
+ str: Formatted parameter string (e.g., "72.7B", "Unknown")
44
+ """
45
+ if pd.isna(param_billions) or param_billions is None:
46
+ return "Unknown"
47
+
48
+ if param_billions >= 1000:
49
+ return f"{param_billions:.0f}B"
50
+ elif param_billions >= 100:
51
+ return f"{param_billions:.0f}B"
52
+ elif param_billions >= 10:
53
+ return f"{param_billions:.1f}B"
54
+ else:
55
+ return f"{param_billions:.2f}B"
56
+
57
+
58
+ def load_leaderboard_data():
59
+ """
60
+ Load leaderboard data from HuggingFace dataset.
61
+
62
+ Returns:
63
+ pandas.DataFrame: Complete leaderboard data with:
64
+ - All model metadata
65
+ - All benchmark scores
66
+ - Provider logos
67
+ - Formatted parameters
68
+ """
69
+ print("Loading leaderboard data from HuggingFace dataset...")
70
+
71
+ # Load dataset from HF
72
+ try:
73
+ ds = load_dataset("OpenEvals/leaderboard-data", split="train")
74
+ df = ds.to_pandas()
75
+ print(f"✓ Loaded {len(df)} models from dataset")
76
+ except Exception as e:
77
+ print(f"✗ Error loading dataset: {e}")
78
+ raise
79
+
80
+ # Load provider logos
81
+ logos = load_provider_logos()
82
+ print(f"✓ Loaded {len(logos)} provider logos")
83
+
84
+ # Add logo URLs to dataframe
85
+ df["logo_url"] = df["provider"].map(logos)
86
+
87
+ # Format parameters for display
88
+ df["parameters_display"] = df["parameters_billions"].apply(format_params)
89
+
90
+ # Sort by model name by default
91
+ df = df.sort_values("model_name").reset_index(drop=True)
92
+
93
+ print(f"✓ Data loaded successfully: {len(df)} models, {df.columns.size} columns")
94
+
95
+ return df
96
+
97
+
98
+ def get_benchmark_columns():
99
+ """
100
+ Get list of all benchmark score column names.
101
+
102
+ Returns:
103
+ list: Column names for benchmark scores
104
+ """
105
+ return [
106
+ "gsm8k_score",
107
+ "mmluPro_score",
108
+ "gpqa_score",
109
+ "hle_score",
110
+ "olmOcr_score",
111
+ "sweVerified_score",
112
+ "swePro_score",
113
+ "aime2026_score",
114
+ "terminalBench_score",
115
+ "evasionBench_score",
116
+ "hmmt2026_score",
117
+ ]
118
+
119
+
120
+ def get_benchmark_info():
121
+ """
122
+ Get metadata about each benchmark.
123
+
124
+ Returns:
125
+ dict: Benchmark key -> metadata mapping
126
+ """
127
+ return {
128
+ "gsm8k": {
129
+ "name": "GSM8K",
130
+ "full_name": "Grade School Math 8K",
131
+ "category": "math",
132
+ "color": "#7c3aed",
133
+ "url": "https://huggingface.co/datasets/openai/gsm8k",
134
+ },
135
+ "mmluPro": {
136
+ "name": "MMLU-Pro",
137
+ "full_name": "Massive Multi-task Language Understanding Pro",
138
+ "category": "knowledge",
139
+ "color": "#2563eb",
140
+ "url": "https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro",
141
+ },
142
+ "gpqa": {
143
+ "name": "GPQA◆",
144
+ "full_name": "PhD-level Expert Questions",
145
+ "category": "knowledge",
146
+ "color": "#2563eb",
147
+ "url": "https://huggingface.co/datasets/Idavidrein/gpqa",
148
+ },
149
+ "hle": {
150
+ "name": "HLE",
151
+ "full_name": "Humanity's Last Exam",
152
+ "category": "knowledge",
153
+ "color": "#2563eb",
154
+ "url": "https://lastexam.ai",
155
+ },
156
+ "olmOcr": {
157
+ "name": "olmOCR",
158
+ "full_name": "OCR Evaluation Benchmark",
159
+ "category": "vision",
160
+ "color": "#db2777",
161
+ "url": "https://huggingface.co/datasets/allenai/olmOCR-bench",
162
+ },
163
+ "sweVerified": {
164
+ "name": "SWE-V",
165
+ "full_name": "SWE-bench Verified",
166
+ "category": "coding",
167
+ "color": "#059669",
168
+ "url": "https://www.swebench.com",
169
+ },
170
+ "swePro": {
171
+ "name": "SWE-Pro",
172
+ "full_name": "SWE-bench Pro",
173
+ "category": "coding",
174
+ "color": "#059669",
175
+ "url": "https://scale.com/leaderboard/swe_bench_pro_public",
176
+ },
177
+ "aime2026": {
178
+ "name": "AIME 2026",
179
+ "full_name": "American Invitational Mathematics Examination 2026",
180
+ "category": "math",
181
+ "color": "#7c3aed",
182
+ "url": "https://matharena.ai/?comp=aime--aime_2026",
183
+ },
184
+ "terminalBench": {
185
+ "name": "TB 2.0",
186
+ "full_name": "Terminal-Bench 2.0",
187
+ "category": "agent",
188
+ "color": "#0d9488",
189
+ "url": "https://www.tbench.ai/leaderboard/terminal-bench/2.0",
190
+ },
191
+ "evasionBench": {
192
+ "name": "EvasionB",
193
+ "full_name": "EvasionBench",
194
+ "category": "language",
195
+ "color": "#ea580c",
196
+ "url": "https://huggingface.co/datasets/FutureMa/EvasionBench",
197
+ },
198
+ "hmmt2026": {
199
+ "name": "HMMT",
200
+ "full_name": "Harvard-MIT Mathematics Tournament Feb 2026",
201
+ "category": "math",
202
+ "color": "#7c3aed",
203
+ "url": "https://matharena.ai/?comp=hmmt--hmmt_feb_2026",
204
+ },
205
+ }
utils/filters.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Filter logic for the leaderboard.
3
+ Handles search, size range, and benchmark selection filtering.
4
+ """
5
+
6
+ import pandas as pd
7
+
8
+
9
+ def filter_data(df, search_term, size_min, size_max, selected_benchmarks):
10
+ """
11
+ Apply all filters to the leaderboard dataframe.
12
+
13
+ Args:
14
+ df (pd.DataFrame): Original leaderboard dataframe
15
+ search_term (str): Search string for model names (case-insensitive)
16
+ size_min (float): Minimum model size in billions
17
+ size_max (float): Maximum model size in billions
18
+ selected_benchmarks (list): List of benchmark keys to filter by
19
+
20
+ Returns:
21
+ pd.DataFrame: Filtered dataframe
22
+ """
23
+ filtered = df.copy()
24
+
25
+ # 1. Search filter - match model name or provider
26
+ if search_term and search_term.strip():
27
+ search_lower = search_term.lower().strip()
28
+ filtered = filtered[
29
+ filtered["model_name"].str.lower().str.contains(search_lower, na=False)
30
+ | filtered["provider"].str.lower().str.contains(search_lower, na=False)
31
+ ]
32
+
33
+ # 2. Size filter
34
+ # Include models with unknown sizes (they should always be visible)
35
+ size_mask = (
36
+ (filtered["parameters_billions"] >= size_min)
37
+ & (filtered["parameters_billions"] <= size_max)
38
+ ) | filtered["parameters_billions"].isna()
39
+
40
+ filtered = filtered[size_mask]
41
+
42
+ # 3. Benchmark filter - only show models with at least one score in selected benchmarks
43
+ if selected_benchmarks and len(selected_benchmarks) > 0:
44
+ benchmark_cols = [f"{bench}_score" for bench in selected_benchmarks]
45
+
46
+ # Filter to only include columns that exist in the dataframe
47
+ existing_benchmark_cols = [
48
+ col for col in benchmark_cols if col in filtered.columns
49
+ ]
50
+
51
+ if existing_benchmark_cols:
52
+ # Keep rows that have at least one non-null score in the selected benchmarks
53
+ has_score_mask = filtered[existing_benchmark_cols].notna().any(axis=1)
54
+ filtered = filtered[has_score_mask]
55
+ else:
56
+ # If no benchmarks selected, return empty dataframe
57
+ filtered = filtered.iloc[0:0] # Empty with same structure
58
+
59
+ return filtered
60
+
61
+
62
+ def calculate_stats(df, selected_benchmarks):
63
+ """
64
+ Calculate statistics for the filtered data.
65
+
66
+ Args:
67
+ df (pd.DataFrame): Filtered leaderboard dataframe
68
+ selected_benchmarks (list): List of selected benchmark keys
69
+
70
+ Returns:
71
+ dict: Statistics with keys 'models', 'benchmarks', 'scores'
72
+ """
73
+ total_models = len(df)
74
+ total_benchmarks = len(selected_benchmarks) if selected_benchmarks else 0
75
+
76
+ # Count non-null scores in selected benchmarks
77
+ if selected_benchmarks and len(selected_benchmarks) > 0:
78
+ benchmark_cols = [f"{bench}_score" for bench in selected_benchmarks]
79
+ existing_cols = [col for col in benchmark_cols if col in df.columns]
80
+
81
+ if existing_cols and len(df) > 0:
82
+ total_scores = df[existing_cols].notna().sum().sum()
83
+ else:
84
+ total_scores = 0
85
+ else:
86
+ total_scores = 0
87
+
88
+ return {
89
+ "models": total_models,
90
+ "benchmarks": total_benchmarks,
91
+ "scores": int(total_scores),
92
+ }
93
+
94
+
95
+ def parse_benchmark_selections(*checkbox_groups):
96
+ """
97
+ Parse benchmark selections from multiple checkbox groups.
98
+ Converts display names back to benchmark keys.
99
+
100
+ Args:
101
+ *checkbox_groups: Variable number of lists containing selected display names
102
+
103
+ Returns:
104
+ list: List of benchmark keys
105
+ """
106
+ # Mapping from display names to benchmark keys
107
+ display_to_key = {
108
+ "GSM8K": "gsm8k",
109
+ "AIME 2026": "aime2026",
110
+ "HMMT": "hmmt2026",
111
+ "MMLU-Pro": "mmluPro",
112
+ "GPQA": "gpqa",
113
+ "HLE": "hle",
114
+ "SWE-V": "sweVerified",
115
+ "SWE-Pro": "swePro",
116
+ "olmOCR": "olmOcr",
117
+ "TB 2.0": "terminalBench",
118
+ "EvasionB": "evasionBench",
119
+ }
120
+
121
+ selected_keys = []
122
+
123
+ for group in checkbox_groups:
124
+ if group: # Check if not None and not empty
125
+ for display_name in group:
126
+ key = display_to_key.get(display_name)
127
+ if key:
128
+ selected_keys.append(key)
129
+
130
+ return selected_keys
131
+
132
+
133
+ def get_default_benchmarks():
134
+ """
135
+ Get the default set of selected benchmarks (all except olmOCR).
136
+
137
+ Returns:
138
+ list: List of default benchmark keys
139
+ """
140
+ return [
141
+ "gsm8k",
142
+ "mmluPro",
143
+ "gpqa",
144
+ "hle",
145
+ "sweVerified",
146
+ "swePro",
147
+ "aime2026",
148
+ "terminalBench",
149
+ "evasionBench",
150
+ "hmmt2026",
151
+ ]
utils/formatters.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Data formatting utilities for Gradio display.
3
+ Handles formatting of model names, scores, and table structure.
4
+ """
5
+
6
+ import pandas as pd
7
+ from .data_loader import get_benchmark_info
8
+
9
+
10
+ def format_score(score_value):
11
+ """
12
+ Format a benchmark score for display.
13
+
14
+ Args:
15
+ score_value: Score value (float, int, or None)
16
+
17
+ Returns:
18
+ str: Formatted score ("85.3" or "—" for missing)
19
+ """
20
+ if pd.isna(score_value) or score_value is None:
21
+ return "—"
22
+ return f"{score_value:.1f}"
23
+
24
+
25
+ def format_model_name_with_provider(row):
26
+ """
27
+ Create model name display with provider initials prefix.
28
+
29
+ Args:
30
+ row: DataFrame row with 'model_name', 'provider' columns
31
+
32
+ Returns:
33
+ str: Model name with provider prefix (e.g., "[HF] mistralai/Mistral-7B")
34
+ """
35
+ model_name = row["model_name"]
36
+ provider = row["provider"]
37
+
38
+ # Add provider initials as prefix
39
+ initials = provider[:3].upper() if provider else "???"
40
+ return f"[{initials}] {model_name}"
41
+
42
+
43
+ def format_hf_link(row):
44
+ """
45
+ Create HuggingFace link for the model.
46
+
47
+ Args:
48
+ row: DataFrame row with 'model_name' column
49
+
50
+ Returns:
51
+ str: HuggingFace URL
52
+ """
53
+ model_name = row["model_name"]
54
+ return f"https://huggingface.co/{model_name}"
55
+
56
+
57
+ def format_for_display(df, selected_benchmarks):
58
+ """
59
+ Format dataframe for Gradio display.
60
+
61
+ Args:
62
+ df (pd.DataFrame): Filtered leaderboard dataframe
63
+ selected_benchmarks (list): List of benchmark keys to display
64
+
65
+ Returns:
66
+ pd.DataFrame: Formatted dataframe ready for gr.Dataframe
67
+ """
68
+ if df.empty:
69
+ # Return empty dataframe with proper structure
70
+ return pd.DataFrame(columns=["Model", "Parameters", "🔗 Link"])
71
+
72
+ display_df = df.copy()
73
+
74
+ # Create model column with provider prefix
75
+ display_df["Model"] = display_df.apply(format_model_name_with_provider, axis=1)
76
+
77
+ # Create HF link column
78
+ display_df["HF Link"] = display_df.apply(format_hf_link, axis=1)
79
+
80
+ # Start with base columns
81
+ columns_to_show = ["Model", "parameters_display", "HF Link"]
82
+ column_names = ["Model", "Parameters", "🔗 Link"]
83
+
84
+ # Get benchmark info for display names
85
+ benchmark_info = get_benchmark_info()
86
+
87
+ # Add selected benchmark columns
88
+ for bench_key in selected_benchmarks:
89
+ score_col = f"{bench_key}_score"
90
+
91
+ if score_col in display_df.columns:
92
+ # Get display name from benchmark info
93
+ display_name = benchmark_info.get(bench_key, {}).get("name", bench_key)
94
+
95
+ # Format scores
96
+ display_df[display_name] = display_df[score_col].apply(format_score)
97
+
98
+ columns_to_show.append(display_name)
99
+ column_names.append(display_name)
100
+
101
+ # Select only the columns we want to show
102
+ result_df = display_df[columns_to_show].copy()
103
+
104
+ # Rename columns for clarity
105
+ result_df.columns = column_names
106
+
107
+ return result_df
108
+
109
+
110
+ def create_empty_table(selected_benchmarks):
111
+ """
112
+ Create an empty table with proper column structure.
113
+
114
+ Args:
115
+ selected_benchmarks (list): List of benchmark keys
116
+
117
+ Returns:
118
+ pd.DataFrame: Empty dataframe with proper columns
119
+ """
120
+ benchmark_info = get_benchmark_info()
121
+
122
+ columns = ["Model", "Parameters", "🔗 Link"]
123
+
124
+ for bench_key in selected_benchmarks:
125
+ display_name = benchmark_info.get(bench_key, {}).get("name", bench_key)
126
+ columns.append(display_name)
127
+
128
+ return pd.DataFrame(columns=columns)
129
+
130
+
131
+ def get_column_datatypes(selected_benchmarks):
132
+ """
133
+ Get Gradio datatype specification for table columns.
134
+
135
+ Args:
136
+ selected_benchmarks (list): List of benchmark keys
137
+
138
+ Returns:
139
+ list: List of datatype strings for gr.Dataframe
140
+ """
141
+ # Base columns: Model (str), Provider (str), Parameters (str)
142
+ datatypes = ["str", "str", "str"]
143
+
144
+ # Add 'str' for each benchmark column (they're pre-formatted as strings)
145
+ for _ in selected_benchmarks:
146
+ datatypes.append("str")
147
+
148
+ return datatypes
149
+
150
+
151
+ def prepare_export_data(df, selected_benchmarks):
152
+ """
153
+ Prepare data for CSV export (without HTML/markdown formatting).
154
+
155
+ Args:
156
+ df (pd.DataFrame): Filtered leaderboard dataframe
157
+ selected_benchmarks (list): List of benchmark keys
158
+
159
+ Returns:
160
+ pd.DataFrame: Clean dataframe for CSV export
161
+ """
162
+ if df.empty:
163
+ return pd.DataFrame()
164
+
165
+ export_df = df[["model_name", "provider", "parameters_display"]].copy()
166
+ export_df.columns = ["Model", "Provider", "Parameters"]
167
+
168
+ benchmark_info = get_benchmark_info()
169
+
170
+ # Add benchmark scores
171
+ for bench_key in selected_benchmarks:
172
+ score_col = f"{bench_key}_score"
173
+ if score_col in df.columns:
174
+ display_name = benchmark_info.get(bench_key, {}).get("name", bench_key)
175
+ export_df[display_name] = df[score_col].apply(format_score)
176
+
177
+ return export_df
utils/html_generator.py ADDED
@@ -0,0 +1,292 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HTML table generator for the leaderboard.
3
+ Generates styled HTML tables with client-side sorting and provider logos.
4
+ """
5
+
6
+ import pandas as pd
7
+ from typing import Dict, List
8
+ from .data_loader import get_benchmark_info
9
+
10
+
11
+ # Benchmark to category mapping (for color coding)
12
+ BENCHMARK_CATEGORIES = {
13
+ "gsm8k": "math",
14
+ "aime2026": "math",
15
+ "hmmt2026": "math",
16
+ "mmluPro": "knowledge",
17
+ "gpqa": "knowledge",
18
+ "hle": "knowledge",
19
+ "sweVerified": "coding",
20
+ "swePro": "coding",
21
+ "olmOcr": "vision",
22
+ "terminalBench": "agent",
23
+ "evasionBench": "language",
24
+ }
25
+
26
+ # Category color mapping (for score styling)
27
+ CATEGORY_COLORS = {
28
+ "math": "#7c3aed", # purple
29
+ "knowledge": "#2563eb", # blue
30
+ "coding": "#059669", # green
31
+ "agent": "#0d9488", # teal
32
+ "language": "#ea580c", # orange
33
+ "vision": "#db2777", # pink
34
+ }
35
+
36
+
37
+ def get_table_css() -> str:
38
+ """
39
+ Returns the CSS styles for the leaderboard table (light mode only).
40
+ Extracted from index.html and adapted for Gradio embedding.
41
+ """
42
+ return """
43
+ *{margin:0;padding:0;box-sizing:border-box;}
44
+ :root{
45
+ --bg:#f9fafb;--bg2:#f3f4f6;--surface:#ffffff;--surface-alt:#f9fafb;
46
+ --border:#e5e7eb;--border-hover:#d1d5db;
47
+ --shadow-sm:0 1px 3px rgba(15,23,42,.04),0 1px 2px rgba(15,23,42,.06);
48
+ --shadow:0 4px 16px rgba(15,23,42,.06),0 1px 3px rgba(15,23,42,.08);
49
+ --shadow-lg:0 12px 40px rgba(15,23,42,.08),0 4px 12px rgba(15,23,42,.06);
50
+ --text:#111827;--text-sec:#6b7280;--text-muted:#9ca3af;
51
+ --ac:#6366f1;--ac2:#4f46e5;--ac-bg:rgba(99,102,241,.06);
52
+ --teal:#0d9488;--amber:#d97706;--green:#16a34a;--rose:#e11d48;--purple:#7c3aed;
53
+ --radius:16px;--radius-sm:10px;--radius-xs:6px;
54
+ --font:'Source Sans Pro',sans-serif;--font-mono:'IBM Plex Mono',monospace;
55
+ --tr:0.22s cubic-bezier(0.4,0,0.2,1);
56
+ }
57
+
58
+ /* TABLE */
59
+ .tw{background:var(--surface);border:1px solid var(--border);border-radius:var(--radius);overflow-x:auto;box-shadow:var(--shadow);margin-bottom:20px;}
60
+ table{width:100%;border-collapse:collapse;font-size:11px;font-family:var(--font);}
61
+ thead{background:var(--surface-alt);position:sticky;top:0;z-index:100;box-shadow:0 2px 4px rgba(0,0,0,0.1);}
62
+ thead tr{border-bottom:2px solid var(--border);}
63
+ th{padding:12px 8px;text-align:center;font-size:11px;font-family:var(--font-mono);text-transform:uppercase;letter-spacing:.5px;color:var(--text-muted);white-space:nowrap;cursor:pointer;user-select:none;vertical-align:bottom;line-height:1.6;font-weight:700;transition:var(--tr);}
64
+ th.c-model{text-align:left;padding-left:14px;min-width:180px;position:sticky;left:0;background:var(--surface-alt);z-index:101;}
65
+ th:hover{color:var(--ac);background:rgba(99,102,241,.08);transform:translateY(-1px);}
66
+ th.sorted{color:var(--ac);font-weight:800;}
67
+ .sa{opacity:.6;font-size:7px;margin-left:3px;}
68
+ th a{color:inherit;text-decoration:none;}
69
+ th a:hover{color:var(--ac);text-decoration:underline;}
70
+ tbody tr{border-bottom:1px solid var(--border);transition:background var(--tr);}
71
+ tbody tr:last-child{border-bottom:none;}
72
+ tbody tr:hover{background:rgba(99,102,241,.025);}
73
+ td{padding:10px 6px;text-align:center;vertical-align:middle;}
74
+ td.c-model{text-align:left;padding-left:14px;position:sticky;left:0;background:var(--surface);z-index:9;border-right:1px solid var(--border);}
75
+ tbody tr:hover td.c-model{background:rgba(99,102,241,.025);}
76
+
77
+ /* MODEL CELL */
78
+ .mc{display:flex;flex-direction:column;gap:2px;}
79
+ .mn{font-weight:700;font-size:12px;color:var(--text);display:flex;align-items:center;gap:5px;flex-wrap:wrap;}
80
+ .mn a{color:var(--text);text-decoration:none;transition:var(--tr);position:relative;}
81
+ .mn a:hover{color:var(--ac);text-decoration:none;}
82
+ .mn a::after{content:'';position:absolute;bottom:-2px;left:0;width:0;height:1px;background:var(--ac);transition:width 0.3s ease;}
83
+ .mn a:hover::after{width:100%;}
84
+ .ms{display:flex;gap:4px;align-items:center;margin-top:2px;}
85
+ .mp{font-size:8px;color:var(--text-muted);font-family:var(--font-mono);}
86
+
87
+ /* PROVIDER LOGO */
88
+ .provider-logo-inline{width:16px;height:16px;border-radius:50%;object-fit:cover;border:1px solid var(--border);box-shadow:var(--shadow-sm);margin-right:6px;vertical-align:middle;display:inline-block;}
89
+ .provider-logo-fallback-inline{width:16px;height:16px;border-radius:50%;background:var(--ac-bg);border:1px solid var(--border);display:inline-flex;align-items:center;justify-content:center;font-size:8px;font-weight:700;color:var(--ac);font-family:var(--font-mono);margin-right:6px;vertical-align:middle;}
90
+
91
+ /* SCORE CELL */
92
+ .sc{display:flex;flex-direction:column;align-items:center;gap:2px;}
93
+ .sn{font-family:var(--font-mono);font-size:11px;font-weight:700;}
94
+ .na{color:var(--text-muted);font-size:9px;font-family:var(--font-mono);}
95
+
96
+ /* EMPTY STATE */
97
+ .empty-state{text-align:center;padding:40px 20px;color:var(--text-muted);font-size:13px;}
98
+ .empty-state strong{color:var(--text-sec);font-size:15px;display:block;margin-bottom:8px;}
99
+ """
100
+
101
+
102
+ def get_benchmark_category_color(benchmark_key: str) -> str:
103
+ """
104
+ Get the color for a benchmark based on its category.
105
+
106
+ Args:
107
+ benchmark_key: The benchmark key (e.g., 'gsm8k', 'mmluPro')
108
+
109
+ Returns:
110
+ str: Hex color code for the category
111
+ """
112
+ category = BENCHMARK_CATEGORIES.get(benchmark_key, "knowledge")
113
+ return CATEGORY_COLORS.get(category, "#6366f1")
114
+
115
+
116
+ def generate_table_headers(selected_benchmarks: List[str]) -> str:
117
+ """
118
+ Generate HTML for table headers with sorting functionality.
119
+
120
+ Args:
121
+ selected_benchmarks: List of benchmark keys to display
122
+
123
+ Returns:
124
+ str: HTML string for <thead> element
125
+ """
126
+ benchmarks_info = get_benchmark_info()
127
+
128
+ # Start with model header (column 0)
129
+ headers_html = "<thead><tr>\n"
130
+ headers_html += ' <th class="c-model" onclick="sortTable(0)">Model <span class="sa">↕</span></th>\n'
131
+
132
+ # Add benchmark headers (columns 1+)
133
+ for idx, bench_key in enumerate(selected_benchmarks, start=1):
134
+ bench_info = benchmarks_info.get(bench_key, {})
135
+ bench_name = bench_info.get("name", bench_key)
136
+ headers_html += f' <th onclick="sortTable({idx})">{bench_name} <span class="sa">↕</span></th>\n'
137
+
138
+ headers_html += "</tr></thead>\n"
139
+ return headers_html
140
+
141
+
142
+ def generate_model_cell(row: pd.Series, provider_logos: Dict[str, str]) -> str:
143
+ """
144
+ Generate HTML for the model cell (sticky first column).
145
+
146
+ Args:
147
+ row: DataFrame row containing model data
148
+ provider_logos: Dictionary mapping provider names to logo URLs
149
+
150
+ Returns:
151
+ str: HTML string for model <td> element
152
+ """
153
+ model_id = row.get("model_id", "")
154
+ model_name = row.get("model_name", model_id)
155
+ provider = row.get("provider", "Unknown")
156
+ # Try parameters_display first (formatted), then parameters_billions
157
+ params = row.get("parameters_display", row.get("parameters", "Unknown"))
158
+
159
+ # Get provider logo - first try logo_url column, then fallback to provider_logos dict
160
+ provider_logo_url = row.get("logo_url")
161
+ if not provider_logo_url or pd.isna(provider_logo_url):
162
+ provider_logo_url = provider_logos.get(provider)
163
+
164
+ if provider_logo_url:
165
+ logo_html = f'<img src="{provider_logo_url}" alt="{provider}" class="provider-logo-inline" title="{provider}" onerror="this.style.display=\'none\';">'
166
+ else:
167
+ # Fallback: show first 2 letters of provider name
168
+ initials = provider[:2].upper() if provider and provider != "Unknown" else "??"
169
+ logo_html = f'<span class="provider-logo-fallback-inline" title="{provider}">{initials}</span>'
170
+
171
+ # Format HuggingFace link - use model_name which contains the repo path (e.g., "Meta/Llama-3")
172
+ hf_link = f"https://huggingface.co/{model_name}" if model_name else "#"
173
+
174
+ cell_html = f''' <td class="c-model">
175
+ <div class="mc">
176
+ <div class="mn">
177
+ {logo_html}
178
+ <a href="{hf_link}" target="_blank" rel="noopener noreferrer">{model_name}</a>
179
+ </div>
180
+ <div class="ms">
181
+ <span class="mp">{provider}</span>
182
+ <span class="mp">{params}</span>
183
+ </div>
184
+ </div>
185
+ </td>'''
186
+
187
+ return cell_html
188
+
189
+
190
+ def generate_score_cell(score, benchmark_key: str) -> str:
191
+ """
192
+ Generate HTML for a score cell with category-specific color.
193
+
194
+ Args:
195
+ score: The benchmark score (float, None, or NaN)
196
+ benchmark_key: The benchmark key (for color coding)
197
+
198
+ Returns:
199
+ str: HTML string for score <td> element
200
+ """
201
+ # Check if score is missing/invalid
202
+ if pd.isna(score) or score is None:
203
+ return ' <td><div class="sc"><span class="na">—</span></div></td>'
204
+
205
+ try:
206
+ score_float = float(score)
207
+ color = get_benchmark_category_color(benchmark_key)
208
+ score_display = f"{score_float:.1f}"
209
+
210
+ return f' <td><div class="sc"><div class="sn" style="color: {color};">{score_display}</div></div></td>'
211
+ except (ValueError, TypeError):
212
+ return ' <td><div class="sc"><span class="na">—</span></div></td>'
213
+
214
+
215
+ def generate_table_rows(
216
+ df: pd.DataFrame, selected_benchmarks: List[str], provider_logos: Dict[str, str]
217
+ ) -> str:
218
+ """
219
+ Generate HTML for all table rows.
220
+
221
+ Args:
222
+ df: DataFrame containing leaderboard data
223
+ selected_benchmarks: List of benchmark keys to display
224
+ provider_logos: Dictionary mapping provider names to logo URLs
225
+
226
+ Returns:
227
+ str: HTML string for <tbody> element
228
+ """
229
+ if df.empty:
230
+ return """<tbody>
231
+ <tr>
232
+ <td colspan="100" class="empty-state">
233
+ <strong>No models match your criteria</strong>
234
+ Try adjusting your search or filter settings
235
+ </td>
236
+ </tr>
237
+ </tbody>"""
238
+
239
+ rows_html = "<tbody>\n"
240
+
241
+ for _, row in df.iterrows():
242
+ model_name = row.get("model_name", row.get("model_id", "Unknown"))
243
+ rows_html += f'<tr data-name="{model_name}">\n'
244
+
245
+ # Model cell (sticky first column)
246
+ rows_html += generate_model_cell(row, provider_logos) + "\n"
247
+
248
+ # Score cells for each selected benchmark
249
+ for bench_key in selected_benchmarks:
250
+ score_col = f"{bench_key}_score"
251
+ score = row.get(score_col)
252
+ rows_html += generate_score_cell(score, bench_key) + "\n"
253
+
254
+ rows_html += "</tr>\n"
255
+
256
+ rows_html += "</tbody>\n"
257
+ return rows_html
258
+
259
+
260
+ def generate_leaderboard_html(
261
+ df: pd.DataFrame, selected_benchmarks: List[str], provider_logos: Dict[str, str]
262
+ ) -> str:
263
+ """
264
+ Generate complete HTML table for the leaderboard.
265
+
266
+ Args:
267
+ df: DataFrame containing filtered leaderboard data
268
+ selected_benchmarks: List of benchmark keys to display
269
+ provider_logos: Dictionary mapping provider names to logo URLs
270
+
271
+ Returns:
272
+ str: Complete HTML string with styles, table, and inline JavaScript
273
+ """
274
+ css = get_table_css()
275
+ headers = generate_table_headers(selected_benchmarks)
276
+ rows = generate_table_rows(df, selected_benchmarks, provider_logos)
277
+
278
+ # Note: JavaScript for sorting is loaded via Gradio's js parameter in app.py
279
+ html = f"""
280
+ <style>
281
+ {css}
282
+ </style>
283
+
284
+ <div class="tw">
285
+ <table id="leaderboardTable">
286
+ {headers}
287
+ {rows}
288
+ </table>
289
+ </div>
290
+ """
291
+
292
+ return html