---
title: Official Benchmarks Leaderboard 2026
emoji: 🏆
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
---

# 🏆 Official Benchmarks Leaderboard 2026

A unified leaderboard for **11 official HuggingFace benchmarks**. Compare AI models across math, coding, knowledge, vision, agent, and language tasks.

## ✨ Features

- 📊 **11 Official Benchmarks**: GSM8K, MMLU-Pro, GPQA, HLE, SWE-bench, AIME, HMMT, and more
- 🎛️ **Quick Filters**: One-click presets for model sizes and benchmark categories
- 🔍 **Interactive Search**: Filter by model name or provider
- 📏 **Size Range Slider**: Filter models by parameter count (0-1100B+)
- 🎯 **Category Selection**: Choose specific benchmark categories to display
- 📥 **Export CSV**: Download filtered leaderboard data
- 🔄 **Sortable Columns**: Click any header to sort the table
- 🎨 **Modern Design**: Clean, responsive interface with provider logos

## 🎯 Benchmarks Included

### 📐 Math
- **GSM8K**: Grade School Math (8.5K problems)
- **AIME 2026**: American Invitational Mathematics Examination
- **HMMT 2026**: Harvard-MIT Mathematics Tournament

### 🧠 Knowledge
- **MMLU-Pro**: Massive Multi-task Language Understanding
- **GPQA Diamond**: PhD-level expert questions
- **HLE**: Humanity's Last Exam

### 💻 Coding
- **SWE-bench Verified**: Real-world software engineering tasks
- **SWE-bench Pro**: Advanced software engineering challenges

### 👁️ Vision
- **olmOCR**: OCR evaluation benchmark

### 🤖 Agent
- **Terminal-Bench 2.0**: Terminal command understanding

### 💬 Language
- **EvasionBench**: Language understanding challenges

## 🚀 Quick Start

The leaderboard loads automatically from the HuggingFace dataset: `OpenEvals/leaderboard-data`

**Quick Filters:**
- 🔹 **Small (<10B)**, 🔸 **Medium (10-100B)**, 🔶 **Large (100B+)** - Filter by model size
- 💻 **Coding**, 🧠 **Knowledge**, 📐 **Math**, etc. - Show only specific categories

## 📊 Data Source

**Dataset**: [OpenEvals/leaderboard-data](https://huggingface.co/datasets/OpenEvals/leaderboard-data)

All scores are aggregated from official HuggingFace benchmark leaderboards. The dataset is updated regularly with the latest model evaluations.

## 💻 Local Development

```bash
# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py
```

## 📁 Project Structure

```
.
├── app.py                    # Main Gradio application
├── utils/
│   ├── data_loader.py       # Load data from HuggingFace dataset
│   ├── filters.py           # Filter and search logic
│   ├── formatters.py        # Data formatting utilities
│   └── html_generator.py    # Generate HTML leaderboard table
├── static/
│   └── sortTable.js         # Client-side table sorting
├── data/
│   └── provider_logos.json  # Provider avatar URLs
└── requirements.txt         # Python dependencies
```

## 🔧 Technologies

- **Gradio 5.50.0**: Interactive web interface
- **Datasets**: HuggingFace datasets library
- **Pandas**: Data manipulation
- **RangeSlider**: Custom Gradio component for size filtering

## 📝 License

Data is sourced from official HuggingFace benchmarks. Please refer to individual benchmark pages for specific licensing information.

---

Made with ❤️ by the Benchmarks Team