--- title: Official Benchmarks Leaderboard 2026 emoji: 🏆 colorFrom: purple colorTo: blue sdk: gradio sdk_version: 5.50.0 app_file: app.py pinned: false --- # 🏆 Official Benchmarks Leaderboard 2026 A unified leaderboard for **11 official HuggingFace benchmarks**. Compare AI models across math, coding, knowledge, vision, agent, and language tasks. ## ✨ Features - 📊 **11 Official Benchmarks**: GSM8K, MMLU-Pro, GPQA, HLE, SWE-bench, AIME, HMMT, and more - 🎛️ **Quick Filters**: One-click presets for model sizes and benchmark categories - 🔍 **Interactive Search**: Filter by model name or provider - 📏 **Size Range Slider**: Filter models by parameter count (0-1100B+) - 🎯 **Category Selection**: Choose specific benchmark categories to display - 📥 **Export CSV**: Download filtered leaderboard data - 🔄 **Sortable Columns**: Click any header to sort the table - 🎨 **Modern Design**: Clean, responsive interface with provider logos ## 🎯 Benchmarks Included ### 📐 Math - **GSM8K**: Grade School Math (8.5K problems) - **AIME 2026**: American Invitational Mathematics Examination - **HMMT 2026**: Harvard-MIT Mathematics Tournament ### 🧠 Knowledge - **MMLU-Pro**: Massive Multi-task Language Understanding - **GPQA Diamond**: PhD-level expert questions - **HLE**: Humanity's Last Exam ### 💻 Coding - **SWE-bench Verified**: Real-world software engineering tasks - **SWE-bench Pro**: Advanced software engineering challenges ### 👁️ Vision - **olmOCR**: OCR evaluation benchmark ### 🤖 Agent - **Terminal-Bench 2.0**: Terminal command understanding ### 💬 Language - **EvasionBench**: Language understanding challenges ## 🚀 Quick Start The leaderboard loads automatically from the HuggingFace dataset: `OpenEvals/leaderboard-data` **Quick Filters:** - 🔹 **Small (<10B)**, 🔸 **Medium (10-100B)**, 🔶 **Large (100B+)** - Filter by model size - 💻 **Coding**, 🧠 **Knowledge**, 📐 **Math**, etc. - Show only specific categories ## 📊 Data Source **Dataset**: [OpenEvals/leaderboard-data](https://huggingface.co/datasets/OpenEvals/leaderboard-data) All scores are aggregated from official HuggingFace benchmark leaderboards. The dataset is updated regularly with the latest model evaluations. ## 💻 Local Development ```bash # Install dependencies pip install -r requirements.txt # Run the app python app.py ``` ## 📁 Project Structure ``` . ├── app.py # Main Gradio application ├── utils/ │ ├── data_loader.py # Load data from HuggingFace dataset │ ├── filters.py # Filter and search logic │ ├── formatters.py # Data formatting utilities │ └── html_generator.py # Generate HTML leaderboard table ├── static/ │ └── sortTable.js # Client-side table sorting ├── data/ │ └── provider_logos.json # Provider avatar URLs └── requirements.txt # Python dependencies ``` ## 🔧 Technologies - **Gradio 5.50.0**: Interactive web interface - **Datasets**: HuggingFace datasets library - **Pandas**: Data manipulation - **RangeSlider**: Custom Gradio component for size filtering ## 📝 License Data is sourced from official HuggingFace benchmarks. Please refer to individual benchmark pages for specific licensing information. --- Made with ❤️ by the Benchmarks Team