📊 Benchmarks and Leaderboards - a society-ethics Collection

society-ethics 's Collections

⛔️🔦 Provenance, Watermarking & Deepfake Detection

🗳️ AI for Policymakers

⚖️ Showing Biases in ML Systems

🤬⛔ Hate Speech and Filtering

🪪🔦Model Cards

🔒☂️🧑‍🤝‍🧑 Privacy and AI

📊 Benchmarks and Leaderboards

📚🔍 Understanding Datasets

💻🔍 Understanding Models

🏛️📚🖼️ Open Data: Public Domain and Open Licenses

📊 Benchmarks and Leaderboards

updated Aug 11, 2025

Running on CPU Upgrade

14k

Open LLM Leaderboard

🏆

14k

Track, rank and evaluate open LLMs and chatbots
Runtime error

5

Zeno Evals Hub

🏃

5
Running on CPU Upgrade

7.49k

MTEB Leaderboard

📊

7.49k

Embedding Leaderboard
Running

Agents

Featured

588

LLM-Perf Leaderboard

🏆

588

Compare LLM hardware performance and find the best model
Runtime error

135

Leaderboards

📈

135
Running on CPU Upgrade

Agents

Featured

1.38k

Open ASR Leaderboard

🏆

1.38k

Explore and compare speech recognition model benchmarks
Running

Agents

1.51k

Big Code Models Leaderboard

📈

1.51k

Explore and compare code model performance on a leaderboard
Running

4.92k

Arena Leaderboard

🏆

4.92k

View the LMArena leaderboard in full‑screen
Running

Agents

176

Open Object Detection Leaderboard

🏆

176

Request evaluation for a new model
Running

Agents

Featured

71

Toolbench Leaderboard

⚡

71

Display leaderboard of language models
Build error

Agents

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard
Running

Agents

95

OpenCompass LLM Leaderboard

🚀

95

Serve a web page from a Flask server
nguha/legalbench

Viewer • Updated Mar 30 • 91.8k • 20.3k • 183
Running

Agents

6

Skillmix

🚀

6

Browse and compare AI model evaluations
Runtime error

Agents

145

Hallucinations Leaderboard

🔥

145

View and submit LLM evaluations
Running

Agents

43

MVBench Leaderboard

🐨

43

Submit model evaluations and view the leaderboard
Running

Agents

3

Mt Bench French Browser

📊

3

Compare model answers to questions in French
Running

Agents

54

NPHardEval Leaderboard

🥇

54

Explore and filter LLM benchmark results
Running

Agents

356

VBench Leaderboard

📊

356

Submit video model evaluation results to a public benchmark
Build error

Agents

105

Enterprise Scenarios Leaderboard

🥇

105
Running

192

Yet Another LLM Leaderboard

🌖

192

Launch a Streamlit web app interface
Running

72

CyberSecEvalTest

📈

72

Evaluate LLMs' cybersecurity risks and capabilities
Running

Agents

31

Contextual Leaderboard

🐨

31

Submit and evaluate models for contextual understanding tasks
Runtime error

Agents

56

Open Multilingual Llm Leaderboard

🐨

56

Search for model performance across languages and benchmarks
Running on CPU Upgrade

Agents

93

OpenLLM Turkish leaderboard

🥇

93

Explore and submit LLM benchmarks
Running on CPU Upgrade

Agents

1.02k

Open VLM Leaderboard

🌎

1.02k

VLMEvalKit Evaluation Results Collection
Running

Agents

432

Reward Bench Leaderboard

📐

432

Explore and compare model scores on RewardBench benchmarks
Build error

Agents

Featured

63

Guardrails Arena

⚔

63

Jailbreak the LLM and privacy guardrails
Running

Agents

19

🐍💨 Data Contamination Database

🏭

19

Filter data on contamination in datasets and models
Running on CPU Upgrade

Agents

182

Open Arabic LLM Leaderboard

🏆

182

Track, rank and evaluate open Arabic LLMs and chatbots
Runtime error

Agents

76

AIR-Bench Leaderboard

🥇

76

Explore and compare QA and long doc benchmarks
Runtime error

Agents

23

MM-UPD Leaderboard

🥇

23

Submit and evaluate model results on MM-UPD benchmarks
Running

Agents

231

BigCodeBench Leaderboard

🥇

231

Explore code-generation model leaderboards and task details
Building on CPU Upgrade

Agents

76

La Leaderboard

🌸

76

Evaluate open LLMs in the languages of LATAM and Spain.
Running

136

Open FinLLM Leaderboard

🥇

136

Explore and compare LLM performance on financial benchmarks