Papers
arxiv:2406.04770

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Published on Jun 7, 2024
· Submitted by
AK
on Jun 10, 2024
#3 Paper of the day

Abstract

WildBench evaluates large language models using human-chatbot queries, providing tasks and metrics that correlate well with human-voted evaluations.

AI-generated summary

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

Community

@article {yuchen2024wildbench,
  title={WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild},
  author={Yuchen Lin, Bill and Deng, Yuntian and Chandu, Khyathi and Brahman, Faeze and Ravichander, Abhilasha and Pyatkin, Valentina and Dziri, Nouha and Le Bras, Ronan and Choi, Yejin},
  journal={arXiv e-prints},
  pages={arXiv--2406},
  year={2024}
}

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2406.04770
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.04770 in a model README.md to link it from this page.

Datasets citing this paper 10

Browse 10 datasets citing this paper

Spaces citing this paper 30

Browse 30 spaces citing this paper

Collections including this paper 6