arxiv:2406.04770

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Published on Jun 7, 2024

· Submitted by

AK on Jun 10, 2024

#3 Paper of the day

Upvote

Authors:

Bill Yuchen Lin ,

Yuntian Deng ,

Khyathi Chandu ,

Faeze Brahman ,

Abhilasha Ravichander ,

Valentina Pyatkin ,

Nouha Dziri ,

Ronan Le Bras ,

Yejin Choi

Abstract

WildBench evaluates large language models using human-chatbot queries, providing tasks and metrics that correlate well with human-voted evaluations.

AI-generated summary

We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of ``slightly better/worse'' to ``tie'' if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard's 0.91 and AlpacaEval2.0's 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

View arXiv page View PDF GitHub 253 auto Add to collection

Community

yuchenlin

Paper author Jul 21, 2024

•

edited Jul 21, 2024

@article {yuchen2024wildbench,
  title={WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild},
  author={Yuchen Lin, Bill and Deng, Yuntian and Chandu, Khyathi and Brahman, Faeze and Ravichander, Abhilasha and Pyatkin, Valentina and Dziri, Nouha and Le Bras, Ronan and Choi, Yejin},
  journal={arXiv e-prints},
  pages={arXiv--2406},
  year={2024}
}