Auto-deploy from GitHub
Browse files- LICENSE-DATA.txt +8 -12
- results-README.md +76 -102
- results.csv.timestamp +1 -1
LICENSE-DATA.txt
CHANGED
|
@@ -53,17 +53,13 @@ Full license text available at: https://cdla.dev/permissive-2-0/
|
|
| 53 |
|
| 54 |
DATA ATTRIBUTION
|
| 55 |
|
| 56 |
-
Source:
|
| 57 |
-
Website: https://
|
| 58 |
-
GitHub: https://github.com/
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
Citation:
|
| 62 |
-
[TODO: Add paper citation when available]
|
| 63 |
-
For now, please cite the Exgentic project and website.
|
| 64 |
|
| 65 |
Description:
|
| 66 |
-
This dataset contains benchmark results for
|
| 67 |
-
across
|
| 68 |
-
data includes success rates, costs,
|
| 69 |
-
|
|
|
|
| 53 |
|
| 54 |
DATA ATTRIBUTION
|
| 55 |
|
| 56 |
+
Source: Open Agent Leaderboard
|
| 57 |
+
Website: https://exgentic.ai
|
| 58 |
+
GitHub: https://github.com/Exgentic/open-agent-leaderboard
|
| 59 |
+
Paper: https://arxiv.org/abs/2602.22953
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
Description:
|
| 62 |
+
This dataset contains benchmark results for general-purpose AI agent systems
|
| 63 |
+
evaluated across six diverse tasks: AppWorld, BrowseComp+, SWE-bench, and
|
| 64 |
+
TauBench (Airline, Retail, Telecom). The data includes success rates, costs,
|
| 65 |
+
and performance metrics for different agent-model combinations.
|
results-README.md
CHANGED
|
@@ -1,102 +1,76 @@
|
|
| 1 |
-
#
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
## License
|
| 7 |
-
This dataset is licensed under the **Community Data License Agreement
|
| 8 |
-
|
| 9 |
-
- Full license text: [LICENSE-DATA.txt](./LICENSE-DATA.txt)
|
| 10 |
-
- Official license page: https://cdla.dev/permissive-2-0/
|
| 11 |
-
|
| 12 |
-
##
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
- `
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
-
|
| 46 |
-
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
-
|
| 72 |
-
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
- Performance may vary with different prompts or configurations
|
| 78 |
-
- Costs are approximate and may vary based on API pricing changes
|
| 79 |
-
- Some benchmarks have limited task coverage
|
| 80 |
-
|
| 81 |
-
## Updates and Versioning
|
| 82 |
-
This dataset is periodically updated with new results. Check the `date` column for evaluation timestamps and the website for the latest version.
|
| 83 |
-
|
| 84 |
-
## Contact
|
| 85 |
-
For questions, issues, or contributions:
|
| 86 |
-
- GitHub Issues: https://github.com/IBM/exgentic/issues
|
| 87 |
-
- Website: https://ibm-research-ai.github.io/exgentic-website/
|
| 88 |
-
|
| 89 |
-
## Acknowledgments
|
| 90 |
-
This work builds upon the following benchmark datasets:
|
| 91 |
-
- WebArena
|
| 92 |
-
- AssistantBench
|
| 93 |
-
- GAIA
|
| 94 |
-
- OSWorld
|
| 95 |
-
|
| 96 |
-
We thank the creators of these benchmarks for their contributions to the AI agent evaluation ecosystem.
|
| 97 |
-
|
| 98 |
-
---
|
| 99 |
-
|
| 100 |
-
**License**: CDLA-Permissive-2.0
|
| 101 |
-
**Version**: 1.0
|
| 102 |
-
**Date**: February 25, 2026
|
|
|
|
| 1 |
+
# Open Agent Leaderboard Data
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
Benchmark evaluation results for general-purpose AI agent systems tested across six diverse real-world tasks. The data powers the [Open Agent Leaderboard](https://huggingface.co/spaces/open-agent-leaderboard/leaderboard).
|
| 5 |
+
|
| 6 |
+
## License
|
| 7 |
+
This dataset is licensed under the **Community Data License Agreement -- Permissive -- Version 2.0 (CDLA-Permissive-2.0)**.
|
| 8 |
+
|
| 9 |
+
- Full license text: [LICENSE-DATA.txt](./LICENSE-DATA.txt)
|
| 10 |
+
- Official license page: https://cdla.dev/permissive-2-0/
|
| 11 |
+
|
| 12 |
+
## Citation
|
| 13 |
+
|
| 14 |
+
```bibtex
|
| 15 |
+
@inproceedings{bandel2026general,
|
| 16 |
+
title={General Agent Evaluation},
|
| 17 |
+
author={Bandel, Elron and Yehudai, Asaf and Jacovi, Michal and Katz, Yoav and Shmueli-Scheuer, Michal and Choshen, Leshem},
|
| 18 |
+
booktitle={ICLR 2026 Workshop on LLM Agents},
|
| 19 |
+
year={2026},
|
| 20 |
+
url={https://arxiv.org/abs/2602.22953}
|
| 21 |
+
}
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
## Data Description
|
| 25 |
+
|
| 26 |
+
### File: `data/results.csv`
|
| 27 |
+
Each row represents one agent-model-benchmark combination.
|
| 28 |
+
|
| 29 |
+
**Key columns:**
|
| 30 |
+
- `agent`, `agent_name`: Agent identifier and display name
|
| 31 |
+
- `model`, `model_name`: LLM used
|
| 32 |
+
- `benchmark`, `benchmark_name`: Benchmark evaluated
|
| 33 |
+
- `benchmark_score`: Primary success rate (0--1)
|
| 34 |
+
- `average_agent_cost`: Mean cost per task in USD
|
| 35 |
+
- `average_steps`: Mean steps per task
|
| 36 |
+
- `planned_sessions`, `successful_sessions`, `total_sessions`: Task counts
|
| 37 |
+
- `percent_finished`, `percent_successful`: Completion and success rates
|
| 38 |
+
|
| 39 |
+
### Benchmarks
|
| 40 |
+
|
| 41 |
+
| Benchmark | Domain |
|
| 42 |
+
|-----------|--------|
|
| 43 |
+
| AppWorld | App-based task completion |
|
| 44 |
+
| BrowseComp+ | Web research and information retrieval |
|
| 45 |
+
| SWE-bench | Software engineering issue resolution |
|
| 46 |
+
| TauBench-Airline | Customer service (airline) |
|
| 47 |
+
| TauBench-Retail | Customer service (retail) |
|
| 48 |
+
| TauBench-Telecom | Technical support (telecom) |
|
| 49 |
+
|
| 50 |
+
### Agents
|
| 51 |
+
|
| 52 |
+
| Agent | Framework |
|
| 53 |
+
|-------|-----------|
|
| 54 |
+
| Claude Code | [claude-code](https://github.com/anthropics/claude-code) |
|
| 55 |
+
| OpenAI Solo | [openai-agents-python](https://github.com/openai/openai-agents-python) |
|
| 56 |
+
| Smolagent | [smolagents](https://github.com/huggingface/smolagents) |
|
| 57 |
+
| React | [litellm](https://github.com/BerriAI/litellm) |
|
| 58 |
+
| React + Shortlisting | [litellm](https://github.com/BerriAI/litellm) |
|
| 59 |
+
|
| 60 |
+
### Models
|
| 61 |
+
|
| 62 |
+
Claude Opus 4.5, GPT-5.2, Gemini Pro 3.
|
| 63 |
+
|
| 64 |
+
## Data Collection
|
| 65 |
+
|
| 66 |
+
Each agent was evaluated using the [Exgentic](https://github.com/Exgentic/exgentic) framework with consistent evaluation protocols, resource limits, success criteria, and cost tracking. Agents were tested as general-purpose systems without benchmark-specific tuning.
|
| 67 |
+
|
| 68 |
+
See the [paper](https://arxiv.org/abs/2602.22953) for full methodology.
|
| 69 |
+
|
| 70 |
+
## Contact
|
| 71 |
+
- GitHub: https://github.com/Exgentic/open-agent-leaderboard
|
| 72 |
+
- Website: https://exgentic.ai
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
**License**: CDLA-Permissive-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
results.csv.timestamp
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
Mon Mar 30 14:
|
|
|
|
| 1 |
+
Mon Mar 30 14:50:35 UTC 2026
|