Elron commited on
Commit
22f581d
·
verified ·
1 Parent(s): 0e68bf0

Auto-deploy from GitHub

Browse files
Files changed (3) hide show
  1. LICENSE-DATA.txt +8 -12
  2. results-README.md +76 -102
  3. results.csv.timestamp +1 -1
LICENSE-DATA.txt CHANGED
@@ -53,17 +53,13 @@ Full license text available at: https://cdla.dev/permissive-2-0/
53
 
54
  DATA ATTRIBUTION
55
 
56
- Source: Exgentic - Open Leaderboard for General-Purpose AI Agents
57
- Website: https://ibm-research-ai.github.io/exgentic-website/
58
- GitHub: https://github.com/IBM/exgentic
59
- Date: February 25, 2026
60
-
61
- Citation:
62
- [TODO: Add paper citation when available]
63
- For now, please cite the Exgentic project and website.
64
 
65
  Description:
66
- This dataset contains benchmark results for various AI agent frameworks evaluated
67
- across multiple tasks including WebArena, AssistantBench, GAIA, and OSWorld. The
68
- data includes success rates, costs, and performance metrics for different
69
- model-agent combinations.
 
53
 
54
  DATA ATTRIBUTION
55
 
56
+ Source: Open Agent Leaderboard
57
+ Website: https://exgentic.ai
58
+ GitHub: https://github.com/Exgentic/open-agent-leaderboard
59
+ Paper: https://arxiv.org/abs/2602.22953
 
 
 
 
60
 
61
  Description:
62
+ This dataset contains benchmark results for general-purpose AI agent systems
63
+ evaluated across six diverse tasks: AppWorld, BrowseComp+, SWE-bench, and
64
+ TauBench (Airline, Retail, Telecom). The data includes success rates, costs,
65
+ and performance metrics for different agent-model combinations.
results-README.md CHANGED
@@ -1,102 +1,76 @@
1
- # Exgentic Leaderboard Data
2
-
3
- ## Overview
4
- This dataset contains benchmark evaluation results for various AI agent frameworks tested across multiple real-world tasks. The data powers the Exgentic leaderboard, providing transparent performance comparisons for general-purpose AI agents.
5
-
6
- ## License
7
- This dataset is licensed under the **Community Data License Agreement Permissive Version 2.0 (CDLA-Permissive-2.0)**.
8
-
9
- - Full license text: [LICENSE-DATA.txt](./LICENSE-DATA.txt)
10
- - Official license page: https://cdla.dev/permissive-2-0/
11
-
12
- ### What This Means
13
- You are free to:
14
- - ✅ Use, reproduce, and modify the data
15
- - ✅ Create derivative works
16
- - Distribute and sublicense the data
17
- - Use for commercial purposes
18
-
19
- No attribution is legally required, but we appreciate citations (see below).
20
-
21
- ## Citation
22
- **[TODO: Update with paper citation when available]**
23
-
24
- For now, please cite:
25
- ```
26
- Exgentic: Open Leaderboard for General-Purpose AI Agents
27
- https://ibm-research-ai.github.io/exgentic-website/
28
- Accessed: [Date]
29
- ```
30
-
31
- ## Data Description
32
-
33
- ### File: `results.csv`
34
- Contains 91 evaluation runs across 15 model-agent combinations.
35
-
36
- **Columns:**
37
- - `model`: Base language model (e.g., gpt-4o, claude-3.5-sonnet)
38
- - `agent`: Agent framework (e.g., ReAct, Reflexion, OpenHands)
39
- - `benchmark`: Evaluation task (WebArena, AssistantBench, GAIA, OSWorld)
40
- - `task`: Specific task within benchmark
41
- - `success`: Binary success indicator (0 or 1)
42
- - `cost`: Execution cost in USD
43
- - `turns`: Number of interaction turns
44
- - `tokens_input`: Input tokens used
45
- - `tokens_output`: Output tokens generated
46
- - `date`: Evaluation date (YYYY-MM-DD format)
47
-
48
- ### Benchmarks Included
49
- 1. **WebArena**: Web navigation and interaction tasks
50
- 2. **AssistantBench**: General assistant capabilities
51
- 3. **GAIA**: Question answering and reasoning
52
- 4. **OSWorld**: Operating system interaction tasks
53
-
54
- ### Model-Agent Combinations
55
- 15 combinations tested, including:
56
- - GPT-4o with ReAct, Reflexion, OpenHands
57
- - Claude 3.5 Sonnet with ReAct, Reflexion, OpenHands
58
- - Llama 3.1 405B with ReAct, Reflexion, OpenHands
59
- - And more...
60
-
61
- ## Data Source
62
- - **Website**: https://ibm-research-ai.github.io/exgentic-website/
63
- - **GitHub Repository**: https://github.com/IBM/exgentic
64
- - **Last Updated**: February 25, 2026
65
-
66
- ## Data Collection Methodology
67
- [TODO: Add methodology details when available]
68
-
69
- Each agent was evaluated on standardized benchmark tasks with consistent:
70
- - Evaluation protocols
71
- - Resource limits
72
- - Success criteria
73
- - Cost tracking
74
-
75
- ## Known Limitations
76
- - Results represent snapshot evaluations at a specific point in time
77
- - Performance may vary with different prompts or configurations
78
- - Costs are approximate and may vary based on API pricing changes
79
- - Some benchmarks have limited task coverage
80
-
81
- ## Updates and Versioning
82
- This dataset is periodically updated with new results. Check the `date` column for evaluation timestamps and the website for the latest version.
83
-
84
- ## Contact
85
- For questions, issues, or contributions:
86
- - GitHub Issues: https://github.com/IBM/exgentic/issues
87
- - Website: https://ibm-research-ai.github.io/exgentic-website/
88
-
89
- ## Acknowledgments
90
- This work builds upon the following benchmark datasets:
91
- - WebArena
92
- - AssistantBench
93
- - GAIA
94
- - OSWorld
95
-
96
- We thank the creators of these benchmarks for their contributions to the AI agent evaluation ecosystem.
97
-
98
- ---
99
-
100
- **License**: CDLA-Permissive-2.0
101
- **Version**: 1.0
102
- **Date**: February 25, 2026
 
1
+ # Open Agent Leaderboard Data
2
+
3
+ ## Overview
4
+ Benchmark evaluation results for general-purpose AI agent systems tested across six diverse real-world tasks. The data powers the [Open Agent Leaderboard](https://huggingface.co/spaces/open-agent-leaderboard/leaderboard).
5
+
6
+ ## License
7
+ This dataset is licensed under the **Community Data License Agreement -- Permissive -- Version 2.0 (CDLA-Permissive-2.0)**.
8
+
9
+ - Full license text: [LICENSE-DATA.txt](./LICENSE-DATA.txt)
10
+ - Official license page: https://cdla.dev/permissive-2-0/
11
+
12
+ ## Citation
13
+
14
+ ```bibtex
15
+ @inproceedings{bandel2026general,
16
+ title={General Agent Evaluation},
17
+ author={Bandel, Elron and Yehudai, Asaf and Jacovi, Michal and Katz, Yoav and Shmueli-Scheuer, Michal and Choshen, Leshem},
18
+ booktitle={ICLR 2026 Workshop on LLM Agents},
19
+ year={2026},
20
+ url={https://arxiv.org/abs/2602.22953}
21
+ }
22
+ ```
23
+
24
+ ## Data Description
25
+
26
+ ### File: `data/results.csv`
27
+ Each row represents one agent-model-benchmark combination.
28
+
29
+ **Key columns:**
30
+ - `agent`, `agent_name`: Agent identifier and display name
31
+ - `model`, `model_name`: LLM used
32
+ - `benchmark`, `benchmark_name`: Benchmark evaluated
33
+ - `benchmark_score`: Primary success rate (0--1)
34
+ - `average_agent_cost`: Mean cost per task in USD
35
+ - `average_steps`: Mean steps per task
36
+ - `planned_sessions`, `successful_sessions`, `total_sessions`: Task counts
37
+ - `percent_finished`, `percent_successful`: Completion and success rates
38
+
39
+ ### Benchmarks
40
+
41
+ | Benchmark | Domain |
42
+ |-----------|--------|
43
+ | AppWorld | App-based task completion |
44
+ | BrowseComp+ | Web research and information retrieval |
45
+ | SWE-bench | Software engineering issue resolution |
46
+ | TauBench-Airline | Customer service (airline) |
47
+ | TauBench-Retail | Customer service (retail) |
48
+ | TauBench-Telecom | Technical support (telecom) |
49
+
50
+ ### Agents
51
+
52
+ | Agent | Framework |
53
+ |-------|-----------|
54
+ | Claude Code | [claude-code](https://github.com/anthropics/claude-code) |
55
+ | OpenAI Solo | [openai-agents-python](https://github.com/openai/openai-agents-python) |
56
+ | Smolagent | [smolagents](https://github.com/huggingface/smolagents) |
57
+ | React | [litellm](https://github.com/BerriAI/litellm) |
58
+ | React + Shortlisting | [litellm](https://github.com/BerriAI/litellm) |
59
+
60
+ ### Models
61
+
62
+ Claude Opus 4.5, GPT-5.2, Gemini Pro 3.
63
+
64
+ ## Data Collection
65
+
66
+ Each agent was evaluated using the [Exgentic](https://github.com/Exgentic/exgentic) framework with consistent evaluation protocols, resource limits, success criteria, and cost tracking. Agents were tested as general-purpose systems without benchmark-specific tuning.
67
+
68
+ See the [paper](https://arxiv.org/abs/2602.22953) for full methodology.
69
+
70
+ ## Contact
71
+ - GitHub: https://github.com/Exgentic/open-agent-leaderboard
72
+ - Website: https://exgentic.ai
73
+
74
+ ---
75
+
76
+ **License**: CDLA-Permissive-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
results.csv.timestamp CHANGED
@@ -1 +1 @@
1
- Mon Mar 30 14:41:14 UTC 2026
 
1
+ Mon Mar 30 14:50:35 UTC 2026