| WebArena Self-hosted realistic websites where agents complete long-horizon tasks. | Agent / Tool use | Zhou et al., 2023, arXiv:2307.13854 | Apache-2.0 | Roadmap |
| VisualWebArena WebArena tasks that require understanding images and visual layout. | Agent / Tool use | Koh et al., 2024, arXiv:2401.13649 | Apache-2.0 | Roadmap |
| AgentBench Eight distinct environments measuring agent ability across domains. | Agent / Tool use | Liu et al., 2023, arXiv:2308.03688 | Apache-2.0 | Roadmap |
| AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks. | Agent / Tool use | Ma et al., 2024, arXiv:2401.13178 | Apache-2.0 | Roadmap |
| AssistantBench Time-consuming real web tasks with verifiable short answers. | Agent / Tool use | Yoran et al., 2024, arXiv:2407.15711 | MIT | Roadmap |
| ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge. | Agent / Tool use | Qin et al., 2023, arXiv:2307.16789 | Apache-2.0 | Roadmap |
| API-Bank Plan-and-call evaluation over a graded pool of tool APIs. | Agent / Tool use | Li et al., 2023, arXiv:2304.08244 | MIT | Roadmap |
| MLE-bench 75 Kaggle competitions where agents build and submit ML solutions. | Agent / Tool use | Chan et al., 2024, arXiv:2410.07095 | Custom (OpenAI) | Roadmap |
| RE-Bench Open-ended ML research tasks scored against expert human baselines. | Agent / Tool use | Wijk et al., 2024, arXiv:2411.15114 | MIT | Roadmap |
| OSWorld Real operating-system tasks across apps in a live virtual machine. | Agent / Tool use | Xie et al., 2024, arXiv:2404.07972 | Apache-2.0 | Roadmap |
| AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state. | Agent / Tool use | Rawles et al., 2024, arXiv:2405.14573 | Apache-2.0 | Roadmap |
| Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs. | Agent / Tool use | Bonatti et al., 2024, arXiv:2409.08264 | MIT | Roadmap |