| τ-bench Customer-service episodes (retail+airline) judged by final DB state and communicated outputs. | Agent / Tool use | Yao et al., 2024, arXiv:2406.12045 | MIT | Live |
| SWE-bench Verified 500 human-validated GitHub issues resolved by producing a working patch. | Agent / Tool use | Jimenez et al., 2023, arXiv:2310.06770 (Verified subset, OpenAI 2024) | MIT | Building |
| SWE-bench Lite 300-issue subset of SWE-bench for cheaper, faster iteration. | Agent / Tool use | Jimenez et al., 2023, arXiv:2310.06770 (Lite subset) | MIT | Building |
| GAIA Real-world assistant questions needing tools, web, and multi-step reasoning. | Agent / Tool use | Mialon et al., 2023, arXiv:2311.12983 | CC-BY-4.0 | Live |
| WebArena Self-hosted realistic websites where agents complete long-horizon tasks. | Agent / Tool use | Zhou et al., 2023, arXiv:2307.13854 | Apache-2.0 | Roadmap |
| VisualWebArena WebArena tasks that require understanding images and visual layout. | Agent / Tool use | Koh et al., 2024, arXiv:2401.13649 | Apache-2.0 | Roadmap |
| AgentBench Eight distinct environments measuring agent ability across domains. | Agent / Tool use | Liu et al., 2023, arXiv:2308.03688 | Apache-2.0 | Roadmap |
| AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks. | Agent / Tool use | Ma et al., 2024, arXiv:2401.13178 | Apache-2.0 | Roadmap |
| AssistantBench Time-consuming real web tasks with verifiable short answers. | Agent / Tool use | Yoran et al., 2024, arXiv:2407.15711 | MIT | Roadmap |
| ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge. | Agent / Tool use | Qin et al., 2023, arXiv:2307.16789 | Apache-2.0 | Roadmap |
| Berkeley Function-Calling Leaderboard Function-calling graded by AST match (simple category live; multi-call coming). | Agent / Tool use | Yan et al., 2024 | Apache-2.0 | Live |
| API-Bank Plan-and-call evaluation over a graded pool of tool APIs. | Agent / Tool use | Li et al., 2023, arXiv:2304.08244 | MIT | Roadmap |
| MLE-bench 75 Kaggle competitions where agents build and submit ML solutions. | Agent / Tool use | Chan et al., 2024, arXiv:2410.07095 | Custom (OpenAI) | Roadmap |
| RE-Bench Open-ended ML research tasks scored against expert human baselines. | Agent / Tool use | Wijk et al., 2024, arXiv:2411.15114 | MIT | Roadmap |
| OSWorld Real operating-system tasks across apps in a live virtual machine. | Agent / Tool use | Xie et al., 2024, arXiv:2404.07972 | Apache-2.0 | Roadmap |
| AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state. | Agent / Tool use | Rawles et al., 2024, arXiv:2405.14573 | Apache-2.0 | Roadmap |
| Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs. | Agent / Tool use | Bonatti et al., 2024, arXiv:2409.08264 | MIT | Roadmap |