Catalog

17 evals — Agent / Tool use

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

All Agent / Tool use Code Reasoning Embodied Robotics Safety / Securitylive building roadmap

Eval	Category	Paper	License	Status
τ-bench Customer-service episodes (retail+airline) judged by final DB state and communicated outputs.	Agent / Tool use	Yao et al., 2024, arXiv:2406.12045	MIT	Live
SWE-bench Verified 500 human-validated GitHub issues resolved by producing a working patch.	Agent / Tool use	Jimenez et al., 2023, arXiv:2310.06770 (Verified subset, OpenAI 2024)	MIT	Live
SWE-bench Lite 300-issue subset of SWE-bench for cheaper, faster iteration.	Agent / Tool use	Jimenez et al., 2023, arXiv:2310.06770 (Lite subset)	MIT	Live
GAIA Real-world assistant questions needing tools, web, and multi-step reasoning.	Agent / Tool use	Mialon et al., 2023, arXiv:2311.12983	CC-BY-4.0	Live
WebArena Self-hosted realistic websites where agents complete long-horizon tasks.	Agent / Tool use	Zhou et al., 2023, arXiv:2307.13854	Apache-2.0	Roadmap
VisualWebArena WebArena tasks that require understanding images and visual layout.	Agent / Tool use	Koh et al., 2024, arXiv:2401.13649	Apache-2.0	Roadmap
AgentBench Eight distinct environments measuring agent ability across domains.	Agent / Tool use	Liu et al., 2023, arXiv:2308.03688	Apache-2.0	Roadmap
AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks.	Agent / Tool use	Ma et al., 2024, arXiv:2401.13178	Apache-2.0	Roadmap
AssistantBench Time-consuming real web tasks with verifiable short answers.	Agent / Tool use	Yoran et al., 2024, arXiv:2407.15711	MIT	Roadmap
ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge.	Agent / Tool use	Qin et al., 2023, arXiv:2307.16789	Apache-2.0	Roadmap
Berkeley Function-Calling Leaderboard Function-calling graded by AST match (simple category live; multi-call coming).	Agent / Tool use	Yan et al., 2024	Apache-2.0	Live
API-Bank Plan-and-call evaluation over a graded pool of tool APIs.	Agent / Tool use	Li et al., 2023, arXiv:2304.08244	MIT	Roadmap
MLE-bench 75 Kaggle competitions where agents build and submit ML solutions.	Agent / Tool use	Chan et al., 2024, arXiv:2410.07095	Custom (OpenAI)	Roadmap
RE-Bench Open-ended ML research tasks scored against expert human baselines.	Agent / Tool use	Wijk et al., 2024, arXiv:2411.15114	MIT	Roadmap
OSWorld Real operating-system tasks across apps in a live virtual machine.	Agent / Tool use	Xie et al., 2024, arXiv:2404.07972	Apache-2.0	Roadmap
AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state.	Agent / Tool use	Rawles et al., 2024, arXiv:2405.14573	Apache-2.0	Roadmap
Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs.	Agent / Tool use	Bonatti et al., 2024, arXiv:2409.08264	MIT	Roadmap