Catalog

12 evals — Agent / Tool use

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

All Agent / Tool use Code Reasoning Embodied Robotics Safety / Securitylive building roadmap

Eval	Category	Paper	License	Status
WebArena Self-hosted realistic websites where agents complete long-horizon tasks.	Agent / Tool use	Zhou et al., 2023, arXiv:2307.13854	Apache-2.0	Roadmap
VisualWebArena WebArena tasks that require understanding images and visual layout.	Agent / Tool use	Koh et al., 2024, arXiv:2401.13649	Apache-2.0	Roadmap
AgentBench Eight distinct environments measuring agent ability across domains.	Agent / Tool use	Liu et al., 2023, arXiv:2308.03688	Apache-2.0	Roadmap
AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks.	Agent / Tool use	Ma et al., 2024, arXiv:2401.13178	Apache-2.0	Roadmap
AssistantBench Time-consuming real web tasks with verifiable short answers.	Agent / Tool use	Yoran et al., 2024, arXiv:2407.15711	MIT	Roadmap
ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge.	Agent / Tool use	Qin et al., 2023, arXiv:2307.16789	Apache-2.0	Roadmap
API-Bank Plan-and-call evaluation over a graded pool of tool APIs.	Agent / Tool use	Li et al., 2023, arXiv:2304.08244	MIT	Roadmap
MLE-bench 75 Kaggle competitions where agents build and submit ML solutions.	Agent / Tool use	Chan et al., 2024, arXiv:2410.07095	Custom (OpenAI)	Roadmap
RE-Bench Open-ended ML research tasks scored against expert human baselines.	Agent / Tool use	Wijk et al., 2024, arXiv:2411.15114	MIT	Roadmap
OSWorld Real operating-system tasks across apps in a live virtual machine.	Agent / Tool use	Xie et al., 2024, arXiv:2404.07972	Apache-2.0	Roadmap
AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state.	Agent / Tool use	Rawles et al., 2024, arXiv:2405.14573	Apache-2.0	Roadmap
Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs.	Agent / Tool use	Bonatti et al., 2024, arXiv:2409.08264	MIT	Roadmap