Catalog

20 evals

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

All Agent / Tool use Code Reasoning Embodied Robotics Safety / Securitylive building roadmap

Eval	Category	Paper	License	Status
FrontierMath Unpublished research-level math problems vetted by professional mathematicians.	Reasoning	Glazer et al., 2024, arXiv:2411.04872	Proprietary (Epoch AI)	Roadmap
WebArena Self-hosted realistic websites where agents complete long-horizon tasks.	Agent / Tool use	Zhou et al., 2023, arXiv:2307.13854	Apache-2.0	Roadmap
VisualWebArena WebArena tasks that require understanding images and visual layout.	Agent / Tool use	Koh et al., 2024, arXiv:2401.13649	Apache-2.0	Roadmap
AgentBench Eight distinct environments measuring agent ability across domains.	Agent / Tool use	Liu et al., 2023, arXiv:2308.03688	Apache-2.0	Roadmap
AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks.	Agent / Tool use	Ma et al., 2024, arXiv:2401.13178	Apache-2.0	Roadmap
AssistantBench Time-consuming real web tasks with verifiable short answers.	Agent / Tool use	Yoran et al., 2024, arXiv:2407.15711	MIT	Roadmap
ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge.	Agent / Tool use	Qin et al., 2023, arXiv:2307.16789	Apache-2.0	Roadmap
API-Bank Plan-and-call evaluation over a graded pool of tool APIs.	Agent / Tool use	Li et al., 2023, arXiv:2304.08244	MIT	Roadmap
MLE-bench 75 Kaggle competitions where agents build and submit ML solutions.	Agent / Tool use	Chan et al., 2024, arXiv:2410.07095	Custom (OpenAI)	Roadmap
RE-Bench Open-ended ML research tasks scored against expert human baselines.	Agent / Tool use	Wijk et al., 2024, arXiv:2411.15114	MIT	Roadmap
OSWorld Real operating-system tasks across apps in a live virtual machine.	Agent / Tool use	Xie et al., 2024, arXiv:2404.07972	Apache-2.0	Roadmap
AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state.	Agent / Tool use	Rawles et al., 2024, arXiv:2405.14573	Apache-2.0	Roadmap
Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs.	Agent / Tool use	Bonatti et al., 2024, arXiv:2409.08264	MIT	Roadmap
Habitat ObjectNav Navigate photorealistic 3D homes to find an instance of a target object.	Embodied	Szot et al., 2021, arXiv:2106.14405	MIT	Roadmap
ALFRED Follow language instructions to complete household tasks from vision.	Embodied	Shridhar et al., 2019, arXiv:1912.01734	MIT	Roadmap
ManiSkill 2 GPU-parallel manipulation tasks with dense and sparse rewards.	Robotics	Gu et al., 2023, arXiv:2302.04659	Apache-2.0	Roadmap
CALVIN Long-horizon language-conditioned manipulation from undirected play data.	Robotics	Mees et al., 2021, arXiv:2112.03227	MIT	Roadmap
VIMA-Bench Manipulation specified by interleaved text-and-image multimodal prompts.	Robotics	Jiang et al., 2022, arXiv:2210.03094	MIT	Roadmap
BEHAVIOR-1K 1,000 realistic household activities in high-fidelity simulation.	Robotics	Li et al., 2023	MIT	Roadmap
RoboBench Broad manipulation suite spanning skills, objects, and scene variation.	Robotics	2024	Apache-2.0	Roadmap