| FrontierMath Unpublished research-level math problems vetted by professional mathematicians. | Reasoning | Glazer et al., 2024, arXiv:2411.04872 | Proprietary (Epoch AI) | Roadmap |
| BigCodeBench Practical programming tasks chaining many real library function calls. | Code | Zhuo et al., 2024, arXiv:2406.15877 | Apache-2.0 | Roadmap |
| RepoBench Repository-level completion requiring cross-file retrieval and context. | Code | Liu et al., 2023, arXiv:2306.03091 | MIT | Roadmap |
| SWE-Lancer Real freelance software tasks priced by their actual payout. | Code | Miserendino et al., 2025, arXiv:2502.12115 | Custom (OpenAI) | Roadmap |
| WebArena Self-hosted realistic websites where agents complete long-horizon tasks. | Agent / Tool use | Zhou et al., 2023, arXiv:2307.13854 | Apache-2.0 | Roadmap |
| VisualWebArena WebArena tasks that require understanding images and visual layout. | Agent / Tool use | Koh et al., 2024, arXiv:2401.13649 | Apache-2.0 | Roadmap |
| AgentBench Eight distinct environments measuring agent ability across domains. | Agent / Tool use | Liu et al., 2023, arXiv:2308.03688 | Apache-2.0 | Roadmap |
| AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks. | Agent / Tool use | Ma et al., 2024, arXiv:2401.13178 | Apache-2.0 | Roadmap |
| AssistantBench Time-consuming real web tasks with verifiable short answers. | Agent / Tool use | Yoran et al., 2024, arXiv:2407.15711 | MIT | Roadmap |
| ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge. | Agent / Tool use | Qin et al., 2023, arXiv:2307.16789 | Apache-2.0 | Roadmap |
| API-Bank Plan-and-call evaluation over a graded pool of tool APIs. | Agent / Tool use | Li et al., 2023, arXiv:2304.08244 | MIT | Roadmap |
| MLE-bench 75 Kaggle competitions where agents build and submit ML solutions. | Agent / Tool use | Chan et al., 2024, arXiv:2410.07095 | Custom (OpenAI) | Roadmap |
| RE-Bench Open-ended ML research tasks scored against expert human baselines. | Agent / Tool use | Wijk et al., 2024, arXiv:2411.15114 | MIT | Roadmap |
| OSWorld Real operating-system tasks across apps in a live virtual machine. | Agent / Tool use | Xie et al., 2024, arXiv:2404.07972 | Apache-2.0 | Roadmap |
| AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state. | Agent / Tool use | Rawles et al., 2024, arXiv:2405.14573 | Apache-2.0 | Roadmap |
| Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs. | Agent / Tool use | Bonatti et al., 2024, arXiv:2409.08264 | MIT | Roadmap |
| Habitat ObjectNav Navigate photorealistic 3D homes to find an instance of a target object. | Embodied | Szot et al., 2021, arXiv:2106.14405 | MIT | Roadmap |
| ALFRED Follow language instructions to complete household tasks from vision. | Embodied | Shridhar et al., 2019, arXiv:1912.01734 | MIT | Roadmap |
| ManiSkill 2 GPU-parallel manipulation tasks with dense and sparse rewards. | Robotics | Gu et al., 2023, arXiv:2302.04659 | Apache-2.0 | Roadmap |
| CALVIN Long-horizon language-conditioned manipulation from undirected play data. | Robotics | Mees et al., 2021, arXiv:2112.03227 | MIT | Roadmap |
| VIMA-Bench Manipulation specified by interleaved text-and-image multimodal prompts. | Robotics | Jiang et al., 2022, arXiv:2210.03094 | MIT | Roadmap |
| BEHAVIOR-1K 1,000 realistic household activities in high-fidelity simulation. | Robotics | Li et al., 2023 | MIT | Roadmap |
| RoboBench Broad manipulation suite spanning skills, objects, and scene variation. | Robotics | 2024 | Apache-2.0 | Roadmap |