| GPQA Diamond 198 expert-written graduate science questions designed to be Google-proof. | Reasoning | Rein et al., 2023, arXiv:2311.12022 | CC-BY-4.0 | Live |
| MMLU-Pro Harder MMLU with ten answer choices and reasoning-heavy questions. | Reasoning | Wang et al., 2024, arXiv:2406.01574 | MIT | Live |
| MATH Competition mathematics graded on the final boxed answer. | Reasoning | Hendrycks et al., 2021, arXiv:2103.03874 | MIT | Live |
| AIME 2024 Olympiad problems with integer answers in 0-999; exact-match graded. | Reasoning | MAA, 2024 | Educational use; see MAA terms | Live |
| BIG-Bench Hard 27 BIG-Bench tasks where prior models underperformed humans. | Reasoning | Suzgun et al., 2022, arXiv:2210.09261 | Apache-2.0 | Live |
| FrontierMath Unpublished research-level math problems vetted by professional mathematicians. | Reasoning | Glazer et al., 2024, arXiv:2411.04872 | Proprietary (Epoch AI) | Roadmap |
| MuSR Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning. | Reasoning | Sprague et al., 2023, arXiv:2310.16049 | MIT | Live |
| ZebraLogic Einstein-style logic-grid puzzles scored by full-grid correctness. | Reasoning | Lin et al., 2025, arXiv:2502.01100 | Apache-2.0 (solutions gated upstream) | Live |
| HumanEval+ HumanEval with 80x more tests to catch incorrect-but-plausible solutions. | Code | Liu et al., 2023, arXiv:2305.01210 | Apache-2.0 | Live |
| LiveCodeBench Continuously updated contest problems to avoid training-set contamination. | Code | Jain et al., 2024, arXiv:2403.07974 | CC-BY-4.0 | Live |
| BigCodeBench Practical programming tasks chaining many real library function calls. | Code | Zhuo et al., 2024, arXiv:2406.15877 | Apache-2.0 | Roadmap |
| RepoBench Repository-level completion requiring cross-file retrieval and context. | Code | Liu et al., 2023, arXiv:2306.03091 | MIT | Roadmap |
| SWE-Lancer Real freelance software tasks priced by their actual payout. | Code | Miserendino et al., 2025, arXiv:2502.12115 | Custom (OpenAI) | Roadmap |
| τ-bench Customer-service episodes (retail+airline) judged by final DB state and communicated outputs. | Agent / Tool use | Yao et al., 2024, arXiv:2406.12045 | MIT | Live |
| SWE-bench Verified 500 human-validated GitHub issues resolved by producing a working patch. | Agent / Tool use | Jimenez et al., 2023, arXiv:2310.06770 (Verified subset, OpenAI 2024) | MIT | Building |
| SWE-bench Lite 300-issue subset of SWE-bench for cheaper, faster iteration. | Agent / Tool use | Jimenez et al., 2023, arXiv:2310.06770 (Lite subset) | MIT | Building |
| GAIA Real-world assistant questions needing tools, web, and multi-step reasoning. | Agent / Tool use | Mialon et al., 2023, arXiv:2311.12983 | CC-BY-4.0 | Live |
| WebArena Self-hosted realistic websites where agents complete long-horizon tasks. | Agent / Tool use | Zhou et al., 2023, arXiv:2307.13854 | Apache-2.0 | Roadmap |
| VisualWebArena WebArena tasks that require understanding images and visual layout. | Agent / Tool use | Koh et al., 2024, arXiv:2401.13649 | Apache-2.0 | Roadmap |
| AgentBench Eight distinct environments measuring agent ability across domains. | Agent / Tool use | Liu et al., 2023, arXiv:2308.03688 | Apache-2.0 | Roadmap |
| AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks. | Agent / Tool use | Ma et al., 2024, arXiv:2401.13178 | Apache-2.0 | Roadmap |
| AssistantBench Time-consuming real web tasks with verifiable short answers. | Agent / Tool use | Yoran et al., 2024, arXiv:2407.15711 | MIT | Roadmap |
| ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge. | Agent / Tool use | Qin et al., 2023, arXiv:2307.16789 | Apache-2.0 | Roadmap |
| Berkeley Function-Calling Leaderboard Function-calling graded by AST match (simple category live; multi-call coming). | Agent / Tool use | Yan et al., 2024 | Apache-2.0 | Live |
| API-Bank Plan-and-call evaluation over a graded pool of tool APIs. | Agent / Tool use | Li et al., 2023, arXiv:2304.08244 | MIT | Roadmap |
| MLE-bench 75 Kaggle competitions where agents build and submit ML solutions. | Agent / Tool use | Chan et al., 2024, arXiv:2410.07095 | Custom (OpenAI) | Roadmap |
| RE-Bench Open-ended ML research tasks scored against expert human baselines. | Agent / Tool use | Wijk et al., 2024, arXiv:2411.15114 | MIT | Roadmap |
| OSWorld Real operating-system tasks across apps in a live virtual machine. | Agent / Tool use | Xie et al., 2024, arXiv:2404.07972 | Apache-2.0 | Roadmap |
| AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state. | Agent / Tool use | Rawles et al., 2024, arXiv:2405.14573 | Apache-2.0 | Roadmap |
| Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs. | Agent / Tool use | Bonatti et al., 2024, arXiv:2409.08264 | MIT | Roadmap |
| ALFWorld Household tasks in the real TextWorld engine; success rate over 134 unseen games. | Embodied | Shridhar et al., 2020, arXiv:2010.03768 | MIT | Live |
| ScienceWorld Interactive science experiments requiring procedural understanding. | Embodied | Wang et al., 2022, arXiv:2203.07540 | Apache-2.0 | Live |
| WebShop Buy the right product from 1.18M items given a natural-language goal. | Embodied | Yao et al., 2022, arXiv:2207.01206 | MIT | Live |
| Habitat ObjectNav Navigate photorealistic 3D homes to find an instance of a target object. | Embodied | Szot et al., 2021, arXiv:2106.14405 | MIT | Roadmap |
| ALFRED Follow language instructions to complete household tasks from vision. | Embodied | Shridhar et al., 2019, arXiv:1912.01734 | MIT | Roadmap |
| LIBERO Lifelong manipulation suites testing declarative and procedural transfer. | Robotics | Liu et al., 2023, arXiv:2306.03310 | MIT | Live |
| ManiSkill 2 GPU-parallel manipulation tasks with dense and sparse rewards. | Robotics | Gu et al., 2023, arXiv:2302.04659 | Apache-2.0 | Roadmap |
| CALVIN Long-horizon language-conditioned manipulation from undirected play data. | Robotics | Mees et al., 2021, arXiv:2112.03227 | MIT | Roadmap |
| VIMA-Bench Manipulation specified by interleaved text-and-image multimodal prompts. | Robotics | Jiang et al., 2022, arXiv:2210.03094 | MIT | Roadmap |
| BEHAVIOR-1K 1,000 realistic household activities in high-fidelity simulation. | Robotics | Li et al., 2023 | MIT | Roadmap |
| RoboBench Broad manipulation suite spanning skills, objects, and scene variation. | Robotics | 2024 | Apache-2.0 | Roadmap |
| Cybench 40 professional CTF tasks measuring offensive cyber capability and risk. | Safety / Security | Zhang et al., 2024, arXiv:2408.08926 | MIT | Building |
| HarmBench Red-teaming behaviors (standard+contextual), judged; score = 1 - ASR. | Safety / Security | Mazeika et al., 2024, arXiv:2402.04249 | MIT | Live |
| AILuminate MLCommons hazard prompts (public practice set), judged safe-response rate. | Safety / Security | MLCommons, 2024, arXiv:2503.05731 | CC-BY-4.0 | Live |
| JailbreakBench Refusal robustness on the JBB harmful behaviors (no-attack baseline; judged). | Safety / Security | Chao et al., 2024, arXiv:2404.01318 | MIT | Live |