| GPQA Diamond 198 expert-written graduate science questions designed to be Google-proof. | Reasoning | Rein et al., 2023, arXiv:2311.12022 | CC-BY-4.0 | Live |
| MMLU-Pro Harder MMLU with ten answer choices and reasoning-heavy questions. | Reasoning | Wang et al., 2024, arXiv:2406.01574 | MIT | Live |
| MATH Competition mathematics graded on the final boxed answer. | Reasoning | Hendrycks et al., 2021, arXiv:2103.03874 | MIT | Live |
| AIME 2024 Olympiad problems with integer answers in 0-999; exact-match graded. | Reasoning | MAA, 2024 | Educational use; see MAA terms | Live |
| BIG-Bench Hard 27 BIG-Bench tasks where prior models underperformed humans. | Reasoning | Suzgun et al., 2022, arXiv:2210.09261 | Apache-2.0 | Live |
| MuSR Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning. | Reasoning | Sprague et al., 2023, arXiv:2310.16049 | MIT | Live |
| ZebraLogic Einstein-style logic-grid puzzles scored by full-grid correctness. | Reasoning | Lin et al., 2025, arXiv:2502.01100 | Apache-2.0 (solutions gated upstream) | Live |
| HumanEval+ HumanEval with 80x more tests to catch incorrect-but-plausible solutions. | Code | Liu et al., 2023, arXiv:2305.01210 | Apache-2.0 | Live |
| LiveCodeBench Continuously updated contest problems to avoid training-set contamination. | Code | Jain et al., 2024, arXiv:2403.07974 | CC-BY-4.0 | Live |
| τ-bench Customer-service episodes (retail+airline) judged by final DB state and communicated outputs. | Agent / Tool use | Yao et al., 2024, arXiv:2406.12045 | MIT | Live |
| GAIA Real-world assistant questions needing tools, web, and multi-step reasoning. | Agent / Tool use | Mialon et al., 2023, arXiv:2311.12983 | CC-BY-4.0 | Live |
| Berkeley Function-Calling Leaderboard Function-calling graded by AST match (simple category live; multi-call coming). | Agent / Tool use | Yan et al., 2024 | Apache-2.0 | Live |
| ALFWorld Household tasks in the real TextWorld engine; success rate over 134 unseen games. | Embodied | Shridhar et al., 2020, arXiv:2010.03768 | MIT | Live |
| ScienceWorld Interactive science experiments requiring procedural understanding. | Embodied | Wang et al., 2022, arXiv:2203.07540 | Apache-2.0 | Live |
| WebShop Buy the right product from 1.18M items given a natural-language goal. | Embodied | Yao et al., 2022, arXiv:2207.01206 | MIT | Live |
| LIBERO Lifelong manipulation suites testing declarative and procedural transfer. | Robotics | Liu et al., 2023, arXiv:2306.03310 | MIT | Live |
| HarmBench Red-teaming behaviors (standard+contextual), judged; score = 1 - ASR. | Safety / Security | Mazeika et al., 2024, arXiv:2402.04249 | MIT | Live |
| AILuminate MLCommons hazard prompts (public practice set), judged safe-response rate. | Safety / Security | MLCommons, 2024, arXiv:2503.05731 | CC-BY-4.0 | Live |
| JailbreakBench Refusal robustness on the JBB harmful behaviors (no-attack baseline; judged). | Safety / Security | Chao et al., 2024, arXiv:2404.01318 | MIT | Live |