| GPQA Diamond 198 expert-written graduate science questions designed to be Google-proof. | Reasoning | Rein et al., 2023, arXiv:2311.12022 | CC-BY-4.0 | Live |
| MMLU-Pro Harder MMLU with ten answer choices and reasoning-heavy questions. | Reasoning | Wang et al., 2024, arXiv:2406.01574 | MIT | Live |
| MATH Competition mathematics graded on the final boxed answer. | Reasoning | Hendrycks et al., 2021, arXiv:2103.03874 | MIT | Live |
| AIME 2024 Olympiad problems with integer answers in 0-999; exact-match graded. | Reasoning | MAA, 2024 | Educational use; see MAA terms | Live |
| BIG-Bench Hard 27 BIG-Bench tasks where prior models underperformed humans. | Reasoning | Suzgun et al., 2022, arXiv:2210.09261 | Apache-2.0 | Live |
| MuSR Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning. | Reasoning | Sprague et al., 2023, arXiv:2310.16049 | MIT | Live |
| ZebraLogic Einstein-style logic-grid puzzles scored by full-grid correctness. | Reasoning | Lin et al., 2025, arXiv:2502.01100 | Apache-2.0 (solutions gated upstream) | Live |