Catalog

7 evals — Reasoning

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

All Agent / Tool use Code Reasoning Embodied Robotics Safety / Securitylive building roadmap

Eval	Category	Paper	License	Status
GPQA Diamond 198 expert-written graduate science questions designed to be Google-proof.	Reasoning	Rein et al., 2023, arXiv:2311.12022	CC-BY-4.0	Live
MMLU-Pro Harder MMLU with ten answer choices and reasoning-heavy questions.	Reasoning	Wang et al., 2024, arXiv:2406.01574	MIT	Live
MATH Competition mathematics graded on the final boxed answer.	Reasoning	Hendrycks et al., 2021, arXiv:2103.03874	MIT	Live
AIME 2024 Olympiad problems with integer answers in 0-999; exact-match graded.	Reasoning	MAA, 2024	Educational use; see MAA terms	Live
BIG-Bench Hard 27 BIG-Bench tasks where prior models underperformed humans.	Reasoning	Suzgun et al., 2022, arXiv:2210.09261	Apache-2.0	Live
MuSR Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning.	Reasoning	Sprague et al., 2023, arXiv:2310.16049	MIT	Live
ZebraLogic Einstein-style logic-grid puzzles scored by full-grid correctness.	Reasoning	Lin et al., 2025, arXiv:2502.01100	Apache-2.0 (solutions gated upstream)	Live