Catalog

24 evals

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

All Agent / Tool use Code Reasoning Embodied Robotics Safety / Securitylive building roadmap

Eval	Category	Paper	License	Status
GPQA Diamond 198 expert-written graduate science questions designed to be Google-proof.	Reasoning	Rein et al., 2023, arXiv:2311.12022	CC-BY-4.0	Live
MMLU-Pro Harder MMLU with ten answer choices and reasoning-heavy questions.	Reasoning	Wang et al., 2024, arXiv:2406.01574	MIT	Live
MATH Competition mathematics graded on the final boxed answer.	Reasoning	Hendrycks et al., 2021, arXiv:2103.03874	MIT	Live
AIME 2024 Olympiad problems with integer answers in 0-999; exact-match graded.	Reasoning	MAA, 2024	Educational use; see MAA terms	Live
BIG-Bench Hard 27 BIG-Bench tasks where prior models underperformed humans.	Reasoning	Suzgun et al., 2022, arXiv:2210.09261	Apache-2.0	Live
MuSR Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning.	Reasoning	Sprague et al., 2023, arXiv:2310.16049	MIT	Live
ZebraLogic Einstein-style logic-grid puzzles scored by full-grid correctness.	Reasoning	Lin et al., 2025, arXiv:2502.01100	Apache-2.0 (solutions gated upstream)	Live
HumanEval+ HumanEval with 80x more tests to catch incorrect-but-plausible solutions.	Code	Liu et al., 2023, arXiv:2305.01210	Apache-2.0	Live
LiveCodeBench Continuously updated contest problems to avoid training-set contamination.	Code	Jain et al., 2024, arXiv:2403.07974	CC-BY-4.0	Live
BigCodeBench Practical programming tasks chaining many real library function calls.	Code	Zhuo et al., 2024, arXiv:2406.15877	Apache-2.0	Live
RepoBench Repository-level completion requiring cross-file retrieval and context.	Code	Liu et al., 2023, arXiv:2306.03091	MIT	Live
SWE-Lancer Real freelance software tasks priced by their actual payout.	Code	Miserendino et al., 2025, arXiv:2502.12115	Custom (OpenAI)	Live
τ-bench Customer-service episodes (retail+airline) judged by final DB state and communicated outputs.	Agent / Tool use	Yao et al., 2024, arXiv:2406.12045	MIT	Live
SWE-bench Verified 500 human-validated GitHub issues resolved by producing a working patch.	Agent / Tool use	Jimenez et al., 2023, arXiv:2310.06770 (Verified subset, OpenAI 2024)	MIT	Live
SWE-bench Lite 300-issue subset of SWE-bench for cheaper, faster iteration.	Agent / Tool use	Jimenez et al., 2023, arXiv:2310.06770 (Lite subset)	MIT	Live
GAIA Real-world assistant questions needing tools, web, and multi-step reasoning.	Agent / Tool use	Mialon et al., 2023, arXiv:2311.12983	CC-BY-4.0	Live
Berkeley Function-Calling Leaderboard Function-calling graded by AST match (simple category live; multi-call coming).	Agent / Tool use	Yan et al., 2024	Apache-2.0	Live
ALFWorld Household tasks in the real TextWorld engine; success rate over 134 unseen games.	Embodied	Shridhar et al., 2020, arXiv:2010.03768	MIT	Live
ScienceWorld Interactive science experiments requiring procedural understanding.	Embodied	Wang et al., 2022, arXiv:2203.07540	Apache-2.0	Live
WebShop Buy the right product from 1.18M items given a natural-language goal.	Embodied	Yao et al., 2022, arXiv:2207.01206	MIT	Live
LIBERO Lifelong manipulation suites testing declarative and procedural transfer.	Robotics	Liu et al., 2023, arXiv:2306.03310	MIT	Live
HarmBench Red-teaming behaviors (standard+contextual), judged; score = 1 - ASR.	Safety / Security	Mazeika et al., 2024, arXiv:2402.04249	MIT	Live
AILuminate MLCommons hazard prompts (public practice set), judged safe-response rate.	Safety / Security	MLCommons, 2024, arXiv:2503.05731	CC-BY-4.0	Live
JailbreakBench Refusal robustness on the JBB harmful behaviors (no-attack baseline; judged).	Safety / Security	Chao et al., 2024, arXiv:2404.01318	MIT	Live