Catalog

45 evals

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

All Agent / Tool use Code Reasoning Embodied Robotics Safety / Securitylive building roadmap

Eval	Category	Paper	License	Status
GPQA Diamond 198 expert-written graduate science questions designed to be Google-proof.	Reasoning	Rein et al., 2023, arXiv:2311.12022	CC-BY-4.0	Live
MMLU-Pro Harder MMLU with ten answer choices and reasoning-heavy questions.	Reasoning	Wang et al., 2024, arXiv:2406.01574	MIT	Live
MATH Competition mathematics graded on the final boxed answer.	Reasoning	Hendrycks et al., 2021, arXiv:2103.03874	MIT	Live
AIME 2024 Olympiad problems with integer answers in 0-999; exact-match graded.	Reasoning	MAA, 2024	Educational use; see MAA terms	Live
BIG-Bench Hard 27 BIG-Bench tasks where prior models underperformed humans.	Reasoning	Suzgun et al., 2022, arXiv:2210.09261	Apache-2.0	Live
FrontierMath Unpublished research-level math problems vetted by professional mathematicians.	Reasoning	Glazer et al., 2024, arXiv:2411.04872	Proprietary (Epoch AI)	Roadmap
MuSR Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning.	Reasoning	Sprague et al., 2023, arXiv:2310.16049	MIT	Live
ZebraLogic Einstein-style logic-grid puzzles scored by full-grid correctness.	Reasoning	Lin et al., 2025, arXiv:2502.01100	Apache-2.0 (solutions gated upstream)	Live
HumanEval+ HumanEval with 80x more tests to catch incorrect-but-plausible solutions.	Code	Liu et al., 2023, arXiv:2305.01210	Apache-2.0	Live
LiveCodeBench Continuously updated contest problems to avoid training-set contamination.	Code	Jain et al., 2024, arXiv:2403.07974	CC-BY-4.0	Live
BigCodeBench Practical programming tasks chaining many real library function calls.	Code	Zhuo et al., 2024, arXiv:2406.15877	Apache-2.0	Live
RepoBench Repository-level completion requiring cross-file retrieval and context.	Code	Liu et al., 2023, arXiv:2306.03091	MIT	Live
SWE-Lancer Real freelance software tasks priced by their actual payout.	Code	Miserendino et al., 2025, arXiv:2502.12115	Custom (OpenAI)	Live
τ-bench Customer-service episodes (retail+airline) judged by final DB state and communicated outputs.	Agent / Tool use	Yao et al., 2024, arXiv:2406.12045	MIT	Live
SWE-bench Verified 500 human-validated GitHub issues resolved by producing a working patch.	Agent / Tool use	Jimenez et al., 2023, arXiv:2310.06770 (Verified subset, OpenAI 2024)	MIT	Live
SWE-bench Lite 300-issue subset of SWE-bench for cheaper, faster iteration.	Agent / Tool use	Jimenez et al., 2023, arXiv:2310.06770 (Lite subset)	MIT	Live
GAIA Real-world assistant questions needing tools, web, and multi-step reasoning.	Agent / Tool use	Mialon et al., 2023, arXiv:2311.12983	CC-BY-4.0	Live
WebArena Self-hosted realistic websites where agents complete long-horizon tasks.	Agent / Tool use	Zhou et al., 2023, arXiv:2307.13854	Apache-2.0	Roadmap
VisualWebArena WebArena tasks that require understanding images and visual layout.	Agent / Tool use	Koh et al., 2024, arXiv:2401.13649	Apache-2.0	Roadmap
AgentBench Eight distinct environments measuring agent ability across domains.	Agent / Tool use	Liu et al., 2023, arXiv:2308.03688	Apache-2.0	Roadmap
AgentBoard Fine-grained progress-rate metrics over partially solved agent tasks.	Agent / Tool use	Ma et al., 2024, arXiv:2401.13178	Apache-2.0	Roadmap
AssistantBench Time-consuming real web tasks with verifiable short answers.	Agent / Tool use	Yoran et al., 2024, arXiv:2407.15711	MIT	Roadmap
ToolBench Tool-use over thousands of real REST APIs with a pass-rate judge.	Agent / Tool use	Qin et al., 2023, arXiv:2307.16789	Apache-2.0	Roadmap
Berkeley Function-Calling Leaderboard Function-calling graded by AST match (simple category live; multi-call coming).	Agent / Tool use	Yan et al., 2024	Apache-2.0	Live
API-Bank Plan-and-call evaluation over a graded pool of tool APIs.	Agent / Tool use	Li et al., 2023, arXiv:2304.08244	MIT	Roadmap
MLE-bench 75 Kaggle competitions where agents build and submit ML solutions.	Agent / Tool use	Chan et al., 2024, arXiv:2410.07095	Custom (OpenAI)	Roadmap
RE-Bench Open-ended ML research tasks scored against expert human baselines.	Agent / Tool use	Wijk et al., 2024, arXiv:2411.15114	MIT	Roadmap
OSWorld Real operating-system tasks across apps in a live virtual machine.	Agent / Tool use	Xie et al., 2024, arXiv:2404.07972	Apache-2.0	Roadmap
AndroidWorld 116 tasks across 20 Android apps in a live emulator with dynamic state.	Agent / Tool use	Rawles et al., 2024, arXiv:2405.14573	Apache-2.0	Roadmap
Windows Agent Arena Windows desktop tasks in parallelizable cloud VMs.	Agent / Tool use	Bonatti et al., 2024, arXiv:2409.08264	MIT	Roadmap
ALFWorld Household tasks in the real TextWorld engine; success rate over 134 unseen games.	Embodied	Shridhar et al., 2020, arXiv:2010.03768	MIT	Live
ScienceWorld Interactive science experiments requiring procedural understanding.	Embodied	Wang et al., 2022, arXiv:2203.07540	Apache-2.0	Live
WebShop Buy the right product from 1.18M items given a natural-language goal.	Embodied	Yao et al., 2022, arXiv:2207.01206	MIT	Live
Habitat ObjectNav Navigate photorealistic 3D homes to find an instance of a target object.	Embodied	Szot et al., 2021, arXiv:2106.14405	MIT	Roadmap
ALFRED Follow language instructions to complete household tasks from vision.	Embodied	Shridhar et al., 2019, arXiv:1912.01734	MIT	Roadmap
LIBERO Lifelong manipulation suites testing declarative and procedural transfer.	Robotics	Liu et al., 2023, arXiv:2306.03310	MIT	Live
ManiSkill 2 GPU-parallel manipulation tasks with dense and sparse rewards.	Robotics	Gu et al., 2023, arXiv:2302.04659	Apache-2.0	Roadmap
CALVIN Long-horizon language-conditioned manipulation from undirected play data.	Robotics	Mees et al., 2021, arXiv:2112.03227	MIT	Roadmap
VIMA-Bench Manipulation specified by interleaved text-and-image multimodal prompts.	Robotics	Jiang et al., 2022, arXiv:2210.03094	MIT	Roadmap
BEHAVIOR-1K 1,000 realistic household activities in high-fidelity simulation.	Robotics	Li et al., 2023	MIT	Roadmap
RoboBench Broad manipulation suite spanning skills, objects, and scene variation.	Robotics	2024	Apache-2.0	Roadmap
Cybench 40 professional CTF tasks measuring offensive cyber capability and risk.	Safety / Security	Zhang et al., 2024, arXiv:2408.08926	MIT	Building
HarmBench Red-teaming behaviors (standard+contextual), judged; score = 1 - ASR.	Safety / Security	Mazeika et al., 2024, arXiv:2402.04249	MIT	Live
AILuminate MLCommons hazard prompts (public practice set), judged safe-response rate.	Safety / Security	MLCommons, 2024, arXiv:2503.05731	CC-BY-4.0	Live
JailbreakBench Refusal robustness on the JBB harmful behaviors (no-attack baseline; judged).	Safety / Security	Chao et al., 2024, arXiv:2404.01318	MIT	Live