AGI·EVALSSign in
Catalog

45 evals

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

EvalStatus
GPQA DiamondLive
MMLU-ProLive
MATHLive
AIME 2024Live
BIG-Bench HardLive
FrontierMathRoadmap
MuSRLive
ZebraLogicLive
HumanEval+Live
LiveCodeBenchLive
BigCodeBenchRoadmap
RepoBenchRoadmap
SWE-LancerRoadmap
τ-benchLive
SWE-bench VerifiedBuilding
SWE-bench LiteBuilding
GAIALive
WebArenaRoadmap
VisualWebArenaRoadmap
AgentBenchRoadmap
AgentBoardRoadmap
AssistantBenchRoadmap
ToolBenchRoadmap
Berkeley Function-Calling LeaderboardLive
API-BankRoadmap
MLE-benchRoadmap
RE-BenchRoadmap
OSWorldRoadmap
AndroidWorldRoadmap
Windows Agent ArenaRoadmap
ALFWorldLive
ScienceWorldLive
WebShopLive
Habitat ObjectNavRoadmap
ALFREDRoadmap
LIBEROLive
ManiSkill 2Roadmap
CALVINRoadmap
VIMA-BenchRoadmap
BEHAVIOR-1KRoadmap
RoboBenchRoadmap
CybenchBuilding
HarmBenchLive
AILuminateLive
JailbreakBenchLive