AGI·EVALSSign in
Catalog

17 evals — Agent / Tool use

The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.

EvalStatus
τ-benchLive
SWE-bench VerifiedBuilding
SWE-bench LiteBuilding
GAIALive
WebArenaRoadmap
VisualWebArenaRoadmap
AgentBenchRoadmap
AgentBoardRoadmap
AssistantBenchRoadmap
ToolBenchRoadmap
Berkeley Function-Calling LeaderboardLive
API-BankRoadmap
MLE-benchRoadmap
RE-BenchRoadmap
OSWorldRoadmap
AndroidWorldRoadmap
Windows Agent ArenaRoadmap