← Catalog/ Agent / Tool use
AssistantBench
RoadmapTime-consuming real web tasks with verifiable short answers.
Status
This eval is catalogued and on the roadmap. The protocols are stable — implementing it is an EvalRunner with a catalog entry.
Time-consuming real web tasks with verifiable short answers.
This eval is catalogued and on the roadmap. The protocols are stable — implementing it is an EvalRunner with a catalog entry.