← Catalog/ Agent / Tool use
RE-Bench
RoadmapOpen-ended ML research tasks scored against expert human baselines.
Status
This eval is catalogued and on the roadmap. The protocols are stable — implementing it is an EvalRunner with a catalog entry.
Open-ended ML research tasks scored against expert human baselines.
This eval is catalogued and on the roadmap. The protocols are stable — implementing it is an EvalRunner with a catalog entry.