← Catalog/ Agent / Tool use
SWE-bench Verified
Building500 human-validated GitHub issues resolved by producing a working patch.
Status
A runner for this eval is in progress. The protocols are stable — implementing it is an EvalRunner with a catalog entry.