Catalog
3 evals
The same catalog/evals.yaml the CLI reads. Live means it runs end-to-end today; building and roadmap entries show exactly what is coming and welcome contributions.
| Eval | Category | Paper | License | Status |
|---|---|---|---|---|
| SWE-bench Verified | Agent / Tool use | Building | ||
| SWE-bench Lite | Agent / Tool use | Building | ||
| Cybench | Safety / Security | Building |