AGI·EVALSSign in
← Docs/ Agent / Tool use

SWE-bench Verified

Building

500 human-validated GitHub issues resolved by producing a working patch.

Runner in progress

SWE-bench Verified is catalogued but not runnable yet, so there are no usage docs — we do not document what does not run. The fact sheet below is sourced from the paper; the protocols it will implement are stable today.

Paper
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Citation
Jimenez et al., 2023, arXiv:2310.06770 (Verified subset, OpenAI 2024)
License
MIT
How an eval goes live
  1. Implement an EvalRunner against the stable protocols.
  2. Bundle a small real-schema sample so it runs offline.
  3. Point the catalog entry's runner at the class.
  4. Ship its docs in the same change — required to flip live.

pip install agi-evals