← Docs/ Agent / Tool use

SWE-bench Verified

Live

500 human-validated GitHub issues resolved by producing a working patch.

How it works

01
SWE-bench is the standard for software-engineering agents: each task is a real GitHub issue from a popular Python repo, pinned to a commit. The model reads the issue and produces a patch (unified diff); the patch is applied in the repo's Docker image and the issue's test suite is run.
02
A task is resolved only when the previously failing tests (FAIL_TO_PASS) pass AND the previously passing tests (PASS_TO_PASS) still pass — the benchmark's exact criterion, computed by the official swebench harness (we shell out to it; no reimplementation).
03
The environment is the benchmark: grading needs pip install 'agi-eval[swebench]' and Docker. load() enumerates instances from bundled metadata with no Docker (so listing and --limit are free); only grading builds containers.
04
Honest scope, like LIBERO: there is no --model echo smoke test — grading one instance builds a container and runs a real test suite. The model must emit a valid unified diff; agent scaffolds that retrieve files and iterate score far higher than single-shot prompting.

Scoring

01
score = resolve rate (fraction of instances resolved). Binary per instance: resolved or not, by the official FAIL_TO_PASS + PASS_TO_PASS criterion.
02
detail.resolved and detail.patch reconstruct each attempt; a patch that applies but doesn't fix, or fixes but breaks a passing test, both score 0 — matching upstream.

Using it

01
pip install 'agi-eval[swebench]' # + Docker running
02
agi-eval download swe-bench-verified # 500 instances (or swe-bench-lite, 300)
03
agi-eval run swe-bench-lite --model openai:gpt-4o --limit 20 --concurrency 1

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	The model returned no unified diff.	Single-shot prompting often fails to emit a clean diff. Use an agent scaffold, or a model fine-tuned for patches; the runner accepts ```diff blocks or raw diffs.
WRONG_ANSWER	The patch didn't resolve the issue (didn't apply, didn't fix, or broke a passing test).	detail.patch shows what was submitted; inspect against the repo. This is the honest hard part of SWE-bench — resolve rates even for frontier models are modest.
HARNESS_ERROR	The swebench harness or Docker isn't available.	pip install 'agi-eval[swebench]', start Docker, and ensure enough disk — instance images are large.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run SWE-bench Verified →Leaderboard