HumanEval+
LiveHumanEval with 80x more tests to catch incorrect-but-plausible solutions.
How it works
- 01
The model completes a Python function from its signature + docstring. The runner extracts the code block, assembles candidate + hidden test suite + check(entry_point), and executes it in a fresh Python subprocess with a wall-clock timeout.
- 02
Subprocess isolation is deliberate: generated code never touches the harness interpreter, and a hung solution is killed by the timeout instead of blocking the run.
- 03
Extraction prefers a ```python fence containing the entry point, then the largest fence, then a bare 'def entry_point' slice of the reply.
Scoring
- 01
score = pass rate: a case passes only if every hidden assertion passes (exit code 0).
- 02
This is pass@1 with greedy decoding by default; pass@k requires sampling k completions upstream.
Using it
- 01
CLI: agi-evals run humaneval-plus --model ollama:qwen2.5-coder --concurrency 4
- 02
SDK: HumanEvalPlusRunner(timeout_s=20.0) if your machine is slow or solutions are heavy.
- 03
SECURITY: this executes model-generated code locally. Run models you trust, or wrap the runner in a container/OS sandbox.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| NO_ANSWER | No code could be extracted from the reply (no fence, no recognizable def). | Tell the model to return one ```python block with the full function. Chat models that 'explain first, code later' usually fix themselves with that one instruction. |
| WRONG_ANSWER | The assembled program exited non-zero: an assertion failed or the candidate raised. | detail.stderr carries the traceback tail and detail.completion the exact code that ran — reproduce locally by concatenating completion + test + check(entry_point). |
| TIMEOUT | Execution exceeded timeout_s (default 10s) — infinite loop or pathological solution. | Raise timeout_s if legitimate solutions are slow on your machine; a persistent TIMEOUT on one task is almost always a real non-terminating solution. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |