← Docs/ Code

HumanEval+

Live

HumanEval with 80x more tests to catch incorrect-but-plausible solutions.

How it works

01
The model completes a Python function from its signature + docstring. The runner extracts the code block, assembles candidate + hidden test suite + check(entry_point), and executes it in a fresh Python subprocess with a wall-clock timeout.
02
Subprocess isolation is deliberate: generated code never touches the harness interpreter, and a hung solution is killed by the timeout instead of blocking the run.
03
Extraction prefers a ```python fence containing the entry point, then the largest fence, then a bare 'def entry_point' slice of the reply.

Scoring

01
score = pass rate: a case passes only if every hidden assertion passes (exit code 0).
02
This is pass@1 with greedy decoding by default; pass@k requires sampling k completions upstream.

Using it

01
CLI: agi-evals run humaneval-plus --model ollama:qwen2.5-coder --concurrency 4
02
SDK: HumanEvalPlusRunner(timeout_s=20.0) if your machine is slow or solutions are heavy.
03
SECURITY: this executes model-generated code locally. Run models you trust, or wrap the runner in a container/OS sandbox.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	No code could be extracted from the reply (no fence, no recognizable def).	Tell the model to return one ```python block with the full function. Chat models that 'explain first, code later' usually fix themselves with that one instruction.
WRONG_ANSWER	The assembled program exited non-zero: an assertion failed or the candidate raised.	detail.stderr carries the traceback tail and detail.completion the exact code that ran — reproduce locally by concatenating completion + test + check(entry_point).
TIMEOUT	Execution exceeded timeout_s (default 10s) — infinite loop or pathological solution.	Raise timeout_s if legitimate solutions are slow on your machine; a persistent TIMEOUT on one task is almost always a real non-terminating solution.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run HumanEval+ →Leaderboard