AGI·EVALSSign in
← Docs/ Code

HumanEval+

Live

HumanEval with 80x more tests to catch incorrect-but-plausible solutions.

How it works

  • 01

    The model completes a Python function from its signature + docstring. The runner extracts the code block, assembles candidate + hidden test suite + check(entry_point), and executes it in a fresh Python subprocess with a wall-clock timeout.

  • 02

    Subprocess isolation is deliberate: generated code never touches the harness interpreter, and a hung solution is killed by the timeout instead of blocking the run.

  • 03

    Extraction prefers a ```python fence containing the entry point, then the largest fence, then a bare 'def entry_point' slice of the reply.

Scoring

  • 01

    score = pass rate: a case passes only if every hidden assertion passes (exit code 0).

  • 02

    This is pass@1 with greedy decoding by default; pass@k requires sampling k completions upstream.

Using it

  • 01

    CLI: agi-evals run humaneval-plus --model ollama:qwen2.5-coder --concurrency 4

  • 02

    SDK: HumanEvalPlusRunner(timeout_s=20.0) if your machine is slow or solutions are heavy.

  • 03

    SECURITY: this executes model-generated code locally. Run models you trust, or wrap the runner in a container/OS sandbox.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERNo code could be extracted from the reply (no fence, no recognizable def).Tell the model to return one ```python block with the full function. Chat models that 'explain first, code later' usually fix themselves with that one instruction.
WRONG_ANSWERThe assembled program exited non-zero: an assertion failed or the candidate raised.detail.stderr carries the traceback tail and detail.completion the exact code that ran — reproduce locally by concatenating completion + test + check(entry_point).
TIMEOUTExecution exceeded timeout_s (default 10s) — infinite loop or pathological solution.Raise timeout_s if legitimate solutions are slow on your machine; a persistent TIMEOUT on one task is almost always a real non-terminating solution.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.