← Docs/ Code

BigCodeBench

Live

Practical programming tasks chaining many real library function calls.

How it works

01
BigCodeBench extends function synthesis to realistic, library-heavy programming: each task's hidden test suite is a unittest.TestCase exercising the completed function with diverse calls, often through popular third-party packages (numpy, pandas, flask, sklearn...).
02
The model completes the function from its signature and docstring (the 'complete' split); the candidate is assembled with the test class and run in a fresh subprocess with a wall-clock timeout — the same execution model as HumanEval+, scaled up.
03
Scope note on dependencies: BigCodeBench tasks import from 139 libraries. The bundled offline sample is restricted to stdlib-only tasks so it runs out of the box; the full set needs those libraries installed (the upstream bigcodebench package pins them). Missing imports surface in detail.stderr.

Scoring

01
score = pass@1 (fraction of tasks whose full unittest suite passes). Supports pass@k via n_samples/k for sampled decoding (unbiased Chen et al. estimator).
02
A task passes only when every test method succeeds — partial test passes count as a fail, matching upstream.
03
Infra/import errors land as WRONG_ANSWER with the traceback in detail.stderr; read it before blaming the model — a missing library is an environment problem, not a capability one.

Using it

01
agi-eval run bigcodebench --model echo --limit 3 # offline, stdlib-only sample
02
agi-eval download bigcodebench # full set (problems + tests)
03
# install the tasks' libraries for full runs, then:
04
agi-eval run bigcodebench --model openai:gpt-4o --concurrency 4

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	No code block in the reply.	Instruct the model to return one ```python block with the full function; smaller models sometimes answer in prose.
WRONG_ANSWER	Tests failed — or a library the task imports isn't installed.	Check detail.stderr: a unittest assertion is a real failure; a ModuleNotFoundError means install the task's libraries (detail shows the task's libs).
TIMEOUT	The suite exceeded the 30s budget (BigCodeBench suites are larger than HumanEval's).	Usually an infinite loop in the candidate; raise timeout_s if a legitimately heavy task needs it.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run BigCodeBench →Leaderboard