← Docs/ Code

LiveCodeBench

Live

Continuously updated contest problems to avoid training-set contamination.

How it works

01
Contest problems from LeetCode, AtCoder, and Codeforces, collected continuously AFTER model training cutoffs — the contamination-free angle is the point. Each result records platform, difficulty, and contest_date so you can filter to post-cutoff problems for your model.
02
Downloads default to the most recent releases (v5+v6, ~660MB) because recency is the benchmark's premise; set AGI_EVALS_LCB_RELEASES=all for the full ~4.3GB archive of every release delta.
03
Two test shapes, both supported: stdin/stdout programs (output compared line-by-line, trailing whitespace normalized) and LeetCode-style functional tests (JSON args applied to Solution.<func_name>, JSON return compared). Every test runs in a fresh subprocess with a wall-clock timeout.
04
Private test suites ship compressed upstream; the runner keeps them compressed in the cache (~128MB instead of gigabytes) and decodes lazily per case.
05
Sampling: LiveCodeBenchRunner(n_samples=10, k=1) draws 10 completions at temperature 0.8 and reports the unbiased pass@k estimator (Chen et al. 2021) instead of greedy pass@1.

Scoring

01
score = mean pass@k over problems (pass@1 greedy by default). A problem passes only if EVERY test passes — one wrong test case fails the problem, matching the benchmark.
02
For honest contamination-free numbers, filter to problems with contest_date after your model's cutoff; an all-time score mixes memorizable and fresh problems.

Using it

01
CLI: agi-evals download livecodebench && agi-evals run livecodebench --model openai:gpt-4o-mini --limit 50
02
SDK pass@k: run_eval(LiveCodeBenchRunner(n_samples=10, k=5), patient) — needs a sampling-capable adapter (temperature is set automatically).
03
Full suites can run hundreds of tests per problem; use max_private_tests=20 for fast iteration, full tests for pushed numbers.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	No code block extracted from the reply.	The instruction asks for exactly one ```python block; models that chat first usually comply when reminded to return only the block.
WRONG_ANSWER	A test failed: wrong stdout, wrong return value, or a runtime error (stderr is captured in detail).	detail.failed_test_index plus detail.got/expected pinpoint the failing test. For functional problems confirm the model kept the starter signature.
TIMEOUT	One test exceeded timeout_s (default 6s, matching the official harness).	Usually a genuinely slow/non-terminating solution (the finding, not a bug). Raise timeout_s only if your machine is slow enough that reference solutions also miss the limit.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run LiveCodeBench →Leaderboard