AGI·EVALSSign in
← Docs/ Code

LiveCodeBench

Live

Continuously updated contest problems to avoid training-set contamination.

How it works

  • 01

    Contest problems from LeetCode, AtCoder, and Codeforces, collected continuously AFTER model training cutoffs — the contamination-free angle is the point. Each result records platform, difficulty, and contest_date so you can filter to post-cutoff problems for your model.

  • 02

    Downloads default to the most recent releases (v5+v6, ~660MB) because recency is the benchmark's premise; set AGI_EVALS_LCB_RELEASES=all for the full ~4.3GB archive of every release delta.

  • 03

    Two test shapes, both supported: stdin/stdout programs (output compared line-by-line, trailing whitespace normalized) and LeetCode-style functional tests (JSON args applied to Solution.<func_name>, JSON return compared). Every test runs in a fresh subprocess with a wall-clock timeout.

  • 04

    Private test suites ship compressed upstream; the runner keeps them compressed in the cache (~128MB instead of gigabytes) and decodes lazily per case.

  • 05

    Sampling: LiveCodeBenchRunner(n_samples=10, k=1) draws 10 completions at temperature 0.8 and reports the unbiased pass@k estimator (Chen et al. 2021) instead of greedy pass@1.

Scoring

  • 01

    score = mean pass@k over problems (pass@1 greedy by default). A problem passes only if EVERY test passes — one wrong test case fails the problem, matching the benchmark.

  • 02

    For honest contamination-free numbers, filter to problems with contest_date after your model's cutoff; an all-time score mixes memorizable and fresh problems.

Using it

  • 01

    CLI: agi-evals download livecodebench && agi-evals run livecodebench --model openai:gpt-4o-mini --limit 50

  • 02

    SDK pass@k: run_eval(LiveCodeBenchRunner(n_samples=10, k=5), patient) — needs a sampling-capable adapter (temperature is set automatically).

  • 03

    Full suites can run hundreds of tests per problem; use max_private_tests=20 for fast iteration, full tests for pushed numbers.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERNo code block extracted from the reply.The instruction asks for exactly one ```python block; models that chat first usually comply when reminded to return only the block.
WRONG_ANSWERA test failed: wrong stdout, wrong return value, or a runtime error (stderr is captured in detail).detail.failed_test_index plus detail.got/expected pinpoint the failing test. For functional problems confirm the model kept the starter signature.
TIMEOUTOne test exceeded timeout_s (default 6s, matching the official harness).Usually a genuinely slow/non-terminating solution (the finding, not a bug). Raise timeout_s only if your machine is slow enough that reference solutions also miss the limit.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.