AGI·EVALSSign in
← Docs/ Reasoning

BIG-Bench Hard

Live

27 BIG-Bench tasks where prior models underperformed humans.

How it works

  • 01

    BIG-Bench Hard collects 27 BIG-Bench tasks where earlier models underperformed average human raters — boolean expressions, date understanding, navigation, word sorting, and more. Each case is one task input with a single string target.

  • 02

    The runner asks for step-by-step reasoning ending in 'Answer: <answer>', mirroring the paper's chain-of-thought setting, then grades by exact match after normalization (case, surrounding parens, trailing periods) — so '(A)', 'False', 'valid', and numbers all compare cleanly.

  • 03

    The bundled sample spans several task types; `agi-evals download bbh` fetches all 27 tasks (~6.5k cases) and caches them locally.

Scoring

  • 01

    score = normalized exact-match rate across all tasks pooled together. The paper reports per-task accuracy; per-task breakdown is available from each result's detail.task field.

  • 02

    Targets vary in shape per task — letters for MCQ-style tasks, words for classification-style, free strings for sorting/Dyck — which is why grading normalizes rather than pattern-matching one format.

Using it

  • 01

    CLI: agi-evals download bbh && agi-evals run bbh --model openai:gpt-4o-mini --limit 200

  • 02

    Per-task analysis: group results by detail.task to see which of the 27 tasks drags the average.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWEREmpty reply — extraction falls back to the last non-empty line, so only a truly empty response tags this.Check the adapter; an empty string usually means a transport-level truncation rather than a model choice.
WRONG_ANSWERThe extracted answer didn't match after normalization. Multi-word targets (word_sorting, dyck_languages) must match exactly, token for token.Inspect detail.parsed_answer vs detail.expected per task. For sorting/Dyck tasks confirm the model isn't adding commas or quotes — exact match is the paper's protocol, so the grader will not loosen it.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.