← Docs/ Reasoning

BIG-Bench Hard

Live

27 BIG-Bench tasks where prior models underperformed humans.

How it works

01
BIG-Bench Hard collects 27 BIG-Bench tasks where earlier models underperformed average human raters — boolean expressions, date understanding, navigation, word sorting, and more. Each case is one task input with a single string target.
02
The runner asks for step-by-step reasoning ending in 'Answer: <answer>', mirroring the paper's chain-of-thought setting, then grades by exact match after normalization (case, surrounding parens, trailing periods) — so '(A)', 'False', 'valid', and numbers all compare cleanly.
03
The bundled sample spans several task types; `agi-evals download bbh` fetches all 27 tasks (~6.5k cases) and caches them locally.

Scoring

01
score = normalized exact-match rate across all tasks pooled together. The paper reports per-task accuracy; per-task breakdown is available from each result's detail.task field.
02
Targets vary in shape per task — letters for MCQ-style tasks, words for classification-style, free strings for sorting/Dyck — which is why grading normalizes rather than pattern-matching one format.

Using it

01
CLI: agi-evals download bbh && agi-evals run bbh --model openai:gpt-4o-mini --limit 200
02
Per-task analysis: group results by detail.task to see which of the 27 tasks drags the average.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	Empty reply — extraction falls back to the last non-empty line, so only a truly empty response tags this.	Check the adapter; an empty string usually means a transport-level truncation rather than a model choice.
WRONG_ANSWER	The extracted answer didn't match after normalization. Multi-word targets (word_sorting, dyck_languages) must match exactly, token for token.	Inspect detail.parsed_answer vs detail.expected per task. For sorting/Dyck tasks confirm the model isn't adding commas or quotes — exact match is the paper's protocol, so the grader will not loosen it.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run BIG-Bench Hard →Leaderboard