BIG-Bench Hard
Live27 BIG-Bench tasks where prior models underperformed humans.
How it works
- 01
BIG-Bench Hard collects 27 BIG-Bench tasks where earlier models underperformed average human raters — boolean expressions, date understanding, navigation, word sorting, and more. Each case is one task input with a single string target.
- 02
The runner asks for step-by-step reasoning ending in 'Answer: <answer>', mirroring the paper's chain-of-thought setting, then grades by exact match after normalization (case, surrounding parens, trailing periods) — so '(A)', 'False', 'valid', and numbers all compare cleanly.
- 03
The bundled sample spans several task types; `agi-evals download bbh` fetches all 27 tasks (~6.5k cases) and caches them locally.
Scoring
- 01
score = normalized exact-match rate across all tasks pooled together. The paper reports per-task accuracy; per-task breakdown is available from each result's detail.task field.
- 02
Targets vary in shape per task — letters for MCQ-style tasks, words for classification-style, free strings for sorting/Dyck — which is why grading normalizes rather than pattern-matching one format.
Using it
- 01
CLI: agi-evals download bbh && agi-evals run bbh --model openai:gpt-4o-mini --limit 200
- 02
Per-task analysis: group results by detail.task to see which of the 27 tasks drags the average.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| NO_ANSWER | Empty reply — extraction falls back to the last non-empty line, so only a truly empty response tags this. | Check the adapter; an empty string usually means a transport-level truncation rather than a model choice. |
| WRONG_ANSWER | The extracted answer didn't match after normalization. Multi-word targets (word_sorting, dyck_languages) must match exactly, token for token. | Inspect detail.parsed_answer vs detail.expected per task. For sorting/Dyck tasks confirm the model isn't adding commas or quotes — exact match is the paper's protocol, so the grader will not loosen it. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |