GAIA
LiveReal-world assistant questions needing tools, web, and multi-step reasoning.
How it works
- 01
GAIA asks real-world assistant questions — 'what was the enrollment of this clinical trial on the NIH site', 'which astronaut in that APOD photo's group spent least time in space' — that humans find tedious but doable and models find genuinely hard. Three difficulty levels, from single-lookup (Level 1) to long multi-source chains (Level 3).
- 02
The protocol is the paper's, verbatim: models receive the official system prompt and must end with 'FINAL ANSWER: ...' — a number, a few words, or a comma-separated list, with strict formatting rules (no commas in numbers, no articles in strings).
- 03
We grade the public validation split (165 questions) with the leaderboard's own question_scorer, vendored 1:1 (Apache-2.0). Test-split answers are withheld upstream — submit to the official leaderboard for those.
- 04
Scope note: some questions reference an attached file (spreadsheet, image, audio). The runner names the attachment in the prompt but does not deliver its contents — matching the paper's no-tools text baseline. Expect file questions to fail unless your adapter does its own retrieval; detail.has_file marks them.
Scoring
- 01
Binary per question via the official scorer: numbers compare as floats after stripping $/%/commas; lists split on , and ; and compare element-wise; strings compare after whitespace/punctuation/case normalization.
- 02
score = pass_rate = fraction of questions answered exactly. detail.level enables the paper's per-level breakdown.
- 03
Formatting matters by design: an answer of '5,876' inside a semicolon list fails against '5876' — the system prompt warns models about exactly this.
Using it
- 01
agi-evals run gaia --model echo # offline smoke test (3 paper examples)
- 02
# Full validation set is gated: accept terms at huggingface.co/datasets/gaia-benchmark/GAIA
- 03
HF_TOKEN=hf_... agi-evals download gaia
- 04
agi-evals run gaia --model openai:gpt-4o --push
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| NO_ANSWER | The model never produced the 'FINAL ANSWER:' template. | The official system prompt is already in every request; smaller models may need a reminder appended. GPT-4-class models follow it reliably. |
| WRONG_ANSWER | Answer didn't match under the official normalization — often a formatting miss (units, articles, commas in numbers) rather than a knowledge miss. | Compare detail.parsed_answer to detail.expected. If detail.has_file is true, the model likely never saw the attachment — that's the documented scope boundary, not a bug. |
| download fails | GAIA is gated on Hugging Face. | Accept the dataset terms while signed in, create a read token, and set HF_TOKEN before agi-evals download gaia. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |