← Docs/ Agent / Tool use

GAIA

Live

Real-world assistant questions needing tools, web, and multi-step reasoning.

How it works

01
GAIA asks real-world assistant questions — 'what was the enrollment of this clinical trial on the NIH site', 'which astronaut in that APOD photo's group spent least time in space' — that humans find tedious but doable and models find genuinely hard. Three difficulty levels, from single-lookup (Level 1) to long multi-source chains (Level 3).
02
The protocol is the paper's, verbatim: models receive the official system prompt and must end with 'FINAL ANSWER: ...' — a number, a few words, or a comma-separated list, with strict formatting rules (no commas in numbers, no articles in strings).
03
We grade the public validation split (165 questions) with the leaderboard's own question_scorer, vendored 1:1 (Apache-2.0). Test-split answers are withheld upstream — submit to the official leaderboard for those.
04
Scope note: some questions reference an attached file (spreadsheet, image, audio). The runner names the attachment in the prompt but does not deliver its contents — matching the paper's no-tools text baseline. Expect file questions to fail unless your adapter does its own retrieval; detail.has_file marks them.

Scoring

01
Binary per question via the official scorer: numbers compare as floats after stripping $/%/commas; lists split on , and ; and compare element-wise; strings compare after whitespace/punctuation/case normalization.
02
score = pass_rate = fraction of questions answered exactly. detail.level enables the paper's per-level breakdown.
03
Formatting matters by design: an answer of '5,876' inside a semicolon list fails against '5876' — the system prompt warns models about exactly this.

Using it

01
agi-evals run gaia --model echo # offline smoke test (3 paper examples)
02
# Full validation set is gated: accept terms at huggingface.co/datasets/gaia-benchmark/GAIA
03
HF_TOKEN=hf_... agi-evals download gaia
04
agi-evals run gaia --model openai:gpt-4o --push

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	The model never produced the 'FINAL ANSWER:' template.	The official system prompt is already in every request; smaller models may need a reminder appended. GPT-4-class models follow it reliably.
WRONG_ANSWER	Answer didn't match under the official normalization — often a formatting miss (units, articles, commas in numbers) rather than a knowledge miss.	Compare detail.parsed_answer to detail.expected. If detail.has_file is true, the model likely never saw the attachment — that's the documented scope boundary, not a bug.
download fails	GAIA is gated on Hugging Face.	Accept the dataset terms while signed in, create a read token, and set HF_TOKEN before agi-evals download gaia.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run GAIA →Leaderboard