AGI·EVALSSign in
← Docs/ Agent / Tool use

GAIA

Live

Real-world assistant questions needing tools, web, and multi-step reasoning.

How it works

  • 01

    GAIA asks real-world assistant questions — 'what was the enrollment of this clinical trial on the NIH site', 'which astronaut in that APOD photo's group spent least time in space' — that humans find tedious but doable and models find genuinely hard. Three difficulty levels, from single-lookup (Level 1) to long multi-source chains (Level 3).

  • 02

    The protocol is the paper's, verbatim: models receive the official system prompt and must end with 'FINAL ANSWER: ...' — a number, a few words, or a comma-separated list, with strict formatting rules (no commas in numbers, no articles in strings).

  • 03

    We grade the public validation split (165 questions) with the leaderboard's own question_scorer, vendored 1:1 (Apache-2.0). Test-split answers are withheld upstream — submit to the official leaderboard for those.

  • 04

    Scope note: some questions reference an attached file (spreadsheet, image, audio). The runner names the attachment in the prompt but does not deliver its contents — matching the paper's no-tools text baseline. Expect file questions to fail unless your adapter does its own retrieval; detail.has_file marks them.

Scoring

  • 01

    Binary per question via the official scorer: numbers compare as floats after stripping $/%/commas; lists split on , and ; and compare element-wise; strings compare after whitespace/punctuation/case normalization.

  • 02

    score = pass_rate = fraction of questions answered exactly. detail.level enables the paper's per-level breakdown.

  • 03

    Formatting matters by design: an answer of '5,876' inside a semicolon list fails against '5876' — the system prompt warns models about exactly this.

Using it

  • 01

    agi-evals run gaia --model echo # offline smoke test (3 paper examples)

  • 02

    # Full validation set is gated: accept terms at huggingface.co/datasets/gaia-benchmark/GAIA

  • 03

    HF_TOKEN=hf_... agi-evals download gaia

  • 04

    agi-evals run gaia --model openai:gpt-4o --push

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERThe model never produced the 'FINAL ANSWER:' template.The official system prompt is already in every request; smaller models may need a reminder appended. GPT-4-class models follow it reliably.
WRONG_ANSWERAnswer didn't match under the official normalization — often a formatting miss (units, articles, commas in numbers) rather than a knowledge miss.Compare detail.parsed_answer to detail.expected. If detail.has_file is true, the model likely never saw the attachment — that's the documented scope boundary, not a bug.
download failsGAIA is gated on Hugging Face.Accept the dataset terms while signed in, create a read token, and set HF_TOKEN before agi-evals download gaia.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.