AGI·EVALSSign in
← Docs/ Reasoning

GPQA Diamond

Live

198 expert-written graduate science questions designed to be Google-proof.

How it works

  • 01

    Each case is one expert-written, Google-proof graduate science question with four options. The runner renders the question with lettered choices A–D and an instruction to finish with 'Answer: X'.

  • 02

    Grading is single-letter exact match against the gold choice. The extraction is deliberately conservative: explicit 'Answer: X' beats a trailing '(X)', which beats a bare standalone letter.

  • 03

    The bundled sample ships in the package so the eval runs offline; pass data_path= to the runner (or place the full GPQA Diamond JSONL with fields question/choices/answer_index) for real numbers.

Scoring

  • 01

    score = fraction of questions answered correctly (0–1). Random guessing baseline is 0.25.

  • 02

    pass_rate equals score for this eval since each case is binary.

  • 03

    Infra failures (ADAPTER_ERROR / HARNESS_ERROR) are excluded from the mean and reported separately in failure_counts.

Using it

  • 01

    CLI: agi-evals run gpqa-diamond --model anthropic:claude-opus-4-8 --limit 50

  • 02

    SDK: run_eval(load_runner('gpqa-diamond', data_path='gpqa_diamond.jsonl'), patient)

  • 03

    198 questions in the full set — a full run is cheap; use --concurrency 8 freely on hosted APIs.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERThe reply contained no parseable choice letter. The grader looks for an 'Answer: X' line, then a parenthesized (X) near the end, then a standalone capital letter (bare 'A'/'I' are ignored inside prose to avoid false reads).Instruct the model to end with 'Answer: X'. Models with heavy chain-of-thought sometimes bury the letter — raising max_tokens is not the fix; the final-line instruction is.
WRONG_ANSWERA letter was parsed but did not match the gold choice.Inspect detail.parsed_answer vs detail.expected on the result to confirm the parse was faithful before blaming the model.
ADAPTER_ERRORThe endpoint raised (auth, rate limit, network) — not a model failure.These are excluded from the score. Check API keys/quota; rerun with --concurrency 2 if you are being rate-limited.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.