AGI·EVALSSHEET NO. AE-001 / REV 0.1Sign in

← Docs/ Reasoning

MMLU-Pro

Live

Harder MMLU with ten answer choices and reasoning-heavy questions.

How it works

01
MMLU-Pro extends MMLU to up to ten options (A–J) with more reasoning-heavy questions, cutting the guessing baseline to ~0.1.
02
The runner renders all options with letters and grades the extracted letter against the gold index — identical machinery to GPQA, widened to ten choices.
03
Dataset rows use fields question/choices/answer_index; the bundled sample mirrors the upstream schema exactly.

Scoring

01
score = exact-match accuracy over answered questions (0–1); guessing baseline ≈ 0.10 with ten options.
02
Because choices go to J, watch NO_ANSWER counts: models that answer 'the third option' instead of a letter are unparseable by design — fix the prompt, not the grader.

Using it

01
CLI: agi-evals run mmlu-pro --model openai:gpt-4o-mini --limit 100
02
Full MMLU-Pro is ~12k questions: use --limit for iteration and a full run only for numbers you intend to push.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	The reply contained no parseable choice letter. The grader looks for an 'Answer: X' line, then a parenthesized (X) near the end, then a standalone capital letter (bare 'A'/'I' are ignored inside prose to avoid false reads).	Instruct the model to end with 'Answer: X'. Models with heavy chain-of-thought sometimes bury the letter — raising max_tokens is not the fix; the final-line instruction is.
WRONG_ANSWER	A letter was parsed but did not match the gold choice.	Inspect detail.parsed_answer vs detail.expected on the result to confirm the parse was faithful before blaming the model.
ADAPTER_ERROR	The endpoint raised (auth, rate limit, network) — not a model failure.	These are excluded from the score. Check API keys/quota; rerun with --concurrency 2 if you are being rate-limited.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run MMLU-Pro →Leaderboard