AGI·EVALSSHEET NO. AE-001 / REV 0.1Sign in

← Docs/ Reasoning

AIME 2024

Live

Olympiad problems with integer answers in 0-999; exact-match graded.

How it works

01
AIME answers are integers 0–999, which makes grading unambiguous: exact integer match.
02
Extraction prefers a boxed integer, then an 'answer is N' phrase, then the last integer in the reply — in that order of trust.

Scoring

01
score = exact-match rate over 0–999 integer answers. There is no partial credit, matching the competition.

Using it

01
CLI: agi-evals run aime-2024 --model openai:o4 --limit 15
02
These problems reward long reasoning: low temperature and a large token budget measurably help most models.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	No integer could be extracted from the reply at all.	Almost always truncation — raise the completion budget. The runner's last-integer fallback means any finished solution parses.
WRONG_ANSWER	The extracted integer did not match. Note the last-integer fallback can grab a stray number from an unfinished solution.	Inspect detail.parsed_answer; instruct the model to end with \boxed{N} so extraction never relies on the fallback.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run AIME 2024 →Leaderboard