AGI·EVALSSign in
← Docs/ Reasoning

AIME 2024

Live

Olympiad problems with integer answers in 0-999; exact-match graded.

How it works

  • 01

    AIME answers are integers 0–999, which makes grading unambiguous: exact integer match.

  • 02

    Extraction prefers a boxed integer, then an 'answer is N' phrase, then the last integer in the reply — in that order of trust.

Scoring

  • 01

    score = exact-match rate over 0–999 integer answers. There is no partial credit, matching the competition.

Using it

  • 01

    CLI: agi-evals run aime-2024 --model openai:o4 --limit 15

  • 02

    These problems reward long reasoning: low temperature and a large token budget measurably help most models.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERNo integer could be extracted from the reply at all.Almost always truncation — raise the completion budget. The runner's last-integer fallback means any finished solution parses.
WRONG_ANSWERThe extracted integer did not match. Note the last-integer fallback can grab a stray number from an unfinished solution.Inspect detail.parsed_answer; instruct the model to end with \boxed{N} so extraction never relies on the fallback.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.