← Docs/ Reasoning
AIME 2024
LiveOlympiad problems with integer answers in 0-999; exact-match graded.
How it works
- 01
AIME answers are integers 0–999, which makes grading unambiguous: exact integer match.
- 02
Extraction prefers a boxed integer, then an 'answer is N' phrase, then the last integer in the reply — in that order of trust.
Scoring
- 01
score = exact-match rate over 0–999 integer answers. There is no partial credit, matching the competition.
Using it
- 01
CLI: agi-evals run aime-2024 --model openai:o4 --limit 15
- 02
These problems reward long reasoning: low temperature and a large token budget measurably help most models.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| NO_ANSWER | No integer could be extracted from the reply at all. | Almost always truncation — raise the completion budget. The runner's last-integer fallback means any finished solution parses. |
| WRONG_ANSWER | The extracted integer did not match. Note the last-integer fallback can grab a stray number from an unfinished solution. | Inspect detail.parsed_answer; instruct the model to end with \boxed{N} so extraction never relies on the fallback. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |