AGI·EVALSSign in
← Docs/ Reasoning

MATH

Live

Competition mathematics graded on the final boxed answer.

How it works

  • 01

    Each case is a competition problem graded on the final answer enclosed in \boxed{}. The runner instructs the model to show reasoning and finish with the boxed answer.

  • 02

    Extraction scans for the LAST \boxed{...} with balanced-brace parsing (answers often contain nested braces, so a naive regex would truncate).

  • 03

    Comparison normalizes LaTeX decoration (\left/\right, \frac{a}{b} → a/b, $, spacing), then tries numeric equality with tolerance, then optional sympy symbolic equality if installed.

Scoring

  • 01

    score = fraction of problems whose normalized boxed answer matches the gold answer.

  • 02

    Equivalent-but-differently-written answers (0.5 vs \frac{1}{2}) are handled by normalization + numeric comparison; install sympy for symbolic edge cases.

Using it

  • 01

    CLI: agi-evals run math --model anthropic:claude-opus-4-8 --limit 50

  • 02

    Set a generous max_tokens budget upstream if your adapter defaults are low — MATH solutions are long; truncated reasoning usually means a missing \boxed and a NO_ANSWER tag.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERNo \boxed{...} found in the reply — usually truncation or the model ignoring the format.Confirm the response wasn't cut off (raise max_tokens); re-state 'final answer in \boxed{}' in your system prompt if the model drops it.
WRONG_ANSWERA boxed answer was found but did not match after normalization.Check detail.parsed_answer vs detail.expected. If they are mathematically equal but string-different, install sympy (pip install sympy) to enable symbolic comparison, and file an issue with the pair.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.