MATH
LiveCompetition mathematics graded on the final boxed answer.
How it works
- 01
Each case is a competition problem graded on the final answer enclosed in \boxed{}. The runner instructs the model to show reasoning and finish with the boxed answer.
- 02
Extraction scans for the LAST \boxed{...} with balanced-brace parsing (answers often contain nested braces, so a naive regex would truncate).
- 03
Comparison normalizes LaTeX decoration (\left/\right, \frac{a}{b} → a/b, $, spacing), then tries numeric equality with tolerance, then optional sympy symbolic equality if installed.
Scoring
- 01
score = fraction of problems whose normalized boxed answer matches the gold answer.
- 02
Equivalent-but-differently-written answers (0.5 vs \frac{1}{2}) are handled by normalization + numeric comparison; install sympy for symbolic edge cases.
Using it
- 01
CLI: agi-evals run math --model anthropic:claude-opus-4-8 --limit 50
- 02
Set a generous max_tokens budget upstream if your adapter defaults are low — MATH solutions are long; truncated reasoning usually means a missing \boxed and a NO_ANSWER tag.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| NO_ANSWER | No \boxed{...} found in the reply — usually truncation or the model ignoring the format. | Confirm the response wasn't cut off (raise max_tokens); re-state 'final answer in \boxed{}' in your system prompt if the model drops it. |
| WRONG_ANSWER | A boxed answer was found but did not match after normalization. | Check detail.parsed_answer vs detail.expected. If they are mathematically equal but string-different, install sympy (pip install sympy) to enable symbolic comparison, and file an issue with the pair. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |