Every eval here is somebody's research
A benchmark is years of work: task design, data collection, validation, grading code that holds up under adversarial submissions. When a score travels, it usually travels without the names attached. We built the crediting in, so using an eval sends something back to the people who made it.
Every eval page, every run report, and every CLI run carries the original paper and citation. agi-evals cite <eval> prints ready-to-paste BibTeX. Citations are the currency of research; we mint them at the point of use.
Each eval links straight to its upstream repository so you can star it yourself — 16 of 19live evals carry a verified upstream link. We never star on your behalf: a star only means something when it's real.
Our runner code is Apache-2.0, but every dataset keeps its original upstream license, documented on its catalog entry. Faithfulness to the original work — its data, its scoring, its protocol — beats breadth every time. When we approximate, the scope note says so.
Eval pages show how many runs the community has pushed against each benchmark. Authors get a live number that proves their work is being used — something a PDF can never show.
Built an eval you want runnable here, with the credit flowing back to you? The protocols are stable and adding an eval is deliberately small. Reach out — or watch the catalog for your benchmark and tell us what we got wrong.