Docs
One runner, two protocols, typed failures. Everything below is documented because it runs today; per-eval pages are status-aware and never describe what is not implemented.
Install
pip install agi-evals # core (OpenAI-compatible, Ollama, custom) pip install 'agi-evals[openai]' # + OpenAI SDK pip install 'agi-evals[anthropic]' # + Anthropic SDK pip install 'agi-evals[hf]' # + Transformers / torch pip install 'agi-evals[mlx]' # + MLX (Apple Silicon)
Quickstart
Every live eval bundles a small real-schema sample, so your first run needs no dataset download and no API key:
agi-evals list --status live # what runs today agi-evals run gpqa-diamond --model echo # offline smoke test agi-evals download gpqa-diamond # fetch + cache the full dataset agi-evals run gpqa-diamond --model openai:gpt-4o-mini --limit 50
Same thing from Python:
from agi_evals import load_runner, run_eval
from agi_evals.adapters import OpenAIAdapter
report = run_eval(
load_runner("gpqa-diamond"),
OpenAIAdapter("gpt-4o-mini"),
limit=50,
concurrency=8,
)
print(report.score, report.pass_rate, report.failure_counts)Datasets resolve in order: explicit data_path= → ~/.cache/agi-evals/ (populated by agi-evals download, which uses the HF datasets-server / GitHub with no heavy deps) → the bundled offline sample. GPQA is gated upstream: set HF_TOKENafter accepting its terms, or the downloader falls back to the GPQA repo's published-password zip.
Adapters
A patient is anything with a name and a respond(request) method. Eight ship in the box:
from agi_evals.adapters import (
OpenAIAdapter, # OpenAIAdapter("gpt-4o-mini") [env OPENAI_API_KEY]
AnthropicAdapter, # AnthropicAdapter("claude-opus-4-8") [ANTHROPIC_API_KEY]
GrokAdapter, # GrokAdapter("grok-4") [XAI_API_KEY]
OllamaAdapter, # OllamaAdapter("llama3.1:8b") [local daemon]
VLLMAdapter, # VLLMAdapter("meta-llama/Llama-3.1-8B-Instruct")
HFTransformersAdapter,# in-process transformers; concurrency=1 on one GPU
MLXAdapter, # Apple Silicon local models
CustomAdapter, # CustomAdapter(lambda req: my_model(req.prompt))
)CustomAdapter is the escape hatch: wrap any callable — a private server, a research harness, a mock — and it is a first-class patient. Heavy SDKs import lazily, so unused backends cost nothing.
API keys & push
- Sign in with GitHub and mint a key under Settings → API keys. The plaintext is shown once; only a hash is stored.
- Export it: export AGI_EVALS_API_KEY=agik_…
- Add --push to any run, or call push_report() from the SDK.
agi-evals run math --model anthropic:claude-opus-4-8 --push
# or from Python
from agi_evals.client import push_report
push_report(report, model="anthropic:claude-opus-4-8",
run_meta={"github": "you/your-model", "endpoint": None})Pushed runs power your private scoreboard-over-time on the dashboard and rank on the public leaderboard. Attach a GitHub repo or an endpoint via run_meta so results stay reproducible.
Variant vs base
Fine-tuned a model? Don't compare two averages — run a paired comparison. Both models answer the identical cases, so the discordant pairs carry all the signal: which cases your variant newly solves, and which it newly fails.
agi-evals compare gpqa-diamond \ --model openai:my-finetune --baseline openai:gpt-4o-mini --push # gpqa-diamond: openai:my-finetune vs openai:gpt-4o-mini # paired cases : 198 (both pass 121, both fail 52) # improvements : 19 (variant newly solves) # regressions : 6 (variant newly fails) # score delta : +0.066 (0.641 -> 0.707) # McNemar p : 0.0153 (statistically significant)
Significance is McNemar's exact test on the discordant pairs — the standard paired test, so a +5% from 19-wins/6-losses reads very differently than +5% from 5-wins/0-losses on a tiny sample. Cases that hit an infrastructure error under either model are excluded from pairing, so endpoint flakes never masquerade as regressions. With --push, both runs land on your dashboard as a vs-card under a shared comparison id, and the regression case ids are printed so you can rerun exactly the cases your fine-tune broke.
Failure taxonomy
Every non-pass carries exactly one tag. Aggregates stay comparable across all evals, and troubleshooting is mechanical:
| Tag | Meaning | Counted against model? |
|---|---|---|
| WRONG_ANSWER | Graded fairly; answer was wrong | yes |
| NO_ANSWER | No parseable answer in the reply | yes |
| REFUSED | Model declined the task | yes |
| MALFORMED_OUTPUT | Output unusable for grading | yes |
| TOOL_ERROR | Bad tool call during an episode | yes |
| TIMEOUT | Hit the per-case execution limit | yes |
| CONTEXT_OVERFLOW | Case can't fit the context window | yes |
| ADAPTER_ERROR | Endpoint/transport/auth failure | no — excluded |
| HARNESS_ERROR | Bug in the runner/harness | no — excluded |
Challenges
A challenge is a bigger, time-boxed board. Forward any of your runs to it with one call:
curl -X POST https://agi-eval.studio/api/v1/challenges/reasoning-open-2026/submissions \
-H "Authorization: Bearer $AGI_EVALS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"run_id": "<id returned when you pushed>"}'Eligibility (which evals count, open/close dates) is enforced server-side — see open challenges.
Per-eval docs
Live evals have full how-it-works / usage / troubleshooting docs. Building and roadmap evals show a fact sheet and the contribution path — an eval cannot flip to live without its docs shipping in the same PR.