AGI·EVALSSign in

Docs

One runner, two protocols, typed failures. Everything below is documented because it runs today; per-eval pages are status-aware and never describe what is not implemented.

Install

pip install agi-evals                 # core (OpenAI-compatible, Ollama, custom)
pip install 'agi-evals[openai]'       # + OpenAI SDK
pip install 'agi-evals[anthropic]'    # + Anthropic SDK
pip install 'agi-evals[hf]'           # + Transformers / torch
pip install 'agi-evals[mlx]'          # + MLX (Apple Silicon)

Quickstart

Every live eval bundles a small real-schema sample, so your first run needs no dataset download and no API key:

agi-evals list --status live          # what runs today
agi-evals run gpqa-diamond --model echo   # offline smoke test
agi-evals download gpqa-diamond           # fetch + cache the full dataset
agi-evals run gpqa-diamond --model openai:gpt-4o-mini --limit 50

Same thing from Python:

from agi_evals import load_runner, run_eval
from agi_evals.adapters import OpenAIAdapter

report = run_eval(
    load_runner("gpqa-diamond"),
    OpenAIAdapter("gpt-4o-mini"),
    limit=50,
    concurrency=8,
)
print(report.score, report.pass_rate, report.failure_counts)

Datasets resolve in order: explicit data_path= ~/.cache/agi-evals/ (populated by agi-evals download, which uses the HF datasets-server / GitHub with no heavy deps) → the bundled offline sample. GPQA is gated upstream: set HF_TOKENafter accepting its terms, or the downloader falls back to the GPQA repo's published-password zip.

Adapters

A patient is anything with a name and a respond(request) method. Eight ship in the box:

from agi_evals.adapters import (
    OpenAIAdapter,        # OpenAIAdapter("gpt-4o-mini")  [env OPENAI_API_KEY]
    AnthropicAdapter,     # AnthropicAdapter("claude-opus-4-8")  [ANTHROPIC_API_KEY]
    GrokAdapter,          # GrokAdapter("grok-4")  [XAI_API_KEY]
    OllamaAdapter,        # OllamaAdapter("llama3.1:8b")  [local daemon]
    VLLMAdapter,          # VLLMAdapter("meta-llama/Llama-3.1-8B-Instruct")
    HFTransformersAdapter,# in-process transformers; concurrency=1 on one GPU
    MLXAdapter,           # Apple Silicon local models
    CustomAdapter,        # CustomAdapter(lambda req: my_model(req.prompt))
)

CustomAdapter is the escape hatch: wrap any callable — a private server, a research harness, a mock — and it is a first-class patient. Heavy SDKs import lazily, so unused backends cost nothing.

API keys & push

  1. Sign in with GitHub and mint a key under Settings → API keys. The plaintext is shown once; only a hash is stored.
  2. Export it: export AGI_EVALS_API_KEY=agik_…
  3. Add --push to any run, or call push_report() from the SDK.
agi-evals run math --model anthropic:claude-opus-4-8 --push

# or from Python
from agi_evals.client import push_report
push_report(report, model="anthropic:claude-opus-4-8",
            run_meta={"github": "you/your-model", "endpoint": None})

Pushed runs power your private scoreboard-over-time on the dashboard and rank on the public leaderboard. Attach a GitHub repo or an endpoint via run_meta so results stay reproducible.

Variant vs base

Fine-tuned a model? Don't compare two averages — run a paired comparison. Both models answer the identical cases, so the discordant pairs carry all the signal: which cases your variant newly solves, and which it newly fails.

agi-evals compare gpqa-diamond \
  --model openai:my-finetune --baseline openai:gpt-4o-mini --push

# gpqa-diamond: openai:my-finetune vs openai:gpt-4o-mini
#   paired cases : 198  (both pass 121, both fail 52)
#   improvements : 19  (variant newly solves)
#   regressions  : 6   (variant newly fails)
#   score delta  : +0.066  (0.641 -> 0.707)
#   McNemar p    : 0.0153  (statistically significant)

Significance is McNemar's exact test on the discordant pairs — the standard paired test, so a +5% from 19-wins/6-losses reads very differently than +5% from 5-wins/0-losses on a tiny sample. Cases that hit an infrastructure error under either model are excluded from pairing, so endpoint flakes never masquerade as regressions. With --push, both runs land on your dashboard as a vs-card under a shared comparison id, and the regression case ids are printed so you can rerun exactly the cases your fine-tune broke.

Failure taxonomy

Every non-pass carries exactly one tag. Aggregates stay comparable across all evals, and troubleshooting is mechanical:

TagMeaningCounted against model?
WRONG_ANSWERGraded fairly; answer was wrongyes
NO_ANSWERNo parseable answer in the replyyes
REFUSEDModel declined the taskyes
MALFORMED_OUTPUTOutput unusable for gradingyes
TOOL_ERRORBad tool call during an episodeyes
TIMEOUTHit the per-case execution limityes
CONTEXT_OVERFLOWCase can't fit the context windowyes
ADAPTER_ERROREndpoint/transport/auth failureno — excluded
HARNESS_ERRORBug in the runner/harnessno — excluded

Challenges

A challenge is a bigger, time-boxed board. Forward any of your runs to it with one call:

curl -X POST https://agi-eval.studio/api/v1/challenges/reasoning-open-2026/submissions \
  -H "Authorization: Bearer $AGI_EVALS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"run_id": "<id returned when you pushed>"}'

Eligibility (which evals count, open/close dates) is enforced server-side — see open challenges.

Per-eval docs

Live evals have full how-it-works / usage / troubleshooting docs. Building and roadmap evals show a fact sheet and the contribution path — an eval cannot flip to live without its docs shipping in the same PR.