← Catalog/ Agent / Tool use

Berkeley Function-Calling Leaderboard

Live

Function-calling graded by AST match (simple category live; multi-call coming).

Run it

# CLI

agi-evals run bfcl --model openai:gpt-4o-mini --push

# SDK

from agi_evals import load_runner, run_eval
from agi_evals.adapters import OpenAIAdapter

report = run_eval(
    load_runner("bfcl"),
    OpenAIAdapter("gpt-4o-mini"),
    concurrency=8,
)
print(report.score, report.failure_counts)

A bundled sample makes this eval runnable offline out of the box; point data_path= at the full upstream dataset for real numbers. How it works, scoring & troubleshooting →

Leaderboard for Berkeley Function-Calling Leaderboard →