AGI·EVALSSign in
← Docs/ Agent / Tool use

Berkeley Function-Calling Leaderboard

Live

Function-calling graded by AST match (simple category live; multi-call coming).

How it works

  • 01

    BFCL (simple category, v4 python) hands the model one user request plus one function schema; a pass means calling the right function with acceptable argument values. The ground truth lists allowed values per parameter, with "" marking a parameter omissible.

  • 02

    The runner advertises the function as a native tool (ToolSpec) — adapters with tool support (OpenAI, Anthropic, Grok, Ollama) return structured tool_calls, which are graded directly. For models without native tools, the runner falls back to parsing a JSON {"name", "arguments"} object or Python-style name(arg=val) syntax from the text, mirroring BFCL's prompting mode.

  • 03

    Grading is the benchmark's AST-style match: function name (dotted and underscored spellings both count, since hosted APIs forbid dots), every required parameter present with an allowed value, and no unexpected parameters.

Scoring

  • 01

    score = fraction of requests answered with an acceptable call. This implements the 'simple' category; multi-call, parallel, and irrelevance-detection categories are tracked separately on the catalog as they come live.

  • 02

    MALFORMED_OUTPUT vs WRONG_ANSWER is the key split: the former means no call was parseable at all (a capability/formatting failure), the latter means a call was made but didn't match.

Using it

  • 01

    CLI: agi-evals download bfcl && agi-evals run bfcl --model openai:gpt-4o-mini

  • 02

    Tool-native adapters give the truest signal. When testing a text-only model, expect the JSON-fallback path and read detail.called to see what was parsed.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
MALFORMED_OUTPUTNeither a native tool call nor a parseable JSON/Python-syntax call was found in the reply.For text-mode models, tighten the instruction to 'reply with exactly one JSON object and nothing else'. Persistent malformed output on a tool-capable adapter usually means the adapter isn't passing request.tools through — check the adapter, not the model.
WRONG_ANSWERA call was parsed but failed the match: wrong function, missing required parameter, out-of-range value, or extra parameters.detail.reason states exactly which rule failed and detail.called shows the parsed call. Values compare loosely (int/float/string coercion), so a genuine mismatch is a real model error.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.