Berkeley Function-Calling Leaderboard
LiveFunction-calling graded by AST match (simple category live; multi-call coming).
How it works
- 01
BFCL (simple category, v4 python) hands the model one user request plus one function schema; a pass means calling the right function with acceptable argument values. The ground truth lists allowed values per parameter, with "" marking a parameter omissible.
- 02
The runner advertises the function as a native tool (ToolSpec) — adapters with tool support (OpenAI, Anthropic, Grok, Ollama) return structured tool_calls, which are graded directly. For models without native tools, the runner falls back to parsing a JSON {"name", "arguments"} object or Python-style name(arg=val) syntax from the text, mirroring BFCL's prompting mode.
- 03
Grading is the benchmark's AST-style match: function name (dotted and underscored spellings both count, since hosted APIs forbid dots), every required parameter present with an allowed value, and no unexpected parameters.
Scoring
- 01
score = fraction of requests answered with an acceptable call. This implements the 'simple' category; multi-call, parallel, and irrelevance-detection categories are tracked separately on the catalog as they come live.
- 02
MALFORMED_OUTPUT vs WRONG_ANSWER is the key split: the former means no call was parseable at all (a capability/formatting failure), the latter means a call was made but didn't match.
Using it
- 01
CLI: agi-evals download bfcl && agi-evals run bfcl --model openai:gpt-4o-mini
- 02
Tool-native adapters give the truest signal. When testing a text-only model, expect the JSON-fallback path and read detail.called to see what was parsed.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| MALFORMED_OUTPUT | Neither a native tool call nor a parseable JSON/Python-syntax call was found in the reply. | For text-mode models, tighten the instruction to 'reply with exactly one JSON object and nothing else'. Persistent malformed output on a tool-capable adapter usually means the adapter isn't passing request.tools through — check the adapter, not the model. |
| WRONG_ANSWER | A call was parsed but failed the match: wrong function, missing required parameter, out-of-range value, or extra parameters. | detail.reason states exactly which rule failed and detail.called shows the parsed call. Values compare loosely (int/float/string coercion), so a genuine mismatch is a real model error. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |