← Docs/ Agent / Tool use

τ-bench

Live

Customer-service episodes (retail+airline) judged by final DB state and communicated outputs.

How it works

01
A faithful port of Sierra Research's τ-bench: each case is a full customer-service episode in the retail (115 tasks) or airline (50 tasks) domain. The agent gets the domain's policy wiki as its system prompt and the domain's tools as native tool specs; a simulated USER — itself just another PatientAdapter — opens with a scenario and converses.
02
The domain tools (16 retail, 14 airline) are vendored 1:1 from the original MIT-licensed repo, operating on the original mock databases. Every agent turn is one action: a tool call (executed against the episode's private DB copy) or a message to the user. The episode ends when the user says ###STOP### or the 30-step cap fires.
03
Reward is the paper's, exactly: 1.0 iff the final database state hashes equal to the state produced by replaying the task's ground-truth actions AND every required output was communicated to the user (case-insensitive, commas stripped). All-or-nothing — no partial credit.
04
Port verification: a gold-replay oracle (an agent that executes the ground-truth actions verbatim) scores 165/165 across both domains on the real data.
05
The user simulator uses the upstream system prompt verbatim. Pass user=<any adapter> for the benchmark protocol; without one, a deterministic relay user keeps the eval runnable offline (recorded as user_mode=relay on every result — NOT comparable with LLM-user runs).

Scoring

01
score = fraction of episodes with reward 1.0. The paper's headline metric; pass^k consistency over repeated trials can be computed by running the eval k times.
02
detail.r_action and detail.r_output split every failure: wrong/missing database mutations vs failing to tell the user something required. detail.outputs shows exactly which required outputs were missed.
03
TIMEOUT means the 30-step cap fired before the user ended the conversation — usually an agent stuck in a tool loop or never concluding.

Using it

01
CLI (relay user, smoke only): agi-evals download tau-bench && agi-evals run tau-bench --model openai:gpt-4o
02
SDK (benchmark protocol): run_eval(TauBenchRunner(domain='retail', user=OpenAIAdapter('gpt-4o')), patient) — the paper used GPT-4 as the user simulator.
03
TauBenchRunner(domain='airline') selects the harder airline domain. Transcripts are adapter-universal (tool calls/results embedded as text turns), so even non-function-calling models can play via the JSON fallback.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
WRONG_ANSWER	Episode ended but reward is 0: the DB diverged from the gold replay (r_action=false — wrong tool, wrong args, missing or extra mutation) and/or a required output never reached the user (r_output=false).	Check detail.r_action vs detail.r_output first. For r_action failures, compare the agent's tool calls against the task's gold actions; common causes are skipping identity verification (policy) or mutating before user confirmation.
TIMEOUT	30 agent actions without the user ending the episode.	Inspect detail.n_tool_calls: high counts mean a tool loop (often retrying a failing call); low counts mean the agent never concluded the conversation. With the relay user, remember it only stops after the agent responds several times.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run τ-bench →Leaderboard