AGI·EVALSSign in
← Docs/ Embodied

ALFWorld

Live

Household tasks in the real TextWorld engine; success rate over 134 unseen games.

How it works

  • 01

    Household tasks (pick & place, heat/cool/clean then place, examine in light, two-object place) played as text adventures in the REAL engine: TextWorld games compiled from ALFRED room layouts with a PDDL backend deciding what every action does. We drive the official alfworld package rather than reimplementing the simulator — the engine is the benchmark.

  • 02

    Heads up, this is the one live eval that can't run with zero dependencies: pip install 'agi-evals[alfworld]' then agi-evals download alfworld (the official downloader fetches the game files into ~/.cache/alfworld).

  • 03

    Protocol (fixed, documented): the agent sees the room observation (which states the task), each turn lists the admissible commands, and the agent replies with exactly one command. Episode ends on goal completion or the 50-step cap. This instructed zero-shot protocol differs from ReAct's two-shot prompting — compare scores within the platform, not against the ReAct paper.

  • 04

    Default split is eval_out_of_distribution: 134 games in rooms never seen during the benchmark's training phase. eval_in_distribution selects seen rooms.

Scoring

  • 01

    score = task success rate (goal satisfied per the engine's PDDL check). Binary per episode — no partial credit for almost-completed tasks.

  • 02

    detail.task_type enables the paper's per-type breakdown (pick_and_place_simple is easiest; pick_two_obj_and_place and clean/heat/cool chains are where models fall down).

  • 03

    detail.steps shows how long success took; low-step TIMEOUTs usually mean malformed commands ('Nothing happens.' loops).

Using it

  • 01

    pip install 'agi-evals[alfworld]' && agi-evals download alfworld

  • 02

    CLI: agi-evals run alfworld --model openai:gpt-4o --concurrency 4

  • 03

    SDK: run_eval(ALFWorldRunner(split='eval_in_distribution', max_steps=50), patient). Each episode constructs its own env, so harness concurrency is safe.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
TIMEOUT50 actions without satisfying the goal — flailing, command-format errors, or a plan that never converges.Read the transcript tail: repeated 'Nothing happens.' means the agent is issuing commands not in the admissible list (check exact object naming like 'apple 1'); varied-but-aimless commands mean a planning failure, which is the finding.
WRONG_ANSWERThe engine terminated the episode without a win before the cap (rare).Usually an engine-level termination; inspect detail.steps and rerun the single game with --limit 1 to reproduce.
ADAPTER_ERROR / HARNESS_ERROREngine not installed or game data missing surfaces here as an infra failure, excluded from the score.pip install 'agi-evals[alfworld]' and agi-evals download alfworld; confirm $ALFWORLD_DATA/json_2.1.1 exists.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.