← Docs/ Embodied

ALFWorld

Live

Household tasks in the real TextWorld engine; success rate over 134 unseen games.

How it works

01
Household tasks (pick & place, heat/cool/clean then place, examine in light, two-object place) played as text adventures in the REAL engine: TextWorld games compiled from ALFRED room layouts with a PDDL backend deciding what every action does. We drive the official alfworld package rather than reimplementing the simulator — the engine is the benchmark.
02
Heads up, this is the one live eval that can't run with zero dependencies: pip install 'agi-eval[alfworld]' then agi-evals download alfworld (the official downloader fetches the game files into ~/.cache/alfworld).
03
Protocol (fixed, documented): the agent sees the room observation (which states the task), each turn lists the admissible commands, and the agent replies with exactly one command. Episode ends on goal completion or the 50-step cap. This instructed zero-shot protocol differs from ReAct's two-shot prompting — compare scores within the platform, not against the ReAct paper.
04
Default split is eval_out_of_distribution: 134 games in rooms never seen during the benchmark's training phase. eval_in_distribution selects seen rooms.

Scoring

01
score = task success rate (goal satisfied per the engine's PDDL check). Binary per episode — no partial credit for almost-completed tasks.
02
detail.task_type enables the paper's per-type breakdown (pick_and_place_simple is easiest; pick_two_obj_and_place and clean/heat/cool chains are where models fall down).
03
detail.steps shows how long success took; low-step TIMEOUTs usually mean malformed commands ('Nothing happens.' loops).

Using it

01
pip install 'agi-eval[alfworld]' && agi-evals download alfworld
02
CLI: agi-evals run alfworld --model openai:gpt-4o --concurrency 4
03
SDK: run_eval(ALFWorldRunner(split='eval_in_distribution', max_steps=50), patient). Each episode constructs its own env, so harness concurrency is safe.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
TIMEOUT	50 actions without satisfying the goal — flailing, command-format errors, or a plan that never converges.	Read the transcript tail: repeated 'Nothing happens.' means the agent is issuing commands not in the admissible list (check exact object naming like 'apple 1'); varied-but-aimless commands mean a planning failure, which is the finding.
WRONG_ANSWER	The engine terminated the episode without a win before the cap (rare).	Usually an engine-level termination; inspect detail.steps and rerun the single game with --limit 1 to reproduce.
ADAPTER_ERROR / HARNESS_ERROR	Engine not installed or game data missing surfaces here as an infra failure, excluded from the score.	pip install 'agi-eval[alfworld]' and agi-evals download alfworld; confirm $ALFWORLD_DATA/json_2.1.1 exists.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run ALFWorld →Leaderboard