← Docs/ Embodied

WebShop

Live

Buy the right product from 1.18M items given a natural-language goal.

How it works

01
WebShop is a simulated e-commerce site built from 1.18M real products. The agent reads a shopping instruction ('long sleeve navy shirt under $30'), then searches, browses result pages, opens items, picks options, and buys — all through text actions: search[query] and click[element], with each page's clickables listed every turn.
02
The environment IS the benchmark (like ALFWorld and ScienceWorld), so this runner drives the official web_agent_site engine. WebShop isn't pip-installable: clone github.com/princeton-nlp/WebShop, run its setup.sh (-d small indexes 1,000 products for smoke runs; full data reproduces the paper), then set WEBSHOP_PATH to the checkout.
03
Episodes follow the paper: the 500 fixed test goals (indices 0–499) and a 100-action cap. The engine holds one product index in memory and isn't thread-safe — episodes serialize internally, so --concurrency adds no speed here.
04
Protocol note: this instructed zero-shot protocol differs from the paper's IL/RL baselines and ReAct's few-shot prompting; compare scores within the platform, not against paper tables.

Scoring

01
The reward is the benchmark's own: an attribute/option/price/type match in [0,1] computed by the environment at purchase. Partial credit is WebShop's design — buying a near-miss product scores its match fraction.
02
score = mean purchase reward; passed only for a perfect 1.0 match. detail.reward, detail.steps, and detail.purchased reconstruct every episode.
03
No purchase within the cap scores 0 (TIMEOUT) — an agent that browses forever earns nothing, exactly as upstream.

Using it

01
git clone https://github.com/princeton-nlp/WebShop && cd WebShop && ./setup.sh -d small
02
export WEBSHOP_PATH=/path/to/WebShop
03
agi-evals run webshop --model openai:gpt-4o --limit 50 --concurrency 1

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
HARNESS_ERROR	The web_agent_site engine isn't importable.	Clone the WebShop repo, run its setup.sh, and set WEBSHOP_PATH. The engine needs its data files and (for the full index) Java/Pyserini.
TIMEOUT	100 actions without clicking Buy Now — usually the agent loops between search and results.	Check the transcript for repeated identical searches; smaller models often need the reminder that options must be clicked before Buy Now.
WRONG_ANSWER	Purchase completed but matched imperfectly (reward < 1.0).	detail.reward shows how close: ~0.5–0.9 usually means right product, missed options (size/color); low values mean wrong product category entirely.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run WebShop →Leaderboard