AGI·EVALSSign in
← Docs/ Embodied

WebShop

Live

Buy the right product from 1.18M items given a natural-language goal.

How it works

  • 01

    WebShop is a simulated e-commerce site built from 1.18M real products. The agent reads a shopping instruction ('long sleeve navy shirt under $30'), then searches, browses result pages, opens items, picks options, and buys — all through text actions: search[query] and click[element], with each page's clickables listed every turn.

  • 02

    The environment IS the benchmark (like ALFWorld and ScienceWorld), so this runner drives the official web_agent_site engine. WebShop isn't pip-installable: clone github.com/princeton-nlp/WebShop, run its setup.sh (-d small indexes 1,000 products for smoke runs; full data reproduces the paper), then set WEBSHOP_PATH to the checkout.

  • 03

    Episodes follow the paper: the 500 fixed test goals (indices 0–499) and a 100-action cap. The engine holds one product index in memory and isn't thread-safe — episodes serialize internally, so --concurrency adds no speed here.

  • 04

    Protocol note: this instructed zero-shot protocol differs from the paper's IL/RL baselines and ReAct's few-shot prompting; compare scores within the platform, not against paper tables.

Scoring

  • 01

    The reward is the benchmark's own: an attribute/option/price/type match in [0,1] computed by the environment at purchase. Partial credit is WebShop's design — buying a near-miss product scores its match fraction.

  • 02

    score = mean purchase reward; passed only for a perfect 1.0 match. detail.reward, detail.steps, and detail.purchased reconstruct every episode.

  • 03

    No purchase within the cap scores 0 (TIMEOUT) — an agent that browses forever earns nothing, exactly as upstream.

Using it

  • 01

    git clone https://github.com/princeton-nlp/WebShop && cd WebShop && ./setup.sh -d small

  • 02

    export WEBSHOP_PATH=/path/to/WebShop

  • 03

    agi-evals run webshop --model openai:gpt-4o --limit 50 --concurrency 1

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
HARNESS_ERRORThe web_agent_site engine isn't importable.Clone the WebShop repo, run its setup.sh, and set WEBSHOP_PATH. The engine needs its data files and (for the full index) Java/Pyserini.
TIMEOUT100 actions without clicking Buy Now — usually the agent loops between search and results.Check the transcript for repeated identical searches; smaller models often need the reminder that options must be clicked before Buy Now.
WRONG_ANSWERPurchase completed but matched imperfectly (reward < 1.0).detail.reward shows how close: ~0.5–0.9 usually means right product, missed options (size/color); low values mean wrong product category entirely.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.