WebShop
LiveBuy the right product from 1.18M items given a natural-language goal.
How it works
- 01
WebShop is a simulated e-commerce site built from 1.18M real products. The agent reads a shopping instruction ('long sleeve navy shirt under $30'), then searches, browses result pages, opens items, picks options, and buys — all through text actions: search[query] and click[element], with each page's clickables listed every turn.
- 02
The environment IS the benchmark (like ALFWorld and ScienceWorld), so this runner drives the official web_agent_site engine. WebShop isn't pip-installable: clone github.com/princeton-nlp/WebShop, run its setup.sh (-d small indexes 1,000 products for smoke runs; full data reproduces the paper), then set WEBSHOP_PATH to the checkout.
- 03
Episodes follow the paper: the 500 fixed test goals (indices 0–499) and a 100-action cap. The engine holds one product index in memory and isn't thread-safe — episodes serialize internally, so --concurrency adds no speed here.
- 04
Protocol note: this instructed zero-shot protocol differs from the paper's IL/RL baselines and ReAct's few-shot prompting; compare scores within the platform, not against paper tables.
Scoring
- 01
The reward is the benchmark's own: an attribute/option/price/type match in [0,1] computed by the environment at purchase. Partial credit is WebShop's design — buying a near-miss product scores its match fraction.
- 02
score = mean purchase reward; passed only for a perfect 1.0 match. detail.reward, detail.steps, and detail.purchased reconstruct every episode.
- 03
No purchase within the cap scores 0 (TIMEOUT) — an agent that browses forever earns nothing, exactly as upstream.
Using it
- 01
git clone https://github.com/princeton-nlp/WebShop && cd WebShop && ./setup.sh -d small
- 02
export WEBSHOP_PATH=/path/to/WebShop
- 03
agi-evals run webshop --model openai:gpt-4o --limit 50 --concurrency 1
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| HARNESS_ERROR | The web_agent_site engine isn't importable. | Clone the WebShop repo, run its setup.sh, and set WEBSHOP_PATH. The engine needs its data files and (for the full index) Java/Pyserini. |
| TIMEOUT | 100 actions without clicking Buy Now — usually the agent loops between search and results. | Check the transcript for repeated identical searches; smaller models often need the reminder that options must be clicked before Buy Now. |
| WRONG_ANSWER | Purchase completed but matched imperfectly (reward < 1.0). | detail.reward shows how close: ~0.5–0.9 usually means right product, missed options (size/color); low values mean wrong product category entirely. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |