← Docs/ Embodied

ScienceWorld

Live

Interactive science experiments requiring procedural understanding.

How it works

01
Thirty elementary-science tasks (boil water, measure melting points, grow plants, wire circuits, test conductivity, genetics crosses...) in an interactive text environment where the PROCEDURE is what's tested: you boil water by finding a pot, filling it, activating the stove, and waiting — not by saying 'boil water'.
02
Like ALFWorld, the environment is the benchmark: this drives the official scienceworld package (the engine is a bundled JVM jar — pip install 'agi-eval[scienceworld]' and have Java on PATH).
03
Protocol: the agent sees the observation and task description, replies with one command per turn ('open door to kitchen', 'activate stove', 'focus on substance in pot'), 100-step budget. Each episode constructs its own engine instance, so keep --concurrency modest (each spawns a JVM).

Scoring

01
ScienceWorld gives PARTIAL CREDIT: episodes score 0-100 for completed sub-goals, with negative scores for irreversible blunders. The platform reports score = max(0, raw)/100 per episode (clamped at zero, the conventional averaging choice) with the raw engine score in detail.raw_score.
02
passed=true only for a perfect 100; a TIMEOUT with raw_score 50 still contributes 0.5 to the mean — read score and pass_rate together.
03
detail.task enables the paper's per-task breakdown across the 30 task types.

Using it

01
pip install 'agi-eval[scienceworld]' # needs a Java runtime
02
agi-evals download scienceworld # verifies engine + lists test episodes
03
agi-evals run scienceworld --model openai:gpt-4o --limit 30 --concurrency 2

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
TIMEOUT	100 steps without completing the task — check detail.raw_score to see how far the agent got.	raw_score 0 means the agent never started the procedure (often navigation failure — rooms must be reached via 'go to'); mid-range scores mean it stalled at a sub-goal, which the transcript pinpoints.
WRONG_ANSWER	Episode ended below 100 — commonly focusing on the wrong object ('focus on' is how the env attributes your result) or an irreversible mistake (negative raw_score).	detail.raw_score < 0 means the agent broke the experiment; re-read the task's focus instructions.
ADAPTER_ERROR / HARNESS_ERROR	Engine or Java missing surfaces as infra failure, excluded from the score.	pip install 'agi-eval[scienceworld]' and confirm `java -version` works.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run ScienceWorld →Leaderboard