AGI·EVALSSign in
← Docs/ Embodied

ScienceWorld

Live

Interactive science experiments requiring procedural understanding.

How it works

  • 01

    Thirty elementary-science tasks (boil water, measure melting points, grow plants, wire circuits, test conductivity, genetics crosses...) in an interactive text environment where the PROCEDURE is what's tested: you boil water by finding a pot, filling it, activating the stove, and waiting — not by saying 'boil water'.

  • 02

    Like ALFWorld, the environment is the benchmark: this drives the official scienceworld package (the engine is a bundled JVM jar — pip install 'agi-evals[scienceworld]' and have Java on PATH).

  • 03

    Protocol: the agent sees the observation and task description, replies with one command per turn ('open door to kitchen', 'activate stove', 'focus on substance in pot'), 100-step budget. Each episode constructs its own engine instance, so keep --concurrency modest (each spawns a JVM).

Scoring

  • 01

    ScienceWorld gives PARTIAL CREDIT: episodes score 0-100 for completed sub-goals, with negative scores for irreversible blunders. The platform reports score = max(0, raw)/100 per episode (clamped at zero, the conventional averaging choice) with the raw engine score in detail.raw_score.

  • 02

    passed=true only for a perfect 100; a TIMEOUT with raw_score 50 still contributes 0.5 to the mean — read score and pass_rate together.

  • 03

    detail.task enables the paper's per-task breakdown across the 30 task types.

Using it

  • 01

    pip install 'agi-evals[scienceworld]' # needs a Java runtime

  • 02

    agi-evals download scienceworld # verifies engine + lists test episodes

  • 03

    agi-evals run scienceworld --model openai:gpt-4o --limit 30 --concurrency 2

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
TIMEOUT100 steps without completing the task — check detail.raw_score to see how far the agent got.raw_score 0 means the agent never started the procedure (often navigation failure — rooms must be reached via 'go to'); mid-range scores mean it stalled at a sub-goal, which the transcript pinpoints.
WRONG_ANSWEREpisode ended below 100 — commonly focusing on the wrong object ('focus on' is how the env attributes your result) or an irreversible mistake (negative raw_score).detail.raw_score < 0 means the agent broke the experiment; re-read the task's focus instructions.
ADAPTER_ERROR / HARNESS_ERROREngine or Java missing surfaces as infra failure, excluded from the score.pip install 'agi-evals[scienceworld]' and confirm `java -version` works.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.