ScienceWorld
LiveInteractive science experiments requiring procedural understanding.
How it works
- 01
Thirty elementary-science tasks (boil water, measure melting points, grow plants, wire circuits, test conductivity, genetics crosses...) in an interactive text environment where the PROCEDURE is what's tested: you boil water by finding a pot, filling it, activating the stove, and waiting — not by saying 'boil water'.
- 02
Like ALFWorld, the environment is the benchmark: this drives the official scienceworld package (the engine is a bundled JVM jar — pip install 'agi-evals[scienceworld]' and have Java on PATH).
- 03
Protocol: the agent sees the observation and task description, replies with one command per turn ('open door to kitchen', 'activate stove', 'focus on substance in pot'), 100-step budget. Each episode constructs its own engine instance, so keep --concurrency modest (each spawns a JVM).
Scoring
- 01
ScienceWorld gives PARTIAL CREDIT: episodes score 0-100 for completed sub-goals, with negative scores for irreversible blunders. The platform reports score = max(0, raw)/100 per episode (clamped at zero, the conventional averaging choice) with the raw engine score in detail.raw_score.
- 02
passed=true only for a perfect 100; a TIMEOUT with raw_score 50 still contributes 0.5 to the mean — read score and pass_rate together.
- 03
detail.task enables the paper's per-task breakdown across the 30 task types.
Using it
- 01
pip install 'agi-evals[scienceworld]' # needs a Java runtime
- 02
agi-evals download scienceworld # verifies engine + lists test episodes
- 03
agi-evals run scienceworld --model openai:gpt-4o --limit 30 --concurrency 2
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| TIMEOUT | 100 steps without completing the task — check detail.raw_score to see how far the agent got. | raw_score 0 means the agent never started the procedure (often navigation failure — rooms must be reached via 'go to'); mid-range scores mean it stalled at a sub-goal, which the transcript pinpoints. |
| WRONG_ANSWER | Episode ended below 100 — commonly focusing on the wrong object ('focus on' is how the env attributes your result) or an irreversible mistake (negative raw_score). | detail.raw_score < 0 means the agent broke the experiment; re-read the task's focus instructions. |
| ADAPTER_ERROR / HARNESS_ERROR | Engine or Java missing surfaces as infra failure, excluded from the score. | pip install 'agi-evals[scienceworld]' and confirm `java -version` works. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |