LIBERO
LiveLifelong manipulation suites testing declarative and procedural transfer.
How it works
- 01
LIBERO benchmarks language-conditioned robot manipulation in MuJoCo: four task suites (libero_spatial, libero_object, libero_goal, libero_10) probing spatial reasoning, object generalization, goal transfer, and long-horizon tasks — 10 tasks each, with the benchmark's fixed init states making every trial deterministic and reproducible.
- 02
This is the platform's first robotics eval, and the first to take a PolicyAdapter instead of a text patient: the runner steps your policy with camera images (agentview + wrist), proprioception, and the task instruction; the policy returns 7-DoF actions (6 end-effector deltas + gripper). Serve your VLA over HTTP (--model policy:http://gpu-box:8000) or wrap a local checkpoint with CallablePolicyAdapter.
- 03
The environment IS the benchmark: episodes run the official libero package (MuJoCo + robosuite), with the standard settle period of no-op actions before the policy takes over and the established per-suite step budgets (spatial 220, object 280, goal 300, libero_10 520).
- 04
Honest scope: there is no echo smoke test here — a meaningful run needs a real policy checkpoint, usually on a GPU. The scripted-policy test suite proves the harness; your first VLA evaluation is plug-in, not build-out.
Scoring
- 01
Success rate — the benchmark's metric. Each (task, trial) episode scores 1.0 when the environment reports the goal achieved, else 0 with TIMEOUT at the step cap.
- 02
trials_per_task defaults to 10 for tractable first runs; the literature standard is 50 — raise it for paper-comparable rates (SDK: LIBERORunner(suite=..., trials_per_task=50)).
- 03
detail carries suite/task_id/trial/steps/instruction for per-task breakdowns.
Using it
- 01
pip install git+https://github.com/Lifelong-Robot-Learning/LIBERO
- 02
# serve your policy (openpi / LeRobot serve pattern), then:
- 03
agi-evals run libero --model policy:http://localhost:8000 --limit 20 --concurrency 1
- 04
# SDK: run_eval(LIBERORunner(suite='libero_object'), HTTPPolicyAdapter('http://gpu:8000'))
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| HARNESS_ERROR | The libero package (or MuJoCo underneath) isn't importable. | pip install from the upstream repo; headless machines need no display (offscreen rendering), but MuJoCo must install cleanly. |
| TypeError: PolicyAdapter | A text model spec (openai:..., echo) was passed to a robotics eval. | LIBERO evaluates policies: use --model policy:http://host:port or a CallablePolicyAdapter in the SDK. |
| TIMEOUT | The policy never achieved the goal within the suite's step budget. | That's the honest failure mode for robotics. detail.steps near the cap with no progress usually means observation-key mismatch — confirm your policy reads agentview_image/robot0_eye_in_hand_image at your trained resolution (camera_size param). |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |