AGI·EVALSSign in
← Docs/ Robotics

LIBERO

Live

Lifelong manipulation suites testing declarative and procedural transfer.

How it works

  • 01

    LIBERO benchmarks language-conditioned robot manipulation in MuJoCo: four task suites (libero_spatial, libero_object, libero_goal, libero_10) probing spatial reasoning, object generalization, goal transfer, and long-horizon tasks — 10 tasks each, with the benchmark's fixed init states making every trial deterministic and reproducible.

  • 02

    This is the platform's first robotics eval, and the first to take a PolicyAdapter instead of a text patient: the runner steps your policy with camera images (agentview + wrist), proprioception, and the task instruction; the policy returns 7-DoF actions (6 end-effector deltas + gripper). Serve your VLA over HTTP (--model policy:http://gpu-box:8000) or wrap a local checkpoint with CallablePolicyAdapter.

  • 03

    The environment IS the benchmark: episodes run the official libero package (MuJoCo + robosuite), with the standard settle period of no-op actions before the policy takes over and the established per-suite step budgets (spatial 220, object 280, goal 300, libero_10 520).

  • 04

    Honest scope: there is no echo smoke test here — a meaningful run needs a real policy checkpoint, usually on a GPU. The scripted-policy test suite proves the harness; your first VLA evaluation is plug-in, not build-out.

Scoring

  • 01

    Success rate — the benchmark's metric. Each (task, trial) episode scores 1.0 when the environment reports the goal achieved, else 0 with TIMEOUT at the step cap.

  • 02

    trials_per_task defaults to 10 for tractable first runs; the literature standard is 50 — raise it for paper-comparable rates (SDK: LIBERORunner(suite=..., trials_per_task=50)).

  • 03

    detail carries suite/task_id/trial/steps/instruction for per-task breakdowns.

Using it

  • 01

    pip install git+https://github.com/Lifelong-Robot-Learning/LIBERO

  • 02

    # serve your policy (openpi / LeRobot serve pattern), then:

  • 03

    agi-evals run libero --model policy:http://localhost:8000 --limit 20 --concurrency 1

  • 04

    # SDK: run_eval(LIBERORunner(suite='libero_object'), HTTPPolicyAdapter('http://gpu:8000'))

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
HARNESS_ERRORThe libero package (or MuJoCo underneath) isn't importable.pip install from the upstream repo; headless machines need no display (offscreen rendering), but MuJoCo must install cleanly.
TypeError: PolicyAdapterA text model spec (openai:..., echo) was passed to a robotics eval.LIBERO evaluates policies: use --model policy:http://host:port or a CallablePolicyAdapter in the SDK.
TIMEOUTThe policy never achieved the goal within the suite's step budget.That's the honest failure mode for robotics. detail.steps near the cap with no progress usually means observation-key mismatch — confirm your policy reads agentview_image/robot0_eye_in_hand_image at your trained resolution (camera_size param).
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.