AGI·EVALSSign in
← Docs/ Reasoning

ZebraLogic

Live

Einstein-style logic-grid puzzles scored by full-grid correctness.

How it works

  • 01

    Each case is an Einstein-style logic-grid puzzle: N houses, M attributes, and a set of constraints with exactly one consistent assignment. The model must output the ENTIRE solution grid as JSON ({"solution": {"House 1": {...}}}), not just answer one question.

  • 02

    Grading is the paper's protocol: puzzle-level accuracy — every cell must be correct for a pass. Cell-level accuracy is computed and kept in detail.cell_accuracy for analysis, but a 95%-correct grid is still a failed puzzle.

  • 03

    Data note: the public ZebraLogicBench copies blank all solutions to keep answers out of training crawls. `agi-evals download zebralogic` requires requesting access to allenai/ZebraLogicBench-private on Hugging Face and setting HF_TOKEN; the bundled sample contains original puzzles with verified-unique solutions so the eval runs offline.

Scoring

  • 01

    score = fraction of puzzles with a fully correct grid. This is intentionally harsh: it measures end-to-end constraint propagation, not per-cell guessing (random cell accuracy is high; random puzzle accuracy is ~0).

  • 02

    detail.wrong_cells names exactly which house/attribute pairs missed, so you can see whether a model fails by one swapped pair or collapses entirely.

Using it

  • 01

    CLI: agi-evals run zebralogic --model anthropic:claude-opus-4-8 --limit 20

  • 02

    Large grids (6x6) need long outputs — the runner allows 4096 completion tokens; budget accordingly on metered APIs.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERNo parseable JSON grid in the reply (no fenced block, no {...} span with House keys).Models that narrate without committing to JSON fix themselves with 'output ONLY the json block last'. Check the response wasn't truncated mid-JSON — that's the most common cause on big grids.
WRONG_ANSWERGrid parsed but at least one cell is wrong.Look at detail.cell_accuracy: near-1.0 means a single swapped pair (often two houses exchanged); near-random means the model isn't actually propagating constraints. detail.wrong_cells lists the misses.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.