ZebraLogic
LiveEinstein-style logic-grid puzzles scored by full-grid correctness.
How it works
- 01
Each case is an Einstein-style logic-grid puzzle: N houses, M attributes, and a set of constraints with exactly one consistent assignment. The model must output the ENTIRE solution grid as JSON ({"solution": {"House 1": {...}}}), not just answer one question.
- 02
Grading is the paper's protocol: puzzle-level accuracy — every cell must be correct for a pass. Cell-level accuracy is computed and kept in detail.cell_accuracy for analysis, but a 95%-correct grid is still a failed puzzle.
- 03
Data note: the public ZebraLogicBench copies blank all solutions to keep answers out of training crawls. `agi-evals download zebralogic` requires requesting access to allenai/ZebraLogicBench-private on Hugging Face and setting HF_TOKEN; the bundled sample contains original puzzles with verified-unique solutions so the eval runs offline.
Scoring
- 01
score = fraction of puzzles with a fully correct grid. This is intentionally harsh: it measures end-to-end constraint propagation, not per-cell guessing (random cell accuracy is high; random puzzle accuracy is ~0).
- 02
detail.wrong_cells names exactly which house/attribute pairs missed, so you can see whether a model fails by one swapped pair or collapses entirely.
Using it
- 01
CLI: agi-evals run zebralogic --model anthropic:claude-opus-4-8 --limit 20
- 02
Large grids (6x6) need long outputs — the runner allows 4096 completion tokens; budget accordingly on metered APIs.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| NO_ANSWER | No parseable JSON grid in the reply (no fenced block, no {...} span with House keys). | Models that narrate without committing to JSON fix themselves with 'output ONLY the json block last'. Check the response wasn't truncated mid-JSON — that's the most common cause on big grids. |
| WRONG_ANSWER | Grid parsed but at least one cell is wrong. | Look at detail.cell_accuracy: near-1.0 means a single swapped pair (often two houses exchanged); near-random means the model isn't actually propagating constraints. detail.wrong_cells lists the misses. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |