MuSR
LiveLong narrative puzzles (murder mysteries, logistics) requiring soft reasoning.
How it works
- 01
MuSR embeds a reasoning chain inside a long natural narrative — murder mysteries (who did it), object placements (where does a character think an item is), and team allocations (optimal assignment). Each case is narrative + question + 2-3 choices.
- 02
The runner renders the full narrative, then the question and lettered options, and grades the extracted letter — same conservative extraction as GPQA/MMLU-Pro.
- 03
Full dataset: 756 cases across the three domains via `agi-evals download musr` (the domain is kept in each result's detail/subject for per-domain breakdown).
Scoring
- 01
score = exact-match accuracy. Guessing baseline varies by domain (2 choices in mysteries, 3 elsewhere), so compare per-domain rather than against a single random baseline.
- 02
MuSR rewards genuinely reading the narrative: models that skim reliably fall to near-baseline on object placements, which tracks belief states rather than facts.
Using it
- 01
CLI: agi-evals download musr && agi-evals run musr --model anthropic:claude-opus-4-8
- 02
Narratives run to ~1k words — ensure your adapter allows enough input context; completion budgets matter less.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| NO_ANSWER | No parseable choice letter in the reply. | Same fix as all MCQ evals: instruct the model to end with 'Answer: X'. Long narratives make models chatty; the final-line format instruction is what keeps extraction reliable. |
| WRONG_ANSWER | Letter parsed but wrong. | Check detail.parsed_answer vs expected, and look at per-domain accuracy — a model failing only object_placements is failing theory-of-mind tracking, not reading comprehension. |
| CONTEXT_OVERFLOW | The narrative plus instruction exceeded the model's context window. | Use a longer-context model or an adapter with a larger window; truncating narratives invalidates the eval. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |