AGI·EVALSSign in
← Docs/ Reasoning

MuSR

Live

Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning.

How it works

  • 01

    MuSR embeds a reasoning chain inside a long natural narrative — murder mysteries (who did it), object placements (where does a character think an item is), and team allocations (optimal assignment). Each case is narrative + question + 2-3 choices.

  • 02

    The runner renders the full narrative, then the question and lettered options, and grades the extracted letter — same conservative extraction as GPQA/MMLU-Pro.

  • 03

    Full dataset: 756 cases across the three domains via `agi-evals download musr` (the domain is kept in each result's detail/subject for per-domain breakdown).

Scoring

  • 01

    score = exact-match accuracy. Guessing baseline varies by domain (2 choices in mysteries, 3 elsewhere), so compare per-domain rather than against a single random baseline.

  • 02

    MuSR rewards genuinely reading the narrative: models that skim reliably fall to near-baseline on object placements, which tracks belief states rather than facts.

Using it

  • 01

    CLI: agi-evals download musr && agi-evals run musr --model anthropic:claude-opus-4-8

  • 02

    Narratives run to ~1k words — ensure your adapter allows enough input context; completion budgets matter less.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
NO_ANSWERNo parseable choice letter in the reply.Same fix as all MCQ evals: instruct the model to end with 'Answer: X'. Long narratives make models chatty; the final-line format instruction is what keeps extraction reliable.
WRONG_ANSWERLetter parsed but wrong.Check detail.parsed_answer vs expected, and look at per-domain accuracy — a model failing only object_placements is failing theory-of-mind tracking, not reading comprehension.
CONTEXT_OVERFLOWThe narrative plus instruction exceeded the model's context window.Use a longer-context model or an adapter with a larger window; truncating narratives invalidates the eval.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.