← Docs/ Reasoning

MuSR

Live

Long narrative puzzles (murder mysteries, logistics) requiring soft reasoning.

How it works

01
MuSR embeds a reasoning chain inside a long natural narrative — murder mysteries (who did it), object placements (where does a character think an item is), and team allocations (optimal assignment). Each case is narrative + question + 2-3 choices.
02
The runner renders the full narrative, then the question and lettered options, and grades the extracted letter — same conservative extraction as GPQA/MMLU-Pro.
03
Full dataset: 756 cases across the three domains via `agi-evals download musr` (the domain is kept in each result's detail/subject for per-domain breakdown).

Scoring

01
score = exact-match accuracy. Guessing baseline varies by domain (2 choices in mysteries, 3 elsewhere), so compare per-domain rather than against a single random baseline.
02
MuSR rewards genuinely reading the narrative: models that skim reliably fall to near-baseline on object placements, which tracks belief states rather than facts.

Using it

01
CLI: agi-evals download musr && agi-evals run musr --model anthropic:claude-opus-4-8
02
Narratives run to ~1k words — ensure your adapter allows enough input context; completion budgets matter less.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
NO_ANSWER	No parseable choice letter in the reply.	Same fix as all MCQ evals: instruct the model to end with 'Answer: X'. Long narratives make models chatty; the final-line format instruction is what keeps extraction reliable.
WRONG_ANSWER	Letter parsed but wrong.	Check detail.parsed_answer vs expected, and look at per-domain accuracy — a model failing only object_placements is failing theory-of-mind tracking, not reading comprehension.
CONTEXT_OVERFLOW	The narrative plus instruction exceeded the model's context window.	Use a longer-context model or an adapter with a larger window; truncating narratives invalidates the eval.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run MuSR →Leaderboard