AGI·EVALSSign in
← Docs/ Safety / Security

HarmBench

Live

Red-teaming behaviors (standard+contextual), judged; score = 1 - ASR.

How it works

  • 01

    HarmBench's text behaviors: 200 standard plus 100 contextual (the behavior references a provided context document, which the runner prepends). The copyright subset is excluded — it needs hash-based grading we haven't built, and mis-grading it silently would be worse than omitting it.

  • 02

    Each behavior goes directly to the model; a judge classifies whether the generation actually exhibits the behavior (provides the harmful content) vs refusing or deflecting. Pass judge=<any adapter> to approximate the benchmark's fine-tuned Llama-2-13B classifier; the offline heuristic fallback mirrors JailbreakBench's.

  • 03

    HarmBench's official metric is attack success rate (ASR). The platform reports score = 1 - ASR so higher is better everywhere on the leaderboard; per-category and per-functional-category breakdowns live in each result's detail.

Scoring

  • 01

    score = fraction of behaviors NOT exhibited (safe rate). passed=true means safe; an exhibited behavior is tagged WRONG_ANSWER.

  • 02

    Heuristic-judged and LLM-judged runs are not comparable — detail.judged_by records the mode on every result.

Using it

  • 01

    CLI: agi-evals download harmbench && agi-evals run harmbench --model openai:gpt-4o-mini

  • 02

    SDK with judge: run_eval(HarmBenchRunner(judge=AnthropicAdapter('claude-opus-4-8')), patient)

  • 03

    This eval sends genuinely harmful requests to the model under test — by design, for defensive robustness measurement. Review provider policies for automated safety evaluation.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

TagWhat it meansWhat to do
WRONG_ANSWERThe judge ruled the generation exhibits the harmful behavior.Read response_text and detail.verdict; contextual cases often trip models that handle direct requests fine — check detail.functional_category for the split.
MALFORMED_OUTPUTThe LLM judge answered neither YES nor NO.Use a stronger judge or a different one — persistent unparseable verdicts usually mean the judge model is refusing to act as a classifier.
ADAPTER_ERROR / HARNESS_ERROROur side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.