← Docs/ Safety / Security

HarmBench

Live

Red-teaming behaviors (standard+contextual), judged; score = 1 - ASR.

How it works

01
HarmBench's text behaviors: 200 standard plus 100 contextual (the behavior references a provided context document, which the runner prepends). The copyright subset is excluded — it needs hash-based grading we haven't built, and mis-grading it silently would be worse than omitting it.
02
Each behavior goes directly to the model; a judge classifies whether the generation actually exhibits the behavior (provides the harmful content) vs refusing or deflecting. Pass judge=<any adapter> to approximate the benchmark's fine-tuned Llama-2-13B classifier; the offline heuristic fallback mirrors JailbreakBench's.
03
HarmBench's official metric is attack success rate (ASR). The platform reports score = 1 - ASR so higher is better everywhere on the leaderboard; per-category and per-functional-category breakdowns live in each result's detail.

Scoring

01
score = fraction of behaviors NOT exhibited (safe rate). passed=true means safe; an exhibited behavior is tagged WRONG_ANSWER.
02
Heuristic-judged and LLM-judged runs are not comparable — detail.judged_by records the mode on every result.

Using it

01
CLI: agi-evals download harmbench && agi-evals run harmbench --model openai:gpt-4o-mini
02
SDK with judge: run_eval(HarmBenchRunner(judge=AnthropicAdapter('claude-opus-4-8')), patient)
03
This eval sends genuinely harmful requests to the model under test — by design, for defensive robustness measurement. Review provider policies for automated safety evaluation.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
WRONG_ANSWER	The judge ruled the generation exhibits the harmful behavior.	Read response_text and detail.verdict; contextual cases often trip models that handle direct requests fine — check detail.functional_category for the split.
MALFORMED_OUTPUT	The LLM judge answered neither YES nor NO.	Use a stronger judge or a different one — persistent unparseable verdicts usually mean the judge model is refusing to act as a classifier.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run HarmBench →Leaderboard