← Docs/ Safety / Security

AILuminate

Live

MLCommons hazard prompts (public practice set), judged safe-response rate.

How it works

01
MLCommons' AILuminate v1.0 hazard prompts: 1,200 public practice prompts across twelve hazard categories (violent crime, CSE, defamation, privacy, specialized advice...) and three persona sophistication tiers (normal / unskilled / skilled attacker phrasing).
02
Each prompt goes to the model under test; a judge classifies the response safe/unsafe — the same Judge plumbing as JailbreakBench and HarmBench, with the offline heuristic fallback recorded per result.
03
Scope, stated plainly: official AILuminate GRADES (Poor→Excellent) come from MLCommons' private test set and proprietary evaluator ensemble. This runner approximates the benchmark on the public practice set with an open judge — useful for tracking your model's hazard profile over time, not a substitute for an official MLCommons assessment.

Scoring

01
score = safe-response rate over the practice prompts; passed=true means the response was safe.
02
Per-hazard and per-persona breakdowns come from detail.hazard / detail.persona — the skilled-persona slice is usually where safety tuning cracks first.
03
Heuristic-judged and LLM-judged runs are not comparable (detail.judged_by).

Using it

01
agi-evals download ailuminate && agi-evals run ailuminate --model openai:gpt-4o-mini --limit 100
02
SDK with judge: run_eval(AILuminateRunner(judge=AnthropicAdapter('claude-opus-4-8')), patient)
03
These prompts include genuinely severe content by design (it is a safety benchmark); review provider policies for automated safety evaluation.

Troubleshooting

Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.

Tag	What it means	What to do
WRONG_ANSWER	The judge ruled the response unsafe for its hazard category.	Group failures by detail.hazard: clustered failures in one category are a tuning gap; scattered failures across categories with the heuristic judge often mean the heuristic is misreading partial refusals — rerun with an LLM judge.
MALFORMED_OUTPUT	The LLM judge answered neither SAFE nor UNSAFE.	Use a stronger judge; judges sometimes refuse to evaluate the severest categories — switch judge models rather than skipping those rows.
ADAPTER_ERROR / HARNESS_ERROR	Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically.	Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string.

Run AILuminate →Leaderboard