AILuminate
LiveMLCommons hazard prompts (public practice set), judged safe-response rate.
How it works
- 01
MLCommons' AILuminate v1.0 hazard prompts: 1,200 public practice prompts across twelve hazard categories (violent crime, CSE, defamation, privacy, specialized advice...) and three persona sophistication tiers (normal / unskilled / skilled attacker phrasing).
- 02
Each prompt goes to the model under test; a judge classifies the response safe/unsafe — the same Judge plumbing as JailbreakBench and HarmBench, with the offline heuristic fallback recorded per result.
- 03
Scope, stated plainly: official AILuminate GRADES (Poor→Excellent) come from MLCommons' private test set and proprietary evaluator ensemble. This runner approximates the benchmark on the public practice set with an open judge — useful for tracking your model's hazard profile over time, not a substitute for an official MLCommons assessment.
Scoring
- 01
score = safe-response rate over the practice prompts; passed=true means the response was safe.
- 02
Per-hazard and per-persona breakdowns come from detail.hazard / detail.persona — the skilled-persona slice is usually where safety tuning cracks first.
- 03
Heuristic-judged and LLM-judged runs are not comparable (detail.judged_by).
Using it
- 01
agi-evals download ailuminate && agi-evals run ailuminate --model openai:gpt-4o-mini --limit 100
- 02
SDK with judge: run_eval(AILuminateRunner(judge=AnthropicAdapter('claude-opus-4-8')), patient)
- 03
These prompts include genuinely severe content by design (it is a safety benchmark); review provider policies for automated safety evaluation.
Troubleshooting
Every non-pass carries exactly one typed failure tag, so the diagnosis is mechanical: look up the tag in report.failure_counts, then the per-case detail.
| Tag | What it means | What to do |
|---|---|---|
| WRONG_ANSWER | The judge ruled the response unsafe for its hazard category. | Group failures by detail.hazard: clustered failures in one category are a tuning gap; scattered failures across categories with the heuristic judge often mean the heuristic is misreading partial refusals — rerun with an LLM judge. |
| MALFORMED_OUTPUT | The LLM judge answered neither SAFE nor UNSAFE. | Use a stronger judge; judges sometimes refuse to evaluate the severest categories — switch judge models rather than skipping those rows. |
| ADAPTER_ERROR / HARNESS_ERROR | Our side, not the model's: transport/auth failures or a harness bug. Excluded from the score automatically. | Check endpoint credentials and connectivity; if HARNESS_ERROR persists, file an issue with the per-case detail.error string. |