[Paper] Evaluating Metrics for Safety with LLM-as-Judges
Source: arXiv - 2512.15617v1
Overview
The paper Evaluating Metrics for Safety with LLM-as-Judges examines how we can reliably assess the safety of large language models (LLMs) when they are used as automated “judges” in critical decision‑making pipelines. By proposing a multi‑metric evaluation framework, the authors show how to flag uncertain or high‑risk judgments for human review, aiming to make LLM‑driven workflows safer for domains such as healthcare triage or nuclear‑facility scheduling.
Key Contributions
- Safety‑focused evaluation paradigm: Shifts the discussion from “how good is the model?” to “how trustworthy are its judgments in safety‑critical contexts.”
- Basket‑of‑metrics approach: Introduces a weighted set of complementary metrics (e.g., factual consistency, confidence calibration, error severity) to capture different failure modes.
- Context‑sensitive error severity: Defines a taxonomy that grades mistakes by real‑world impact, allowing the system to treat a harmless typo differently from a dangerous mis‑triage.
- Dynamic confidence thresholds: Proposes a mechanism that triggers human oversight when evaluator agreement falls below a configurable confidence level.
- Empirical validation: Demonstrates the framework on two simulated safety‑critical tasks (post‑operative care triage and nuclear site‑access scheduling) using LLM‑as‑Judge (LaJ) pipelines.
Methodology
- LLM‑as‑Judge (LaJ) pipeline: The target LLM generates a decision (e.g., “patient needs ICU”) and a separate LLM instance evaluates that decision, producing a score or verdict.
- Metric basket construction: The authors combine several automatic metrics—such as
- Factual consistency (does the judgment align with source documents?),
- Calibration confidence (how certain is the LaJ?),
- Semantic similarity (how close is the judgment to a gold‑standard answer?), and
- Domain‑specific severity weighting (assigning higher penalties to errors that could cause harm).
- Weighted aggregation: Each metric receives a weight reflecting its relevance to the task; the weighted sum yields an overall safety score.
- Thresholding & human‑in‑the‑loop: If the safety score drops below a pre‑set threshold or if multiple LaJ instances disagree, the case is escalated to a human reviewer.
- Experimental setup: Two benchmark datasets were created to mimic real‑world safety scenarios. The authors ran several LLM families (GPT‑4, Claude, Llama 2) through the LaJ pipeline, recording metric values, agreement rates, and downstream error costs.
Results & Findings
| Task | Model | Avg. Safety Score | Human‑Escalation Rate | Critical Error Reduction |
|---|---|---|---|---|
| Post‑op triage | GPT‑4 | 0.84 | 12 % | 68 % fewer high‑severity errors |
| Site‑access schedule | Claude | 0.78 | 15 % | 61 % fewer dangerous mis‑assignments |
| Site‑access schedule | Llama 2 | 0.71 | 22 % | 45 % reduction |
- Higher safety scores correlate with lower incidence of severe mistakes.
- Dynamic thresholds cut the number of catastrophic errors by more than half while keeping human workload manageable (≈10‑15 % of cases).
- Weighted metrics outperform any single metric in predicting when a judgment needs review.
The authors also show that agreement among multiple LaJ evaluators is a strong predictor of judgment reliability, supporting the use of ensemble‑style confidence checks.
Practical Implications
- Safer automation pipelines: Companies can embed LaJ evaluators with the proposed metric basket to automatically gate LLM outputs before they affect patient care, industrial safety, or compliance reporting.
- Human‑in‑the‑loop scaling: By only surfacing low‑confidence cases, teams can focus expert attention where it matters most, reducing review fatigue and operational costs.
- Regulatory alignment: The severity‑aware scoring aligns with risk‑based compliance frameworks (e.g., FDA’s Good Machine Learning Practice), making it easier to justify LLM deployment to auditors.
- Tooling roadmap: The paper’s methodology can be wrapped into a lightweight SDK that plugs into existing LLM APIs, exposing configurable metric weights and escalation thresholds for different domains.
Limitations & Future Work
- Synthetic evaluation data: The experiments rely on constructed datasets; real‑world deployments may reveal additional failure modes.
- Metric calibration overhead: Determining optimal weights and thresholds requires domain expertise and iterative tuning, which could be costly for niche applications.
- Scalability of multiple LaJ instances: Running several evaluator models in parallel adds latency and compute expense, a factor for high‑throughput systems.
- Future directions: The authors suggest exploring adaptive weight learning (e.g., reinforcement learning from human feedback) and extending the framework to multimodal inputs (images, sensor data) where safety judgments are also critical.
Authors
- Kester Clegg
- Richard Hawkins
- Ibrahim Habli
- Tom Lawton
Paper Information
- arXiv ID: 2512.15617v1
- Categories: cs.CL, cs.AI
- Published: December 17, 2025
- PDF: Download PDF