[Paper] Evaluating Metrics for Safety with LLM-as-Judges

Published: (December 17, 2025 at 12:24 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.15617v1

Overview

The paper Evaluating Metrics for Safety with LLM-as-Judges examines how we can reliably assess the safety of large language models (LLMs) when they are used as automated “judges” in critical decision‑making pipelines. By proposing a multi‑metric evaluation framework, the authors show how to flag uncertain or high‑risk judgments for human review, aiming to make LLM‑driven workflows safer for domains such as healthcare triage or nuclear‑facility scheduling.

Key Contributions

  • Safety‑focused evaluation paradigm: Shifts the discussion from “how good is the model?” to “how trustworthy are its judgments in safety‑critical contexts.”
  • Basket‑of‑metrics approach: Introduces a weighted set of complementary metrics (e.g., factual consistency, confidence calibration, error severity) to capture different failure modes.
  • Context‑sensitive error severity: Defines a taxonomy that grades mistakes by real‑world impact, allowing the system to treat a harmless typo differently from a dangerous mis‑triage.
  • Dynamic confidence thresholds: Proposes a mechanism that triggers human oversight when evaluator agreement falls below a configurable confidence level.
  • Empirical validation: Demonstrates the framework on two simulated safety‑critical tasks (post‑operative care triage and nuclear site‑access scheduling) using LLM‑as‑Judge (LaJ) pipelines.

Methodology

  1. LLM‑as‑Judge (LaJ) pipeline: The target LLM generates a decision (e.g., “patient needs ICU”) and a separate LLM instance evaluates that decision, producing a score or verdict.
  2. Metric basket construction: The authors combine several automatic metrics—such as
    • Factual consistency (does the judgment align with source documents?),
    • Calibration confidence (how certain is the LaJ?),
    • Semantic similarity (how close is the judgment to a gold‑standard answer?), and
    • Domain‑specific severity weighting (assigning higher penalties to errors that could cause harm).
  3. Weighted aggregation: Each metric receives a weight reflecting its relevance to the task; the weighted sum yields an overall safety score.
  4. Thresholding & human‑in‑the‑loop: If the safety score drops below a pre‑set threshold or if multiple LaJ instances disagree, the case is escalated to a human reviewer.
  5. Experimental setup: Two benchmark datasets were created to mimic real‑world safety scenarios. The authors ran several LLM families (GPT‑4, Claude, Llama 2) through the LaJ pipeline, recording metric values, agreement rates, and downstream error costs.

Results & Findings

TaskModelAvg. Safety ScoreHuman‑Escalation RateCritical Error Reduction
Post‑op triageGPT‑40.8412 %68 % fewer high‑severity errors
Site‑access scheduleClaude0.7815 %61 % fewer dangerous mis‑assignments
Site‑access scheduleLlama 20.7122 %45 % reduction
  • Higher safety scores correlate with lower incidence of severe mistakes.
  • Dynamic thresholds cut the number of catastrophic errors by more than half while keeping human workload manageable (≈10‑15 % of cases).
  • Weighted metrics outperform any single metric in predicting when a judgment needs review.

The authors also show that agreement among multiple LaJ evaluators is a strong predictor of judgment reliability, supporting the use of ensemble‑style confidence checks.

Practical Implications

  • Safer automation pipelines: Companies can embed LaJ evaluators with the proposed metric basket to automatically gate LLM outputs before they affect patient care, industrial safety, or compliance reporting.
  • Human‑in‑the‑loop scaling: By only surfacing low‑confidence cases, teams can focus expert attention where it matters most, reducing review fatigue and operational costs.
  • Regulatory alignment: The severity‑aware scoring aligns with risk‑based compliance frameworks (e.g., FDA’s Good Machine Learning Practice), making it easier to justify LLM deployment to auditors.
  • Tooling roadmap: The paper’s methodology can be wrapped into a lightweight SDK that plugs into existing LLM APIs, exposing configurable metric weights and escalation thresholds for different domains.

Limitations & Future Work

  • Synthetic evaluation data: The experiments rely on constructed datasets; real‑world deployments may reveal additional failure modes.
  • Metric calibration overhead: Determining optimal weights and thresholds requires domain expertise and iterative tuning, which could be costly for niche applications.
  • Scalability of multiple LaJ instances: Running several evaluator models in parallel adds latency and compute expense, a factor for high‑throughput systems.
  • Future directions: The authors suggest exploring adaptive weight learning (e.g., reinforcement learning from human feedback) and extending the framework to multimodal inputs (images, sensor data) where safety judgments are also critical.

Authors

  • Kester Clegg
  • Richard Hawkins
  • Ibrahim Habli
  • Tom Lawton

Paper Information

  • arXiv ID: 2512.15617v1
  • Categories: cs.CL, cs.AI
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...