[Paper] Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
Source: arXiv - 2605.06161v1
Overview
Large Language Models (LLMs) are increasingly used as “judges” to evaluate the safety of autonomous agents, but we currently have no systematic way to verify that these judges are actually judging behaviour rather than being swayed by the wording of the evaluation prompt. This paper introduces policy invariance – a set of sanity‑check principles that any trustworthy safety judge should satisfy – and shows that today’s LLM judges often fail these checks.
Key Contributions
- Policy‑Invariance Framework – formalizes three testable principles (rubric‑semantics, rubric‑threshold, and ambiguity‑aware calibration) that capture whether a judge’s verdict depends on the agent’s actions or on superficial prompt changes.
- Stress‑Test Protocol – a reproducible evaluation suite that rewrites evaluation policies in certified‑equivalent ways and deliberately shifts rubric strictness to probe judge stability.
- Empirical Diagnosis – demonstrates that state‑of‑the‑art LLM judges flip up to 9.1 % of safety verdicts on content‑preserving rewrites, with 18–43 % of flips occurring on clearly unambiguous cases.
- Policy Invariance Score & Judge Card – new metrics and a reporting template that expose reliability gaps invisible to traditional accuracy‑only leaderboards.
- Open‑Source Release – code, data, and the full protocol are released for the community to audit their own safety judges.
Methodology
-
Define Invariance Principles
- Rubric‑Semantics Invariance: Verdicts should stay the same when the evaluation policy is rewritten without changing its meaning (e.g., synonym swaps, passive‑active voice changes).
- Rubric‑Threshold Invariance: Verdicts should be robust to intentional shifts from a strict to a lenient rubric, unless the agent’s behavior truly crosses the new threshold.
- Ambiguity‑Aware Calibration: Verdict volatility should concentrate on cases that are genuinely ambiguous; stable cases should remain stable.
-
Create Test Cases
- Collected agent trajectories from ASSEBench and R‑Judge.
- Generated certified‑equivalent rewrites of the safety rubrics using rule‑based paraphrasing and human verification.
- Designed strict‑to‑lenient rubric variants that systematically relax safety thresholds.
-
Run the Judges
- Evaluated four different LLM‑as‑judge models (e.g., GPT‑4, Claude, Llama‑2‑Chat, and a fine‑tuned safety‑specific model).
- Recorded verdicts under original, rewritten, and shifted rubrics.
-
Measure Invariance
- Computed Policy Invariance Score (PIS): proportion of verdicts that remain unchanged across rewrites, normalized by baseline jitter.
- Produced a Judge Card summarizing each model’s PIS, calibration curves, and failure modes.
Results & Findings
| Judge (model) | Baseline Accuracy* | Policy Invariance Score | Verdict flips on rewrites | Flips on unambiguous cases |
|---|---|---|---|---|
| GPT‑4 | 87 % | 0.71 | 6.3 % | 22 % |
| Claude | 84 % | 0.68 | 7.1 % | 27 % |
| Llama‑2‑Chat | 80 % | 0.55 | 9.1 % | 43 % |
| Fine‑tuned safety‑LLM | 82 % | 0.62 | 8.4 % | 31 % |
*Accuracy measured against human‑annotated safety labels.
- Policy invariance varies dramatically even among models with similar raw accuracy.
- Content‑preserving rewrites cause verdict changes comparable to those caused by genuine rubric strictness shifts, indicating that judges are overly sensitive to surface wording.
- Ambiguity‑aware calibration reveals that many flips happen on cases that humans unanimously label as safe or unsafe, exposing a hidden reliability problem.
- The Judge Card makes these gaps visible at a glance, something traditional leaderboards miss.
Practical Implications
- Benchmark Designers: Before adopting an LLM judge as the ground truth, run the policy‑invariance stress test to certify that the evaluator is not “gaming” the prompt.
- Safety‑Critical Deployments: Teams building autonomous agents (e.g., self‑driving bots, financial trading assistants) should incorporate the Policy Invariance Score into their evaluation pipelines to avoid false safety assurances.
- LLM Providers: The findings give a concrete target for fine‑tuning: improve robustness to paraphrasing and rubric shifts, not just raw classification accuracy.
- Tooling: The released code can be integrated into CI/CD pipelines for continuous monitoring of judge reliability as models evolve.
- Regulatory Audits: Policy invariance offers a measurable, interpretable metric that regulators could require for AI safety certifications.
Limitations & Future Work
- Scope of Rubrics: The study focuses on safety rubrics used in current agent‑evaluation benchmarks; other domains (e.g., bias, factuality) may need tailored invariance definitions.
- Human Verification Cost: Certified‑equivalent rewrites rely on human validation, which can be expensive at scale. Automated semantic equivalence checks are a promising avenue.
- Model Diversity: Only four LLM judges were examined; broader coverage (including open‑source models with different architectures) would strengthen generality.
- Dynamic Agents: The current protocol evaluates static trajectories; extending it to interactive, real‑time agents could uncover additional failure modes.
- Calibration Techniques: Future work could explore training objectives that directly optimize for policy invariance, potentially reducing the observed jitter.
Authors
- Shihao Weng
- Yang Feng
- Xiaofei Xie
Paper Information
- arXiv ID: 2605.06161v1
- Categories: cs.AI, cs.SE
- Published: May 7, 2026
- PDF: Download PDF