[Paper] When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Published: 3 days ago (May 7, 2026 at 01:56 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06652v1

Overview

The paper tackles a real‑world problem: how to compare the safety of large language models (LLMs) when no pre‑existing benchmark or labeled dataset exists for the target language, industry, or regulatory context. By formalising “benchmark‑less comparative safety scoring,” the authors propose a rigorous audit framework that can still provide trustworthy evidence for model selection in production settings.

Key Contributions

Formal definition of benchmark‑less safety scoring – introduces a clear contract (scenario pack, rubric, auditor, judge, sampling plan, and rerun budget) that makes audit results interpretable.
Instrumental‑validity chain – replaces unavailable ground‑truth labels with a three‑step validation: (1) controlled safe‑vs‑abliterated contrast, (2) dominance of target‑driven variance over auditor/judge noise, and (3) stability across repeated runs.
SimpleAudit toolkit – a lightweight, “local‑first” implementation that enforces the validity chain and can be run on any hardware without cloud dependencies.
Empirical validation on a Norwegian safety pack – demonstrates high AUROC (0.89–1.00), strong target‑driven variance (η² ≈ 0.52), and convergence after ~10 reruns.
Case study in public‑sector procurement – applies the framework to compare two Norwegian LLMs (Borealis vs. Gemma 3), showing that safety rankings depend on scenario categories and risk measures, and that full audit metadata must be reported.

Methodology

Scenario Pack & Rubric – Engineers craft a fixed set of realistic prompts (scenarios) and a scoring rubric that classifies model outputs as “safe,” “unsafe,” or “abliterated” (intentionally harmful).
Auditor & Judge Roles –
- Auditor runs the model on each scenario, records raw responses.
- Judge (human or automated) applies the rubric to assign safety scores.
Instrumental‑Validity Chain –
- Contrast Test: Verify that the instrument reliably distinguishes a known safe target from an intentionally “abliterated” version (e.g., a prompt engineered to provoke toxic output).
- Variance Decomposition: Use ANOVA‑style analysis to ensure most variance in scores comes from the model under test, not from auditor or judge idiosyncrasies.
- Stability Check: Repeat the audit multiple times (reruns) and measure how quickly AUROC and severity distributions converge; the authors find ten reruns sufficient.
SimpleAudit Implementation – A Python package that automates scenario loading, model invocation, rubric application, and statistical checks, all runnable locally.

Results & Findings

Discriminative Power: On the Norwegian safety pack, safe vs. abliterated prompts were separated with AUROC ranging from 0.89 to a perfect 1.00, confirming the contrast test works.
Target‑Driven Variance: Approximately 52 % of the total variance in safety scores was attributable to the model itself (η² ≈ 0.52), dwarfing auditor and judge contributions.
Stability: Severity‑profile metrics (e.g., critical‑rate, average risk) stabilized after about ten reruns, indicating a practical rerun budget for production audits.
Cross‑Tool Consistency: Applying the same chain to the open‑source tool Petri showed compatible results, suggesting the validity chain is tool‑agnostic.
Procurement Case: When comparing Borealis and Gemma 3 across different scenario categories (e.g., data‑privacy, misinformation), the “safer” model flipped depending on the risk measure used, underscoring the need to report the full audit context rather than a single aggregated rank.

Practical Implications

Deployers can audit new LLMs without waiting for industry‑wide benchmarks, enabling faster, evidence‑based model selection for niche languages or regulated domains.
Audit contracts make results reproducible: By publishing the exact scenario pack, rubric, auditor/judge identities, sampling plan, and rerun count, teams can compare scores across organizations or over time.
Tooling integration: SimpleAudit can be embedded into CI pipelines, allowing continuous safety monitoring as models are fine‑tuned or updated.
Regulatory alignment: The framework provides a defensible audit trail that regulators could accept as “deployment evidence” when formal benchmarks are unavailable.
Decision‑making granularity: Instead of a single “best model” label, stakeholders receive a matrix of safety scores per scenario category and risk metric, supporting nuanced procurement or risk‑mitigation strategies.

Limitations & Future Work

Scenario design bias: The validity of the whole chain hinges on the quality and coverage of the handcrafted scenario pack; poorly chosen prompts could mask safety issues.
Human judge variability: While variance analysis shows target dominance, the study still relies on human rubric application, which may not scale to massive audit batches.
Domain transfer: The experiments focus on Norwegian public‑sector contexts; further validation is needed for other languages, cultural norms, and high‑stakes domains (e.g., healthcare).
Automation of the contrast test: Future work could explore automated generation of abliterated prompts to reduce manual effort.
Integration with existing benchmarks: Combining benchmark‑less scores with traditional benchmark results could yield hybrid safety metrics, an avenue the authors suggest for follow‑up research.

Authors

Sushant Gautam
Finn Schwall
Annika Willoch Olstad
Fernando Vallecillos Ruiz
Birk Torpmann-Hagen
Sunniva Maria Stordal Bjørklund
Leon Moonen
Klas Pettersen
Michael A. Riegler

Paper Information

arXiv ID: 2605.06652v1
Categories: cs.LG, cs.AI, cs.CL
Published: May 7, 2026
PDF: Download PDF

[Paper] When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims