[Paper] When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Published: (May 7, 2026 at 01:56 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.06652v1

Overview

The paper tackles a real‑world problem: how to compare the safety of large language models (LLMs) when no pre‑existing benchmark or labeled dataset exists for the target language, industry, or regulatory context. By formalising “benchmark‑less comparative safety scoring,” the authors propose a rigorous audit framework that can still provide trustworthy evidence for model selection in production settings.

Key Contributions

  • Formal definition of benchmark‑less safety scoring – introduces a clear contract (scenario pack, rubric, auditor, judge, sampling plan, and rerun budget) that makes audit results interpretable.
  • Instrumental‑validity chain – replaces unavailable ground‑truth labels with a three‑step validation: (1) controlled safe‑vs‑abliterated contrast, (2) dominance of target‑driven variance over auditor/judge noise, and (3) stability across repeated runs.
  • SimpleAudit toolkit – a lightweight, “local‑first” implementation that enforces the validity chain and can be run on any hardware without cloud dependencies.
  • Empirical validation on a Norwegian safety pack – demonstrates high AUROC (0.89–1.00), strong target‑driven variance (η² ≈ 0.52), and convergence after ~10 reruns.
  • Case study in public‑sector procurement – applies the framework to compare two Norwegian LLMs (Borealis vs. Gemma 3), showing that safety rankings depend on scenario categories and risk measures, and that full audit metadata must be reported.

Methodology

  1. Scenario Pack & Rubric – Engineers craft a fixed set of realistic prompts (scenarios) and a scoring rubric that classifies model outputs as “safe,” “unsafe,” or “abliterated” (intentionally harmful).
  2. Auditor & Judge Roles
    • Auditor runs the model on each scenario, records raw responses.
    • Judge (human or automated) applies the rubric to assign safety scores.
  3. Instrumental‑Validity Chain
    • Contrast Test: Verify that the instrument reliably distinguishes a known safe target from an intentionally “abliterated” version (e.g., a prompt engineered to provoke toxic output).
    • Variance Decomposition: Use ANOVA‑style analysis to ensure most variance in scores comes from the model under test, not from auditor or judge idiosyncrasies.
    • Stability Check: Repeat the audit multiple times (reruns) and measure how quickly AUROC and severity distributions converge; the authors find ten reruns sufficient.
  4. SimpleAudit Implementation – A Python package that automates scenario loading, model invocation, rubric application, and statistical checks, all runnable locally.

Results & Findings

  • Discriminative Power: On the Norwegian safety pack, safe vs. abliterated prompts were separated with AUROC ranging from 0.89 to a perfect 1.00, confirming the contrast test works.
  • Target‑Driven Variance: Approximately 52 % of the total variance in safety scores was attributable to the model itself (η² ≈ 0.52), dwarfing auditor and judge contributions.
  • Stability: Severity‑profile metrics (e.g., critical‑rate, average risk) stabilized after about ten reruns, indicating a practical rerun budget for production audits.
  • Cross‑Tool Consistency: Applying the same chain to the open‑source tool Petri showed compatible results, suggesting the validity chain is tool‑agnostic.
  • Procurement Case: When comparing Borealis and Gemma 3 across different scenario categories (e.g., data‑privacy, misinformation), the “safer” model flipped depending on the risk measure used, underscoring the need to report the full audit context rather than a single aggregated rank.

Practical Implications

  • Deployers can audit new LLMs without waiting for industry‑wide benchmarks, enabling faster, evidence‑based model selection for niche languages or regulated domains.
  • Audit contracts make results reproducible: By publishing the exact scenario pack, rubric, auditor/judge identities, sampling plan, and rerun count, teams can compare scores across organizations or over time.
  • Tooling integration: SimpleAudit can be embedded into CI pipelines, allowing continuous safety monitoring as models are fine‑tuned or updated.
  • Regulatory alignment: The framework provides a defensible audit trail that regulators could accept as “deployment evidence” when formal benchmarks are unavailable.
  • Decision‑making granularity: Instead of a single “best model” label, stakeholders receive a matrix of safety scores per scenario category and risk metric, supporting nuanced procurement or risk‑mitigation strategies.

Limitations & Future Work

  • Scenario design bias: The validity of the whole chain hinges on the quality and coverage of the handcrafted scenario pack; poorly chosen prompts could mask safety issues.
  • Human judge variability: While variance analysis shows target dominance, the study still relies on human rubric application, which may not scale to massive audit batches.
  • Domain transfer: The experiments focus on Norwegian public‑sector contexts; further validation is needed for other languages, cultural norms, and high‑stakes domains (e.g., healthcare).
  • Automation of the contrast test: Future work could explore automated generation of abliterated prompts to reduce manual effort.
  • Integration with existing benchmarks: Combining benchmark‑less scores with traditional benchmark results could yield hybrid safety metrics, an avenue the authors suggest for follow‑up research.

Authors

  • Sushant Gautam
  • Finn Schwall
  • Annika Willoch Olstad
  • Fernando Vallecillos Ruiz
  • Birk Torpmann-Hagen
  • Sunniva Maria Stordal Bjørklund
  • Leon Moonen
  • Klas Pettersen
  • Michael A. Riegler

Paper Information

  • arXiv ID: 2605.06652v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slo...