[Paper] SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Published: 2 months ago (February 10, 2026 at 12:39 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.10017v1

Overview

Large language models (LLMs) are being deployed for high‑stakes, domain‑specific tasks such as disaster response planning and infrastructure design. Existing evaluation tools, however, focus on surface similarity or generic factuality and miss whether an answer actually contains the specific, decision‑critical details professionals need. This paper introduces SCORE, a reference‑free, multi‑dimensional framework that measures LLM outputs on specificity, robustness, relevance, and context utilization, and validates it with a new, professionally curated dataset.

Key Contributions

SCORE framework: Four complementary, reference‑free metrics (Specificity, Context Utilization, Robustness, Relevance) that together provide a nuanced picture of answer quality.
Domain‑rich benchmark: 1,412 question‑answer pairs covering 40 professional roles (e.g., emergency managers, civil engineers) and seven natural‑hazard scenarios, enabling systematic testing of LLMs in real‑world contexts.
Human‑aligned evaluation: Extensive human annotation study showing inter‑annotator agreement patterns and highlighting the inherent subjectivity of open‑ended, domain‑specific judgments.
Empirical analysis: Demonstrates that no single metric predicts human preferences; a combination of SCORE dimensions correlates best with expert assessments.
Open‑source release: Dataset, annotation guidelines, and evaluation scripts are publicly available to foster reproducible research and industry adoption.

Methodology

Metric Design
- Specificity: Checks whether the answer includes fine‑grained, actionable details (e.g., exact flood‑depth thresholds).
- Context Utilization: Scores how well the model leverages provided background documents or retrieval results.
- Robustness: Measures answer stability under paraphrased prompts or semantic perturbations (e.g., synonym swaps).
- Relevance: Assesses whether the response stays on topic and addresses the core decision question.
Dataset Construction
- Collected real‑world queries from professionals in emergency management, civil engineering, urban planning, etc.
- Paired each query with a high‑quality reference answer written by domain experts.
- Annotated each reference for the four SCORE dimensions to create a gold‑standard for calibration.
Human Evaluation
- Recruited 12 domain experts to rate a subset of model outputs on the four dimensions.
- Computed Krippendorff’s α to quantify inter‑annotator reliability (α ≈ 0.71 overall, indicating moderate agreement).
Model Testing
- Ran several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude, Llama‑2) with retrieval‑augmented generation pipelines on the benchmark.
- Applied the SCORE metrics automatically (via lightweight classifiers fine‑tuned on the annotated data) and compared against human scores.

Results & Findings

Metric Correlation: Specificity and Context Utilization showed the strongest alignment with expert judgments (ρ = 0.62 and 0.58 respectively). Robustness and Relevance were weaker individually but crucial when combined.
Model Rankings: GPT‑4 achieved the highest overall SCORE composite (0.74), but lagged on Robustness (0.51), indicating susceptibility to prompt paraphrases. Llama‑2 performed competitively on Specificity but struggled with Context Utilization.
Composite Advantage: A simple weighted sum of the four dimensions (weights tuned on a validation split) yielded a Pearson correlation of 0.78 with human overall quality scores—significantly higher than any single metric (max 0.62).
Human‑Model Gap: Even top‑performing models missed critical domain nuances in ~18% of cases, underscoring the need for post‑deployment human oversight in high‑risk settings.

Practical Implications

Better RAG Pipelines: Developers can integrate SCORE as a runtime sanity check, flagging answers that lack specificity or ignore retrieved context before surfacing them to users.
Fine‑Tuning Targets: The four dimensions provide clear, interpretable loss signals for reinforcement learning from human feedback (RLHF) or supervised fine‑tuning, enabling more targeted improvements.
Risk Management: Organizations deploying LLMs for disaster response can use SCORE scores to set acceptance thresholds (e.g., reject any answer with Specificity < 0.6), reducing the chance of incomplete or misleading guidance.
Tooling Ecosystem: The released evaluation scripts can be wrapped into CI pipelines, allowing product teams to monitor metric drift as models are updated or as new domain corpora are added.
Cross‑Domain Extensibility: While the benchmark focuses on natural hazards, the SCORE framework is generic enough to be adapted for medical triage, legal advice, or financial risk analysis—any scenario where decision‑critical detail matters.

Limitations & Future Work

Subjectivity: Even with clear guidelines, annotators disagreed on borderline cases, suggesting that some dimensions (especially Relevance) may need richer contextual definitions.
Domain Coverage: The dataset, though diverse, is limited to natural‑hazard contexts; extending to other high‑stakes domains will test the generality of SCORE.
Metric Automation: Current automatic classifiers rely on fine‑tuning with the annotated set; scaling to new domains may require additional labeled data or few‑shot prompting strategies.
Robustness Scope: The robustness tests focused on lexical paraphrases; future work should explore more adversarial perturbations (e.g., misinformation injection).
Human-in-the-Loop: Integrating SCORE with active learning loops—where low‑scoring outputs trigger expert review—remains an open research direction.

By providing a structured, reference‑free way to evaluate the very details that matter most in professional decision‑making, SCORE moves LLM evaluation beyond “does it sound right?” toward “does it say the right thing?”

Authors

Homaira Huda Shomee
Rochana Chaturvedi
Yangxinyu Xie
Tanwi Mallick

Paper Information

arXiv ID: 2602.10017v1
Categories: cs.CL
Published: February 10, 2026
PDF: Download PDF

[Paper] SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report