[Paper] ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Published: (December 31, 2025 at 01:21 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.25023v1

Overview

The paper ResponseRank tackles a subtle but important problem in reinforcement learning from human feedback (RLHF): binary preference data tells us which of two outputs a user likes, but it says nothing about how much they prefer it. By exploiting noisy side‑signals such as response times or annotator agreement, the authors devise a way to infer the strength of a preference and use it to train more data‑efficient reward models.

Key Contributions

  • ResponseRank algorithm – a robust framework that learns preference strength from locally comparable proxy signals (e.g., response latency, inter‑annotator agreement).
  • Pearson Distance Correlation (PDC) – a new evaluation metric that isolates how well a model captures cardinal utility (strength) from mere ordinal correctness.
  • Empirical validation across three domains:
    1. Synthetic preference datasets with simulated response‑time signals.
    2. Large‑scale language‑model fine‑tuning using real annotator‑agreement data.
    3. RL control environments where episode returns serve as a proxy for strength.
  • Demonstrated sample‑efficiency gains (up to ~30 % fewer human labels needed for comparable performance) and increased robustness to noisy strength cues.

Methodology

  1. Collect proxy strength signals – For each pairwise comparison, the system records an auxiliary scalar (e.g., how fast the annotator responded, how many annotators agreed).
  2. Stratify the data – Comparisons are grouped into strata that share similar contextual factors (e.g., same prompt, similar difficulty). This limits systematic bias (e.g., some prompts are always answered quickly).
  3. Local ranking – Within each stratum, the proxy signals are used to produce a relative ranking of the two responses (which one appears “stronger”). Only the ordering matters, not the absolute value of the signal.
  4. Utility‑difference learning – The model is trained to predict a scalar utility for each response such that the difference between utilities respects the locally inferred ranking. A margin‑based loss encourages larger gaps for stronger‑ranked pairs.
  5. Evaluation with PDC – After training, the Pearson correlation between predicted utility differences and the true (simulated) strength values is computed, providing a clean measure of cardinal learning.

The whole pipeline requires no explicit calibration of the proxy signals; it only assumes that relative differences are meaningful within a well‑constructed stratum.

Results & Findings

DomainBaseline (binary RLHF)ResponseRankSample‑efficiency gain
Synthetic (RT)0.71 accuracy, 0.45 PDC0.78 accuracy, 0.62 PDC≈30 % fewer labels
Language‑model (agreement)0.84 win‑rate on held‑out prompts0.89 win‑rate≈25 % fewer annotations
RL control (episode return)0.62 average return0.71 average return≈20 % fewer episodes
  • Robustness to noise: When the proxy signal was deliberately corrupted (adding Gaussian noise), ResponseRank degraded gracefully, while a naïve strength‑regression baseline collapsed.
  • Ablation: Removing the stratum‑wise ranking step reduced PDC by ~0.15, confirming the importance of local comparison.
  • Generalization: Models trained with strength information transferred better to out‑of‑distribution prompts, suggesting that cardinal utility captures richer semantics than pure ordinal labels.

Practical Implications

  • Faster RLHF pipelines – By extracting more signal from each human annotation (strength ≈ “how confident” the annotator is), product teams can halve the number of required preference queries, cutting labeling costs and time‑to‑market for LLM fine‑tuning.
  • Better safety & alignment – Strength‑aware reward models can differentiate between “mildly undesirable” and “strongly undesirable” outputs, enabling more nuanced policy updates and reducing over‑penalization of borderline cases.
  • Adaptive UI for data collection – Systems can prioritize presenting comparisons where proxy signals indicate high uncertainty (low agreement, long response time), focusing human effort where it yields the biggest utility gain.
  • Cross‑domain applicability – Any setting that already logs meta‑data (click‑through rates, dwell time, confidence scores) can plug in ResponseRank without redesigning the annotation workflow.

Limitations & Future Work

  • Dependence on meaningful strata – The method assumes that within‑stratum proxy differences are reliable. Poorly chosen strata (e.g., mixing very different prompts) can re‑introduce bias.
  • Proxy quality variance – In domains where response time is not correlated with preference strength (e.g., multitasking users), the signal may be too noisy to help.
  • Scalability of stratum construction – For massive datasets, building and maintaining strata may add overhead; automated clustering techniques are needed.
  • Future directions suggested by the authors include:
    1. Learning the stratum partitioning jointly with the reward model.
    2. Extending ResponseRank to multi‑option (k‑ary) comparisons.
    3. Integrating calibrated confidence estimates from LLMs themselves as additional strength cues.

Authors

  • Timo Kaufmann
  • Yannick Metz
  • Daniel Keim
  • Eyke Hüllermeier

Paper Information

  • arXiv ID: 2512.25023v1
  • Categories: cs.LG
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »