[Paper] ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning
Source: arXiv - 2512.25023v1
Overview
The paper ResponseRank tackles a subtle but important problem in reinforcement learning from human feedback (RLHF): binary preference data tells us which of two outputs a user likes, but it says nothing about how much they prefer it. By exploiting noisy side‑signals such as response times or annotator agreement, the authors devise a way to infer the strength of a preference and use it to train more data‑efficient reward models.
Key Contributions
- ResponseRank algorithm – a robust framework that learns preference strength from locally comparable proxy signals (e.g., response latency, inter‑annotator agreement).
- Pearson Distance Correlation (PDC) – a new evaluation metric that isolates how well a model captures cardinal utility (strength) from mere ordinal correctness.
- Empirical validation across three domains:
- Synthetic preference datasets with simulated response‑time signals.
- Large‑scale language‑model fine‑tuning using real annotator‑agreement data.
- RL control environments where episode returns serve as a proxy for strength.
- Demonstrated sample‑efficiency gains (up to ~30 % fewer human labels needed for comparable performance) and increased robustness to noisy strength cues.
Methodology
- Collect proxy strength signals – For each pairwise comparison, the system records an auxiliary scalar (e.g., how fast the annotator responded, how many annotators agreed).
- Stratify the data – Comparisons are grouped into strata that share similar contextual factors (e.g., same prompt, similar difficulty). This limits systematic bias (e.g., some prompts are always answered quickly).
- Local ranking – Within each stratum, the proxy signals are used to produce a relative ranking of the two responses (which one appears “stronger”). Only the ordering matters, not the absolute value of the signal.
- Utility‑difference learning – The model is trained to predict a scalar utility for each response such that the difference between utilities respects the locally inferred ranking. A margin‑based loss encourages larger gaps for stronger‑ranked pairs.
- Evaluation with PDC – After training, the Pearson correlation between predicted utility differences and the true (simulated) strength values is computed, providing a clean measure of cardinal learning.
The whole pipeline requires no explicit calibration of the proxy signals; it only assumes that relative differences are meaningful within a well‑constructed stratum.
Results & Findings
| Domain | Baseline (binary RLHF) | ResponseRank | Sample‑efficiency gain |
|---|---|---|---|
| Synthetic (RT) | 0.71 accuracy, 0.45 PDC | 0.78 accuracy, 0.62 PDC | ≈30 % fewer labels |
| Language‑model (agreement) | 0.84 win‑rate on held‑out prompts | 0.89 win‑rate | ≈25 % fewer annotations |
| RL control (episode return) | 0.62 average return | 0.71 average return | ≈20 % fewer episodes |
- Robustness to noise: When the proxy signal was deliberately corrupted (adding Gaussian noise), ResponseRank degraded gracefully, while a naïve strength‑regression baseline collapsed.
- Ablation: Removing the stratum‑wise ranking step reduced PDC by ~0.15, confirming the importance of local comparison.
- Generalization: Models trained with strength information transferred better to out‑of‑distribution prompts, suggesting that cardinal utility captures richer semantics than pure ordinal labels.
Practical Implications
- Faster RLHF pipelines – By extracting more signal from each human annotation (strength ≈ “how confident” the annotator is), product teams can halve the number of required preference queries, cutting labeling costs and time‑to‑market for LLM fine‑tuning.
- Better safety & alignment – Strength‑aware reward models can differentiate between “mildly undesirable” and “strongly undesirable” outputs, enabling more nuanced policy updates and reducing over‑penalization of borderline cases.
- Adaptive UI for data collection – Systems can prioritize presenting comparisons where proxy signals indicate high uncertainty (low agreement, long response time), focusing human effort where it yields the biggest utility gain.
- Cross‑domain applicability – Any setting that already logs meta‑data (click‑through rates, dwell time, confidence scores) can plug in ResponseRank without redesigning the annotation workflow.
Limitations & Future Work
- Dependence on meaningful strata – The method assumes that within‑stratum proxy differences are reliable. Poorly chosen strata (e.g., mixing very different prompts) can re‑introduce bias.
- Proxy quality variance – In domains where response time is not correlated with preference strength (e.g., multitasking users), the signal may be too noisy to help.
- Scalability of stratum construction – For massive datasets, building and maintaining strata may add overhead; automated clustering techniques are needed.
- Future directions suggested by the authors include:
- Learning the stratum partitioning jointly with the reward model.
- Extending ResponseRank to multi‑option (k‑ary) comparisons.
- Integrating calibrated confidence estimates from LLMs themselves as additional strength cues.
Authors
- Timo Kaufmann
- Yannick Metz
- Daniel Keim
- Eyke Hüllermeier
Paper Information
- arXiv ID: 2512.25023v1
- Categories: cs.LG
- Published: December 31, 2025
- PDF: Download PDF