[Paper] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift

Published: (January 9, 2026 at 10:56 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.05882v1

Overview

This paper investigates why language models that have been “preference‑tuned” (i.e., aligned to human judgments of helpfulness, safety, etc.) often stumble when they are applied to data that looks different from the data they were tuned on. By systematically testing several alignment objectives and a range of adaptation tricks—especially pseudo‑labeling—the authors show how to keep the benefits of preference tuning while mitigating the drop in performance that domain shift usually causes.

Key Contributions

  • Comprehensive benchmark of five widely‑used preference‑tuning objectives on two downstream tasks (summarization and QA helpfulness) under multiple domain‑shift scenarios.
  • Systematic comparison of adaptation strategies, including direct supervised fine‑tuning on target data and unsupervised pseudo‑labeling pipelines.
  • Empirical evidence that pseudo‑labeling consistently narrows the performance gap caused by domain shift, often outperforming naive fine‑tuning.
  • Insightful analysis of how different alignment losses (e.g., KL‑divergence, pairwise ranking, reward‑model regression) trade off between generalization and diversity of model outputs.
  • Open‑source release of the evaluation suite, data splits, and code to reproduce the experiments.

Methodology

  1. Base Models – The authors start from several strong pretrained language models (e.g., LLaMA‑7B, FLAN‑T5‑XXL).
  2. Preference‑Tuning Objectives – Five loss functions are examined:
    • KL‑divergence to a reference distribution,
    • Pairwise ranking (Bradley‑Terry),
    • Direct reward‑model regression,
    • Contrastive alignment, and
    • A hybrid “helpfulness‑safety” multi‑task loss.
  3. Domain‑Shift Setup – Two source domains (news summarization & Stack‑Exchange QA) are paired with out‑of‑distribution target domains (scientific abstracts & medical QA).
  4. Adaptation Strategies
    • Supervised fine‑tuning on a small labeled target set,
    • Pseudo‑labeling: generate model outputs on unlabeled target data, score them with the original reward model, and then fine‑tune on the high‑scoring pseudo‑labels,
    • Hybrid (mix of supervised + pseudo).
  5. Evaluation – Helpfulness is measured with human‑rated scores and automatic proxies (e.g., ROUGE for summarization, BLEU + answer correctness for QA). Diversity is quantified via distinct‑n and entropy metrics.

Results & Findings

Alignment ObjectiveSource‑Only Score+ Supervised FT+ Pseudo‑Labeling
KL‑divergence0.620.66 (+4)0.71 (+9)
Pairwise Ranking0.600.64 (+4)0.70 (+10)
Reward Regression0.580.62 (+4)0.68 (+10)
Contrastive0.610.65 (+4)0.69 (+8)
Hybrid0.630.67 (+4)0.72 (+9)

Numbers are averaged helpfulness scores (higher is better).

  • Generalization Gap: All objectives lose ~5‑10 % when evaluated on the target domain without adaptation.
  • Pseudo‑labeling Wins: Adding high‑confidence pseudo‑labels recovers most of the lost performance, often surpassing the supervised fine‑tuning baseline despite using no human labels in the target domain.
  • Diversity Trade‑off: Pure KL‑divergence yields the most diverse outputs, while ranking‑based losses produce tighter, higher‑quality responses but with slightly lower diversity.
  • Objective‑Specific Trends: The hybrid loss combines the best of both worlds—strong helpfulness and respectable diversity—making it the most robust across shifts.

Practical Implications

  • Deploying Aligned LLMs: Companies can safely roll out preference‑tuned models to new verticals (e.g., from customer‑support chat to medical triage) by first running a lightweight pseudo‑labeling pipeline instead of costly human annotation.
  • Cost‑Effective Adaptation: Pseudo‑labeling requires only the original reward model and unlabeled target data, cutting adaptation budgets by up to 80 % compared to full supervised fine‑tuning.
  • Product Roadmaps: Teams building “helpful” assistants can pick an alignment objective based on their priority—if output variety matters (e.g., creative writing), KL‑divergence is preferable; for safety‑critical domains, pairwise ranking or the hybrid loss may be better.
  • Tooling Integration: The released code can be plugged into existing RLHF pipelines (e.g., OpenAI’s trl library) to add a “pseudo‑labeling stage” before production rollout.
  • Regulatory Compliance: By maintaining alignment quality under domain shift, organizations can better meet AI‑risk standards that require consistent behavior across use‑cases.

Limitations & Future Work

  • Scale Sensitivity: Experiments were limited to models ≤ 13 B parameters; it remains unclear whether the same trends hold for multi‑billion‑parameter systems.
  • Reward Model Bias: The pseudo‑labeling process inherits any systematic bias present in the original reward model, which could amplify undesirable behaviors in the target domain.
  • Task Breadth: Only summarization and QA were examined; other modalities (code generation, dialogue) may exhibit different shift dynamics.
  • Human Evaluation Depth: While the study includes human ratings, deeper qualitative analyses (e.g., error typology) are left for future work.
  • Adaptive Pseudo‑Label Thresholds: The paper uses a fixed confidence cutoff; exploring dynamic or curriculum‑based thresholds could further improve robustness.

Overall, the study offers a practical roadmap for keeping preference‑aligned language models useful and reliable when they venture beyond the data they were originally trained on.

Authors

  • Constantinos Karouzos
  • Xingwei Tan
  • Nikolaos Aletras

Paper Information

  • arXiv ID: 2601.05882v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »