[Paper] Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

Published: (May 27, 2026 at 01:55 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.28802v1

Overview

This paper investigates whether large language models (LLMs) can capture individual annotators’ reasoning when they provide free‑text explanations for classification decisions. By treating the variation in human explanations as a stable signal rather than noise, the authors show that models can learn to mimic the explain‑and‑label behavior of specific annotators across tasks such as Natural Language Inference (NLI) and paraphrase detection.

Key Contributions

  • Empirical evidence of annotator stability: Demonstrates that, after accounting for content effects, each annotator exhibits a recognizable pattern in both labels and free‑text explanations.
  • Cross‑Annotator Preference Optimization (CAPO): Introduces a novel training objective that explicitly contrasts a target annotator’s output with other valid but less target‑specific outputs for the same input.
  • Comprehensive benchmark: Evaluates prompting, standard supervised fine‑tuning (SFT), and CAPO on two sentence‑pair tasks with four annotators each, providing a clear picture of what works and why.
  • Human validation of reasoning: Shows that CAPO‑trained models retain the target annotator’s reasoning style, as confirmed by human judges.
  • Open‑source resources: Releases the annotation datasets, code for CAPO, and evaluation scripts to facilitate reproducibility.

Methodology

  1. Data collection – For each of the two tasks (NLI and paraphrase), four human annotators labeled 1,000 sentence pairs and wrote a short free‑text explanation for every decision.
  2. Stability analysis – The authors first measured how much of the variation was due to the input itself versus the annotator. By aggregating predictions per annotator and stripping away content‑specific cues, they revealed consistent individual “explanation signatures.”
  3. Modeling approaches
    • Prompting: Zero‑shot or few‑shot prompts that ask a pre‑trained LLM to generate a label and explanation.
    • Supervised fine‑tuning (SFT): Standard cross‑entropy training on the (label, explanation) pairs of a single annotator.
    • CAPO: A contrastive loss that, for each example, pushes the model toward the target annotator’s output while pulling it away from the other three annotators’ valid outputs. This encourages the model to learn what makes the target’s reasoning unique, not just the correct answer.
  4. Evaluation – Metrics include label accuracy, BLEU/ROUGE for explanation similarity, and a judge‑based attribution test where humans rate how well the model’s output matches the target annotator’s style.

Results & Findings

ApproachLabel AccuracyExplanation Similarity (BLEU)Human Attribution
Prompting (zero‑shot)62 %12 %48 %
Prompting (few‑shot)68 %18 %55 %
SFT (single annotator)74 %27 %71 %
CAPO77 %31 %78 %
  • Prompting struggles to consistently reproduce a specific annotator’s reasoning; performance is highly variable across examples.
  • SFT captures annotator‑specific patterns better than prompting but still treats each example in isolation.
  • CAPO yields the strongest gains, especially in the human attribution test, confirming that the model not only predicts the right label but also mirrors the annotator’s explanatory style.
  • Qualitative analysis shows that CAPO‑trained models preserve subtle preferences (e.g., focusing on lexical overlap vs. logical entailment) that differ between annotators.

Practical Implications

  • Personalized AI assistants: Customer‑support bots could be tuned to adopt the explanatory tone of a particular support agent, ensuring consistency with existing knowledge bases.
  • Explainable AI pipelines: Instead of generic post‑hoc explanations, developers can train models that generate explanations aligned with the reasoning of domain experts, improving trust and auditability.
  • Annotation cost reduction: By learning from a handful of annotators’ histories, an LLM can generate high‑quality explanations for new data, reducing the need for exhaustive human annotation.
  • Regulatory compliance: In sectors where explanations must follow specific guidelines (e.g., finance, healthcare), CAPO can enforce annotator‑specific compliance patterns automatically.
  • Multi‑annotator aggregation: CAPO’s contrastive framework can be extended to blend multiple expert styles, enabling “style‑aware” ensemble explanations.

Limitations & Future Work

  • Dataset size & diversity: The study uses only two tasks and four annotators per task; broader domains (e.g., code review, medical diagnosis) may exhibit different stability properties.
  • Explanation length: Free‑text explanations are short (≈1‑2 sentences); scaling to longer, more complex rationales remains an open question.
  • Model size dependency: Experiments were conducted with GPT‑Neo‑2.7B and Llama‑7B; it is unclear how CAPO behaves with much larger or smaller models.
  • Potential bias amplification: Training on a single annotator’s style could inadvertently propagate that annotator’s systematic biases; future work should explore fairness‑aware regularization.
  • Interactive fine‑tuning: Incorporating real‑time feedback from annotators (e.g., correction loops) could further improve personalization and reduce drift over time.

Authors

  • Beiduo Chen
  • Pingjun Hong
  • Ziyun Zhang
  • Benjamin Roth
  • Anna Korhonen
  • Barbara Plank

Paper Information

  • arXiv ID: 2605.28802v1
  • Categories: cs.CL
  • Published: May 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »