[Paper] Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

Published: 2 weeks ago (May 27, 2026 at 01:55 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.28802v1

Overview

This paper investigates whether large language models (LLMs) can capture individual annotators’ reasoning when they provide free‑text explanations for classification decisions. By treating the variation in human explanations as a stable signal rather than noise, the authors show that models can learn to mimic the explain‑and‑label behavior of specific annotators across tasks such as Natural Language Inference (NLI) and paraphrase detection.

Key Contributions

Empirical evidence of annotator stability: Demonstrates that, after accounting for content effects, each annotator exhibits a recognizable pattern in both labels and free‑text explanations.
Cross‑Annotator Preference Optimization (CAPO): Introduces a novel training objective that explicitly contrasts a target annotator’s output with other valid but less target‑specific outputs for the same input.
Comprehensive benchmark: Evaluates prompting, standard supervised fine‑tuning (SFT), and CAPO on two sentence‑pair tasks with four annotators each, providing a clear picture of what works and why.
Human validation of reasoning: Shows that CAPO‑trained models retain the target annotator’s reasoning style, as confirmed by human judges.
Open‑source resources: Releases the annotation datasets, code for CAPO, and evaluation scripts to facilitate reproducibility.

Methodology

Data collection – For each of the two tasks (NLI and paraphrase), four human annotators labeled 1,000 sentence pairs and wrote a short free‑text explanation for every decision.
Stability analysis – The authors first measured how much of the variation was due to the input itself versus the annotator. By aggregating predictions per annotator and stripping away content‑specific cues, they revealed consistent individual “explanation signatures.”
Modeling approaches
- Prompting: Zero‑shot or few‑shot prompts that ask a pre‑trained LLM to generate a label and explanation.
- Supervised fine‑tuning (SFT): Standard cross‑entropy training on the (label, explanation) pairs of a single annotator.
- CAPO: A contrastive loss that, for each example, pushes the model toward the target annotator’s output while pulling it away from the other three annotators’ valid outputs. This encourages the model to learn what makes the target’s reasoning unique, not just the correct answer.
Evaluation – Metrics include label accuracy, BLEU/ROUGE for explanation similarity, and a judge‑based attribution test where humans rate how well the model’s output matches the target annotator’s style.

Results & Findings

Approach	Label Accuracy	Explanation Similarity (BLEU)	Human Attribution
Prompting (zero‑shot)	62 %	12 %	48 %
Prompting (few‑shot)	68 %	18 %	55 %
SFT (single annotator)	74 %	27 %	71 %
CAPO	77 %	31 %	78 %

Prompting struggles to consistently reproduce a specific annotator’s reasoning; performance is highly variable across examples.
SFT captures annotator‑specific patterns better than prompting but still treats each example in isolation.
CAPO yields the strongest gains, especially in the human attribution test, confirming that the model not only predicts the right label but also mirrors the annotator’s explanatory style.
Qualitative analysis shows that CAPO‑trained models preserve subtle preferences (e.g., focusing on lexical overlap vs. logical entailment) that differ between annotators.

Practical Implications

Personalized AI assistants: Customer‑support bots could be tuned to adopt the explanatory tone of a particular support agent, ensuring consistency with existing knowledge bases.
Explainable AI pipelines: Instead of generic post‑hoc explanations, developers can train models that generate explanations aligned with the reasoning of domain experts, improving trust and auditability.
Annotation cost reduction: By learning from a handful of annotators’ histories, an LLM can generate high‑quality explanations for new data, reducing the need for exhaustive human annotation.
Regulatory compliance: In sectors where explanations must follow specific guidelines (e.g., finance, healthcare), CAPO can enforce annotator‑specific compliance patterns automatically.
Multi‑annotator aggregation: CAPO’s contrastive framework can be extended to blend multiple expert styles, enabling “style‑aware” ensemble explanations.

Limitations & Future Work

Dataset size & diversity: The study uses only two tasks and four annotators per task; broader domains (e.g., code review, medical diagnosis) may exhibit different stability properties.
Explanation length: Free‑text explanations are short (≈1‑2 sentences); scaling to longer, more complex rationales remains an open question.
Model size dependency: Experiments were conducted with GPT‑Neo‑2.7B and Llama‑7B; it is unclear how CAPO behaves with much larger or smaller models.
Potential bias amplification: Training on a single annotator’s style could inadvertently propagate that annotator’s systematic biases; future work should explore fairness‑aware regularization.
Interactive fine‑tuning: Incorporating real‑time feedback from annotators (e.g., correction loops) could further improve personalization and reduce drift over time.

Authors

Beiduo Chen
Pingjun Hong
Ziyun Zhang
Benjamin Roth
Anna Korhonen
Barbara Plank

Paper Information

arXiv ID: 2605.28802v1
Categories: cs.CL
Published: May 27, 2026
PDF: Download PDF

[Paper] Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection