[Paper] Three Models of RLHF Annotation: Extension, Evidence, and Authority

Published: 19 hours ago (April 28, 2026 at 01:39 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.25895v1

Overview

Steve Coyne’s paper dissects the often‑overlooked assumptions behind Reinforcement Learning from Human Feedback (RLHF) – the technique that powers today’s most capable language models. By framing annotator judgments through three distinct lenses – extension, evidence, and authority – the work clarifies why current pipelines sometimes behave unpredictably and offers a roadmap for building more reliable, ethically grounded systems.

Key Contributions

Conceptual taxonomy of three normative roles for human annotators in RLHF:
1. Extension – annotators amplify the designer’s intent.
2. Evidence – annotators supply independent factual or moral information.
3. Authority – annotators act as representatives of a broader stakeholder population.
Critical analysis of landmark RLHF papers, showing which model they implicitly adopt and where mismatches cause failure modes (e.g., bias amplification, “over‑alignment”, or loss of factual accuracy).
Design guidelines recommending that RLHF pipelines be decomposed into orthogonal annotation dimensions (e.g., factuality, style, safety) and that each dimension be matched to the most appropriate model.
Normative criteria for selecting a model, including transparency, accountability, and the intended deployment context.

Methodology

Coyne conducts a theoretical review rather than an empirical experiment. The steps are:

Model definition – Formalizes the three annotation roles using simple decision‑theoretic language (e.g., utility functions for designers vs. annotators).
Literature mapping – Reads a curated set of influential RLHF studies (OpenAI’s InstructGPT, DeepMind’s Sparrow, Anthropic’s Claude, etc.) and tags each pipeline step (prompt design, reward modeling, policy optimization) with the underlying model.
Failure‑mode taxonomy – Identifies real‑world incidents (bias spikes, hallucinations, “gaming” of reward models) that arise when a pipeline conflates models.
Normative framework – Proposes a checklist for practitioners to decide which model fits each annotation task, based on factors like stakeholder diversity, regulatory requirements, and product goals.

The analysis stays high‑level, using intuitive examples (e.g., “Should a model refuse to answer a political question?”) to illustrate each model’s implications.

Results & Findings

Extension dominates current commercial RLHF pipelines: annotators are treated as proxies for the product team’s preferences, leading to over‑fitting to internal values and under‑representation of external user groups.
Evidence‑oriented annotation is rare but crucial for factuality and safety; when omitted, models can confidently generate misinformation.
Authority‑based pipelines appear mainly in open‑source or community‑driven projects, where annotators are explicitly positioned as representatives of a target user base. These pipelines better capture diverse norms but suffer from coordination and quality‑control challenges.
Mixed‑model pipelines (e.g., using extension for style and authority for policy) outperform single‑model pipelines on benchmark suites that test both factual accuracy and alignment with user expectations.

Practical Implications

Modular annotation pipelines – Teams should split the RLHF workflow into separate “modules” (e.g., factuality, toxicity, tone) and assign each a model that matches its purpose. This reduces cross‑contamination of biases.
Tailored data collection – For factuality, recruit domain experts and treat their judgments as evidence; for cultural sensitivity, recruit a demographically diverse panel and treat them as authority.
Dynamic reward weighting – Instead of a monolithic reward model, combine sub‑rewards (evidence‑score, authority‑score, extension‑score) with adjustable coefficients depending on deployment context (e.g., higher authority weight for consumer‑facing chatbots).
Auditability & compliance – By making the normative role explicit, organizations can better document why a model behaves a certain way, satisfying regulatory demands (e.g., EU AI Act) that require “human oversight” to be clearly defined.
Risk mitigation – Recognizing when a pipeline unintentionally mixes models helps anticipate failure modes: e.g., using extension‑style annotators for factuality can cause hallucinations, while using evidence‑style annotators for policy may ignore societal norms.

Limitations & Future Work

The paper is conceptual; it does not provide large‑scale empirical validation of the proposed modular pipelines.
Scalability of authority‑based annotation (recruiting representative crowds) remains an open challenge, especially for high‑throughput model updates.
Coyne notes that metric design for each model (e.g., how to quantify “authority”) needs further research, as does the integration of these metrics into existing RLHF toolchains.
Future work could explore automated model selection (e.g., meta‑learning which annotation role best fits a new task) and cross‑domain studies to test the framework on multimodal models beyond text.

Authors

Steve Coyne

Paper Information

arXiv ID: 2604.25895v1
Categories: cs.CY, cs.AI, cs.CL
Published: April 28, 2026
PDF: Download PDF

[Paper] Three Models of RLHF Annotation: Extension, Evidence, and Authority

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] Toward a Functional Geometric Algebra for Natural Language Semantics

[Paper] Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

[Paper] G-Loss: Graph-Guided Fine-Tuning of Language Models