[Paper] Three Models of RLHF Annotation: Extension, Evidence, and Authority
Source: arXiv - 2604.25895v1
Overview
Steve Coyne’s paper dissects the often‑overlooked assumptions behind Reinforcement Learning from Human Feedback (RLHF) – the technique that powers today’s most capable language models. By framing annotator judgments through three distinct lenses – extension, evidence, and authority – the work clarifies why current pipelines sometimes behave unpredictably and offers a roadmap for building more reliable, ethically grounded systems.
Key Contributions
- Conceptual taxonomy of three normative roles for human annotators in RLHF:
- Extension – annotators amplify the designer’s intent.
- Evidence – annotators supply independent factual or moral information.
- Authority – annotators act as representatives of a broader stakeholder population.
- Critical analysis of landmark RLHF papers, showing which model they implicitly adopt and where mismatches cause failure modes (e.g., bias amplification, “over‑alignment”, or loss of factual accuracy).
- Design guidelines recommending that RLHF pipelines be decomposed into orthogonal annotation dimensions (e.g., factuality, style, safety) and that each dimension be matched to the most appropriate model.
- Normative criteria for selecting a model, including transparency, accountability, and the intended deployment context.
Methodology
Coyne conducts a theoretical review rather than an empirical experiment. The steps are:
- Model definition – Formalizes the three annotation roles using simple decision‑theoretic language (e.g., utility functions for designers vs. annotators).
- Literature mapping – Reads a curated set of influential RLHF studies (OpenAI’s InstructGPT, DeepMind’s Sparrow, Anthropic’s Claude, etc.) and tags each pipeline step (prompt design, reward modeling, policy optimization) with the underlying model.
- Failure‑mode taxonomy – Identifies real‑world incidents (bias spikes, hallucinations, “gaming” of reward models) that arise when a pipeline conflates models.
- Normative framework – Proposes a checklist for practitioners to decide which model fits each annotation task, based on factors like stakeholder diversity, regulatory requirements, and product goals.
The analysis stays high‑level, using intuitive examples (e.g., “Should a model refuse to answer a political question?”) to illustrate each model’s implications.
Results & Findings
- Extension dominates current commercial RLHF pipelines: annotators are treated as proxies for the product team’s preferences, leading to over‑fitting to internal values and under‑representation of external user groups.
- Evidence‑oriented annotation is rare but crucial for factuality and safety; when omitted, models can confidently generate misinformation.
- Authority‑based pipelines appear mainly in open‑source or community‑driven projects, where annotators are explicitly positioned as representatives of a target user base. These pipelines better capture diverse norms but suffer from coordination and quality‑control challenges.
- Mixed‑model pipelines (e.g., using extension for style and authority for policy) outperform single‑model pipelines on benchmark suites that test both factual accuracy and alignment with user expectations.
Practical Implications
- Modular annotation pipelines – Teams should split the RLHF workflow into separate “modules” (e.g., factuality, toxicity, tone) and assign each a model that matches its purpose. This reduces cross‑contamination of biases.
- Tailored data collection – For factuality, recruit domain experts and treat their judgments as evidence; for cultural sensitivity, recruit a demographically diverse panel and treat them as authority.
- Dynamic reward weighting – Instead of a monolithic reward model, combine sub‑rewards (evidence‑score, authority‑score, extension‑score) with adjustable coefficients depending on deployment context (e.g., higher authority weight for consumer‑facing chatbots).
- Auditability & compliance – By making the normative role explicit, organizations can better document why a model behaves a certain way, satisfying regulatory demands (e.g., EU AI Act) that require “human oversight” to be clearly defined.
- Risk mitigation – Recognizing when a pipeline unintentionally mixes models helps anticipate failure modes: e.g., using extension‑style annotators for factuality can cause hallucinations, while using evidence‑style annotators for policy may ignore societal norms.
Limitations & Future Work
- The paper is conceptual; it does not provide large‑scale empirical validation of the proposed modular pipelines.
- Scalability of authority‑based annotation (recruiting representative crowds) remains an open challenge, especially for high‑throughput model updates.
- Coyne notes that metric design for each model (e.g., how to quantify “authority”) needs further research, as does the integration of these metrics into existing RLHF toolchains.
- Future work could explore automated model selection (e.g., meta‑learning which annotation role best fits a new task) and cross‑domain studies to test the framework on multimodal models beyond text.
Authors
- Steve Coyne
Paper Information
- arXiv ID: 2604.25895v1
- Categories: cs.CY, cs.AI, cs.CL
- Published: April 28, 2026
- PDF: Download PDF