[Paper] Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Published: 1 day ago (April 23, 2026 at 01:14 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21871v1

Overview

The paper investigates how large language models (LLMs) handle moral dilemmas that depend on the relationship between the decision‑maker and the affected parties. Using the classic Whistleblower’s Dilemma—where one must decide whether to expose a wrongdoing—the authors vary crime severity and relational closeness to see whether LLMs follow strict fairness rules, mimic how humans actually behave, or make their own autonomous choices. The findings reveal a striking mismatch: LLMs tend to stick to prescriptive moral norms even when their internal “world‑model” predicts that humans would act out of loyalty.

Key Contributions

Tri‑Perspective Framework – Introduces three lenses for evaluating machine morality: (1) Moral Rightness (what ought to be done), (2) Predicted Human Behavior (what people actually do), and (3) Model Decision (what the LLM would choose).
Relational Moral Benchmark – Extends the Whistleblower’s Dilemma with systematic manipulations of crime severity and interpersonal closeness, creating a reproducible test suite for future LLM safety work.
Empirical Evidence of Divergence – Shows that LLMs’ decisions align with the fairness‑oriented “rightness” view, while their own predictions of human behavior shift toward loyalty as relational ties strengthen.
Interpretability via Reasoning Traces – Analyzes chain‑of‑thought outputs to surface the reasoning steps that lead to each perspective, highlighting where the model’s internal world‑model conflicts with its final decision.
Risk Highlight for Decision‑Support Systems – Argues that LLMs deployed as advisors (e.g., compliance bots, HR assistants) may ignore socially nuanced expectations, potentially eroding trust or causing policy missteps.

Methodology

Scenario Construction – The authors generate a matrix of 12 prompts covering three crime severities (minor, moderate, severe) × four relational distances (stranger, coworker, close friend, family).
Three Query Types
- Moral Rightness: “Is it morally right to report …?”
- Predicted Human Behavior: “Would most people report …?”
- Model Decision: “If you were the person, would you report …?”
Model Suite – Experiments run on several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude, Llama‑2) with chain‑of‑thought prompting to elicit reasoning.
Scoring – Answers are mapped to a 5‑point Likert scale (strongly disagree → strongly agree). Consistency across perspectives is measured with Cohen’s κ.
Qualitative Trace Analysis – The authors manually code reasoning snippets for references to fairness, loyalty, duty, and consequences, then compare frequency across perspectives.

Results & Findings

Perspective	Trend across relational closeness	Trend across crime severity
Moral Rightness	Remains high (≈4.2/5) regardless of closeness – fairness dominates.	Slight dip for minor crimes but still >4.0.
Predicted Human Behavior	Drops dramatically as closeness rises (≈4.5 → 2.8) – people expect loyalty to win.	Stronger drop for severe crimes; people still think loyalty can outweigh seriousness.
Model Decision	Mirrors Moral Rightness (≈4.1) – LLMs choose to report even for close relations.	Consistently high; severity has modest effect.

Cross‑Perspective Divergence: κ ≈ 0.22 (low agreement) between Predicted Human Behavior and Model Decision.
Reasoning Trace Insight: When asked about “rightness,” models cite “fairness,” “justice,” and “rule of law.” When predicting human behavior, they mention “protecting relationships,” “fear of retaliation,” and “social pressure.” Yet the final decision still defaults to the fairness‑centric reasoning.
Model‑Specific Variations: GPT‑4 shows the strongest alignment with moral rightness; Llama‑2 exhibits slightly more variance but still leans toward fairness.

Practical Implications

Compliance & Whistleblowing Platforms – AI assistants that advise employees on reporting misconduct may over‑recommend disclosure, ignoring the real social cost for the whistleblower.
HR & Conflict‑Resolution Tools – Systems that suggest actions in interpersonal disputes need to incorporate relational context; otherwise they risk recommending solutions that feel “cold” or unrealistic to users.
Policy‑Making & Governance – Regulators evaluating AI safety should consider not just what the model says is right but also whether the model understands how humans actually behave in socially charged scenarios.
Prompt Engineering – Developers can explicitly request “socially aware” advice (e.g., “consider loyalty and personal risk”) to nudge the model toward a more balanced recommendation.
Transparency Features – Exposing the chain‑of‑thought trace to end‑users can surface the internal conflict, allowing a human to make an informed final call.

Limitations & Future Work

Prompt Sensitivity – Results may shift with alternative phrasing or temperature settings; the study uses a single prompting style.
Cultural Scope – All scenarios assume a Western, individual‑rights‑focused moral baseline; cross‑cultural variations in loyalty vs. fairness are not explored.
Model Diversity – Only a handful of commercial LLMs were tested; open‑source or smaller models might behave differently.
Dynamic Context – Real‑world whistleblowing involves ongoing feedback loops (e.g., retaliation risk after reporting) that static prompts cannot capture.
Future Directions – The authors propose building relationally aware fine‑tuning datasets, integrating multi‑agent simulations to model downstream consequences, and developing evaluation metrics that jointly score fairness and social alignment.

Authors

Jiseon Kim
Jea Kwon
Luiz Felipe Vecchietti
Wenchao Dong
Jaehong Kim
Meeyoung Cha

Paper Information

arXiv ID: 2604.21871v1
Categories: cs.CL
Published: April 23, 2026
PDF: Download PDF

[Paper] Machine Behavior in Relational Moral Dilemmas: Moral Rightness, Predicted Human Behavior, and Model Decisions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation