[Paper] Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Published: 3 days ago (June 9, 2026 at 12:14 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.11046v1

Overview

Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Prajakta Kini
Avinash Reddy
Souradip Chakraborty
Satya Sai Srinath Namburi GNVV
Furong Huang
Amrit Singh Bedi
Alvaro Velasquez

Paper Information

arXiv ID: 2606.11046v1
Categories: cs.CL
Published: June 9, 2026
PDF: Download PDF

[Paper] Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

[Paper] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents