[Paper] Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Published: 3 days ago (June 10, 2026 at 01:31 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12360v1

Overview

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

Key Contributions

This paper presents research in the following areas:

cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Leon Bergen
Usha Bhalla
Sidharth Baskaran
Max Loeffler
Raphael Sarfati
Dhruvil Gala
Ryan Panwar
Santiago Aranguri
Thomas Fel
Atticus Geiger
Matthew Kowal
Siddharth Boppana
Daniel Balsam
Owen Lewis
Jack Merullo
Thomas McGrath
Ekdeep Singh Lubana

Paper Information

arXiv ID: 2606.12360v1
Categories: cs.LG
Published: June 10, 2026
PDF: Download PDF

[Paper] Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

[Paper] Understanding Truncated Positional Encodings for Graph Neural Networks