[Paper] Predicting Future Behaviors in Reasoning Models Enables Better Steering

Published: 3 days ago (June 9, 2026 at 01:49 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.11172v1

Overview

Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.

Key Contributions

This paper presents research in the following areas:

cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Evgenii Kortukov
Piotr Komorowski
Florian Klein
Paula Engl
Gabriele Sarti
Seong Joon Oh
Sebastian Lapuschkin
Wojciech Samek

Paper Information

arXiv ID: 2606.11172v1
Categories: cs.LG
Published: June 9, 2026
PDF: Download PDF

[Paper] Predicting Future Behaviors in Reasoning Models Enables Better Steering

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

[Paper] Understanding Truncated Positional Encodings for Graph Neural Networks