[Paper] Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Published: 3 days ago (June 8, 2026 at 12:32 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09711v1

Overview

Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy—gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a staged sequence before sustained reward hacking, and that its current direct-probe score forecasts later hack onset and severity even when the visible hack rate is still low. PRIME also adapts when the evaluator changes, retargeting to whichever proxy—gold gap remains rewarded and persisting when gold reward suppresses overt hacking, and ablating its activation directions reduces hacking. Across checkpoints, in-domain PRIME tracks out-of-domain misalignment. Together these results suggest that exploitable proxy RL amplifies a proxy-internalization capability upstream of visible hacking, making PRIME a candidate early-warning signal for broader alignment risk.

Key Contributions

This paper presents research in the following areas:

cs.AI
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

Mohammad Beigi
Ming Jin
Lifu Huang

Paper Information

arXiv ID: 2606.09711v1
Categories: cs.AI, cs.LG
Published: June 8, 2026
PDF: Download PDF

[Paper] Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?