[Paper] Performative Policy Gradient: Optimality in Performative Reinforcement Learning
Source: arXiv - 2512.20576v1
Overview
The paper “Performative Policy Gradient: Optimality in Performative Reinforcement Learning” tackles a subtle but critical gap in modern RL: once a policy is deployed, its actions can change the environment itself (think recommendation systems reshaping user behavior or autonomous fleets influencing traffic patterns). Existing RL theory assumes a static world, which leads to sub‑optimal or even unstable behavior when the environment reacts to the policy. This work extends the classic policy‑gradient framework to explicitly account for those feedback loops, delivering the first algorithm that provably finds performatively optimal policies.
Key Contributions
- Performative extensions of core RL theory: Derives a performative version of the performance‑difference lemma and the policy‑gradient theorem, showing how the gradient must incorporate the environment’s response to the policy.
- Performative Policy Gradient (PePG) algorithm: Introduces a practical, softmax‑parameterised policy‑gradient method that internalises the distribution shift induced by its own actions.
- Convergence guarantees: Proves that PePG converges to performatively optimal policies both with and without entropy regularisation—i.e., policies that remain optimal after the environment has adapted to them.
- Empirical validation: Demonstrates on benchmark performative‑RL environments that PePG outperforms vanilla policy‑gradient methods and prior performative‑RL approaches that only achieve stability, not optimality.
Methodology
- Modeling Performative RL – The authors formalize a performative Markov Decision Process (MDP) where the transition dynamics (P_{\pi}) depend on the current policy (\pi). Deploying a new policy changes the underlying distribution, which in turn changes the expected return.
- Performative Performance‑Difference Lemma – Extends the classic lemma to relate the return of two policies while accounting for the shift in dynamics caused by the policy change.
- Performative Policy‑Gradient Theorem – Shows that the gradient of the performative objective includes an extra term reflecting how the dynamics change with respect to the policy parameters.
- Algorithm Design (PePG) – Implements stochastic gradient ascent on the performative objective. The algorithm samples trajectories under the current policy, estimates both the standard REINFORCE gradient and the performative correction term, and updates the softmax‑parameterised policy. Entropy regularisation can be added to encourage exploration, and the convergence analysis covers both cases.
- Theoretical Analysis – Using smoothness and boundedness assumptions, the authors prove that PePG’s iterates converge to a stationary point of the performative objective, which corresponds to a performatively optimal policy.
Results & Findings
- Convergence: Under standard step‑size schedules, PePG’s parameters converge to a set of policies that are optimal after the environment has adapted to them.
- Performance Gains: In simulated environments (e.g., a performative version of CartPole where the pole’s dynamics shift with the controller’s aggressiveness), PePG achieves up to 30 % higher cumulative reward than vanilla policy gradient and 15 % higher than the best existing performative‑RL baseline.
- Stability vs. Optimality: Prior performative RL methods guarantee stability (the policy stops changing) but can settle at sub‑optimal points. PePG consistently reaches higher‑reward equilibria, confirming the theoretical claim that optimality is attainable.
- Entropy Regularisation: Adding entropy improves sample efficiency and smooths the learning curve without breaking the convergence guarantees.
Practical Implications
- Deploy‑and‑Learn Systems: Any RL‑driven service that influences its own data distribution—personalised recommendation engines, dynamic pricing, adaptive traffic control, or automated trading—can benefit from PePG to avoid “feedback loops” that degrade performance over time.
- Safety‑Critical Applications: In robotics or autonomous driving, where the robot’s actions reshape the environment (e.g., crowd dynamics), PePG provides a principled way to ensure the learned policy remains optimal after those changes.
- Policy Auditing & Regulation: Regulators concerned about algorithmic influence (e.g., loan‑approval models that affect applicant behavior) can use the performative framework to assess whether a deployed policy is truly optimal under its own impact.
- Tooling: The algorithm is a modest extension of existing REINFORCE pipelines—just an extra gradient term that can be estimated from the same rollout data—making integration into current RL libraries (TensorFlow‑Agents, PyTorch‑RL) straightforward.
Limitations & Future Work
- Assumption of Known Performative Map: The analysis presumes we can estimate how the dynamics change with the policy (the “performative map”). In many real systems this map is noisy or partially observable, which could affect convergence.
- Scalability to High‑Dimensional Policies: Experiments focus on low‑dimensional benchmarks; extending PePG to large‑scale deep RL (e.g., Atari, MuJoCo) may require variance‑reduction tricks or model‑based approximations.
- Non‑Stationary Environments: The current theory handles policy‑induced shifts but not external, time‑varying changes. Combining performative RL with continual‑learning techniques is an open direction.
- Robustness to Model Misspecification: Future work could explore robust variants that remain optimal even when the performative dynamics are only approximately known.
Bottom line: Performative Policy Gradient bridges a crucial gap between theory and practice for RL systems that shape their own world, offering both provable optimality and tangible performance boosts for the next generation of adaptive AI products.
Authors
- Debabrota Basu
- Udvas Das
- Brahim Driss
- Uddalak Mukherjee
Paper Information
- arXiv ID: 2512.20576v1
- Categories: cs.LG, cs.AI, math.OC
- Published: December 23, 2025
- PDF: Download PDF