[Paper] When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic

Published: (March 10, 2026 at 01:46 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.09950v1

Overview

The paper investigates why the learning rate (LR) is such a fickle hyper‑parameter in Proximal Policy Optimization (PPO) actor‑critic agents. By looking inside the neural nets—specifically at how hidden‑unit activations flip sign during training—the authors devise a lightweight metric that can flag “bad” LR choices after only a fraction of the total training time.

Key Contributions

  • Overfitting‑Underfitting Indicator (OUI) for RL – adapts a binary‑activation balance metric to the RL setting and provides a batch‑based, computationally cheap formulation.
  • Theoretical link between LR and activation sign changes – shows how step size controls the rate at which hidden neurons switch polarity, which in turn governs stability vs. stagnation.
  • Early‑stage diagnostic – demonstrates that OUI measured at ~10 % of total training already separates “good” from “bad” learning‑rate regimes across three discrete‑control benchmarks.
  • Empirical asymmetry between actor and critic – the best‑performing critic networks sit in a moderate OUI band (avoiding saturation), while top‑performing actors exhibit higher OUI values.
  • Screening benchmark – compares OUI‑based early pruning against classic early‑return, clip‑based, divergence‑based, and flip‑based rules, showing OUI yields the highest precision for a given recall and synergizes best when combined with early‑return.

Methodology

  1. Probe batch creation – a small, fixed set of environment observations (≈ 1 % of the replay buffer) is sampled at the start of training.
  2. Batch‑based OUI computation – for each hidden neuron, the sign of its pre‑activation (positive vs. negative) is recorded over the probe batch at every training step. OUI is the normalized variance of these binary patterns, reflecting how often a neuron flips between the two states.
  3. Theoretical analysis – using a first‑order Taylor expansion of the weight update, the authors prove that larger LRs increase the probability of sign flips, while very small LRs keep neurons stuck in one polarity, leading to under‑utilization of network capacity.
  4. Experimental protocol – PPO agents are trained on CartPole, Acrobot, and LunarLander with a grid of LR values (both actor and critic). For each run, OUI is logged every 10 % of total timesteps. The final returns are used to label runs as “successful” or “collapsed”.
  5. Screening evaluation – various early‑stop criteria are applied at the 10 % checkpoint. Precision‑recall curves are plotted under a matched‑recall constraint to compare how well each rule filters out doomed runs while retaining good ones.

Results & Findings

EnvironmentLR regimeOUI trend (10 % training)Final return (avg)
CartPoleToo lowNear 0 (no sign flips)< 50 % optimal
CartPoleOptimalModerate (≈ 0.35)≈ 200 % of max
CartPoleToo highNear 1 (constant flips)Divergence / collapse
Acrobot / LunarLanderSame pattern – a sweet‑spot OUI band for critics, higher OUI for actors
  • Early discrimination: A simple OUI threshold at 10 % training separates > 90 % of the runs that later collapse from those that achieve high returns.
  • Actor vs. critic asymmetry: Critics benefit from staying out of saturation (moderate OUI), whereas actors need more dynamic hidden‑unit activity (higher OUI) to explore policies effectively.
  • Screening performance:
    • OUI alone attains the highest precision at any recall level compared with early‑return, KL‑divergence, or weight‑flip criteria.
    • Combining OUI with early‑return (i.e., “return > threshold and OUI ∈ band”) yields the best overall precision, allowing aggressive pruning of up to 70 % of runs without sacrificing the top‑performing ones.

Practical Implications

  • Hyper‑parameter tuning pipelines: Integrate OUI as a cheap “early‑stop” checkpoint. Instead of running dozens of full PPO trainings to find a good LR, you can discard > 60 % of candidates after a few hundred thousand steps.
  • Automated RL services (e.g., RL‑as‑a‑service, AutoRL): OUI can be exposed as a metric in dashboards, giving engineers a real‑time health indicator of the network’s internal dynamics.
  • Robust production deployments: When rolling out new policies, monitor OUI on a validation batch; a sudden drift toward saturation or chaotic flipping can signal that the learning‑rate schedule (or optimizer) needs adjustment before the model degrades in production.
  • Curriculum or adaptive LR schedules: The theoretical link suggests that a schedule which keeps OUI within the “sweet‑spot” band (e.g., gradually decreasing LR as OUI rises) could improve stability without manual tuning.

Limitations & Future Work

  • Scope limited to discrete‑action PPO – the study does not cover continuous‑action algorithms (e.g., SAC, TD3) where activation dynamics may differ.
  • Fixed probe batch – while efficient, a static batch may not capture distributional shifts in later training phases; adaptive probing could be explored.
  • Only learning‑rate examined – other hyper‑parameters (entropy coefficient, clipping epsilon) likely interact with OUI; joint analysis is left for future research.
  • Theoretical assumptions – the sign‑flip analysis relies on first‑order approximations; extending the theory to higher‑order dynamics or non‑linear optimizers (Adam) remains an open question.

Overall, the paper offers a practical, theoretically grounded tool for early detection of problematic learning‑rate settings in PPO, opening the door to faster, more reliable RL experimentation and deployment.

Authors

  • Alberto Fernández-Hernández
  • Cristian Pérez-Corral
  • Jose I. Mestre
  • Manuel F. Dolz
  • Jose Duato
  • Enrique S. Quintana-Ortí

Paper Information

  • arXiv ID: 2603.09950v1
  • Categories: cs.LG, cs.AI
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »