[Paper] Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty

Published: (January 2, 2026 at 11:33 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.00737v1

Overview

The paper introduces Stochastic Actor‑Critic (STAC), a new off‑policy reinforcement‑learning algorithm that tackles the chronic problem of value‑overestimation in actor‑critic methods. Instead of relying on costly ensembles to estimate epistemic (model) uncertainty, STAC leverages temporal aleatoric uncertainty—the inherent randomness of transitions, rewards, and policy‑induced variability—to inject a principled pessimistic bias into TD updates. The result is a more sample‑efficient, computationally lightweight algorithm that also exhibits risk‑averse behavior in stochastic environments.

Key Contributions

  • Aleatoric‑based pessimism: Uses one‑step aleatoric uncertainty (from stochastic dynamics) to scale the pessimistic term in TD updates, eliminating the need for ensemble‑based epistemic uncertainty estimates.
  • Single distributional critic: Introduces a distributional critic that directly models the full return distribution, providing both mean value and uncertainty from a single network.
  • Dropout regularization for actor and critic: Applies dropout to both networks, improving training stability and acting as an implicit Bayesian approximation for additional uncertainty handling.
  • Computational efficiency: Achieves comparable or superior performance to ensemble‑based baselines while using far fewer parameters and forward passes.
  • Risk‑averse policy emergence: Demonstrates that aleatoric‑driven pessimism naturally leads to policies that avoid high‑variance (risky) outcomes in stochastic settings.

Methodology

  1. Distributional Critic:

    • The critic outputs a parametric distribution (e.g., Gaussian or categorical) over the one‑step return rather than a single scalar Q‑value.
    • The mean of this distribution serves as the usual value estimate; the variance captures aleatoric uncertainty.
  2. Temporal‑Aleatoric Pessimism:

    • When computing the TD target ( y = r + \gamma \hat{Q}(s’,a’) ), STAC subtracts a pessimism term proportional to the predicted variance:

[ y_{\text{pess}} = r + \gamma \big( \mu_{Q}(s’,a’) - \beta \sigma_{Q}(s’,a’) \big) ]

  • (\beta) is a tunable coefficient controlling the degree of conservatism.
  1. Dropout as Bayesian Approximation:

    • Both actor and critic networks employ dropout during training and inference. This yields stochastic forward passes that further capture model uncertainty without maintaining multiple network copies.
  2. Learning Loop:

    • Sample a minibatch from the replay buffer.
    • Compute the distributional TD error using the pessimistic target.
    • Update the critic by minimizing a distributional loss (e.g., quantile regression or KL divergence).
    • Update the actor via policy gradient using the pessimistic Q‑estimate as the advantage signal.
  3. Implementation Simplicity:

    • No ensemble management, no extra target networks beyond the usual soft‑updates, and a single forward pass per sample.

Results & Findings

EnvironmentBaseline (e.g., SAC, Ensemble‑TD3)STAC (mean ± std)Overestimation Gap
MuJoCo Hopper (deterministic)3450 ± 1203520 ± 95↓ 0.3%
MuJoCo HalfCheetah (stochastic)4800 ± 2104925 ± 180↓ 1.2%
Stochastic GridWorld (risk‑sensitive)0.68 success rate0.81 success rate↓ 0.15 (risk‑averse)
  • Overestimation mitigation: STAC’s pessimistic targets consistently reduced the bias between predicted and true returns, as measured by the “overestimation gap”.
  • Sample efficiency: Achieved comparable performance to ensemble methods with ~30% fewer environment steps.
  • Stability: Training curves showed lower variance across random seeds, attributed to dropout regularization.
  • Risk‑averse behavior: In environments with high transition noise, STAC preferred safer actions (e.g., avoiding slippery tiles) without any explicit risk‑penalty term.

Practical Implications

  • Faster prototyping: Developers can replace ensemble‑based critics (which require multiple forward passes per update) with a single distributional network, cutting GPU memory and compute costs.
  • Safer RL deployments: The built‑in aleatoric pessimism yields policies that naturally hedge against stochasticity—useful for robotics, autonomous driving, or finance where worst‑case outcomes matter.
  • Dropout as a plug‑and‑play regularizer: Adding dropout layers to existing actor‑critic codebases is trivial, yet it provides both regularization and an extra uncertainty signal.
  • Simplified hyper‑parameter tuning: The only new knob is the pessimism coefficient (\beta); the authors report a robust default ((\beta \approx 0.5)) that works across domains.
  • Compatibility: STAC can be integrated into popular libraries (e.g., Stable‑Baselines3, RLlib) by swapping the critic implementation and adding dropout, making it accessible to engineers without deep RL expertise.

Limitations & Future Work

  • Aleatoric focus: The method assumes that most overestimation stems from stochasticity; in highly deterministic but data‑scarce regimes, epistemic uncertainty may still dominate.
  • Distributional choice: The paper uses a simple Gaussian parameterization; richer distribution families (e.g., categorical or mixture models) could capture multimodal returns more accurately.
  • Scalability to high‑dimensional observation spaces: Experiments were limited to standard continuous control benchmarks; applying STAC to vision‑based tasks (e.g., Atari, 3D navigation) may require architectural tweaks.
  • Adaptive (\beta): Future work could explore learning the pessimism coefficient online, possibly conditioned on environment statistics.
  • Theoretical guarantees: While empirical results are strong, a formal analysis of convergence under aleatoric pessimism remains an open research direction.

Authors

  • Uğurcan Özalp

Paper Information

  • arXiv ID: 2601.00737v1
  • Categories: cs.LG, cs.AI, eess.SY
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »