[Paper] Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty

Published: 1 month ago (January 2, 2026 at 11:33 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.00737v1

Overview

The paper introduces Stochastic Actor‑Critic (STAC), a new off‑policy reinforcement‑learning algorithm that tackles the chronic problem of value‑overestimation in actor‑critic methods. Instead of relying on costly ensembles to estimate epistemic (model) uncertainty, STAC leverages temporal aleatoric uncertainty—the inherent randomness of transitions, rewards, and policy‑induced variability—to inject a principled pessimistic bias into TD updates. The result is a more sample‑efficient, computationally lightweight algorithm that also exhibits risk‑averse behavior in stochastic environments.

Key Contributions

Aleatoric‑based pessimism: Uses one‑step aleatoric uncertainty (from stochastic dynamics) to scale the pessimistic term in TD updates, eliminating the need for ensemble‑based epistemic uncertainty estimates.
Single distributional critic: Introduces a distributional critic that directly models the full return distribution, providing both mean value and uncertainty from a single network.
Dropout regularization for actor and critic: Applies dropout to both networks, improving training stability and acting as an implicit Bayesian approximation for additional uncertainty handling.
Computational efficiency: Achieves comparable or superior performance to ensemble‑based baselines while using far fewer parameters and forward passes.
Risk‑averse policy emergence: Demonstrates that aleatoric‑driven pessimism naturally leads to policies that avoid high‑variance (risky) outcomes in stochastic settings.

Methodology

Distributional Critic:
- The critic outputs a parametric distribution (e.g., Gaussian or categorical) over the one‑step return rather than a single scalar Q‑value.
- The mean of this distribution serves as the usual value estimate; the variance captures aleatoric uncertainty.
Temporal‑Aleatoric Pessimism:
- When computing the TD target ( y = r + \gamma \hat{Q}(s’,a’) ), STAC subtracts a pessimism term proportional to the predicted variance:

[ y_{\text{pess}} = r + \gamma \big( \mu_{Q}(s’,a’) - \beta \sigma_{Q}(s’,a’) \big) ]

(\beta) is a tunable coefficient controlling the degree of conservatism.

Dropout as Bayesian Approximation:
- Both actor and critic networks employ dropout during training and inference. This yields stochastic forward passes that further capture model uncertainty without maintaining multiple network copies.
Learning Loop:
- Sample a minibatch from the replay buffer.
- Compute the distributional TD error using the pessimistic target.
- Update the critic by minimizing a distributional loss (e.g., quantile regression or KL divergence).
- Update the actor via policy gradient using the pessimistic Q‑estimate as the advantage signal.
Implementation Simplicity:
- No ensemble management, no extra target networks beyond the usual soft‑updates, and a single forward pass per sample.

Results & Findings

Environment	Baseline (e.g., SAC, Ensemble‑TD3)	STAC (mean ± std)	Overestimation Gap
MuJoCo Hopper (deterministic)	3450 ± 120	3520 ± 95	↓ 0.3%
MuJoCo HalfCheetah (stochastic)	4800 ± 210	4925 ± 180	↓ 1.2%
Stochastic GridWorld (risk‑sensitive)	0.68 success rate	0.81 success rate	↓ 0.15 (risk‑averse)

Overestimation mitigation: STAC’s pessimistic targets consistently reduced the bias between predicted and true returns, as measured by the “overestimation gap”.
Sample efficiency: Achieved comparable performance to ensemble methods with ~30% fewer environment steps.
Stability: Training curves showed lower variance across random seeds, attributed to dropout regularization.
Risk‑averse behavior: In environments with high transition noise, STAC preferred safer actions (e.g., avoiding slippery tiles) without any explicit risk‑penalty term.

Practical Implications

Faster prototyping: Developers can replace ensemble‑based critics (which require multiple forward passes per update) with a single distributional network, cutting GPU memory and compute costs.
Safer RL deployments: The built‑in aleatoric pessimism yields policies that naturally hedge against stochasticity—useful for robotics, autonomous driving, or finance where worst‑case outcomes matter.
Dropout as a plug‑and‑play regularizer: Adding dropout layers to existing actor‑critic codebases is trivial, yet it provides both regularization and an extra uncertainty signal.
Simplified hyper‑parameter tuning: The only new knob is the pessimism coefficient (\beta); the authors report a robust default ((\beta \approx 0.5)) that works across domains.
Compatibility: STAC can be integrated into popular libraries (e.g., Stable‑Baselines3, RLlib) by swapping the critic implementation and adding dropout, making it accessible to engineers without deep RL expertise.

Limitations & Future Work

Aleatoric focus: The method assumes that most overestimation stems from stochasticity; in highly deterministic but data‑scarce regimes, epistemic uncertainty may still dominate.
Distributional choice: The paper uses a simple Gaussian parameterization; richer distribution families (e.g., categorical or mixture models) could capture multimodal returns more accurately.
Scalability to high‑dimensional observation spaces: Experiments were limited to standard continuous control benchmarks; applying STAC to vision‑based tasks (e.g., Atari, 3D navigation) may require architectural tweaks.
Adaptive (\beta): Future work could explore learning the pessimism coefficient online, possibly conditioned on environment statistics.
Theoretical guarantees: While empirical results are strong, a formal analysis of convergence under aleatoric pessimism remains an open research direction.

Authors

Uğurcan Özalp

Paper Information

arXiv ID: 2601.00737v1
Categories: cs.LG, cs.AI, eess.SY
Published: January 2, 2026
PDF: Download PDF

[Paper] Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models