[Paper] Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts
Source: arXiv - 2601.08726v1
Overview
The paper Model‑Agnostic Solutions for Deep Reinforcement Learning in Non‑Ergodic Contexts shows that standard deep RL algorithms—built around the Bellman equation’s expected‑value formulation—systematically miss the true optimum when the environment is non‑ergodic (i.e., when long‑run outcomes depend on the actual trajectory rather than on an ensemble average). By injecting explicit time information into the agent’s function approximator, the authors demonstrate that deep agents can learn value functions that align with the time‑average growth rate, closing the performance gap without reshaping rewards or redesigning the objective.
Key Contributions
- Theoretical clarification of why expected‑value Bellman updates are mismatched to non‑ergodic dynamics, extending earlier work from tabular to deep RL settings.
- Proof‑of‑concept architecture that augments the state representation with a temporal feature (e.g., episode step count or a learned time embedding) while keeping the rest of the learning pipeline unchanged.
- Empirical validation across several synthetic non‑ergodic benchmarks (multiplicative‑growth processes, stochastic gambling games, and a non‑stationary navigation task) showing up to 30 % higher cumulative reward compared with vanilla DQN, PPO, and A2C.
- Model‑agnostic claim: the temporal augmentation works with any off‑policy or on‑policy deep RL algorithm, making it a drop‑in improvement rather than a new algorithmic family.
- Practical recipe for developers: minimal code changes (add a time channel to the observation tensor, optionally normalize it) and no need to redesign reward shaping or policy objectives.
Methodology
- Problem framing – The authors formalize non‑ergodicity as a divergence between the ensemble‑average expectation used in the Bellman equation and the time‑average growth that an individual agent experiences.
- Temporal augmentation – They extend the observation vector
s_ttos'_t = [s_t; τ_t], whereτ_tis a scalar or low‑dimensional encoding of the elapsed time (e.g., normalized step count, sinusoidal positional encoding, or a learned recurrent hidden state). - Network architecture – Existing deep RL networks (CNNs for visual inputs, MLPs for low‑dimensional states) are left untouched except for an extra input channel. The rest of the pipeline—experience replay, target networks, policy gradients—remains identical.
- Training protocol – Agents are trained on a suite of non‑ergodic environments:
- Multiplicative wealth games where rewards compound multiplicatively, leading to geometric‑mean optimality.
- Stochastic gambling (e.g., Kelly‑type bets) where the optimal policy maximizes long‑run growth, not expected payoff.
- Non‑stationary gridworld where transition probabilities drift over time.
- Evaluation – Performance is measured by the time‑average cumulative reward over long horizons (10⁴–10⁵ steps) and compared against baseline agents lacking the time feature.
Results & Findings
| Environment | Baseline (DQN/PPO) | Temporal‑augmented | Relative Gain |
|---|---|---|---|
| Multiplicative wealth (log‑normal returns) | 0.62 × optimal growth | 0.94 × optimal growth | +52 % |
| Stochastic gambling (Kelly benchmark) | 0.71 × optimal growth | 0.96 × optimal growth | +35 % |
| Drifting gridworld | 0.78 × optimal reward | 0.88 × optimal reward | +13 % |
- Policy quality: Agents with the time channel learned policies that explicitly avoided “risk‑seeking” actions that look attractive under expectation but lead to ruin over time.
- Stability: Training curves were smoother; variance across random seeds dropped by ~40 %, indicating that the temporal signal helps the optimizer converge to a more robust optimum.
- Generalization: The same augmentation worked for both value‑based (DQN) and policy‑gradient (PPO, A2C) methods, confirming the model‑agnostic claim.
Practical Implications
- Finance & Trading Bots – Strategies that must maximize geometric returns (e.g., portfolio growth, Kelly betting) can be trained with off‑the‑shelf deep RL libraries simply by feeding the elapsed trade count or a calendar embedding.
- Robotics in Degrading Environments – When wear‑and‑tear or battery depletion changes dynamics over time, adding a time feature lets the policy adapt to the actual degradation trajectory rather than an averaged model.
- Long‑Running Services (e.g., Cloud Autoscaling) – Systems that experience non‑stationary load patterns can benefit from temporal context to avoid policies that look optimal on average but cause cascading failures under sustained high load.
- Minimal engineering overhead – The fix is a one‑liner in most RL codebases:
obs = np.concatenate([obs, time_feature], axis=-1). No need to redesign reward functions, implement custom loss terms, or switch to risk‑sensitive RL frameworks. - Compatibility with existing tooling – Works with OpenAI Gym, RLlib, Stable‑Baselines3, and even custom simulators, making it instantly testable in production prototypes.
Limitations & Future Work
- Synthetic focus – Experiments are limited to controlled, synthetic environments; real‑world benchmarks (e.g., stock market simulators, large‑scale robotics) are still pending.
- Time representation choice – The paper uses simple scalar step counts; more complex temporal encodings (Fourier features, learned embeddings) could further improve performance but were not explored.
- Scalability – Adding a time dimension modestly increases input size; for high‑dimensional visual inputs the impact is negligible, but for ultra‑low‑latency edge devices the extra computation might matter.
- Theoretical bounds – While the authors provide intuition, a formal convergence proof for arbitrary deep function approximators under non‑ergodic dynamics remains open.
Future directions suggested by the authors:
- Extending the approach to multi‑agent non‑ergodic settings.
- Integrating with risk‑sensitive objectives (e.g., CVaR).
- Automating the discovery of the most informative temporal features via meta‑learning.
Authors
- Bert Verbruggen
- Arne Vanhoyweghen
- Vincent Ginis
Paper Information
- arXiv ID: 2601.08726v1
- Categories: cs.LG
- Published: January 13, 2026
- PDF: Download PDF