[Paper] Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes
Source: arXiv - 2602.16629v1
Overview
This paper tackles a long‑standing gap between theory and practice in average‑reward reinforcement learning. While Differential Temporal‑Difference (TD) methods are the go‑to algorithms for learning long‑run performance, their existing convergence proofs rely on a “local clock”—learning‑rate schedules that depend on how often each state is visited. Practitioners never use such schedules, especially in large‑scale or function‑approximation settings. The authors close this gap by proving almost‑sure convergence of on‑policy and, under mild conditions, off‑policy differential TD using the standard, globally diminishing step‑sizes that developers already employ.
Key Contributions
- Unified convergence proof for on‑policy n‑step differential TD (any horizon n) with ordinary diminishing step‑sizes—no per‑state visit counters needed.
- Three sufficient off‑policy conditions (bounded importance‑sampling ratios, ergodicity, and a trace‑norm bound) that guarantee convergence of off‑policy n‑step differential TD without a local clock.
- Extension beyond tabular MDPs: the analysis holds for any finite‑state Markov decision process, paving the way for future work with function approximation.
- Bridging theory‑practice divide: the results align the mathematical guarantees with the learning‑rate schedules actually used in modern RL libraries (e.g., PyTorch, TensorFlow).
Methodology
-
Problem Setup – The authors consider a finite‑state, ergodic Markov Decision Process (MDP) with an average reward objective. The differential value function (h(s)) satisfies the Poisson equation (r(s) - \rho + \sum_{s’} P(s’|s,a)h(s’) - h(s) = 0), where (\rho) is the average reward.
-
Algorithmic Framework – They study the n‑step differential TD(λ) update:
[ \Delta_t = \alpha_t \bigl( G_t^{(n)} - \rho_t - h_t(s_t) + \gamma^{(n)} h_t(s_{t+n}) \bigr) ]
where (G_t^{(n)}) is the n‑step cumulative reward, (\alpha_t) is a global step‑size, and (\rho_t) is an online estimate of the average reward. -
Stochastic Approximation Lens – The update is cast as a stochastic approximation (SA) recursion. By constructing a suitable Lyapunov function and leveraging the ODE method, they show that the SA iterates track the solution of the underlying Poisson equation.
-
Key Technical Tricks –
- Elimination of the local clock: they prove that the global diminishing step‑size already guarantees sufficient “exploration” because the underlying Markov chain is ergodic.
- Bounding the noise term: they use martingale convergence theorems to handle the stochasticity introduced by sampling and importance sampling ratios (for off‑policy).
- Off‑policy conditions: they derive three verifiable criteria that keep the importance‑sampling weights from exploding, ensuring the SA noise remains bounded.
-
Almost‑Sure Convergence – By satisfying the Robbins‑Monro conditions on (\alpha_t) (i.e., (\sum \alpha_t = \infty), (\sum \alpha_t^2 < \infty)), they prove that the parameter vector ((h_t, \rho_t)) converges to the true differential value function and average reward with probability 1.
Results & Findings
- On‑policy: For any finite horizon n (including the classic 1‑step case), the algorithm converges almost surely to the correct differential value function and average reward, using only a globally diminishing learning rate.
- Off‑policy: Under any of the three sufficient conditions (bounded importance‑sampling ratios, uniform ergodicity of the behavior policy, or a trace‑norm bound), the same convergence guarantee holds.
- No need for state‑wise step‑size schedules: Empirically, the authors illustrate that standard step‑size schedules (e.g., (\alpha_t = \frac{c}{t^{\beta}}) with (\beta \in (0.5,1])) work as well as the theoretically required local clocks.
These findings confirm that the differential TD family is theoretically sound under the exact learning‑rate schemes already used in practice.
Practical Implications
- Simplified implementations – Library developers can drop the complex bookkeeping of per‑state visit counters and rely on a single global scheduler, reducing code complexity and memory overhead.
- Robust off‑policy learning – The three off‑policy conditions are easy to check (e.g., clipping importance‑sampling ratios, ensuring the behavior policy is sufficiently exploratory). This makes differential TD viable for batch‑learning from logged data or for actor‑critic architectures where the critic runs off‑policy.
- Scalability to large‑scale RL – Since the convergence proof no longer hinges on tabular representations, the results give confidence that extending differential TD to linear or neural function approximation will inherit similar guarantees (subject to additional technical work).
- Better average‑reward benchmarks – Many real‑world systems (network routing, inventory management, continual control) care about long‑run average performance. Practitioners can now adopt differential TD with standard hyper‑parameter tuning pipelines, knowing the algorithm is theoretically sound.
Limitations & Future Work
- Finite‑state assumption – The current proofs require a finite state space; extending to continuous or high‑dimensional spaces with function approximation remains open.
- Specific off‑policy conditions – While the three sufficient conditions are practical, they may be restrictive for highly skewed importance‑sampling ratios; tighter analyses could relax these constraints.
- Empirical validation – The paper provides synthetic experiments; large‑scale benchmarks (e.g., Atari, MuJoCo) with neural critics are needed to confirm the theory in deep RL settings.
- Extension to eligibility traces (λ < 1) – The analysis focuses on pure n‑step updates; integrating eligibility traces and proving convergence for the full TD(λ) family is a natural next step.
Bottom line: This work removes a major theoretical obstacle that has kept differential TD from being a “plug‑and‑play” tool in modern RL pipelines. Developers can now implement average‑reward TD algorithms with the same simple learning‑rate schedules they already use, confident that the method converges almost surely under realistic conditions.
Authors
- Ethan Blaser
- Jiuqi Wang
- Shangtong Zhang
Paper Information
- arXiv ID: 2602.16629v1
- Categories: cs.LG, cs.AI
- Published: February 18, 2026
- PDF: Download PDF