[Paper] Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

Published: 2 months ago (February 18, 2026 at 12:24 PM EST)

5 min read

Source: arXiv

Source: arXiv

Source: arXiv:2602.16629v1

Overview

This paper tackles a long‑standing gap between theory and practice in average‑reward reinforcement learning.

Problem:
Differential Temporal‑Difference (TD) methods are the go‑to algorithms for learning long‑run performance, but existing convergence proofs rely on a “local clock”—learning‑rate schedules that depend on how often each state is visited. Practitioners never use such schedules, especially in large‑scale or function‑approximation settings.
Contribution:
The authors close this gap by proving almost‑sure convergence of:
1. On‑policy differential TD, and
2. Off‑policy differential TD (under mild conditions).
Crucially, the proofs hold for the standard, globally diminishing step‑sizes that developers already employ.

Key Contributions

Unified convergence proof for on‑policy n‑step differential TD (any horizon n) with ordinary diminishing step sizes—no per‑state visit counters needed.
Three sufficient off‑policy conditions (bounded importance‑sampling ratios, ergodicity, and a trace‑norm bound) that guarantee convergence of off‑policy n‑step differential TD without a local clock.
Extension beyond tabular MDPs: the analysis holds for any finite‑state Markov decision process, paving the way for future work with function approximation.
Bridging theory‑practice divide: the results align the mathematical guarantees with the learning‑rate schedules actually used in modern RL libraries (e.g., PyTorch, TensorFlow).

Methodology

Problem Setup – The authors consider a finite‑state, ergodic Markov Decision Process (MDP) with an average‑reward objective.
The differential value function (h(s)) satisfies the Poisson equation
[ r(s) - \rho + \sum_{s’} P(s’ \mid s,a),h(s’) - h(s) = 0, ]
where (\rho) is the average reward.
Algorithmic Framework – They study the n‑step differential TD(λ) update
[ \Delta_t = \alpha_t \bigl( G_t^{(n)} - \rho_t - h_t(s_t) + \gamma^{(n)} h_t(s_{t+n}) \bigr), ]
where
- (G_t^{(n)}) – the n‑step cumulative reward,
- (\alpha_t) – a global step‑size,
- (\rho_t) – an online estimate of the average reward.
Stochastic‑Approximation Lens – The update is cast as a stochastic‑approximation (SA) recursion.
By constructing a suitable Lyapunov function and applying the ODE method, the authors show that the SA iterates track the solution of the underlying Poisson equation.
Key Technical Tricks
- Elimination of the local clock – They prove that the global diminishing step‑size already guarantees sufficient “exploration” because the underlying Markov chain is ergodic.
- Bounding the noise term – Martingale convergence theorems are used to handle the stochasticity introduced by sampling and importance‑sampling ratios (for off‑policy learning).
- Off‑policy conditions – Three verifiable criteria are derived that keep the importance‑sampling weights from exploding, ensuring the SA noise remains bounded.
Almost‑Sure Convergence – By satisfying the Robbins‑Monro conditions on (\alpha_t)
[ \sum_{t=0}^{\infty} \alpha_t = \infty, \qquad \sum_{t=0}^{\infty} \alpha_t^{2} < \infty, ]
the authors prove that the parameter vector ((h_t, \rho_t)) converges to the true differential value function and average reward with probability 1.

Results & Findings

On‑policy
- For any finite horizon (n) (including the classic 1‑step case), the algorithm converges almost surely to the correct differential value function and average reward, using only a globally diminishing learning rate.
Off‑policy
- Under any of the three sufficient conditions—
  1. Bounded importance‑sampling ratios,
  2. Uniform ergodicity of the behavior policy, or
  3. A trace‑norm bound—
    the same convergence guarantee holds.
No need for state‑wise step‑size schedules
- Empirically, the authors show that standard step‑size schedules (e.g., (\alpha_t = \frac{c}{t^{\beta}}) with (\beta \in (0.5, 1])) perform as well as the theoretically required local clocks.

These findings confirm that the differential TD family is theoretically sound under the exact learning‑rate schemes already used in practice.

Practical Implications

Simplified implementations – Library developers can drop the complex bookkeeping of per‑state visit counters and rely on a single global scheduler, reducing code complexity and memory overhead.
Robust off‑policy learning – The three off‑policy conditions are easy to check (e.g., clipping importance‑sampling ratios, ensuring the behavior policy is sufficiently exploratory). This makes differential TD viable for batch learning from logged data or for actor‑critic architectures where the critic runs off‑policy.
Scalability to large‑scale RL – Because the convergence proof no longer hinges on tabular representations, the results give confidence that extending differential TD to linear or neural function approximation will inherit similar guarantees (subject to additional technical work).
Better average‑reward benchmarks – Many real‑world systems (network routing, inventory management, continual control) care about long‑run average performance. Practitioners can now adopt differential TD with standard hyper‑parameter‑tuning pipelines, knowing the algorithm is theoretically sound.

Limitations & Future Work

Finite‑state assumption – The current proofs require a finite state space; extending to continuous or high‑dimensional spaces with function approximation remains open.
Specific off‑policy conditions – While the three sufficient conditions are practical, they may be restrictive for highly skewed importance‑sampling ratios; tighter analyses could relax these constraints.
Empirical validation – The paper provides synthetic experiments; large‑scale benchmarks (e.g., Atari, MuJoCo) with neural critics are needed to confirm the theory in deep RL settings.
Extension to eligibility traces (λ < 1) – The analysis focuses on pure n‑step updates; integrating eligibility traces and proving convergence for the full TD(λ) family is a natural next step.

Bottom line: This work removes a major theoretical obstacle that has kept differential TD from being a “plug‑and‑play” tool in modern RL pipelines. Developers can now implement average‑reward TD algorithms with the same simple learning‑rate schedules they already use, confident that the method converges almost surely under realistic conditions.

Authors

Ethan Blaser
Jiuqi Wang
Shangtong Zhang

Paper Information

Item	Details
arXiv ID	`2602.16629v1`
Categories	`cs.LG, cs.AI`
Published	February 18, 2026
PDF	Download PDF

[Paper] Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Does AI have a hero gene?

Why Your AI Trading Agent Needs a Memory — and How We Built One

[Paper] Computer-Using World Model

트릴리온랩스, ‘Tri 21B 씽크’ 공개…글로벌 AI 성능 평가 톱30 진입