[Paper] Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Source: arXiv - 2605.08053v1
Overview
This paper tackles a long‑standing gap in reinforcement learning (RL): how to learn policies that optimise an exponential‑utility (risk‑averse) objective in discounted Markov decision processes (MDPs). While classic RL focuses on maximizing expected cumulative reward, many real‑world systems (finance, robotics, cloud resource allocation) care about risk. The authors develop the first principled, value‑based RL algorithms that provably converge to optimal risk‑averse policies under a fixed risk‑aversion parameter.
Key Contributions
- Two new Bellman‑type operators for exponential‑utility Q‑functions, shown to be contractions under (i) the standard sup‑norm and (ii) a sup‑log/Thompson metric.
- Proof that the greedy stationary policy derived from the fixed point of either operator is optimal among all stationary policies for the exponential‑utility criterion.
- A two‑timescale model‑free Q‑learning algorithm (analogous to classic Q‑learning) with:
- Almost‑sure convergence guarantees.
- Finite‑time convergence rates derived via timescale separation analysis.
- A one‑timescale algorithm based on a sub‑linear power‑law operator, together with a novel convergence proof that leverages local Lipschitzness, monotonicity, homogeneity, and Dini derivatives.
- Scalar finite‑time analysis that highlights why extending rate results to the full vector‑valued case is challenging.
Methodology
-
Problem Setup – The authors consider a discounted MDP ((\mathcal{S},\mathcal{A},P,R,\gamma)) and a risk‑aversion parameter (\eta>0). The performance of a policy (\pi) is measured by the exponential utility
$$ J^\pi(s)=\mathbb{E}!\left[\exp!\Big(\eta\sum_{t=0}^\infty \gamma^t R(s_t,a_t)\Big),\bigg|,s_0=s\right]. $$
Maximising (J^\pi) is equivalent to minimising the risk‑sensitive cost (-\frac{1}{\eta}\log J^\pi).
-
Bellman‑type Equations – Building on Porteus (1975), they derive a non‑linear Bellman equation for the exponential‑utility Q‑function (Q(s,a)). Two operator forms are introduced:
- (T_{\infty}): a sup‑norm contraction.
- (T_{\log}): a contraction in the sup‑log (Thompson) metric, which better captures the multiplicative nature of the exponential utility.
-
Algorithm Design –
- Two‑timescale Q‑learning updates the Q‑values on a fast timescale while simultaneously estimating a normalising scalar on a slower timescale, ensuring stability despite the non‑linearity.
- One‑timescale power‑law update replaces the contraction with a sub‑linear mapping, avoiding the need for a second learning rate but requiring a more delicate convergence argument.
-
Convergence Analysis – For the two‑timescale method, standard stochastic approximation tools (ODE method, Robbins‑Monro) give almost‑sure convergence and explicit finite‑time bounds. For the one‑timescale method, the authors prove local Lipschitzness and monotonicity of the operator, then use Dini derivatives to control the error trajectory, establishing convergence without a global contraction.
Results & Findings
-
Both operators admit unique fixed points; the associated greedy policies are provably optimal for the exponential‑utility objective among stationary policies.
-
The two‑timescale Q‑learning algorithm converges almost surely to these fixed points, with a finite‑time error bound that scales as
$$ O!\bigg(\frac{\log T}{\sqrt{T}}\bigg) $$
(up to constants depending on (\eta) and (\gamma)).
-
The one‑timescale power‑law algorithm also converges, though the analysis only yields a scalar convergence rate; extending this to the full vector case remains an open technical challenge.
-
Empirical simulations (present in the full paper) illustrate that the algorithms learn risk‑averse behaviours—e.g., avoiding high‑variance reward regions—while still achieving competitive discounted returns.
Practical Implications
| Domain | Why Exponential Utility Matters | How the Paper Helps |
|---|---|---|
| Finance & Trading | Portfolio managers care about downside risk, not just expected profit. | Provides a model‑free RL method to learn policies that directly optimise a risk‑averse utility, enabling automated, data‑driven hedging strategies. |
| Robotics & Autonomous Systems | Safety‑critical tasks require avoiding rare catastrophic failures. | Allows robots to learn policies that penalise high‑variance outcomes (e.g., slipping, collisions) without hand‑crafted safety constraints. |
| Cloud & Edge Resource Allocation | Service‑level agreements often involve tail‑latency guarantees. | RL agents can be trained to minimise the exponential cost of latency spikes, leading to more reliable provisioning. |
| Operations & Supply Chain | Demand uncertainty can cause costly stockouts or overstock. | The algorithms can learn ordering policies that hedge against demand volatility, improving service levels. |
For developers, the two‑timescale Q‑learning algorithm can be implemented with minor modifications to existing Q‑learning codebases: add a secondary learning rate to track a normalising scalar (often called a “log‑partition” term). The one‑timescale variant offers a simpler implementation (single learning rate) but may require more careful tuning and monitoring of convergence.
Limitations & Future Work
- Fixed risk‑aversion: Theory assumes a constant (\eta). Adapting the methods to state‑dependent or learned risk parameters is an open problem.
- Contraction only in specialised metrics: The sup‑log/Thompson metric is not commonly used in RL libraries, which may complicate practical debugging and hyper‑parameter selection.
- Finite‑time rates for the one‑timescale algorithm are currently only scalar; extending to full vector‑valued Q‑functions remains a technical hurdle.
- Scalability: Experiments are limited to modest‑size MDPs; integrating function approximation (e.g., deep neural nets) and proving convergence in that setting is future work.
- Policy class: Optimality is shown among stationary policies; exploring non‑stationary or hierarchical policies could yield further performance gains.
Overall, this work lays the theoretical groundwork for risk‑aware value‑based RL, opening the door for practical, risk‑sensitive agents in a variety of high‑stakes applications.
Authors
- Gugan Thoppe
- L. A. Prashanth
- Ankur Naskar
- Sanjay Bhat
Paper Information
- arXiv ID: 2605.08053v1
- Categories: cs.LG
- Published: May 8, 2026
- PDF: Download PDF