[Paper] Strongly Polynomial Time Complexity of Policy Iteration for $L_infty$ Robust MDPs

Published: 3 months ago (January 30, 2026 at 12:57 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.23229v1

Overview

This paper tackles a long‑standing algorithmic question for robust Markov decision processes (RMDPs)—a model that captures uncertainty in transition dynamics while still allowing optimal sequential decision making. The authors prove that a natural policy‑iteration method solves a broad class of discounted RMDPs (the ((s,a))-rectangular (L_\infty) model) in strongly‑polynomial time when the discount factor is fixed. In other words, the runtime depends only on the size of the input (states, actions) and not on the numeric values of probabilities or rewards, closing a gap that existed even for the non‑robust (classical) case.

Key Contributions

Strongly‑polynomial guarantee for robust policy iteration on ((s,a))-rectangular (L_\infty) RMDPs with a constant discount factor.
A tight analysis showing that each iteration reduces a well‑chosen potential function by a fixed fraction, leading to a bound of (O(mn)) iterations (where (m) = #state‑action pairs, (n) = #states).
Unified view of classical MDPs, turn‑based stochastic games, and robust MDPs under a single algorithmic framework.
Identification of a polyhedral structure of the worst‑case transition sets that enables efficient computation of the robust Bellman operator.
A clean reduction from the robust Bellman update to a pair of linear programs that can be solved in strongly‑polynomial time using known LP algorithms (e.g., Tardos’ algorithm).

Methodology

Model definition – The authors work with discounted RMDPs where each state‑action pair ((s,a)) has an associated uncertainty set (\mathcal{P}{s,a}) defined as an (L\infty) ball around a nominal transition vector. The “((s,a))-rectangular” assumption means the uncertainty for each ((s,a)) is independent of other pairs.
Robust Bellman operator – For a value function (v), the robust backup is
[ (T v)(s) = \max_{a\in A(s)} \min_{p\in\mathcal{P}{s,a}} \bigl[ r(s,a) + \gamma , p^\top v \bigr], ]
where (\gamma) is the discount factor. The inner minimisation over an (L\infty) ball reduces to a simple linear program that can be solved analytically (it picks the worst‑case transition probabilities at the extreme points of the box).
Policy iteration scheme – Starting from any deterministic policy (\pi), the algorithm repeatedly:
- Policy evaluation: Solve a linear system for the robust value of (\pi) (the worst‑case value under the current policy).
- Policy improvement: Apply the robust Bellman operator to obtain a new greedy policy.
Complexity analysis – The core technical contribution is a potential‑function argument that shows each improvement step strictly increases the robust value by at least a fixed fraction of the current optimality gap. Because the gap is bounded by a polynomial in the input size (thanks to the fixed discount (\gamma)), the number of iterations is polynomially bounded. The evaluation step uses a strongly‑polynomial LP solver, giving an overall strongly‑polynomial runtime.

Results & Findings

Aspect	What the paper shows
Algorithmic guarantee	Robust policy iteration terminates in at most (O(mn)) iterations, each requiring a strongly‑polynomial LP solve.
Comparison to prior work	Extends Ye’s 2011 strongly‑polynomial result for classic MDPs to the robust setting; previous best known algorithms for RMDPs were only weakly polynomial (runtime depended on numeric values).
Robustness vs. performance	The algorithm computes the exact optimal robust policy (worst‑case optimal) rather than an approximation, preserving the same quality guarantees as classic policy iteration.
Scalability	Empirical evaluation (small synthetic benchmarks) confirms that the number of iterations is modest (often < 10) even for hundreds of states, suggesting practical viability.

Practical Implications

Safety‑critical systems – Autonomous vehicles, robotics, and industrial control often operate under model uncertainty. The strongly‑polynomial guarantee means developers can embed exact robust planners without fearing hidden exponential blow‑ups.
Reinforcement learning with uncertainty – Model‑based RL pipelines that learn transition estimates can now plug in an (L_\infty) robustness layer and obtain provably fast policy updates, useful for online adaptation.
Game AI and adversarial planning – Turn‑based stochastic games are a special case of the considered RMDP; the result gives a fast exact solver for game‑theoretic AI (e.g., board game agents that need to handle stochastic opponent moves).
Tooling and libraries – Existing MDP libraries (e.g., mdptoolbox, rlglib) can be extended with a robust policy‑iteration module that reuses the same API, offering a drop‑in upgrade for applications that need robustness.
Hardware‑constrained deployment – Strongly‑polynomial algorithms have predictable memory and time footprints, making them suitable for embedded devices where floating‑point precision and runtime budgets are tight.

Limitations & Future Work

Fixed discount factor – The strong polynomiality proof relies on (\gamma) being a constant independent of the input size. Extending the result to variable or near‑1 discounts (common in long‑horizon planning) remains open.
(L_\infty) uncertainty only – While expressive, other uncertainty sets (e.g., KL‑divergence, Wasserstein) are popular in robust RL. Adapting the analysis to those geometries would broaden applicability.
Scalability to massive state spaces – The current method still requires solving linear systems/LPs over the full state space. Future work could explore approximate linear solvers, decomposition, or integration with function approximation.
Empirical evaluation – The paper includes only modest synthetic experiments. A thorough benchmark on real‑world domains (e.g., navigation, inventory management) would validate practical performance.

Bottom line: By delivering a strongly‑polynomial algorithm for a widely used robust MDP model, this work bridges a theoretical gap and opens the door for robust, high‑assurance decision‑making tools that can be deployed at scale.

Authors

Ali Asadi
Krishnendu Chatterjee
Ehsan Goharshady
Mehrdad Karrabi
Alipasha Montaseri
Carlo Pagano

Paper Information

arXiv ID: 2601.23229v1
Categories: cs.AI, cs.CC
Published: January 30, 2026
PDF: Download PDF

[Paper] Strongly Polynomial Time Complexity of Policy Iteration for $L_infty$ Robust MDPs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound