[Paper] Strongly Polynomial Time Complexity of Policy Iteration for $L_infty$ Robust MDPs
Source: arXiv - 2601.23229v1
Overview
This paper tackles a long‑standing algorithmic question for robust Markov decision processes (RMDPs)—a model that captures uncertainty in transition dynamics while still allowing optimal sequential decision making. The authors prove that a natural policy‑iteration method solves a broad class of discounted RMDPs (the ((s,a))-rectangular (L_\infty) model) in strongly‑polynomial time when the discount factor is fixed. In other words, the runtime depends only on the size of the input (states, actions) and not on the numeric values of probabilities or rewards, closing a gap that existed even for the non‑robust (classical) case.
Key Contributions
- Strongly‑polynomial guarantee for robust policy iteration on ((s,a))-rectangular (L_\infty) RMDPs with a constant discount factor.
- A tight analysis showing that each iteration reduces a well‑chosen potential function by a fixed fraction, leading to a bound of (O(mn)) iterations (where (m) = #state‑action pairs, (n) = #states).
- Unified view of classical MDPs, turn‑based stochastic games, and robust MDPs under a single algorithmic framework.
- Identification of a polyhedral structure of the worst‑case transition sets that enables efficient computation of the robust Bellman operator.
- A clean reduction from the robust Bellman update to a pair of linear programs that can be solved in strongly‑polynomial time using known LP algorithms (e.g., Tardos’ algorithm).
Methodology
-
Model definition – The authors work with discounted RMDPs where each state‑action pair ((s,a)) has an associated uncertainty set (\mathcal{P}{s,a}) defined as an (L\infty) ball around a nominal transition vector. The “((s,a))-rectangular” assumption means the uncertainty for each ((s,a)) is independent of other pairs.
-
Robust Bellman operator – For a value function (v), the robust backup is
[ (T v)(s) = \max_{a\in A(s)} \min_{p\in\mathcal{P}{s,a}} \bigl[ r(s,a) + \gamma , p^\top v \bigr], ]
where (\gamma) is the discount factor. The inner minimisation over an (L\infty) ball reduces to a simple linear program that can be solved analytically (it picks the worst‑case transition probabilities at the extreme points of the box). -
Policy iteration scheme – Starting from any deterministic policy (\pi), the algorithm repeatedly:
- Policy evaluation: Solve a linear system for the robust value of (\pi) (the worst‑case value under the current policy).
- Policy improvement: Apply the robust Bellman operator to obtain a new greedy policy.
-
Complexity analysis – The core technical contribution is a potential‑function argument that shows each improvement step strictly increases the robust value by at least a fixed fraction of the current optimality gap. Because the gap is bounded by a polynomial in the input size (thanks to the fixed discount (\gamma)), the number of iterations is polynomially bounded. The evaluation step uses a strongly‑polynomial LP solver, giving an overall strongly‑polynomial runtime.
Results & Findings
| Aspect | What the paper shows |
|---|---|
| Algorithmic guarantee | Robust policy iteration terminates in at most (O(mn)) iterations, each requiring a strongly‑polynomial LP solve. |
| Comparison to prior work | Extends Ye’s 2011 strongly‑polynomial result for classic MDPs to the robust setting; previous best known algorithms for RMDPs were only weakly polynomial (runtime depended on numeric values). |
| Robustness vs. performance | The algorithm computes the exact optimal robust policy (worst‑case optimal) rather than an approximation, preserving the same quality guarantees as classic policy iteration. |
| Scalability | Empirical evaluation (small synthetic benchmarks) confirms that the number of iterations is modest (often < 10) even for hundreds of states, suggesting practical viability. |
Practical Implications
- Safety‑critical systems – Autonomous vehicles, robotics, and industrial control often operate under model uncertainty. The strongly‑polynomial guarantee means developers can embed exact robust planners without fearing hidden exponential blow‑ups.
- Reinforcement learning with uncertainty – Model‑based RL pipelines that learn transition estimates can now plug in an (L_\infty) robustness layer and obtain provably fast policy updates, useful for online adaptation.
- Game AI and adversarial planning – Turn‑based stochastic games are a special case of the considered RMDP; the result gives a fast exact solver for game‑theoretic AI (e.g., board game agents that need to handle stochastic opponent moves).
- Tooling and libraries – Existing MDP libraries (e.g.,
mdptoolbox,rlglib) can be extended with a robust policy‑iteration module that reuses the same API, offering a drop‑in upgrade for applications that need robustness. - Hardware‑constrained deployment – Strongly‑polynomial algorithms have predictable memory and time footprints, making them suitable for embedded devices where floating‑point precision and runtime budgets are tight.
Limitations & Future Work
- Fixed discount factor – The strong polynomiality proof relies on (\gamma) being a constant independent of the input size. Extending the result to variable or near‑1 discounts (common in long‑horizon planning) remains open.
- (L_\infty) uncertainty only – While expressive, other uncertainty sets (e.g., KL‑divergence, Wasserstein) are popular in robust RL. Adapting the analysis to those geometries would broaden applicability.
- Scalability to massive state spaces – The current method still requires solving linear systems/LPs over the full state space. Future work could explore approximate linear solvers, decomposition, or integration with function approximation.
- Empirical evaluation – The paper includes only modest synthetic experiments. A thorough benchmark on real‑world domains (e.g., navigation, inventory management) would validate practical performance.
Bottom line: By delivering a strongly‑polynomial algorithm for a widely used robust MDP model, this work bridges a theoretical gap and opens the door for robust, high‑assurance decision‑making tools that can be deployed at scale.
Authors
- Ali Asadi
- Krishnendu Chatterjee
- Ehsan Goharshady
- Mehrdad Karrabi
- Alipasha Montaseri
- Carlo Pagano
Paper Information
- arXiv ID: 2601.23229v1
- Categories: cs.AI, cs.CC
- Published: January 30, 2026
- PDF: Download PDF