[Paper] STACHE: Local Black-Box Explanations for Reinforcement Learning Policies

Published: 2 months ago (December 10, 2025 at 01:37 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09909v1

Overview

The paper introduces STACHE, a framework that generates local, black‑box explanations for the actions taken by reinforcement‑learning (RL) agents in discrete Markov games. By pinpointing exactly where an action stays stable and what minimal changes would flip that decision, STACHE gives developers a concrete way to debug, verify, and improve policies—especially in sparse‑reward or safety‑critical settings.

Key Contributions

Composite Explanation: Combines a Robustness Region (the set of neighboring states that keep the same action) with Minimal Counterfactuals (the smallest perturbations that would cause a different action).
Exact, Search‑Based Algorithm: Leverages factored state representations to compute explanations without resorting to surrogate models, eliminating fidelity loss.
Training‑Phase Insight: Shows how the size and shape of robustness regions evolve during learning, revealing the transition from chaotic to stable policies.
Empirical Validation: Demonstrates the approach on several Gymnasium environments, confirming that explanations are both accurate and informative for real RL agents.
Tool‑Ready Prototype: Provides an open‑source implementation that integrates with standard RL libraries (e.g., Stable‑Baselines3, Gymnasium).

Methodology

Problem Setting – The authors focus on discrete Markov games where the state space can be factored into independent variables (e.g., grid coordinates, inventory items).
Robustness Region Construction – Starting from a target state s and the agent’s chosen action a, a breadth‑first search explores adjacent factored states while checking if the policy still outputs a. The search stops when a different action appears, yielding the maximal connected region where a is invariant.
Minimal Counterfactual Extraction – Within the boundary of the robustness region, the algorithm identifies the smallest set of factor changes that flip the action. This is done by solving a constrained optimization problem over the factored dimensions, guaranteeing minimality.
Composite Explanation Assembly – The robustness region (a “what‑if” safe zone) and the minimal counterfactuals (the “tipping points”) are packaged together as a single, human‑readable explanation.
Implementation Details – The search exploits memoization and parallel evaluation of the policy network, making the method tractable even for high‑dimensional factored spaces.

Results & Findings

Environment	Avg. Robustness Region Size	Avg. Counterfactual Distance	Insight Gained
CartPole‑v1	12.4 states	1 factor change	Early training: tiny regions → high sensitivity
FrozenLake‑v1	8.7 states	2 factor changes	Mid‑training: regions expand as policy learns safe paths
Custom GridWorld	21.3 states	1‑2 factor changes	Late training: large, stable regions indicating robust navigation

Stability Over Training: Robustness regions start fragmented and grow monotonically as the agent converges, confirming that STACHE can be used as a training diagnostic tool.
Action Sensitivity Mapping: Minimal counterfactuals highlight exactly which state variables (e.g., “enemy proximity”, “fuel level”) are critical for a decision, enabling targeted feature engineering.
Performance: The exact search completes within seconds for state spaces up to ~10⁶ factored combinations, comparable to or faster than surrogate‑model approaches that require additional training.

Practical Implications

Debugging & Safety Audits: Engineers can quickly locate fragile decision boundaries (e.g., a self‑driving car’s lane‑change policy) and reinforce them through additional training data or reward shaping.
Policy Verification: Regulatory or compliance pipelines can require a minimum robustness region size for safety‑critical actions, turning STACHE outputs into quantitative certificates.
Feature Prioritization: By surfacing the most influential state factors, developers can focus sensor improvements or state‑abstraction efforts where they matter most.
Curriculum Design: Observing how robustness regions evolve can guide curriculum learning—introducing harder scenarios only after the agent’s decision boundary has sufficiently widened.
Integration: The provided Python library plugs into existing RL pipelines, allowing on‑the‑fly explanations during training runs or post‑hoc analysis of deployed agents.

Limitations & Future Work

Discrete‑Only Scope: STACHE currently assumes a fully discrete, factored state space; extending to continuous domains would require discretization or hybrid search strategies.
Scalability to Very High Dimensions: While memoization helps, state spaces with >20 factors can still cause exponential blow‑up; approximate pruning heuristics are a possible remedy.
Policy Black‑Box Assumption: The method treats the policy as a black box, which is safe but may miss opportunities to exploit internal gradients for faster counterfactual discovery.
Future Directions: The authors plan to (1) adapt the algorithm for hybrid continuous‑discrete environments, (2) combine exact search with learned surrogate models for scalability, and (3) explore automated policy repair based on identified counterfactuals.

Authors

Andrew Elashkin
Orna Grumberg

Paper Information

arXiv ID: 2512.09909v1
Categories: cs.LG, cs.AI
Published: December 10, 2025
PDF: Download PDF

[Paper] STACHE: Local Black-Box Explanations for Reinforcement Learning Policies

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously