[Paper] Predictive Safety Shield for Dyna-Q Reinforcement Learning

Published: 2 months ago (November 26, 2025 at 10:59 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21531v1

Overview

The paper introduces a Predictive Safety Shield that plugs into model‑based reinforcement learning (RL) agents—specifically Dyna‑Q—in discrete environments. By simulating a few steps ahead with a learned model, the shield can pick the safest action that also respects future performance, delivering hard safety guarantees without sacrificing learning speed.

Key Contributions

Predictive shielding: Extends classic safety shields by using short‑horizon model predictions to evaluate the downstream impact of each safe action.
Local Q‑function updates: The shield adjusts the agent’s Q‑values on‑the‑fly based on simulated safe trajectories, effectively “teaching” the agent which safe actions are actually beneficial.
Performance‑aware safety: Demonstrates that safety need not be a blunt fallback; the shield can steer the agent along optimal or near‑optimal safe paths.
Robustness to distribution shift: Shows that the approach tolerates mismatches between the simulated model used for shielding and the real environment, without extra retraining.
Empirical validation: Experiments on gridworld benchmarks illustrate that even a 2‑step prediction horizon can recover the optimal safe policy.

Methodology

Base RL algorithm – Dyna‑Q: The agent learns a Q‑function while simultaneously building a learned model of the environment (transition and reward).
Safety shield layer: Before executing an action, the shield checks whether the action is a priori safe (e.g., stays within a predefined safe set).
Predictive simulation: For each candidate safe action, the shield rolls out the learned model for a short horizon (h) (typically 1‑3 steps). It evaluates the simulated trajectory’s cumulative reward and safety status.
Local Q‑value correction: The shield updates the Q‑value of the current state–action pair with the simulated return, biasing the agent toward safe actions that also promise higher future reward.
Execution: The agent selects the action with the highest (shield‑adjusted) Q‑value; if no safe action exists, a predefined fallback controller is used.

The whole process runs online, requiring only the existing Dyna‑Q model—no extra neural networks or offline data collection.

Results & Findings

Environment	Horizon (h)	Success Rate (Safety)	Cumulative Reward
5×5 Gridworld (static obstacles)	1	100 %	Near‑optimal
10×10 Gridworld (moving hazards)	2	100 %	15 % higher than baseline Dyna‑Q
Sim‑to‑Real transfer (model drift)	3	100 %	No degradation vs. in‑sim

Short horizons suffice: Even with (h=1) the shield can avoid dead‑ends and guide the agent to the optimal path.
No safety violations: Across all trials the shield guarantees hard safety—no unsafe state is ever visited.
Robustness: When the environment dynamics are perturbed (simulating a “real‑world” shift), the shield still prevents unsafe actions without needing to retrain the model.

Practical Implications

Safety‑critical robotics: Mobile robots navigating warehouses or factories can use the shield to guarantee collision‑free motion while still learning efficient routes.
Autonomous vehicles in discrete decision layers: High‑level maneuver planning (e.g., lane changes) can be protected by a predictive shield that respects traffic rules and anticipates downstream risks.
Industrial control: PLCs that learn to optimize production sequences can embed the shield to avoid unsafe actuator commands, reducing downtime and maintenance costs.
Rapid prototyping: Developers can plug the shield into existing Dyna‑Q or other model‑based RL codebases with minimal changes, gaining safety guarantees without a separate verification pipeline.

Limitations & Future Work

Discrete state‑action spaces: The current formulation assumes a finite grid‑like environment; extending to continuous domains will require function approximation for the predictive rollout.
Model fidelity: The shield’s effectiveness hinges on the learned model being reasonably accurate over the short horizon; large model errors could mislead the Q‑updates.
Scalability of rollout: While short horizons keep computation cheap, larger state spaces may still pose a combinatorial explosion when evaluating many safe actions.

Future research directions include: adapting the predictive shield to continuous control via learned dynamics ensembles, integrating uncertainty quantification to weight Q‑updates, and testing on real robotic platforms to validate the simulated robustness claims.

Authors

Jin Pin
Krasowski Hanna
Vanneaux Elena

Paper Information

arXiv ID: 2511.21531v1
Categories: cs.LG, cs.AI, cs.RO, eess.SY
Published: November 26, 2025
PDF: Download PDF

[Paper] Predictive Safety Shield for Dyna-Q Reinforcement Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval