[Paper] Predictive Safety Shield for Dyna-Q Reinforcement Learning
Source: arXiv - 2511.21531v1
Overview
The paper introduces a Predictive Safety Shield that plugs into model‑based reinforcement learning (RL) agents—specifically Dyna‑Q—in discrete environments. By simulating a few steps ahead with a learned model, the shield can pick the safest action that also respects future performance, delivering hard safety guarantees without sacrificing learning speed.
Key Contributions
- Predictive shielding: Extends classic safety shields by using short‑horizon model predictions to evaluate the downstream impact of each safe action.
- Local Q‑function updates: The shield adjusts the agent’s Q‑values on‑the‑fly based on simulated safe trajectories, effectively “teaching” the agent which safe actions are actually beneficial.
- Performance‑aware safety: Demonstrates that safety need not be a blunt fallback; the shield can steer the agent along optimal or near‑optimal safe paths.
- Robustness to distribution shift: Shows that the approach tolerates mismatches between the simulated model used for shielding and the real environment, without extra retraining.
- Empirical validation: Experiments on gridworld benchmarks illustrate that even a 2‑step prediction horizon can recover the optimal safe policy.
Methodology
- Base RL algorithm – Dyna‑Q: The agent learns a Q‑function while simultaneously building a learned model of the environment (transition and reward).
- Safety shield layer: Before executing an action, the shield checks whether the action is a priori safe (e.g., stays within a predefined safe set).
- Predictive simulation: For each candidate safe action, the shield rolls out the learned model for a short horizon (h) (typically 1‑3 steps). It evaluates the simulated trajectory’s cumulative reward and safety status.
- Local Q‑value correction: The shield updates the Q‑value of the current state–action pair with the simulated return, biasing the agent toward safe actions that also promise higher future reward.
- Execution: The agent selects the action with the highest (shield‑adjusted) Q‑value; if no safe action exists, a predefined fallback controller is used.
The whole process runs online, requiring only the existing Dyna‑Q model—no extra neural networks or offline data collection.
Results & Findings
| Environment | Horizon (h) | Success Rate (Safety) | Cumulative Reward |
|---|---|---|---|
| 5×5 Gridworld (static obstacles) | 1 | 100 % | Near‑optimal |
| 10×10 Gridworld (moving hazards) | 2 | 100 % | 15 % higher than baseline Dyna‑Q |
| Sim‑to‑Real transfer (model drift) | 3 | 100 % | No degradation vs. in‑sim |
- Short horizons suffice: Even with (h=1) the shield can avoid dead‑ends and guide the agent to the optimal path.
- No safety violations: Across all trials the shield guarantees hard safety—no unsafe state is ever visited.
- Robustness: When the environment dynamics are perturbed (simulating a “real‑world” shift), the shield still prevents unsafe actions without needing to retrain the model.
Practical Implications
- Safety‑critical robotics: Mobile robots navigating warehouses or factories can use the shield to guarantee collision‑free motion while still learning efficient routes.
- Autonomous vehicles in discrete decision layers: High‑level maneuver planning (e.g., lane changes) can be protected by a predictive shield that respects traffic rules and anticipates downstream risks.
- Industrial control: PLCs that learn to optimize production sequences can embed the shield to avoid unsafe actuator commands, reducing downtime and maintenance costs.
- Rapid prototyping: Developers can plug the shield into existing Dyna‑Q or other model‑based RL codebases with minimal changes, gaining safety guarantees without a separate verification pipeline.
Limitations & Future Work
- Discrete state‑action spaces: The current formulation assumes a finite grid‑like environment; extending to continuous domains will require function approximation for the predictive rollout.
- Model fidelity: The shield’s effectiveness hinges on the learned model being reasonably accurate over the short horizon; large model errors could mislead the Q‑updates.
- Scalability of rollout: While short horizons keep computation cheap, larger state spaces may still pose a combinatorial explosion when evaluating many safe actions.
Future research directions include: adapting the predictive shield to continuous control via learned dynamics ensembles, integrating uncertainty quantification to weight Q‑updates, and testing on real robotic platforms to validate the simulated robustness claims.
Authors
- Jin Pin
- Krasowski Hanna
- Vanneaux Elena
Paper Information
- arXiv ID: 2511.21531v1
- Categories: cs.LG, cs.AI, cs.RO, eess.SY
- Published: November 26, 2025
- PDF: Download PDF