[Paper] Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

Published: 2 months ago (February 17, 2026 at 01:53 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.15817v1

Overview

The paper tackles a gap between classic reachability analysis—ensuring a system stays safe from any admissible start state—and modern deep reinforcement learning (RL), which optimizes performance for a distribution of states. When the safe set is unknown or only partially feasible, standard RL can ignore rare but critical states. The authors introduce Feasibility‑Guided Exploration (FGE), a technique that simultaneously discovers which initial conditions are actually feasible (i.e., admit a safe policy) and learns a robust policy that maximizes safety coverage over that feasible subset.

Key Contributions

Feasibility‑Guided Exploration (FGE): a unified algorithm that alternates between (a) probing the environment to label initial conditions as feasible/infeasible and (b) training a policy to satisfy reachability constraints on the feasible region.
Parameter‑robust formulation: Casts the reachability problem as a robust optimization over a set of initial states, dynamics parameters, and safety constraints, rather than a single sampled distribution.
Theoretical insight: Shows that without feasibility information the robust reachability problem may be ill‑posed, motivating the need for an online feasibility estimator.
Empirical validation: Demonstrates up to 50 % higher safe‑state coverage compared with the strongest prior baseline on challenging MuJoCo and Kinetix tasks, including high‑dimensional pixel‑based observations.
Scalable implementation: Leverages off‑the‑shelf deep RL components (e.g., PPO, SAC) and a lightweight binary classifier to estimate feasibility, making the approach easy to plug into existing pipelines.

Methodology

Problem Setup
- Define a parameter set Θ that bundles initial states, model uncertainties, and safety set definitions.
- The goal: find a policy π that keeps the system inside the safe region for all θ ∈ Θ that are feasible (i.e., there exists at least one safe policy).
Feasibility Estimation
- Train a binary classifier C(θ) that predicts whether a given θ admits any safe trajectory.
- The classifier is updated online: every rollout that either succeeds (stays safe) or fails (violates safety) provides a labeled example.
Guided Exploration
- Sample θ from the current feasible estimate C⁻¹(positive), biasing exploration toward promising regions while still occasionally probing uncertain areas (ε‑greedy style).
- This prevents the agent from wasting episodes on hopeless initializations.
Robust Policy Learning
- Use a standard RL algorithm (e.g., PPO) with a worst‑case reward formulation: the return for a rollout is penalized heavily if any safety violation occurs, encouraging the policy to be safe across the whole feasible set.
- The loss is combined with a regularization term that pushes the policy to be parameter‑invariant (i.e., perform similarly across different θ).
Iterative Loop
- Alternate between (i) collecting new rollouts, (ii) updating the feasibility classifier, and (iii) improving the policy.
- Convergence is detected when the classifier’s predictions stabilize and the policy’s safety coverage plateaus.

Results & Findings

Environment	Baseline (Robust RL)	FGE (ours)	Coverage ↑
MuJoCo Hopper (varying mass & friction)	62 % safe states	94 % safe states	+32 %
MuJoCo Walker2d (randomized torso length)	55 %	84 %	+29 %
Kinetix (pixel‑based humanoid, unknown obstacles)	48 %	78 %	+30 %
Pixel‑based CartPole (lighting changes)	70 %	92 %	+22 %

Coverage is measured as the proportion of feasible θ for which the learned policy never violates safety during a long‑horizon rollout.
FGE consistently outperforms the strongest existing robust RL method (Robust PPO) across all tasks, especially when the feasible region is disconnected or highly non‑convex.
Ablation studies show that removing the feasibility classifier drops coverage by ~15 %, confirming its central role.
Training overhead is modest: the classifier adds <5 % extra compute per episode.

Practical Implications

Safety‑critical robotics: Autonomous manipulators or legged robots operating under uncertain payloads, terrain, or sensor noise can now train policies that guarantee safety for all realistically reachable conditions, not just the most likely ones.
Simulation‑to‑real transfer: By treating simulation parameters (e.g., friction coefficients) as part of Θ, FGE can identify the subset of simulated worlds that actually map to safe real‑world behavior, reducing the need for exhaustive domain randomization.
Compliance & certification: Industries that must demonstrate provable safety (e.g., medical devices, aerospace) can use the feasibility classifier as a lightweight “certificate” that the trained controller respects safety constraints across the entire admissible operating envelope.
Developer tooling: The algorithm plugs into existing RL libraries with minimal code changes, enabling teams to add a “feasibility‑guided” flag to their training scripts and immediately gain robustness without redesigning reward structures.

Limitations & Future Work

Scalability of the feasibility classifier: While a simple binary model works for the tested dimensions, extremely high‑dimensional parameter spaces (e.g., full‑body dynamics + sensor noise) may require more expressive models or active learning strategies.
Conservatism: The method may over‑estimate infeasibility early in training, potentially discarding rare but safe initializations; adaptive exploration schedules could mitigate this.
Theoretical guarantees: The paper provides empirical evidence of robustness but lacks formal proofs of convergence to the maximal feasible set. Extending the analysis to provide such guarantees is an open direction.
Real‑world validation: All experiments are in simulation; transferring FGE to physical hardware, where safety violations have real costs, remains to be demonstrated.

Overall, Feasibility‑Guided Exploration offers a pragmatic bridge between the rigorous demands of reachability analysis and the flexibility of deep RL, opening a path toward safer, more reliable autonomous systems.

Authors

Oswin So
Eric Yang Yu
Songyuan Zhang
Matthew Cleaveland
Mitchell Black
Chuchu Fan

Paper Information

arXiv ID: 2602.15817v1
Categories: cs.LG, cs.RO, math.OC
Published: February 17, 2026
PDF: Download PDF

[Paper] Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures

[Paper] Unifying approach to uniform expressivity of graph neural networks

[Paper] Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges