[Paper] Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training
Source: arXiv - 2604.26833v1
Overview
The paper proposes a hierarchical decision‑making framework that lets a UAV learn to complete search‑and‑rescue (SAR) missions even when it has only a handful of simulated training runs. By pairing a rule‑based high‑level coach (derived from a formal task specification) with a goal‑conditioned reinforcement‑learning (RL) low‑level controller, the system can stay safe from the start while still adapting online to the specifics of each mission.
Key Contributions
- Hybrid architecture: Combines deterministic, interpretable rules (high‑level advisor) with an online goal‑conditioned RL controller (low‑level).
- Zero‑pretraining deployment: Demonstrates that the system can be launched without any offline RL pre‑training, a strict “no‑simulation‑pretraining” regime.
- Rule‑derived metadata for replay: Extends prioritized experience replay with mode‑aware tags and safety hints supplied by the high‑level advisor, improving sample efficiency.
- Two realistic SAR tasks: Battery‑aware multi‑goal delivery and moving‑target delivery in cluttered 3‑D environments, both featuring dynamic obstacles and strict safety constraints.
- Early‑stage safety gains: Shows a substantial reduction in collision‑induced episode terminations during the first few hundred learning steps.
Methodology
-
Task Specification → Rules
- Engineers write a structured mission description (e.g., “avoid no‑fly zones, keep battery > 20 % before returning”).
- An offline compiler translates this into a deterministic rule set that can recommend actions, forbid unsafe actions, and assign arbitration weights (how much influence the rule should have vs. the RL policy).
-
Goal‑Conditioned Low‑Level RL
- The UAV receives a dense reward signal that encodes progress toward the current goal (e.g., distance to a moving victim).
- A standard off‑policy algorithm (e.g., DDPG/SAC) is used, but the policy is conditioned on the current goal so the same network can handle many way‑points without retraining.
-
Mode‑Aware Prioritized Replay
- Each transition stored in the replay buffer is tagged with metadata from the high‑level advisor (e.g., “safe region”, “near battery limit”).
- The replay sampler gives higher priority to transitions that are both informative for learning and aligned with safety rules, allowing the agent to learn the right behavior faster.
-
Arbitration at Runtime
- At each decision step, the system computes a weighted blend:
[ a = w_{\text{rule}} \cdot a_{\text{rule}} + (1 - w_{\text{rule}}) \cdot a_{\text{RL}} ] - The weight (w_{\text{rule}}) is dynamic: it rises in high‑risk regimes (low battery, dense obstacles) and falls when the environment is benign.
- At each decision step, the system computes a weighted blend:
Results & Findings
| Task | Metric (early phase) | Baseline (pure RL) | Proposed Hybrid |
|---|---|---|---|
| Battery‑aware multi‑goal delivery | Collision terminations per 10k steps | 27 | 9 |
| Moving‑target delivery | Sample‑efficiency (steps to 80 % success) | 45 k | 28 k |
| Overall mission success (after 100k steps) | 92 % vs. 90 % (baseline) | 92 % | 92 % |
- Safety: The hybrid system cuts early collisions by ~65 %, meaning the UAV can operate in real‑world airspace much sooner.
- Sample efficiency: By reusing rule‑guided experiences, the RL component reaches a competent policy roughly 30 % faster.
- Adaptability: Even without any offline pre‑training, the agent learns to chase moving targets and respect battery constraints, showing that the rule coach does not lock the policy into a static behavior.
Practical Implications
- Rapid field deployment: Rescue teams can upload a mission spec and launch UAVs without spending weeks on simulation pre‑training, dramatically shortening response times.
- Regulatory compliance: The rule‑based layer guarantees that hard safety constraints (no‑fly zones, minimum battery) are never violated, easing certification with aviation authorities.
- Developer‑friendly integration: The high‑level advisor is expressed in a declarative JSON/YAML format; developers can tweak safety policies without touching the RL code.
- Transferable to other domains: Any robotics problem with clear safety rules (warehouse drones, autonomous forklifts, planetary rovers) can adopt the same coach‑plus‑RL pattern.
Limitations & Future Work
- Rule authoring overhead: Crafting comprehensive, conflict‑free rule sets still requires domain expertise; automated rule synthesis is an open challenge.
- Scalability of arbitration: The simple linear weighting may not capture complex interactions in highly dynamic environments; future work could explore learned meta‑controllers for arbitration.
- Simulation‑to‑real gap: Although the paper uses limited simulation, transferring to real UAV hardware may expose sensor noise and latency issues not captured in the current experiments.
- Extending to multi‑agent teams: Coordinating several UAVs under a shared rule coach raises questions about conflict resolution and communication overhead, a promising direction for follow‑up studies.
Authors
- Mahya Ramezani
- Holger Voos
Paper Information
- arXiv ID: 2604.26833v1
- Categories: cs.RO, cs.AI, cs.LG
- Published: April 29, 2026
- PDF: Download PDF