[Paper] Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training

Published: (April 29, 2026 at 12:01 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.26833v1

Overview

The paper proposes a hierarchical decision‑making framework that lets a UAV learn to complete search‑and‑rescue (SAR) missions even when it has only a handful of simulated training runs. By pairing a rule‑based high‑level coach (derived from a formal task specification) with a goal‑conditioned reinforcement‑learning (RL) low‑level controller, the system can stay safe from the start while still adapting online to the specifics of each mission.

Key Contributions

  • Hybrid architecture: Combines deterministic, interpretable rules (high‑level advisor) with an online goal‑conditioned RL controller (low‑level).
  • Zero‑pretraining deployment: Demonstrates that the system can be launched without any offline RL pre‑training, a strict “no‑simulation‑pretraining” regime.
  • Rule‑derived metadata for replay: Extends prioritized experience replay with mode‑aware tags and safety hints supplied by the high‑level advisor, improving sample efficiency.
  • Two realistic SAR tasks: Battery‑aware multi‑goal delivery and moving‑target delivery in cluttered 3‑D environments, both featuring dynamic obstacles and strict safety constraints.
  • Early‑stage safety gains: Shows a substantial reduction in collision‑induced episode terminations during the first few hundred learning steps.

Methodology

  1. Task Specification → Rules

    • Engineers write a structured mission description (e.g., “avoid no‑fly zones, keep battery > 20 % before returning”).
    • An offline compiler translates this into a deterministic rule set that can recommend actions, forbid unsafe actions, and assign arbitration weights (how much influence the rule should have vs. the RL policy).
  2. Goal‑Conditioned Low‑Level RL

    • The UAV receives a dense reward signal that encodes progress toward the current goal (e.g., distance to a moving victim).
    • A standard off‑policy algorithm (e.g., DDPG/SAC) is used, but the policy is conditioned on the current goal so the same network can handle many way‑points without retraining.
  3. Mode‑Aware Prioritized Replay

    • Each transition stored in the replay buffer is tagged with metadata from the high‑level advisor (e.g., “safe region”, “near battery limit”).
    • The replay sampler gives higher priority to transitions that are both informative for learning and aligned with safety rules, allowing the agent to learn the right behavior faster.
  4. Arbitration at Runtime

    • At each decision step, the system computes a weighted blend:
      [ a = w_{\text{rule}} \cdot a_{\text{rule}} + (1 - w_{\text{rule}}) \cdot a_{\text{RL}} ]
    • The weight (w_{\text{rule}}) is dynamic: it rises in high‑risk regimes (low battery, dense obstacles) and falls when the environment is benign.

Results & Findings

TaskMetric (early phase)Baseline (pure RL)Proposed Hybrid
Battery‑aware multi‑goal deliveryCollision terminations per 10k steps279
Moving‑target deliverySample‑efficiency (steps to 80 % success)45 k28 k
Overall mission success (after 100k steps)92 % vs. 90 % (baseline)92 %92 %
  • Safety: The hybrid system cuts early collisions by ~65 %, meaning the UAV can operate in real‑world airspace much sooner.
  • Sample efficiency: By reusing rule‑guided experiences, the RL component reaches a competent policy roughly 30 % faster.
  • Adaptability: Even without any offline pre‑training, the agent learns to chase moving targets and respect battery constraints, showing that the rule coach does not lock the policy into a static behavior.

Practical Implications

  • Rapid field deployment: Rescue teams can upload a mission spec and launch UAVs without spending weeks on simulation pre‑training, dramatically shortening response times.
  • Regulatory compliance: The rule‑based layer guarantees that hard safety constraints (no‑fly zones, minimum battery) are never violated, easing certification with aviation authorities.
  • Developer‑friendly integration: The high‑level advisor is expressed in a declarative JSON/YAML format; developers can tweak safety policies without touching the RL code.
  • Transferable to other domains: Any robotics problem with clear safety rules (warehouse drones, autonomous forklifts, planetary rovers) can adopt the same coach‑plus‑RL pattern.

Limitations & Future Work

  • Rule authoring overhead: Crafting comprehensive, conflict‑free rule sets still requires domain expertise; automated rule synthesis is an open challenge.
  • Scalability of arbitration: The simple linear weighting may not capture complex interactions in highly dynamic environments; future work could explore learned meta‑controllers for arbitration.
  • Simulation‑to‑real gap: Although the paper uses limited simulation, transferring to real UAV hardware may expose sensor noise and latency issues not captured in the current experiments.
  • Extending to multi‑agent teams: Coordinating several UAVs under a shared rule coach raises questions about conflict resolution and communication overhead, a promising direction for follow‑up studies.

Authors

  • Mahya Ramezani
  • Holger Voos

Paper Information

  • arXiv ID: 2604.26833v1
  • Categories: cs.RO, cs.AI, cs.LG
  • Published: April 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »