[Paper] LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration

Published: (December 26, 2025 at 07:09 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.22010v1

Overview

The paper introduces LongFly, a new framework that enables unmanned aerial vehicles (UAVs) to follow natural‑language navigation instructions over long distances while coping with the visual complexity of disaster‑scene environments. By explicitly modeling the spatiotemporal context of past observations and flight trajectories, LongFly dramatically improves the reliability of vision‑and‑language navigation (VLN) for UAVs, a capability that is critical for time‑sensitive search‑and‑rescue missions.

Key Contributions

  • History‑aware spatiotemporal modeling that converts raw, multi‑view UAV footage into compact, expressive context vectors.
  • Slot‑based historical image compression module that dynamically distills redundant visual data into a fixed‑length representation, reducing memory and compute overhead.
  • Spatiotemporal trajectory encoding that captures both the order of visited waypoints and the geometric structure of the flight path.
  • Prompt‑guided multimodal integration that fuses past context with the current visual frame using language prompts, enabling time‑aware reasoning for waypoint prediction.
  • State‑of‑the‑art performance gains: +7.89 % success rate and +6.33 % success weighted by path length over existing UAV VLN baselines, consistent across seen and unseen environments.

Methodology

  1. Data Collection & Pre‑processing – The UAV records RGB images from multiple onboard cameras while executing navigation episodes defined by natural‑language instructions.
  2. Slot‑Based Historical Image Compression
    • The recent visual stream is partitioned into slots (e.g., every 0.5 s or per waypoint).
    • A lightweight attention encoder selects the most informative frames per slot and aggregates them into a fixed‑size vector, discarding redundancy.
  3. Spatiotemporal Trajectory Encoding
    • The UAV’s 3‑D pose sequence (position + orientation) is fed into a transformer‑style encoder that learns temporal dynamics (speed, turn rate) and spatial relationships (relative distances).
  4. Prompt‑Guided Multimodal Integration
    • A language model generates a prompt that describes the current instruction step (e.g., “fly toward the collapsed building”).
    • The prompt conditions a cross‑modal attention layer that merges the compressed visual history, trajectory embedding, and the live camera view, producing a context‑aware representation for decision making.
  5. Waypoint Prediction & Control
    • The integrated representation is passed to a policy network that outputs the next waypoint or low‑level control commands.
    • The loop repeats until the instruction is satisfied or a timeout occurs.

The entire pipeline runs in near‑real‑time on a typical UAV edge compute platform (e.g., NVIDIA Jetson), thanks to the compact representations and efficient attention mechanisms.

Results & Findings

MetricLongFlyPrior BestΔ
Success Rate (SR)78.4 %70.5 %+7.9 %
Success weighted by Path Length (SPL)62.1 %55.8 %+6.3 %
Inference latency (per step)45 ms62 ms–27 %
  • Robustness to unseen environments: LongFly’s gains hold when the UAV navigates in entirely new disaster zones, indicating strong generalization.
  • Ablation studies show that removing either the slot‑compression or trajectory encoder drops SR by >3 %, confirming that both visual and motion histories are essential.
  • Qualitative analysis reveals smoother flight paths with fewer back‑tracking loops, thanks to the temporal reasoning enabled by the prompt‑guided integration.

Practical Implications

  • Search‑and‑Rescue (SAR): First responders can issue high‑level spoken or textual commands (“search the east side of the collapsed bridge”) and rely on UAVs to autonomously execute long‑range missions without constant tele‑operation.
  • Infrastructure Inspection: LongFly can be adapted for routine inspections of large structures (bridges, power lines) where operators need to specify “inspect the left side of tower 3” and let the drone handle the navigation.
  • Edge Deployment: The compact context representations make it feasible to run the model on existing UAV compute modules, avoiding the need for costly cloud off‑loading and reducing latency—critical for time‑critical disaster response.
  • Developer APIs: The modular design (compression, trajectory encoder, integration) can be exposed as reusable SDK components, allowing robotics developers to plug LongFly into custom flight controllers or simulation environments.

Limitations & Future Work

  • Sensor Dependence: The current system assumes reliable RGB vision; performance may degrade in low‑light or heavy smoke where visual cues are scarce.
  • Scalability of Prompt Design: The prompt‑guided integration relies on manually crafted instruction templates; automating prompt generation for arbitrary natural language remains an open challenge.
  • Real‑World Flight Tests: Experiments were conducted in simulated disaster environments; extensive field trials are needed to validate robustness against wind, GPS drift, and communication loss.
  • Future Directions: The authors plan to incorporate multimodal sensors (LiDAR, thermal imaging) to bolster perception in adverse conditions, and to explore continual learning so the UAV can refine its spatiotemporal model during deployment.

Authors

  • Wen Jiang
  • Li Wang
  • Kangyao Huang
  • Wei Fan
  • Jinyuan Liu
  • Shaoyu Liu
  • Hongwei Duan
  • Bin Xu
  • Xiangyang Ji

Paper Information

  • arXiv ID: 2512.22010v1
  • Categories: cs.CV, cs.AI
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »