[Paper] Visual Heading Prediction for Autonomous Aerial Vehicles
Source: arXiv - 2512.09898v1
Overview
The paper presents a vision‑only, data‑driven pipeline that lets an autonomous drone (UAV) reliably locate a ground robot (UGV) and compute the exact heading it must turn to align with the robot—without relying on GPS or external motion‑capture systems. By combining a fine‑tuned YOLOv5 detector with a tiny neural network for angle regression, the authors achieve sub‑degree heading accuracy using only a single onboard camera, opening the door for UAV‑UGV teamwork in GPS‑denied or indoor settings.
Key Contributions
- Real‑time UGV detection with a YOLOv5 model reaching ≈95 % accuracy on a custom dataset of >13 k annotated images.
- Lightweight heading‑prediction ANN that consumes bounding‑box geometry and outputs a heading angle with MAE = 0.1506° and RMSE = 0.1957°.
- End‑to‑end, infrastructure‑independent framework that works with a single monocular camera, eliminating the need for GNSS, LiDAR, or external motion‑capture rigs.
- Comprehensive dataset and training pipeline (VICON‑ground‑truthed images) released publicly, facilitating reproducibility and further research.
- Live demo showing UAV‑UGV alignment under dynamic conditions, confirming the approach’s feasibility for real‑world missions.
Methodology
- Data Collection – A VICON motion‑capture system recorded precise UAV and UGV poses while a downward‑facing RGB camera captured the scene. Over 13 k frames were manually annotated with UGV bounding boxes and the corresponding ground‑truth heading angles.
- Object Detection – The authors fine‑tuned YOLOv5 (a popular single‑stage detector) on the annotated set. The model runs at >30 fps on a modest GPU, outputting the UGV’s bounding box (center, width, height).
- Feature Extraction – From each bounding box they compute simple geometric cues (relative size, offset from image center) that correlate with the UAV’s relative orientation to the UGV.
- Heading Regression – A shallow feed‑forward ANN (2 hidden layers, ~200 parameters) ingests these cues and predicts the required yaw angle for the UAV to face the UGV. The network is trained with mean‑squared‑error loss against the VICON headings.
- Inference Loop – In deployment, the UAV captures a frame, runs YOLOv5, feeds the bounding‑box features to the ANN, and instantly commands a yaw adjustment to align with the ground robot.
Results & Findings
- Detection: YOLOv5 achieved 95 % precision/recall on a held‑out test split, with an average inference time of 12 ms per frame.
- Angle Prediction: The ANN’s MAE of 0.1506° and RMSE of 0.1957° indicate that the predicted heading is virtually indistinguishable from the ground‑truth, even when the UGV appears at various distances and orientations.
- Robustness: Experiments with moving UGVs and varying lighting showed the system maintaining sub‑degree accuracy, confirming resilience to moderate visual disturbances.
- Real‑time Performance: The full pipeline (detection + regression) runs at ~25 fps on an NVIDIA Jetson Xavier, satisfying typical UAV control loop requirements.
Practical Implications
- GPS‑Denied Operations – Search‑and‑rescue, indoor inspection, or subterranean missions can now rely on pure vision for UAV‑UGV coordination, reducing hardware cost and mission risk.
- Swarm Scalability – Because the model is lightweight, multiple drones can run the pipeline concurrently on edge devices, enabling larger multi‑robot teams without centralized processing.
- Plug‑and‑Play Integration – The approach works with any monocular RGB camera and can be dropped into existing UAV flight stacks (e.g., PX4, ROS) with minimal code changes.
- Rapid Prototyping – The released dataset and training scripts let developers fine‑tune the system for different ground‑robot shapes, colors, or camera placements, accelerating custom deployments.
- Safety & Redundancy – Vision‑only heading estimation provides a fallback when GNSS signals are jammed or spoofed, enhancing overall system robustness.
Limitations & Future Work
- Controlled Environment Bias – Training data were captured in a lab with relatively uniform backgrounds; performance in cluttered outdoor scenes remains to be validated.
- Single‑UGV Focus – The current model assumes one target per frame; handling multiple ground robots or occlusions will require additional detection and data association logic.
- Depth Ambiguity – Using only a monocular camera limits absolute distance estimation; integrating lightweight depth cues (e.g., stereo or monocular depth networks) could improve long‑range alignment.
- Dynamic Lighting & Weather – Future work should test robustness under harsh illumination, rain, or dust, possibly by augmenting training data or employing domain‑adaptation techniques.
Overall, the paper delivers a practical, low‑cost solution for UAV‑UGV heading alignment that can be immediately leveraged by developers building autonomous multi‑robot systems for GPS‑challenged environments.
Authors
- Reza Ahmari
- Ahmad Mohammadi
- Vahid Hemmati
- Mohammed Mynuddin
- Parham Kebria
- Mahmoud Nabil Mahmoud
- Xiaohong Yuan
- Abdollah Homaifar
Paper Information
- arXiv ID: 2512.09898v1
- Categories: cs.RO, cs.AI, cs.CV, cs.MA, eess.SY
- Published: December 10, 2025
- PDF: Download PDF