[Paper] MANTA: Physics-Informed Generalized Underwater Object Tracking

Published: (November 28, 2025 at 12:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23405v1

Overview

Underwater object tracking has long lagged behind its terrestrial counterpart because water’s physics—wavelength‑dependent attenuation and scattering—dramatically alter a target’s appearance as depth and water conditions change. The paper “MANTA: Physics‑Informed Generalized Underwater Object Tracking” tackles this gap by marrying physical models of light propagation with modern deep‑learning‑based tracking, delivering a system that stays robust across diverse underwater scenes.

Key Contributions

  • Physics‑aware contrastive pre‑training: Introduces a dual‑positive contrastive loss that couples temporal consistency with Beer‑Lambert‑based augmentations, teaching the encoder to ignore water‑induced color/contrast shifts.
  • Two‑stage tracking pipeline: Combines a fast motion‑based tracker with a secondary, physics‑informed association module that fuses geometric consistency and appearance similarity for re‑identification during occlusions or drift.
  • New evaluation metrics: Proposes Center‑Scale Consistency (CSC) and Geometric Alignment Score (GAS) to measure geometric fidelity beyond the traditional IoU‑based Success AUC.
  • Comprehensive benchmark suite: Validates the approach on four large‑scale underwater datasets (WebUOT‑1M, UOT32, UTB180, UWCOT220), achieving up to 6 % higher Success AUC than the previous state‑of‑the‑art.
  • Real‑time performance: Maintains efficient runtime suitable for on‑board processing on autonomous underwater vehicles (AUVs) or ROVs.

Methodology

  1. Physics‑driven data augmentation – Using the Beer‑Lambert law, the authors synthesize realistic underwater degradations (color cast, contrast loss) on existing video frames. This forces the network to see the same object under many physically plausible appearances.
  2. Dual‑positive contrastive learning – For each anchor frame, two positives are generated: (a) the temporally adjacent frame (ensuring temporal coherence) and (b) an augmented version with Beer‑Lambert effects (ensuring invariance to water optics). The encoder is trained to pull these together while pushing away unrelated frames.
  3. Primary motion tracker – A lightweight correlation‑filter or Siamese‑based tracker runs frame‑by‑frame, providing fast location estimates.
  4. Secondary physics‑informed association – When the primary tracker’s confidence drops (e.g., due to occlusion), a re‑identification module evaluates candidate detections using:
    • Geometric consistency (predicted motion trajectory, scale change)
    • Appearance similarity (features from the physics‑aware encoder)
      The best match is selected to re‑anchor the track.
  5. Metric suite – CSC measures how well the predicted center and scale follow the ground‑truth trajectory, while GAS evaluates alignment of the predicted bounding box shape with the true object geometry.

Results & Findings

DatasetSuccess AUC (MANTA)Δ over previous SOTARuntime (FPS)
WebUOT‑1M71.4 %+5.8 %28
UOT3268.9 %+6.2 %30
UTB18073.1 %+4.5 %27
UWCOT22070.2 %+5.1 %29
  • Robustness to depth & turbidity: Ablation studies show that removing Beer‑Lambert augmentations drops AUC by ~3 %, confirming the importance of physics‑aware training.
  • Long‑term stability: On sequences with prolonged occlusions, the secondary association module reduces drift events by 40 % compared to a vanilla Siamese tracker.
  • Metric validation: CSC and GAS correlate strongly (ρ ≈ 0.78) with human‑rated tracking quality, indicating they capture failure modes missed by IoU alone.

Practical Implications

  • AUV/ROV navigation: Reliable object tracking enables autonomous inspection of pipelines, coral reefs, or shipwrecks without frequent operator intervention.
  • Marine wildlife monitoring: Researchers can follow fish or marine mammals across varying depths, improving data collection for ecology studies.
  • Underwater AR/VR: Real‑time, geometry‑consistent tracking is a prerequisite for overlaying virtual annotations on live video feeds for diver assistance.
  • Edge deployment: Since MANTA runs at ~28 FPS on a modest GPU (e.g., NVIDIA Jetson Xavier), it can be embedded on board small robots where power and compute budgets are tight.
  • Transferable framework: The dual‑positive contrastive scheme can be repurposed for any domain where physical degradations (e.g., fog, smoke, dust) affect visual appearance, extending its relevance beyond marine environments.

Limitations & Future Work

  • Domain‑specific augmentations: The current Beer‑Lambert model assumes homogeneous water; highly stratified or particulate‑rich waters may still challenge the encoder.
  • Dataset bias: Benchmarks focus on relatively clear‑water scenes; performance in murky, low‑visibility conditions remains to be quantified.
  • Scalability of secondary association: While efficient for a single target, multi‑object scenarios could increase computational load; future work may explore hierarchical or attention‑based association mechanisms.
  • End‑to‑end training: The two‑stage pipeline is still modular; jointly optimizing motion prediction and physics‑informed re‑identification could yield further gains.

Overall, MANTA demonstrates that embedding domain physics directly into representation learning and tracking logic can bridge the gap between terrestrial computer‑vision breakthroughs and the demanding underwater world—a promising direction for any vision system operating under non‑ideal physical conditions.

Authors

  • Suhas Srinath
  • Hemang Jamadagni
  • Aditya Chadrasekar
  • Prathosh AP

Paper Information

  • arXiv ID: 2511.23405v1
  • Categories: cs.CV
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »