[Paper] Assessing Vision-Language Models for Perception in Autonomous Underwater Robotic Software
Source: arXiv - 2602.10655v1
Overview
This paper investigates how modern Vision‑Language Models (VLMs) perform as perception components in autonomous underwater robots (AURs). By focusing on the detection of underwater trash—a task that typifies the low‑visibility, noisy conditions AURs face—the authors provide the first software‑engineering‑centric evaluation of VLM reliability, uncertainty, and suitability for maritime robotics.
Key Contributions
- Empirical benchmark of several state‑of‑the‑art VLMs (e.g., CLIP, BLIP, Flamingo) on underwater trash detection datasets.
- Quantitative analysis of model uncertainty using Monte‑Carlo dropout and predictive entropy, linking uncertainty to real‑world risk assessment.
- Guidelines for software engineers on selecting and integrating VLMs into AUR perception pipelines, based on trade‑offs among accuracy, robustness, and computational cost.
- Open‑source evaluation framework that can be reused for other underwater perception tasks (e.g., marine life monitoring, infrastructure inspection).
Methodology
- Dataset preparation – The authors curated a collection of underwater images containing various types of debris (plastic bags, bottles, nets) captured in different lighting and turbidity conditions. Images were manually annotated to serve as ground truth.
- Model selection – Four widely used VLMs were chosen:
- CLIP (image‑text contrastive learning)
- BLIP (bootstrapped language‑image pre‑training)
- Flamingo (few‑shot multimodal)
- ViLT (vision‑language transformer without CNN backbone)
- Inference pipeline – Each VLM was prompted with natural‑language queries such as “Is there trash in this image?” The models output a confidence score for the presence of trash.
- Uncertainty estimation – For each prediction, the authors performed 30 stochastic forward passes with dropout enabled, then computed predictive entropy and variance as uncertainty metrics.
- Evaluation metrics – Standard classification measures (precision, recall, F1) were complemented by calibration curves (expected calibration error) and a risk‑aware metric that penalizes high‑confidence false positives.
Results & Findings
| Model | F1 Score | ECE (Calibration) | Avg. Inference Time (ms) |
|---|---|---|---|
| CLIP | 0.78 | 0.12 | 45 |
| BLIP | 0.73 | 0.08 | 62 |
| Flamingo | 0.71 | 0.15 | 120 |
| ViLT | 0.66 | 0.13 | 38 |
- Accuracy vs. Calibration: CLIP achieved the highest raw detection performance, but BLIP was better calibrated, meaning its confidence scores more faithfully reflected true correctness.
- Uncertainty as a safety signal: Predictions with high predictive entropy (>1.5 nats) corresponded to a 70 % false‑positive rate, suggesting that uncertainty thresholds can be used to trigger fallback behaviors (e.g., re‑scanning or switching to sonar).
- Robustness to turbidity: All models degraded gracefully as water clarity worsened; BLIP’s performance dropped only 8 % from clear to murky water, whereas ViLT fell 20 %.
- Computational trade‑offs: ViLT was the fastest but least accurate; Flamingo offered richer contextual reasoning at the cost of latency unsuitable for real‑time control loops.
Practical Implications
- Risk‑aware perception pipelines – Developers can embed uncertainty checks to decide when to trust a VLM’s output or fall back to traditional sonar/sonar‑image fusion, improving overall system safety.
- Model selection guidance – For missions where real‑time response is critical (e.g., obstacle avoidance), CLIP or ViLT may be preferred. For inspection tasks where false alarms are costly, BLIP’s better calibration makes it a stronger candidate.
- Reduced labeling burden – Because VLMs can leverage textual prompts, engineers can prototype new detection categories (e.g., “marine debris”) without collecting large annotated image sets, accelerating development cycles.
- Integration with existing ROS/ROS‑2 stacks – The open‑source framework provides ROS nodes that wrap VLM inference and uncertainty estimation, enabling drop‑in replacement of current vision modules.
Limitations & Future Work
- Domain shift: The evaluation used a curated dataset; performance on completely unseen underwater locales (different flora/fauna, lighting spectra) remains untested.
- Hardware constraints: Experiments ran on high‑end GPUs; embedded underwater platforms may need model pruning or quantization, which could affect accuracy and uncertainty behavior.
- Multi‑modal fusion: The study focused on vision‑language alone. Future research should explore combining VLMs with acoustic or lidar data to further boost robustness.
- Long‑term reliability: The paper does not address model drift over time (e.g., bio‑fouling on camera lenses). Continuous learning or periodic re‑calibration strategies are open research directions.
Authors
- Muhammad Yousaf
- Aitor Arrieta
- Shaukat Ali
- Paolo Arcaini
- Shuai Wang
Paper Information
- arXiv ID: 2602.10655v1
- Categories: cs.SE, cs.RO
- Published: February 11, 2026
- PDF: Download PDF