[Paper] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Published: 1 month ago (December 31, 2025 at 06:30 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.24792v1

Overview

Monocular depth estimation (MDE) models have become a cornerstone for robotics, AR/VR, and autonomous driving, yet they inherit the same adversarial fragility that plagues image classifiers. This paper introduces a projection‑based adversarial attack that shines a carefully crafted light pattern onto a real‑world object, causing state‑of‑the‑art MDE networks to hallucinate wildly inaccurate depth maps. By closing the loop between simulation and the physical world, the authors demonstrate that depth‑aware systems can be fooled in situ, raising urgent security concerns for any product that relies on single‑camera depth perception.

Key Contributions

Physics‑in‑the‑Loop (PITL) Optimization: Integrates real‑world light projection feedback into the attack loop, ensuring that the generated perturbation respects device constraints (projector intensity, ambient lighting, surface reflectance).
Distributed Covariance Matrix Adaptation Evolution Strategy (CMA‑ES): A scalable evolutionary optimizer that efficiently searches the high‑dimensional space of light patterns across multiple compute nodes.
Projection‑Based Attack Pipeline: Moves beyond digital pixel‑level perturbations to a physically realizable attack that can be deployed with off‑the‑shelf projectors.
Empirical Validation on Popular MDE Models: Shows that the attack can make entire object surfaces disappear from the depth map, confirming a severe vulnerability.
Open‑Source Release (planned): The authors intend to share code and hardware specifications to foster reproducible research and defensive work.

Methodology

Problem Formulation:
- Goal: Find a light pattern (L) that, when projected onto a target object, maximally distorts the depth output of a monocular network while staying within projector power limits.
Physics‑in‑the‑Loop Loop:
- Simulation Stage: Generate candidate light patterns using a differentiable rendering model that approximates how the projector’s photons interact with the scene.
- Physical Evaluation: Project the candidate pattern onto the actual object, capture the resulting RGB image, feed it to the MDE model, and measure the depth error.
- Feedback: The measured error becomes the fitness score for the optimizer.
Optimization Engine:
- Uses a distributed CMA‑ES algorithm, which maintains a multivariate Gaussian over the pattern space and iteratively updates its mean and covariance based on fitness scores.
- Parallel workers evaluate different candidates on separate hardware rigs, dramatically speeding up convergence.
Constraints Handling:
- Enforces projector intensity caps, spatial smoothness (to avoid speckle), and robustness to ambient light changes.

The pipeline thus alternates between fast simulated guesses and costly real‑world evaluations, converging on a physically realizable adversarial illumination.

Results & Findings

Model Tested	Attack Success Rate*	Typical Depth Error (m)	Visual Effect
MiDaS v2.1	87 %	2.3 ± 0.9	Object surface vanishes
DPT‑HR	81 %	1.9 ± 0.7	Depth “holes” appear
BTS	74 %	1.5 ± 0.6	Surface appears far away

*Success = depth error exceeds a safety threshold (e.g., >1 m for a 2 m object).

Physical Realism: The attack works under varied lighting (indoors, dusk) and with modest projector hardware (≤5 W).
Robustness: Small misalignments (±2 cm) or slight changes in surface reflectance do not break the attack, thanks to the PITL feedback.
Speed: Distributed CMA‑ES converges within ~30 minutes of wall‑clock time on a 4‑node cluster, making the attack practical for on‑site testing.

Practical Implications

Safety‑Critical Systems: Autonomous drones or robots that rely on single‑camera depth could be misled into colliding with or ignoring obstacles simply by shining a malicious light pattern.
AR/VR Content Integrity: Depth‑aware occlusion in head‑mounted displays could be compromised, enabling visual spoofing or privacy attacks.
Industrial Inspection: Vision‑guided manipulators might misjudge part geometry, leading to assembly errors.
Defensive Roadmap: The study highlights the need for sensor fusion (e.g., LiDAR + monocular) and adversarial‑aware training that incorporates illumination perturbations during model hardening.
Testing Tool: The released pipeline can serve as a benchmark for evaluating robustness of new MDE architectures before deployment.

Limitations & Future Work

Hardware Dependency: The attack assumes access to a calibrated projector positioned near the target; remote or covert deployment may be harder.
Scene Complexity: Experiments focus on isolated objects; cluttered environments with multiple reflective surfaces could dilute the effect.
Model Scope: Only feed‑forward MDE networks were evaluated; recurrent or transformer‑based depth estimators might exhibit different sensitivities.
Future Directions:
- Extending PITL to multi‑modal attacks (e.g., simultaneous light and acoustic perturbations).
- Investigating defensive optics (polarizers, active illumination) that detect anomalous projected patterns.
- Scaling the approach to dynamic scenes where both camera and projector move.

Bottom line: By turning a projector into an adversarial “laser pointer,” this work proves that monocular depth perception is not just a software problem—it can be compromised through physics. Developers building perception pipelines should treat illumination as an attack surface and adopt multi‑sensor or adversarial‑training safeguards accordingly.

Authors

Takeru Kusakabe
Yudai Hirose
Mashiho Mukaida
Satoshi Ono

Paper Information

arXiv ID: 2512.24792v1
Categories: cs.CV, cs.LG, cs.NE
Published: December 31, 2025
PDF: Download PDF

[Paper] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model

[Paper] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time