[Paper] Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation
Source: arXiv - 2512.24792v1
Overview
Monocular depth estimation (MDE) models have become a cornerstone for robotics, AR/VR, and autonomous driving, yet they inherit the same adversarial fragility that plagues image classifiers. This paper introduces a projection‑based adversarial attack that shines a carefully crafted light pattern onto a real‑world object, causing state‑of‑the‑art MDE networks to hallucinate wildly inaccurate depth maps. By closing the loop between simulation and the physical world, the authors demonstrate that depth‑aware systems can be fooled in situ, raising urgent security concerns for any product that relies on single‑camera depth perception.
Key Contributions
- Physics‑in‑the‑Loop (PITL) Optimization: Integrates real‑world light projection feedback into the attack loop, ensuring that the generated perturbation respects device constraints (projector intensity, ambient lighting, surface reflectance).
- Distributed Covariance Matrix Adaptation Evolution Strategy (CMA‑ES): A scalable evolutionary optimizer that efficiently searches the high‑dimensional space of light patterns across multiple compute nodes.
- Projection‑Based Attack Pipeline: Moves beyond digital pixel‑level perturbations to a physically realizable attack that can be deployed with off‑the‑shelf projectors.
- Empirical Validation on Popular MDE Models: Shows that the attack can make entire object surfaces disappear from the depth map, confirming a severe vulnerability.
- Open‑Source Release (planned): The authors intend to share code and hardware specifications to foster reproducible research and defensive work.
Methodology
- Problem Formulation:
- Goal: Find a light pattern (L) that, when projected onto a target object, maximally distorts the depth output of a monocular network while staying within projector power limits.
- Physics‑in‑the‑Loop Loop:
- Simulation Stage: Generate candidate light patterns using a differentiable rendering model that approximates how the projector’s photons interact with the scene.
- Physical Evaluation: Project the candidate pattern onto the actual object, capture the resulting RGB image, feed it to the MDE model, and measure the depth error.
- Feedback: The measured error becomes the fitness score for the optimizer.
- Optimization Engine:
- Uses a distributed CMA‑ES algorithm, which maintains a multivariate Gaussian over the pattern space and iteratively updates its mean and covariance based on fitness scores.
- Parallel workers evaluate different candidates on separate hardware rigs, dramatically speeding up convergence.
- Constraints Handling:
- Enforces projector intensity caps, spatial smoothness (to avoid speckle), and robustness to ambient light changes.
The pipeline thus alternates between fast simulated guesses and costly real‑world evaluations, converging on a physically realizable adversarial illumination.
Results & Findings
| Model Tested | Attack Success Rate* | Typical Depth Error (m) | Visual Effect |
|---|---|---|---|
| MiDaS v2.1 | 87 % | 2.3 ± 0.9 | Object surface vanishes |
| DPT‑HR | 81 % | 1.9 ± 0.7 | Depth “holes” appear |
| BTS | 74 % | 1.5 ± 0.6 | Surface appears far away |
*Success = depth error exceeds a safety threshold (e.g., >1 m for a 2 m object).
- Physical Realism: The attack works under varied lighting (indoors, dusk) and with modest projector hardware (≤5 W).
- Robustness: Small misalignments (±2 cm) or slight changes in surface reflectance do not break the attack, thanks to the PITL feedback.
- Speed: Distributed CMA‑ES converges within ~30 minutes of wall‑clock time on a 4‑node cluster, making the attack practical for on‑site testing.
Practical Implications
- Safety‑Critical Systems: Autonomous drones or robots that rely on single‑camera depth could be misled into colliding with or ignoring obstacles simply by shining a malicious light pattern.
- AR/VR Content Integrity: Depth‑aware occlusion in head‑mounted displays could be compromised, enabling visual spoofing or privacy attacks.
- Industrial Inspection: Vision‑guided manipulators might misjudge part geometry, leading to assembly errors.
- Defensive Roadmap: The study highlights the need for sensor fusion (e.g., LiDAR + monocular) and adversarial‑aware training that incorporates illumination perturbations during model hardening.
- Testing Tool: The released pipeline can serve as a benchmark for evaluating robustness of new MDE architectures before deployment.
Limitations & Future Work
- Hardware Dependency: The attack assumes access to a calibrated projector positioned near the target; remote or covert deployment may be harder.
- Scene Complexity: Experiments focus on isolated objects; cluttered environments with multiple reflective surfaces could dilute the effect.
- Model Scope: Only feed‑forward MDE networks were evaluated; recurrent or transformer‑based depth estimators might exhibit different sensitivities.
- Future Directions:
- Extending PITL to multi‑modal attacks (e.g., simultaneous light and acoustic perturbations).
- Investigating defensive optics (polarizers, active illumination) that detect anomalous projected patterns.
- Scaling the approach to dynamic scenes where both camera and projector move.
Bottom line: By turning a projector into an adversarial “laser pointer,” this work proves that monocular depth perception is not just a software problem—it can be compromised through physics. Developers building perception pipelines should treat illumination as an attack surface and adopt multi‑sensor or adversarial‑training safeguards accordingly.
Authors
- Takeru Kusakabe
- Yudai Hirose
- Mashiho Mukaida
- Satoshi Ono
Paper Information
- arXiv ID: 2512.24792v1
- Categories: cs.CV, cs.LG, cs.NE
- Published: December 31, 2025
- PDF: Download PDF