[Paper] BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection
Source: arXiv - 2512.02972v1
Overview
The paper BEVDilation introduces a new way to fuse LiDAR and camera data for 3D object detection that puts LiDAR at the core of the pipeline. By treating image‑derived BEV features as guidance rather than raw inputs, the authors dramatically reduce the spatial mis‑alignment that typically hurts performance when depth estimates from images are noisy.
Key Contributions
- LiDAR‑centric fusion paradigm – prioritizes accurate LiDAR geometry and uses camera cues only as implicit guidance.
- Sparse Voxel Dilation Block – densifies foreground voxels by injecting image priors, mitigating point‑cloud sparsity.
- Semantic‑Guided BEV Dilation Block – enriches LiDAR feature diffusion with semantic information from images and captures long‑range context.
- Robustness to depth noise – demonstrates that the guidance‑only approach is far less sensitive to erroneous depth estimates than naïve concatenation.
- State‑of‑the‑art results on nuScenes – outperforms existing multi‑modal detectors while keeping inference speed competitive.
Methodology
- Base LiDAR Backbone – The system starts with a conventional voxel‑based LiDAR encoder that produces a BEV feature map.
- Image‑to‑BEV Projection (Guidance Only) – Camera images are processed by a 2‑D CNN, then projected into BEV space using estimated depth. Instead of concatenating these features with LiDAR’s, they are kept separate and later used as soft guidance.
- Sparse Voxel Dilation Block
- Identifies foreground voxels (e.g., potential vehicle locations).
- Uses the projected image BEV as a mask to “dilate” these voxels, filling gaps caused by LiDAR sparsity.
- Semantic‑Guided BEV Dilation Block
- Takes the dilated voxel map and runs a diffusion‑style operation that spreads semantic cues (road, vehicle, pedestrian) from the image across the LiDAR BEV.
- Incorporates a long‑range context module (e.g., deformable attention) to capture relationships beyond the immediate neighborhood.
- Detection Head – The enriched BEV feature map feeds into a standard anchor‑free 3D detection head that predicts bounding boxes and class scores.
The overall pipeline can be visualized as LiDAR → BEV encoder → (guided dilation using image BEV) → enriched BEV → detector.
Results & Findings
- nuScenes validation: BEVDilation achieves +1.8 % mAP and +2.3 % NDS over the previous best LiDAR‑camera fusion model, while staying within ~10 ms of additional latency.
- Depth‑noise robustness test: When synthetic depth noise is added to the image branch, performance drops < 0.5 % for BEVDilation, compared to > 3 % for naïve concatenation methods.
- Ablation studies: Removing either the Sparse Voxel Dilation or the Semantic‑Guided BEV Dilation reduces mAP by ~1 % each, confirming that both blocks contribute uniquely.
Practical Implications
- Safer autonomous driving stacks – By relying on LiDAR geometry first, the detector remains reliable even when camera depth estimation fails (e.g., adverse lighting or weather).
- Easier integration – Existing LiDAR‑only pipelines can adopt BEVDilation by plugging in the two dilation blocks, without redesigning the whole backbone.
- Edge‑friendly deployment – The method adds only modest compute overhead, making it suitable for real‑time inference on automotive‑grade GPUs or specialized ASICs.
- Improved perception for low‑density LiDARs – The sparsity‑filling mechanism is especially valuable for cost‑reduced LiDAR sensors that produce fewer points per frame.
Limitations & Future Work
- The approach still depends on a reasonably accurate depth estimate for the image‑to‑BEV projection; extreme depth failures could limit guidance quality.
- Experiments are confined to the nuScenes dataset; broader validation on other domains (e.g., highway or indoor robotics) is needed.
- The authors suggest exploring self‑supervised semantic guidance and dynamic dilation rates to further adapt to varying scene densities in future versions.
Authors
- Guowen Zhang
- Chenhang He
- Liyi Chen
- Lei Zhang
Paper Information
- arXiv ID: 2512.02972v1
- Categories: cs.CV, cs.RO
- Published: December 2, 2025
- PDF: Download PDF