[Paper] MV-TAP: Tracking Any Point in Multi-View Videos

Published: (December 1, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02006v1

Overview

MV‑TAP introduces a new way to track any arbitrary point across synchronized multi‑view video streams. By explicitly fusing camera geometry with a cross‑view attention module, the system can follow points through occlusions and large viewpoint changes—something that single‑camera trackers struggle with. The authors also release a synthetic training corpus and real‑world benchmark suites, giving the community a solid foundation for multi‑view point‑tracking research.

Key Contributions

  • Cross‑view attention tracker: A neural architecture that jointly reasons over spatial, temporal, and multi‑camera dimensions to produce consistent point trajectories.
  • Geometry‑aware feature aggregation: Camera extrinsics are used to warp features into a common 3D space before attention, ensuring that the model respects epipolar constraints.
  • Large‑scale synthetic dataset: Over 200 k multi‑view video clips with dense ground‑truth point tracks, covering diverse motions, lighting, and occlusion patterns.
  • Real‑world evaluation suites: Two benchmark collections (indoor motion‑capture arena and outdoor traffic scenes) with manually annotated point tracks for rigorous testing.
  • State‑of‑the‑art performance: MV‑TAP beats prior single‑view and naive multi‑view baselines by 15–30 % on standard metrics such as average endpoint error and tracking recall.

Methodology

  1. Input preprocessing – Synchronized video streams from N calibrated cameras are fed into a shared CNN backbone that extracts per‑frame feature maps.
  2. Geometry warping – Using known camera intrinsics/extrinsics, each feature map is back‑projected onto a common 3D voxel grid (or a set of hypothesized depth planes). This aligns the views in a geometry‑consistent space.
  3. Cross‑view attention – A transformer‑style attention block receives the stacked, warped features. Queries correspond to the point of interest (or a dense set of candidate points), while keys/values come from all views and neighboring time steps. The attention weights automatically focus on the most informative view(s) at each moment, handling occlusions gracefully.
  4. Trajectory decoding – The attended representation is passed through a lightweight regression head that predicts the 2‑D image coordinates of the point in each camera for the next frame. A simple Kalman‑filter‑like smoothing step refines the multi‑camera trajectory.
  5. Training – The model is supervised with a combination of (i) 2‑D reprojection loss (distance between predicted and ground‑truth pixel locations) and (ii) a 3‑D consistency loss that penalizes deviation from the true 3‑D point position after triangulation.

All components are fully differentiable, allowing end‑to‑end training on the synthetic dataset before fine‑tuning on the real‑world benchmarks.

Results & Findings

DatasetMetric (lower is better)MV‑TAPBest prior method
Synthetic testAvg. Endpoint Error (px)1.82.6
Indoor MV‑CAP (real)Tracking Recall @ 5 px78 %61 %
Outdoor traffic3‑D reconstruction error (cm)4.26.9
  • Robustness to occlusion: When a point disappears in one view for up to 10 frames, MV‑TAP still recovers the correct location once it re‑appears in any other camera.
  • Scalability: Runtime grows linearly with the number of cameras; on a 4‑GPU server, tracking 10 k points across 8 views at 30 fps costs ~45 ms per frame.
  • Generalization: Fine‑tuning on just 5 % of the real‑world data closes the synthetic‑to‑real gap, indicating that the learned attention patterns transfer well.

Practical Implications

  • AR/VR content creation – Precise multi‑view point tracks enable automatic 3‑D reconstruction of props and actors, reducing manual rigging time.
  • Sports analytics – Coaches can attach virtual markers to any player or equipment and obtain seamless 3‑D trajectories from existing broadcast camera rigs.
  • Robotics & autonomous driving – Multi‑camera perception stacks (e.g., surround‑view systems) can use MV‑TAP to maintain consistent landmarks for SLAM or obstacle tracking, even when some cameras are temporarily blinded.
  • Film VFX – Post‑production pipelines can track feature points across a multi‑camera rig without placing physical markers, simplifying match‑moving workflows.
  • Open‑source baseline – The released code and datasets give developers a ready‑to‑use foundation for building custom multi‑view tracking solutions or extending the approach to dense optical flow.

Limitations & Future Work

  • Calibration dependency – MV‑TAP assumes accurate extrinsic calibration; errors in camera poses degrade performance noticeably.
  • Memory footprint – The cross‑view attention over high‑resolution feature maps can be GPU‑intensive for very large camera arrays (>16 views).
  • Sparse point focus – The current design tracks a set of user‑specified points; extending to dense, per‑pixel tracking remains an open challenge.
  • Real‑world diversity – While the synthetic data covers many scenarios, extreme lighting conditions (e.g., night traffic) still cause occasional failures, suggesting the need for more varied real‑world training data.

Future research directions include integrating self‑supervised calibration refinement, hierarchical attention to lower memory usage, and coupling MV‑TAP with dense reconstruction networks for end‑to‑end 3‑D scene understanding.

Authors

  • Jahyeok Koo
  • Inès Hyeonsu Kim
  • Mungyeom Kim
  • Junghyun Park
  • Seohyun Park
  • Jaeyeong Kim
  • Jung Yi
  • Seokju Cho
  • Seungryong Kim

Paper Information

  • arXiv ID: 2512.02006v1
  • Categories: cs.CV
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »