[Paper] 3AM: Segment Anything with Geometric Consistency in Videos

Published: (January 13, 2026 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.08831v1

Overview

The paper “3AM: Segment Anything with Geometric Consistency in Videos” tackles a long‑standing problem in video object segmentation (VOS): maintaining accurate masks when the camera viewpoint swings dramatically. By marrying the powerful appearance‑based SAM2 model with lightweight 3‑D‑aware features from the MUSt3R framework, the authors achieve geometry‑consistent segmentation without needing depth maps, camera poses, or any preprocessing at inference time.

Key Contributions

  • 3AM architecture: A training‑time plug‑in that fuses MUSt3R’s multi‑level 3‑D features with SAM2’s appearance features via a lightweight Feature Merger.
  • Implicit geometric correspondence: The merged representation encodes spatial position, enabling the model to stay “anchored” to the same physical object across wide‑baseline frames.
  • Field‑of‑view aware sampling: A novel data‑sampling strategy that forces training frames to share consistent object regions, strengthening the learning of 3‑D correspondences.
  • Zero‑extra inference cost: At test time the system only requires raw RGB frames—no depth, pose, or heavy preprocessing—making it drop‑in compatible with existing SAM2 pipelines.
  • State‑of‑the‑art performance: On challenging wide‑baseline video benchmarks (ScanNet++, Replica) 3AM reaches 90.6 % IoU and 71.7 % Positive IoU, beating the previous best VOS methods by +15.9 and +30.4 points respectively.

Methodology

Backbone Fusion

  • SAM2 supplies strong per‑frame appearance embeddings (color, texture).
  • MUSt3R provides multi‑scale 3‑D‑aware embeddings that capture implicit geometry (e.g., relative depth, surface orientation) learned from large‑scale RGB‑only video data.
  • A Feature Merger (a few 1×1 convolutions + residual connections) combines these two streams into a single token set that is fed to SAM2’s memory encoder.

Training‑time Geometry Enforcement

  • The authors introduce a field‑of‑view aware sampler that selects frame pairs where the same object occupies overlapping image regions despite large camera motions.
  • A contrastive loss encourages the merged tokens of overlapping regions to be close, while non‑overlapping regions are pushed apart, teaching the network an implicit notion of 3‑D consistency.

Inference Simplicity

  • After training, the model runs exactly like vanilla SAM2: feed an RGB frame, retrieve the memory bank, and predict masks.
  • The geometry knowledge is baked into the learned weights, so no external 3‑D data is required.

Results & Findings

Dataset (subset)MetricSAM2 (baseline)3AM (ours)Δ
ScanNet++ (selected)IoU74.7 %90.6 %+15.9
ScanNet++ (selected)Positive IoU41.3 %71.7 %+30.4
Replica (wide‑baseline)IoU68.2 %84.5 %+16.3
  • Robustness to viewpoint change: 3AM maintains mask continuity even when objects rotate out of view or undergo severe perspective distortion.
  • Ablation studies show that removing the Feature Merger or the field‑of‑view sampler drops performance back to SAM2‑level, confirming each component’s necessity.
  • Runtime impact is negligible (< 5 % overhead) because the merger is lightweight and inference remains RGB‑only.

Practical Implications

  • Plug‑and‑play upgrade for any product already using SAM2 (e.g., video editing tools, AR/VR pipelines, autonomous‑driving perception stacks).
  • Reduced engineering burden: No need to collect or synchronize depth sensors or SLAM pose estimates, which are often noisy or unavailable on consumer devices.
  • Better user experience in applications that require persistent object masks across camera moves—think interactive video retargeting, virtual try‑on, or robotics manipulation where the robot’s viewpoint constantly changes.
  • Lower compute cost compared to full 3‑D instance‑segmentation pipelines that rely on expensive point‑cloud processing, making it suitable for edge devices or real‑time streaming services.

Limitations & Future Work

  • Training data dependency: The geometry encoder (MUSt3R) is pre‑trained on large RGB video corpora; performance may degrade on domains with drastically different scene geometry (e.g., underwater or medical videos).
  • No explicit depth output: While masks stay consistent, the model does not provide depth or 3‑D shape estimates, which could be valuable for downstream tasks.
  • Memory scaling: Like SAM2, 3AM still stores a memory bank of past frames; very long videos may require additional strategies (e.g., hierarchical memory pruning).
  • Future directions suggested by the authors include extending the merger to handle multi‑modal inputs (e.g., LiDAR), learning to predict coarse depth jointly with masks, and exploring self‑supervised fine‑tuning on domain‑specific video streams.

Authors

  • Yang‑Che Sun
  • Cheng Sun
  • Chin‑Yang Lin
  • Fu‑En Yang
  • Min‑Hung Chen
  • Yen‑Yu Lin
  • Yu‑Lun Liu

Paper Information

  • arXiv ID: 2601.08831v1
  • Categories: cs.CV
  • Published: January 13, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »