[Paper] GMOS: Grounding Moving Object Segmentation in 3D Space and Time
Source: arXiv - 2605.30352v1
Overview
The paper introduces GMOS, a new framework that segments and tracks moving objects directly in 3‑D space and time, using only raw RGB video as input. By grounding motion understanding in 3‑D geometry and treating motion as an instantaneous property of each object, GMOS pushes moving‑object segmentation (MOS) beyond the 2‑D, sequence‑level approaches that dominate today.
Key Contributions
- 3‑D‑aware MOS: First MOS system that reasons about motion in a unified 3‑D spatio‑temporal volume, eliminating the need for pre‑computed 2‑D cues such as optical flow.
- Instantaneous Motion Modeling (MOS‑I): Introduces a fine‑grained evaluation protocol that measures per‑frame object motion states, highlighting the ability to detect motion changes frame‑by‑frame.
- GMOS‑S variant: A lightweight “foreground‑background” version that runs up to 3× faster, suitable for real‑time or edge deployment.
- GMOS‑2K dataset: Curated 2,210 real‑world videos with per‑object temporal motion annotations, built from five existing VOS benchmarks, to train and benchmark 3‑D MOS models.
- State‑of‑the‑art performance: Sets new records on standard MOS, MOS‑I, and unsupervised VOS benchmarks while offering a considerably lower inference latency.
Methodology
-
3‑D Feature Backbone
- The video is processed by a 3‑D CNN that extracts spatio‑temporal features directly from the RGB frames, preserving depth cues implicitly through motion parallax.
-
Object‑level Query Embeddings
- Inspired by transformer‑based detection, a set of learnable query vectors represent potential moving objects. Each query attends to the 3‑D feature map, producing an object‑specific embedding that captures both appearance and motion.
-
Instantaneous Motion Decoder
- For every frame, the decoder predicts a binary mask for each query (object) and a per‑pixel motion confidence indicating whether the object is moving in that exact frame. This yields a fine‑grained “moving/not‑moving” label per object per timestamp.
-
Training Signals
- Supervision comes from the GMOS‑2K annotations: (i) per‑object masks, (ii) temporal motion flags, and (iii) optional depth cues derived from structure‑from‑motion pipelines.
- A multi‑task loss combines mask segmentation (Dice + BCE), motion classification (cross‑entropy), and a consistency term that encourages smoothness across adjacent frames.
-
GMOS‑S Simplification
- The “S” version collapses the object queries into a single foreground/background query, dramatically reducing compute while still delivering high‑quality moving‑object masks for applications that do not need object identity.
Results & Findings
| Benchmark | Metric (higher = better) | GMOS | GMOS‑S | Prior Best |
|---|---|---|---|---|
| MOS (overall IoU) | 0.78 | 0.78 | 0.71 | 0.73 |
| MOS‑I (instantaneous F‑score) | 0.74 | 0.74 | 0.68 | 0.66 |
| Unsupervised VOS (J&F) | 0.81 / 0.78 | 0.81 / 0.78 | 0.75 / 0.72 | 0.77 / 0.74 |
| Inference speed (FPS, 1080p) | – | 12 | 30 | 5‑7 |
- Accuracy: GMOS outperforms all previous multi‑object MOS methods, especially on the newly proposed MOS‑I protocol that rewards per‑frame motion detection.
- Speed: Even the full model runs more than twice as fast as the previous state‑of‑the‑art, thanks to the end‑to‑end 3‑D backbone that removes costly optical‑flow pre‑processing.
- Online capability: The architecture processes frames sequentially without needing the entire video, enabling streaming inference for live cameras.
Practical Implications
- Autonomous robotics & drones: Real‑time detection of moving obstacles with 3‑D awareness can improve navigation safety without the latency of separate depth or flow pipelines.
- AR/VR content creation: Instantaneous motion masks enable dynamic occlusion handling and realistic object insertion, all from a single RGB feed.
- Surveillance & smart city analytics: Fine‑grained motion state per object helps differentiate transient motion (e.g., a passing car) from persistent activity (e.g., a loitering person).
- Edge deployment: GMOS‑S’s high FPS on a single GPU makes it viable for on‑device processing in smartphones, wearables, or low‑power edge servers.
- Data‑centric pipelines: The GMOS‑2K dataset and MOS‑I evaluation protocol provide a new benchmark for developers building downstream tasks such as action recognition or scene understanding that rely on accurate motion segmentation.
Limitations & Future Work
- Depth ambiguity: While the model learns implicit 3‑D cues, it still struggles in texture‑less regions where depth inference from motion alone is weak.
- Scalability to many objects: Performance degrades modestly when more than 10 moving objects appear simultaneously, suggesting a need for more efficient query handling.
- Domain shift: The current training data is sourced from VOS benchmarks; performance on highly specialized domains (e.g., underwater or medical video) remains untested.
Future research directions highlighted by the authors include integrating explicit monocular depth estimation to reinforce 3‑D reasoning, exploring hierarchical query structures for large numbers of objects, and extending the framework to multimodal inputs (e.g., LiDAR + RGB) for even richer motion understanding.
Authors
- Junyu Xie
- Tengda Han
- Weidi Xie
- Andrew Zisserman
Paper Information
- arXiv ID: 2605.30352v1
- Categories: cs.CV
- Published: May 28, 2026
- PDF: Download PDF