[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

Published: (January 16, 2026 at 01:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.11508v1

Overview

The paper ReScene4D tackles a surprisingly common problem for anyone building long‑term indoor perception systems: how to keep track of what each object is and where it is over time when 3‑D scans are taken only intermittently. By formalizing “temporally sparse 4‑D semantic instance segmentation,” the authors enable robots, AR/VR platforms, and facility‑management tools to maintain consistent object identities even as furniture is moved, added, or removed.

Key Contributions

  • Task definition: Introduces the novel problem of temporally sparse 4‑D semantic instance segmentation (4DSIS) and proposes a dedicated evaluation metric, t‑mAP, that rewards both spatial accuracy and temporal identity consistency.
  • ReScene4D architecture: Adapts state‑of‑the‑art 3‑D SIS networks to the 4‑D setting without requiring dense, high‑frequency scans. The model shares latent context across time steps, effectively “remembering” past observations.
  • Cross‑observation information sharing: Demonstrates three practical strategies (feature aggregation, memory banks, and attention‑based fusion) for propagating instance cues between sparsely captured scans.
  • Performance boost: Shows that the temporal sharing not only solves the tracking problem but also improves pure 3‑D instance segmentation quality on each individual scan.
  • Benchmarking: Sets a new state‑of‑the‑art on the 3RScan dataset, establishing the first public benchmark for evolving indoor scenes.

Methodology

  1. Base 3‑D SIS backbone – The authors start with a proven 3‑D semantic instance segmentation network (e.g., PointGroup or Mask3D) that processes a single point cloud to output per‑point class labels and instance masks.
  2. Temporal memory module – A lightweight memory bank stores a compact embedding for each discovered instance (its geometry, semantics, and a learned “identity vector”).
  3. Cross‑frame fusion – When a new scan arrives, its point features are projected into the memory space. The system uses an attention mechanism to retrieve the most relevant past embeddings, allowing it to:
    • Match current detections to existing IDs (or create new ones).
    • Refine the current segmentation using historical context (e.g., smoothing noisy boundaries).
  4. Training regime – The network is trained end‑to‑end on sequences of scans with a combined loss: (i) standard 3‑D SIS loss (semantic cross‑entropy + instance mask loss) and (ii) a temporal consistency loss that penalizes ID switches across frames.
  5. t‑mAP metric – Extends the classic mean Average Precision (mAP) by counting a detection as correct only if its predicted instance ID matches the ground‑truth ID throughout the evaluated time window.

Results & Findings

MetricReScene4DPrior 3‑D SIS (no temporal)4‑D LiDAR baseline
mAP (per‑frame)58.7 %53.2 %42.1 %
t‑mAP (temporal)45.3 %28.7 %19.4 %
ID‑switches (per 100 scans)3.212.821.5

Key takeaways

  • Temporal sharing improves raw segmentation – even when evaluated frame‑by‑frame, ReScene4D outperforms the same backbone without memory, indicating that historical context helps resolve ambiguous geometry.
  • Consistent IDs are dramatically better – t‑mAP jumps by ~16 points over the naïve baseline, proving the effectiveness of the memory‑attention design.
  • Sparse data works – Unlike LiDAR‑centric 4‑D methods that need high‑frequency streams, ReScene4D maintains performance with scans taken minutes or hours apart, matching realistic indoor capture schedules.

Practical Implications

  • Robotics & autonomous navigation – Service robots can reliably know that “the coffee mug on the table” is the same object after the table is cleaned, enabling better task planning and safety checks.
  • AR/VR content persistence – Developers can anchor virtual objects to real‑world items that move over days, without re‑training models for each new scene.
  • Facility management & digital twins – Asset tracking systems can automatically detect when equipment is relocated or missing, reducing manual inventory audits.
  • Data‑efficient perception pipelines – Because ReScene4D works with sparse scans, companies can avoid costly continuous LiDAR deployments and instead rely on periodic RGB‑D or handheld scans.
  • Open‑source benchmark – The introduced t‑mAP metric and 3RScan split give the community a clear target for future 4‑D perception research, encouraging reproducible progress.

Limitations & Future Work

  • Memory scalability – The current memory bank grows linearly with the number of unique instances; very large environments (e.g., warehouses) may need hierarchical or pruning strategies.
  • Assumption of static semantics – The model assumes object class labels stay constant; handling objects that change function (e.g., a chair turned into a table) remains an open challenge.
  • Sparse temporal resolution – While the method tolerates long gaps, extremely rapid motions (e.g., a rolling ball) could be missed; integrating short‑burst high‑frequency data could improve such cases.
  • Generalization to outdoor or mixed indoor‑outdoor scenes – Extending ReScene4D to outdoor environments with weather‑induced point‑cloud noise is a promising direction.

Overall, ReScene4D marks a solid step toward perception systems that remember the world they see, opening up new possibilities for long‑term autonomous operation in dynamic indoor spaces.

Authors

  • Emily Steiner
  • Jianhao Zheng
  • Henry Howard-Jenkins
  • Chris Xie
  • Iro Armeni

Paper Information

  • arXiv ID: 2601.11508v1
  • Categories: cs.CV
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »