[Paper] SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
Source: arXiv - 2601.09699v1
Overview
Segment Anything 3 (SAM3) has become a go‑to foundation model for detecting, segmenting, and tracking objects in video streams. While it works well for single‑object or low‑density scenes, its original design makes a single, collective decision about which memory frames to use when many objects appear simultaneously. This “group‑level” memory selection can cause identity swaps and jittery masks in crowded videos. The new SAM3‑DMS (Decoupled Memory Selection) plug‑in fixes that problem without any extra training, giving developers a drop‑in upgrade that keeps each object’s memory independent and more reliable.
Key Contributions
- Decoupled Memory Selection (DMS): A training‑free module that selects memory frames per‑object rather than globally, preserving individual object reliability.
- Zero‑Shot Compatibility: Works with the off‑the‑shelf SAM3 model; no fine‑tuning or extra data required.
- Scalable Multi‑Target Performance: Shows increasingly larger gains as the number of concurrent targets grows, making it suitable for dense‑scene applications (e.g., sports, surveillance).
- Robust Identity Preservation: Reduces mask swapping and improves temporal consistency across long video sequences.
- Comprehensive Evaluation: Benchmarked on standard multi‑object video segmentation datasets, reporting state‑of‑the‑art identity‑preservation metrics.
Methodology
- Memory Bank in SAM3: SAM3 stores a set of past frames (the “memory”) that it queries to propagate masks forward. In the original design, the same memory set is used for all objects in a frame.
- Per‑Object Scoring: SAM3‑DMS computes a lightweight confidence score for each object‑memory pair using the existing encoder features (no extra network).
- Decoupled Selection: For every active target, the module picks the top‑k memory frames with the highest confidence for that specific target. This yields a different memory subset per object.
- Mask Propagation: The selected memories are fed back into SAM3’s decoder, producing masks that are conditioned on the most relevant history for each object.
- Training‑Free Integration: Because the scoring function re‑uses SAM3’s internal embeddings, the whole pipeline can be inserted as a pre‑processing step during inference, requiring only a few lines of code.
Results & Findings
| Metric (higher is better) | SAM3 (baseline) | SAM3‑DMS (ours) |
|---|---|---|
| ID‑F1 (identity F1) | 71.2% | 78.9% (+7.7) |
| mIoU (mean IoU) | 68.5% | 70.1% (+1.6) |
| FPS (inference speed) | 12.4 | 11.9 (≈ 4% drop) |
- Identity preservation improves dramatically, especially when >10 objects are present (ID‑F1 gain >10%).
- Mask quality (mIoU) sees modest gains, confirming that the decoupled memory does not sacrifice spatial accuracy.
- Speed impact is minimal; the extra scoring and selection add only a few milliseconds per frame, keeping the system real‑time for most applications.
Qualitative examples show smoother tracks and fewer “mask swaps” when objects cross or occlude each other.
Practical Implications
- Video Analytics & Surveillance: Deploying SAM3‑DMS enables reliable tracking of many people or vehicles without custom re‑training, reducing false alarms caused by identity swaps.
- AR/VR & Real‑Time Effects: Developers can overlay persistent masks on multiple moving objects (e.g., sports players) with stable identities, improving user immersion.
- Robotics & Autonomous Systems: Multi‑object perception pipelines can inherit SAM3‑DMS to maintain consistent object IDs across frames, simplifying downstream planning and decision‑making.
- Content Creation Tools: Video editors using SAM3 for rotoscoping or background replacement will experience fewer manual corrections when handling crowded scenes.
- Easy Integration: Since the method is training‑free and only touches the inference path, it can be added to existing SAM3 deployments with a single function call or a lightweight wrapper.
Limitations & Future Work
- Memory Overhead: Maintaining separate memory subsets per object slightly increases GPU memory usage, which could become a bottleneck on very low‑resource devices.
- Confidence Scoring Simplicity: The current scoring relies on raw encoder embeddings; more sophisticated learned metrics might further boost performance, especially for highly similar objects.
- Extremely Dense Scenes: While gains grow with target count, the method still faces diminishing returns when >50 objects occupy the frame, suggesting a need for hierarchical or region‑based memory management.
- Future Directions: The authors propose exploring adaptive memory budgets per object, integrating lightweight learning‑based selectors, and extending the approach to 3‑D point‑cloud video streams.
Authors
- Ruiqi Shen
- Chang Liu
- Henghui Ding
Paper Information
- arXiv ID: 2601.09699v1
- Categories: cs.CV
- Published: January 14, 2026
- PDF: Download PDF