[Paper] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

Published: 2 months ago (November 26, 2025 at 12:26 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21606v1

Overview

The paper introduces ReSAM, a self‑prompting framework that adapts the powerful Segment Anything Model (SAM) to remote‑sensing imagery using only sparse point annotations. By iteratively refining pseudo‑masks, generating new prompts, and aligning embeddings, ReSAM dramatically improves SAM’s performance on aerial and satellite images without the need for costly dense masks.

Key Contributions

Self‑prompting loop (Refine‑Requery‑Reinforce) that turns a few user‑provided points into progressively better segmentation masks.
Box‑prompt generation from coarse masks, enabling SAM to “re‑query” the image with richer cues while still staying point‑supervised.
Embedding alignment across iterations to mitigate confirmation bias and keep the model from over‑fitting its own mistakes.
Domain‑agnostic adaptation that works on three diverse remote‑sensing benchmarks (WHU, HRSID, NWPU VHR‑10) and outperforms both the vanilla SAM and recent point‑supervised methods.
No dense mask supervision required, making the approach scalable to large satellite datasets where full annotations are prohibitive.

Methodology

Initial Point Input (Refine) – The user supplies a handful of foreground/background points on an image. SAM produces a coarse pseudo‑mask based on these points.
Self‑Constructed Box Prompt (Requery) – From the coarse mask, the system automatically extracts a tight bounding box. This box is fed back to SAM as an additional prompt, prompting the model to re‑segment the region with a richer spatial cue.
Semantic Alignment (Reinforce) – Feature embeddings from the current iteration are compared with those from the previous step. A contrastive loss encourages consistency while allowing corrections, reducing the risk that early errors are reinforced.
Iterative Loop – Steps 1‑3 repeat a few times, each cycle producing a cleaner mask and a more reliable prompt set. The whole pipeline requires only the original point annotations; all other supervision is generated internally.

Results & Findings

On WHU, HRSID, and NWPU VHR‑10, ReSAM improves mean Intersection‑over‑Union (mIoU) by 8–12% over the out‑of‑the‑box SAM.
Compared with recent point‑supervised segmentation methods, ReSAM achieves 3–5% higher mIoU while using the same number of points.
Ablation studies confirm that each component (box re‑query, embedding reinforcement) contributes significantly; removing reinforcement drops performance by ~4%.
Visual inspections show sharper object boundaries and better handling of small, densely packed structures typical in satellite imagery (e.g., vehicles, buildings).

Practical Implications

Rapid map creation – Urban planners can generate accurate building footprints from a few clicks, cutting annotation time from hours to minutes.
Disaster response – First responders can quickly delineate flood extents or fire perimeters with minimal input, enabling faster situational awareness.
Dataset scaling – Companies building large remote‑sensing datasets can bootstrap segmentation masks from point‑level crowdsourced labels, drastically reducing labeling costs.
Foundation model reuse – ReSAM demonstrates a recipe for adapting other foundation vision models (e.g., CLIP, DINO) to niche domains without full‑mask fine‑tuning.
Edge deployment – Because the loop runs on top of SAM’s existing encoder‑decoder, it can be integrated into existing GIS pipelines or even on‑device inference setups with modest compute overhead.

Limitations & Future Work

The method still relies on initial point quality; badly placed points can lead to suboptimal pseudo‑masks that are harder to recover.
Computation cost grows with the number of refinement iterations, which may be a bottleneck for very large satellite tiles.
The current reinforcement strategy uses a simple contrastive loss; more sophisticated uncertainty modeling could further reduce confirmation bias.
Future research could explore multi‑modal prompts (e.g., textual cues) or extend the loop to video‑based remote sensing where temporal consistency is crucial.

Authors

M. Naseer Subhani

Paper Information

arXiv ID: 2511.21606v1
Categories: cs.CV
Published: November 26, 2025
PDF: Download PDF

[Paper] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

[Paper] Visual Generation Tuning