[Paper] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

Published: (November 26, 2025 at 12:26 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21606v1

Overview

The paper introduces ReSAM, a self‑prompting framework that adapts the powerful Segment Anything Model (SAM) to remote‑sensing imagery using only sparse point annotations. By iteratively refining pseudo‑masks, generating new prompts, and aligning embeddings, ReSAM dramatically improves SAM’s performance on aerial and satellite images without the need for costly dense masks.

Key Contributions

  • Self‑prompting loop (Refine‑Requery‑Reinforce) that turns a few user‑provided points into progressively better segmentation masks.
  • Box‑prompt generation from coarse masks, enabling SAM to “re‑query” the image with richer cues while still staying point‑supervised.
  • Embedding alignment across iterations to mitigate confirmation bias and keep the model from over‑fitting its own mistakes.
  • Domain‑agnostic adaptation that works on three diverse remote‑sensing benchmarks (WHU, HRSID, NWPU VHR‑10) and outperforms both the vanilla SAM and recent point‑supervised methods.
  • No dense mask supervision required, making the approach scalable to large satellite datasets where full annotations are prohibitive.

Methodology

  1. Initial Point Input (Refine) – The user supplies a handful of foreground/background points on an image. SAM produces a coarse pseudo‑mask based on these points.
  2. Self‑Constructed Box Prompt (Requery) – From the coarse mask, the system automatically extracts a tight bounding box. This box is fed back to SAM as an additional prompt, prompting the model to re‑segment the region with a richer spatial cue.
  3. Semantic Alignment (Reinforce) – Feature embeddings from the current iteration are compared with those from the previous step. A contrastive loss encourages consistency while allowing corrections, reducing the risk that early errors are reinforced.
  4. Iterative Loop – Steps 1‑3 repeat a few times, each cycle producing a cleaner mask and a more reliable prompt set. The whole pipeline requires only the original point annotations; all other supervision is generated internally.

Results & Findings

  • On WHU, HRSID, and NWPU VHR‑10, ReSAM improves mean Intersection‑over‑Union (mIoU) by 8–12% over the out‑of‑the‑box SAM.
  • Compared with recent point‑supervised segmentation methods, ReSAM achieves 3–5% higher mIoU while using the same number of points.
  • Ablation studies confirm that each component (box re‑query, embedding reinforcement) contributes significantly; removing reinforcement drops performance by ~4%.
  • Visual inspections show sharper object boundaries and better handling of small, densely packed structures typical in satellite imagery (e.g., vehicles, buildings).

Practical Implications

  • Rapid map creation – Urban planners can generate accurate building footprints from a few clicks, cutting annotation time from hours to minutes.
  • Disaster response – First responders can quickly delineate flood extents or fire perimeters with minimal input, enabling faster situational awareness.
  • Dataset scaling – Companies building large remote‑sensing datasets can bootstrap segmentation masks from point‑level crowdsourced labels, drastically reducing labeling costs.
  • Foundation model reuse – ReSAM demonstrates a recipe for adapting other foundation vision models (e.g., CLIP, DINO) to niche domains without full‑mask fine‑tuning.
  • Edge deployment – Because the loop runs on top of SAM’s existing encoder‑decoder, it can be integrated into existing GIS pipelines or even on‑device inference setups with modest compute overhead.

Limitations & Future Work

  • The method still relies on initial point quality; badly placed points can lead to suboptimal pseudo‑masks that are harder to recover.
  • Computation cost grows with the number of refinement iterations, which may be a bottleneck for very large satellite tiles.
  • The current reinforcement strategy uses a simple contrastive loss; more sophisticated uncertainty modeling could further reduce confirmation bias.
  • Future research could explore multi‑modal prompts (e.g., textual cues) or extend the loop to video‑based remote sensing where temporal consistency is crucial.

Authors

  • M. Naseer Subhani

Paper Information

  • arXiv ID: 2511.21606v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »