[Paper] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Published: (November 26, 2025 at 09:11 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21420v1

Overview

Remote‑sensing change captioning asks a model to look at two satellite images taken at different times and generate a natural‑language sentence that describes what has changed (e.g., “a new building was constructed”). The paper introduces a novel pipeline that plugs the Segment Anything Model (SAM) into the captioning stack, giving the system a strong sense of where changes happen and what objects are involved. This region‑level awareness pushes the state‑of‑the‑art performance on several benchmark datasets.

Key Contributions

  • SAM‑driven region mining: Uses the foundation model SAM to automatically segment both semantic (object‑level) and motion (temporal) change regions between the two images.
  • Hybrid feature fusion: Combines global CNN/Transformer visual embeddings, SAM‑derived region embeddings, and a knowledge‑graph of object attributes via cross‑attention.
  • Knowledge‑graph integration: Constructs a lightweight graph that injects prior information about typical remote‑sensing objects (roads, buildings, vegetation) into the caption generator.
  • Transformer decoder for captioning: Generates fluent change descriptions conditioned on the fused multi‑modal representation.
  • State‑of‑the‑art results: Sets new performance records on multiple public remote‑sensing change captioning benchmarks (e.g., LEVIR‑CC, WHU‑CD).

Methodology

  1. Global feature extraction – A backbone CNN or Vision Transformer processes each of the two input images, producing high‑level feature maps that capture overall scene context.
  2. Region extraction with SAM – The pre‑trained SAM model receives the image pair and produces two sets of masks:
    • Semantic masks that outline known object categories (buildings, roads, water).
    • Motion masks that highlight pixels whose appearance changes between timestamps.
      These masks are pooled into compact region embeddings.
  3. Knowledge graph construction – A small graph encodes relationships such as “building → has → roof” or “road → connects → intersection”. Nodes are linked to the region embeddings, providing semantic priors.
  4. Cross‑attention fusion – A multi‑head cross‑attention module lets the caption decoder attend simultaneously to global features, region embeddings, and graph node vectors, aligning spatial and temporal cues.
  5. Caption generation – A standard Transformer decoder, initialized with a language model head, autoregressively emits the change description token‑by‑token.

The whole pipeline is end‑to‑end trainable; only the SAM weights remain frozen, leveraging its zero‑shot segmentation capability without extra annotation.

Results & Findings

  • Quantitative gains: The proposed method improves CIDEr by ~7–10 points and BLEU‑4 by ~3–5 points over the previous best models on the LEVIR‑CC and WHU‑CD datasets.
  • Ablation studies: Removing SAM‑derived masks drops performance by ~4 CIDEr points, confirming the importance of region‑level cues. Adding the knowledge graph yields an extra ~2 CIDEr improvement.
  • Qualitative insights: Visualizations show the model correctly isolates newly built structures and distinguishes them from seasonal vegetation changes, producing captions like “A new residential block appeared north of the highway.”

Practical Implications

  • Rapid disaster assessment: Emergency responders can feed pre‑ and post‑event satellite imagery to obtain concise textual summaries of damaged infrastructure, speeding up situational awareness.
  • Urban planning & monitoring: City planners can automatically generate change logs (e.g., “A new parking lot was added”) for large‑scale GIS databases, reducing manual annotation effort.
  • Environmental tracking: Agencies monitoring deforestation or water‑body shrinkage can receive natural‑language alerts that are easier to parse than raw change maps.
  • Integration with existing pipelines: Because SAM is used as a plug‑and‑play module, developers can retrofit the approach onto existing remote‑sensing analytics stacks with minimal code changes.

Limitations & Future Work

  • Dependence on SAM quality: SAM may produce over‑segmented masks in low‑resolution or heavily cloud‑covered images, which can propagate errors to the captioning stage.
  • Scalability of the knowledge graph: The current graph covers a limited set of common objects; extending it to niche domains (e.g., agricultural crops) will require additional curation.
  • Temporal granularity: The method handles only a pair of timestamps; future work could explore multi‑temporal sequences to capture gradual changes.
  • Real‑time constraints: While inference is fast on a GPU, deploying on edge devices or low‑power platforms may need model compression or pruning techniques.

The authors plan to open‑source their code, which should accelerate adoption and enable the community to address these challenges.

Authors

  • Futian Wang
  • Mengqi Wang
  • Xiao Wang
  • Haowen Wang
  • Jin Tang

Paper Information

  • arXiv ID: 2511.21420v1
  • Categories: cs.CV, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »