[Paper] Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping

Published: (January 5, 2026 at 01:07 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02315v1

Overview

The paper introduces Prithvi‑Complementary Adaptive Fusion Encoder (CAFE), a hybrid architecture that marries a large‑scale Geo‑Foundation Model (Prithvi) with a lightweight CNN branch enriched by Convolutional Attention Modules. By fusing global, long‑range representations from the foundation model with fine‑grained local cues, CAFE pushes flood‑inundation mapping accuracy beyond both classic U‑Net baselines and other state‑of‑the‑art GFMs.

Key Contributions

  • Hybrid encoder design: Combines a pretrained Prithvi transformer encoder with a parallel residual CNN branch, enabling complementary learning of global context and local detail.
  • Convolutional Attention Modules (CAM): Integrated into the CNN path to dynamically weight spatial features, improving the capture of subtle flood boundaries.
  • Adapter‑based fine‑tuning: Uses lightweight adapter layers on top of Prithvi, keeping the massive backbone frozen while allowing rapid adaptation to new flood datasets.
  • Multi‑scale, multi‑level fusion: Features from both branches are merged at several decoder stages, preserving hierarchical information throughout the segmentation pipeline.
  • State‑of‑the‑art performance: Sets new IoU records on Sen1Flood11 (83.41) and FloodPlanet (64.70), outperforming strong baselines such as U‑Net, TerraMind, DOFA, and the original Prithvi model.
  • Open‑source release: Full code and pretrained adapters are publicly available, facilitating reproducibility and downstream experimentation.

Methodology

  1. Backbone selection – The authors start with Prithvi, a transformer‑based GFM pretrained on massive multi‑spectral satellite imagery. Its self‑attention layers excel at modeling long‑range spatial dependencies.
  2. Parallel CNN residual branch – A conventional ResNet‑style CNN processes the same input, but with Convolutional Attention Modules that learn channel‑wise and spatial attention maps, sharpening edge and texture cues that are often lost in transformer tokenization.
  3. Adapter layers – Instead of fine‑tuning the entire Prithvi model (costly in GPU memory and time), small trainable adapter modules are inserted between transformer blocks. This keeps the bulk of the pretrained weights intact while allowing the model to specialize on flood‑mapping data.
  4. Feature fusion – At multiple decoder stages, the transformer and CNN feature maps are upsampled to a common resolution and concatenated. A lightweight convolutional mixer then blends the two streams, letting the network decide how much global vs. local information to trust for each pixel.
  5. Training regime – The combined encoder‑decoder is trained end‑to‑end on labeled flood masks using a standard cross‑entropy + Dice loss. Because adapters are tiny, convergence is fast (≈ 2–3 epochs on Sen1Flood11) and the overall parameter count stays modest compared to fine‑tuning the full transformer.

Results & Findings

DatasetIoU (CAFE)Best prior (baseline)Δ vs. U‑Net
Sen1Flood11 (test)83.41Prithvi 82.50 / TerraMind 82.90+12.84
Sen1Flood11 (hold‑out site)81.37Prithvi 72.42 / U‑Net 70.57+10.80
FloodPlanet64.70Prithvi 2.0 61.91 / TerraMind 62.33+4.56
  • Global context from Prithvi captures the overall flood extent, while the CNN‑CAM branch sharpens riverbanks and small water patches, leading to higher Intersection‑over‑Union (IoU).
  • Adapter‑only fine‑tuning reduces training time and GPU memory by ~70 % compared to full transformer fine‑tuning, without sacrificing accuracy.
  • Ablation studies (not detailed here) show that removing either the CNN branch or the CAMs drops IoU by 2–3 pts, confirming the complementary nature of the two streams.

Practical Implications

  • Rapid deployment for disaster response – Agencies can fine‑tune the lightweight adapters on newly acquired SAR/optical data within hours, delivering up‑to‑date flood maps for emergency teams.
  • Scalable to other multi‑modal segmentation tasks – The fusion paradigm works wherever satellite data combines many spectral bands (e.g., land‑cover change, wildfire burn‑scar detection).
  • Reduced compute budget – By freezing the massive GFM and only training adapters, smaller cloud instances or on‑premise GPUs (8‑12 GB) suffice, lowering operational costs.
  • Plug‑and‑play architecture – Developers can swap the CNN branch for other lightweight backbones (e.g., MobileNet) or replace CAMs with newer attention mechanisms, tailoring the model to edge‑device constraints.
  • Open‑source code – The GitHub repo includes ready‑to‑run notebooks, pretrained adapters, and scripts for converting raw Sentinel‑1/2 tiles into the required multi‑channel tensors, accelerating integration into existing GIS pipelines.

Limitations & Future Work

  • Domain specificity – The current adapters are tuned for flood inundation; performance on drastically different phenomena (e.g., urban heat islands) may require additional modality‑specific adapters.
  • Resolution trade‑off – While the fusion improves boundary precision, the model still operates at a fixed 10 m resolution; finer‑scale mapping would need higher‑resolution inputs or super‑resolution post‑processing.
  • Interpretability – The paper does not provide extensive visual explanations of what the CAMs attend to; future work could incorporate saliency maps to aid trust in critical decision‑making contexts.
  • Extending to time‑series – Flood dynamics evolve rapidly; integrating temporal attention (e.g., video transformers) could further boost early‑warning capabilities.

Overall, Prithvi‑CAFE demonstrates that a thoughtfully engineered hybrid of foundation models and classic CNNs can unlock practical performance gains for real‑world geospatial segmentation challenges.

Authors

  • Saurabh Kaushik
  • Lalit Maurya
  • Beth Tellman

Paper Information

  • arXiv ID: 2601.02315v1
  • Categories: cs.CV
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »