[Paper] Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping

Published: 2 weeks ago (January 5, 2026 at 01:07 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.02315v1

Overview

The paper introduces Prithvi‑Complementary Adaptive Fusion Encoder (CAFE), a hybrid architecture that marries a large‑scale Geo‑Foundation Model (Prithvi) with a lightweight CNN branch enriched by Convolutional Attention Modules. By fusing global, long‑range representations from the foundation model with fine‑grained local cues, CAFE pushes flood‑inundation mapping accuracy beyond both classic U‑Net baselines and other state‑of‑the‑art GFMs.

Key Contributions

Hybrid encoder design: Combines a pretrained Prithvi transformer encoder with a parallel residual CNN branch, enabling complementary learning of global context and local detail.
Convolutional Attention Modules (CAM): Integrated into the CNN path to dynamically weight spatial features, improving the capture of subtle flood boundaries.
Adapter‑based fine‑tuning: Uses lightweight adapter layers on top of Prithvi, keeping the massive backbone frozen while allowing rapid adaptation to new flood datasets.
Multi‑scale, multi‑level fusion: Features from both branches are merged at several decoder stages, preserving hierarchical information throughout the segmentation pipeline.
State‑of‑the‑art performance: Sets new IoU records on Sen1Flood11 (83.41) and FloodPlanet (64.70), outperforming strong baselines such as U‑Net, TerraMind, DOFA, and the original Prithvi model.
Open‑source release: Full code and pretrained adapters are publicly available, facilitating reproducibility and downstream experimentation.

Methodology

Backbone selection – The authors start with Prithvi, a transformer‑based GFM pretrained on massive multi‑spectral satellite imagery. Its self‑attention layers excel at modeling long‑range spatial dependencies.
Parallel CNN residual branch – A conventional ResNet‑style CNN processes the same input, but with Convolutional Attention Modules that learn channel‑wise and spatial attention maps, sharpening edge and texture cues that are often lost in transformer tokenization.
Adapter layers – Instead of fine‑tuning the entire Prithvi model (costly in GPU memory and time), small trainable adapter modules are inserted between transformer blocks. This keeps the bulk of the pretrained weights intact while allowing the model to specialize on flood‑mapping data.
Feature fusion – At multiple decoder stages, the transformer and CNN feature maps are upsampled to a common resolution and concatenated. A lightweight convolutional mixer then blends the two streams, letting the network decide how much global vs. local information to trust for each pixel.
Training regime – The combined encoder‑decoder is trained end‑to‑end on labeled flood masks using a standard cross‑entropy + Dice loss. Because adapters are tiny, convergence is fast (≈ 2–3 epochs on Sen1Flood11) and the overall parameter count stays modest compared to fine‑tuning the full transformer.

Results & Findings

Dataset	IoU (CAFE)	Best prior (baseline)	Δ vs. U‑Net
Sen1Flood11 (test)	83.41	Prithvi 82.50 / TerraMind 82.90	+12.84
Sen1Flood11 (hold‑out site)	81.37	Prithvi 72.42 / U‑Net 70.57	+10.80
FloodPlanet	64.70	Prithvi 2.0 61.91 / TerraMind 62.33	+4.56

Global context from Prithvi captures the overall flood extent, while the CNN‑CAM branch sharpens riverbanks and small water patches, leading to higher Intersection‑over‑Union (IoU).
Adapter‑only fine‑tuning reduces training time and GPU memory by ~70 % compared to full transformer fine‑tuning, without sacrificing accuracy.
Ablation studies (not detailed here) show that removing either the CNN branch or the CAMs drops IoU by 2–3 pts, confirming the complementary nature of the two streams.

Practical Implications

Rapid deployment for disaster response – Agencies can fine‑tune the lightweight adapters on newly acquired SAR/optical data within hours, delivering up‑to‑date flood maps for emergency teams.
Scalable to other multi‑modal segmentation tasks – The fusion paradigm works wherever satellite data combines many spectral bands (e.g., land‑cover change, wildfire burn‑scar detection).
Reduced compute budget – By freezing the massive GFM and only training adapters, smaller cloud instances or on‑premise GPUs (8‑12 GB) suffice, lowering operational costs.
Plug‑and‑play architecture – Developers can swap the CNN branch for other lightweight backbones (e.g., MobileNet) or replace CAMs with newer attention mechanisms, tailoring the model to edge‑device constraints.
Open‑source code – The GitHub repo includes ready‑to‑run notebooks, pretrained adapters, and scripts for converting raw Sentinel‑1/2 tiles into the required multi‑channel tensors, accelerating integration into existing GIS pipelines.

Limitations & Future Work

Domain specificity – The current adapters are tuned for flood inundation; performance on drastically different phenomena (e.g., urban heat islands) may require additional modality‑specific adapters.
Resolution trade‑off – While the fusion improves boundary precision, the model still operates at a fixed 10 m resolution; finer‑scale mapping would need higher‑resolution inputs or super‑resolution post‑processing.
Interpretability – The paper does not provide extensive visual explanations of what the CAMs attend to; future work could incorporate saliency maps to aid trust in critical decision‑making contexts.
Extending to time‑series – Flood dynamics evolve rapidly; integrating temporal attention (e.g., video transformers) could further boost early‑warning capabilities.

Overall, Prithvi‑CAFE demonstrates that a thoughtfully engineered hybrid of foundation models and classic CNNs can unlock practical performance gains for real‑world geospatial segmentation challenges.

Authors

Saurabh Kaushik
Lalit Maurya
Beth Tellman

Paper Information

arXiv ID: 2601.02315v1
Categories: cs.CV
Published: January 5, 2026
PDF: Download PDF

[Paper] Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation