[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Published: 3 days ago (February 19, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.17665v1

Overview

OpenEarthAgent presents a new, unified framework that lets AI agents reason over satellite imagery the way a GIS analyst would—by chaining together specialized tools (e.g., NDVI calculators, vector overlays, map queries) while following a natural‑language instruction. By training on thousands of annotated reasoning traces, the system learns to produce step‑by‑step, tool‑driven solutions that are both accurate and interpretable, opening the door to reliable, multi‑modal geospatial assistants for developers and industry practitioners.

Key Contributions

Unified tool‑augmented architecture for geospatial reasoning that integrates vision, language, and GIS operations in a single agent.
Large, publicly released dataset: 14,538 training and 1,169 evaluation examples covering urban, environmental, disaster‑response, and infrastructure scenarios, with >100 K annotated reasoning steps.
Supervised fine‑tuning on explicit reasoning trajectories, enabling the model to learn stable multi‑step logic and to invoke the correct GIS tool at each step.
Demonstrated performance gains over strong baselines and competitive results compared with recent open‑source and closed‑source multimodal models.
Interpretability by design: every decision is traceable to a concrete tool call (e.g., “compute NDVI for polygon X”), making debugging and compliance easier for real‑world deployments.

Methodology

Data collection & annotation – Satellite images (multispectral, RGB, SAR) are paired with natural‑language queries (e.g., “Identify flood‑affected areas in the last 48 h”). Human annotators then produce a full reasoning trace: a sequence of tool calls (NDVI, raster clipping, vector buffering, etc.) and intermediate textual explanations.
Tool library – A modular set of GIS primitives (index calculators, raster algebra, vector geometry ops, map‑style retrieval) is wrapped as API calls that the agent can invoke during inference.
Model backbone – A vision‑language transformer (similar to Flamingo/BLIP‑2) processes the image and query, while a decoder predicts the next action in the trace (tool name + arguments) and optional explanatory text.
Supervised fine‑tuning – The model is trained to mimic the human‑written traces using teacher‑forcing, encouraging it to learn the correct ordering of tool usage and to keep spatial context across steps.
Inference – At test time the agent generates a trace autoregressively, executes each tool, feeds the tool’s output back into the model, and continues until a final answer is produced.

Results & Findings

Metric	OpenEarthAgent	Strong Baseline*	Recent Open‑Source Model
Exact‑match answer accuracy	68.4 %	58.7 %	62.1 %
Tool‑call correctness (precision)	91.2 %	78.4 %	84.5 %
Reasoning trace length (avg.)	7.3 steps	6.9 steps	8.1 steps
Cross‑domain robustness (urban‑env‑disaster)	+7 % avg. gain	–	–

*Baseline = a vision‑language model with a single “answer‑only” head, no tool augmentation.

Key takeaways

The tool‑augmented agent consistently outperforms a vanilla V‑L model, especially on tasks that require index calculations (e.g., NDVI, NBR).
High precision in tool selection shows the model learns to map linguistic cues (“vegetation health”) to the right GIS operation.
The trace‑based supervision yields interpretable pipelines that can be inspected or edited by a human analyst.

Practical Implications

Rapid prototyping of geospatial analytics – Developers can embed the agent in a web service to answer ad‑hoc queries like “show me the change in built‑up area over the past year” without writing custom GIS scripts.
Disaster response automation – First responders can query satellite feeds (“Where are the worst‑hit flood zones?”) and receive a ready‑to‑use raster mask generated by the agent’s tool chain.
Compliance & auditability – Because each decision is tied to a concrete tool call, organizations can log the full reasoning trace for regulatory review (e.g., environmental impact assessments).
Extensible ecosystem – The modular tool library means new remote‑sensing indices or vector operations can be added, and the same agent will learn to use them with minimal re‑training.
Lower barrier for GIS‑light teams – Small startups or municipal IT departments lacking in‑house GIS expertise can leverage the model as a “smart analyst” that bridges the gap between raw satellite data and actionable insights.

Limitations & Future Work

Tool coverage – The current library focuses on common indices and basic vector ops; more advanced analyses (e.g., time‑series change detection, 3D point‑cloud processing) are not yet supported.
Scalability of reasoning traces – Very long or highly conditional workflows can cause error propagation; future work will explore hierarchical planning or retrieval‑augmented reasoning to keep traces robust.
Domain shift – The dataset is heavily curated; performance on completely unseen sensor modalities (e.g., hyperspectral, SAR‑interferometry) may degrade. Expanding training data and incorporating self‑supervised adaptation are planned.
Real‑time constraints – Each tool call incurs a round‑trip to a GIS backend, which can be a bottleneck for latency‑critical applications. Optimizing tool execution (e.g., batched raster operations, GPU‑accelerated GIS kernels) is an open research direction.

Overall, OpenEarthAgent demonstrates that grounding multimodal language models in concrete GIS tools yields both higher accuracy and interpretability, paving the way for practical AI assistants in the remote‑sensing and geospatial analytics space.

Authors

Akashah Shabbir
Muhammad Umer Sheikh
Muhammad Akhtar Munir
Hiyam Debary
Mustansar Fiaz
Muhammad Zaigham Zaheer
Paolo Fraccaro
Fahad Shahbaz Khan
Muhammad Haris Khan
Xiao Xiang Zhu
Salman Khan

Paper Information

arXiv ID: 2602.17665v1
Categories: cs.CV
Published: February 19, 2026
PDF: Download PDF

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

[Paper] Human-level 3D shape perception emerges from multi-view learning

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

[Paper] IntRec: Intent-based Retrieval with Contrastive Refinement