[Paper] GeoDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis
Source: arXiv - 2603.02172v1
Overview
GeoDiT is a new diffusion‑based transformer that can generate realistic satellite imagery from natural‑language prompts and a handful of user‑placed points. By letting developers specify just a few geolocated points together with descriptive text, the model produces high‑fidelity, geographically coherent images without the need for costly pixel‑wise masks or exhaustive annotations.
Key Contributions
- Point‑conditioned control: Introduces a lightweight conditioning scheme where each point carries a textual label (e.g., “river”, “urban area”), enabling fine‑grained spatial guidance.
- Adaptive local attention: A novel attention mechanism that dynamically narrows the transformer’s focus around the supplied points, improving both efficiency and fidelity.
- Domain‑specific design study: Systematic experiments on how to represent satellite imagery (e.g., RGB, multispectral) and geolocation (absolute coordinates vs. relative offsets) for optimal diffusion alignment.
- State‑of‑the‑art performance: Outperforms existing remote‑sensing generative models on standard benchmarks (e.g., SpaceNet, DeepGlobe) in terms of realism, diversity, and geographic consistency.
- Annotation‑friendly pipeline: Requires only a few point annotations per image, dramatically reducing data‑collection overhead compared with pixel‑level masks.
Methodology
GeoDiT builds on the diffusion‑transformer paradigm: a latent diffusion model iteratively denoises a random tensor into a satellite image while a transformer predicts the denoising direction. The key twist is the point‑conditioned input:
- Point queries – Each user‑provided point is encoded as a 2‑D coordinate plus a learned embedding of its associated text label.
- Adaptive local attention – During each transformer layer, attention scores are re‑weighted so that tokens near a point receive higher influence from that point’s query, while distant tokens attend more globally. This keeps the model’s receptive field focused where control is needed, yet still captures large‑scale landscape context.
- Training data – The authors curate a large corpus of satellite tiles paired with automatically generated point‑label pairs (e.g., using OpenStreetMap tags). The diffusion process is trained to reconstruct the original image given the noisy latent and the point set.
- Inference – At generation time, developers supply a textual prompt (e.g., “coastal city with a harbor”) and a few points (e.g., a point labeled “harbor” at the desired location). The model then synthesizes an image that respects both the global description and the local point constraints.
Results & Findings
- Quantitative gains: GeoDiT achieves a 12 % improvement in Fréchet Inception Distance (FID) and a 9 % boost in Inception Score (IS) over the previous best remote‑sensing diffusion model.
- Spatial fidelity: When evaluated on a held‑out set of point‑annotated images, the model places the requested objects within a mean error of < 5 pixels, confirming that point conditioning works as intended.
- Efficiency: Adaptive local attention reduces memory consumption by ~30 % and speeds up inference by ~25 % compared to a vanilla global‑attention transformer of the same size.
- Ablation studies: Experiments show that absolute geolocation embeddings outperform relative offsets, and that using multispectral (RGB + NIR) inputs yields modest but consistent quality gains.
Practical Implications
- Rapid prototyping of geospatial datasets: Developers can generate synthetic satellite imagery for training computer‑vision models (e.g., building detection, flood mapping) without manually labeling large image collections.
- Scenario planning & simulation: Urban planners or disaster‑response teams can quickly visualize “what‑if” scenarios (e.g., adding a new road, expanding a coastline) by placing a few points and letting GeoDiT render the surrounding landscape.
- Content creation for games and VR: Game studios can produce large, varied terrain textures anchored to specific landmarks, cutting down on manual asset creation.
- Reduced annotation cost: Point‑level labeling is far cheaper than pixel‑wise segmentation, making it feasible to scale up training data for remote‑sensing AI pipelines.
Limitations & Future Work
- Geographic generalization: The model was trained primarily on mid‑latitude regions; performance may degrade in polar or desert environments where training data is sparse.
- Resolution ceiling: Current experiments stop at 256 × 256 px; scaling to higher‑resolution satellite imagery (e.g., 1 m/pixel) will require architectural tweaks and more compute.
- Point density: While a few points work well, extremely dense point sets can overwhelm the adaptive attention mechanism, leading to artifacts.
- Future directions: The authors suggest extending GeoDiT to handle temporal conditioning (e.g., “image of this area in 2030”), integrating additional modalities such as elevation maps, and exploring hierarchical diffusion to reach higher resolutions.
Authors
- Srikumar Sastry
- Dan Cher
- Brian Wei
- Aayush Dhakal
- Subash Khanal
- Dev Gupta
- Nathan Jacobs
Paper Information
- arXiv ID: 2603.02172v1
- Categories: cs.CV
- Published: March 2, 2026
- PDF: Download PDF