[Paper] EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data
Source: arXiv - 2602.12177v1
Overview
The paper introduces EO‑VAE, a single variational auto‑encoder that can tokenize a wide variety of Earth‑observation (EO) sensor data—ranging from multispectral optical imagery to radar and hyperspectral cubes. By handling many sensor modalities with one model, EO‑VAE paves the way for generative AI systems (e.g., diffusion or transformer‑based models) that can work directly on satellite data without needing a separate tokenizer for each instrument.
Key Contributions
- Unified multi‑sensor tokenizer: A single VAE that accepts arbitrary channel subsets (e.g., RGB, NIR, SAR) and learns a common latent space.
- Dynamic hypernetwork conditioning: The encoder/decoder weights are modulated on‑the‑fly based on a lightweight hypernetwork that encodes the sensor’s spectral configuration, enabling flexible channel combinations.
- Improved reconstruction quality: On the TerraMesh benchmark, EO‑VAE outperforms the previously released TerraMind tokenizers across all tested modalities.
- Open‑source baseline: The authors release the trained model, training scripts, and a simple API for downstream generative tasks, establishing a reference point for future EO generative research.
Methodology
- Variational Auto‑Encoder backbone – A standard convolutional VAE is used to map high‑dimensional images to a compact latent code (≈ 256‑dimensional).
- Hypernetwork controller – For each input, a small MLP receives a sensor descriptor (list of wavelengths, spatial resolution, and modality flags). It outputs scaling vectors that adapt the main encoder/decoder’s convolutional kernels, effectively customizing the network for the given channel set.
- Training regime – The model is trained end‑to‑end on the TerraMesh dataset, which contains co‑registered scenes from multiple satellites (Sentinel‑2, Landsat‑8, Sentinel‑1 SAR, etc.). The loss combines the usual VAE reconstruction term, a KL‑divergence regularizer, and a spectral consistency penalty that encourages the latent space to be agnostic to the specific sensor ordering.
- Token extraction – After training, the encoder’s mean latent vector is quantized (e.g., using a learned codebook) to produce discrete tokens that can be fed to downstream generative models just like the tokens used in text‑to‑image diffusion pipelines.
Results & Findings
| Metric | EO‑VAE | TerraMind (per‑sensor) |
|---|---|---|
| PSNR (average across modalities) | 32.8 dB | 30.1 dB |
| SSIM (average) | 0.91 | 0.86 |
| Latent size (bits/pixel) | 0.45 | 0.58 |
| Cross‑sensor reconstruction (train on optical, test on SAR) | 0.78 SSIM | N/A (separate models) |
- Higher fidelity: EO‑VAE consistently reconstructs fine spatial details and preserves spectral signatures better than the baseline tokenizers.
- Compact representation: Because the hypernetwork shares most parameters, the overall model size is ~30 % smaller than training a distinct VAE per sensor.
- Cross‑modal robustness: The shared latent space enables reasonable reconstructions even when the model sees a sensor configuration it has not been explicitly trained on, hinting at a degree of sensor‑agnostic generalization.
Practical Implications
- Simplified pipelines: Developers building generative models for satellite imagery (e.g., cloud‑removal diffusion, synthetic SAR generation) can now rely on a single tokenizer instead of maintaining a zoo of per‑sensor encoders.
- Multi‑modal data fusion: Since EO‑VAE maps different sensors into a common latent space, downstream models can more easily learn relationships across modalities—useful for tasks like joint optical‑SAR segmentation or change detection.
- Edge deployment: The hypernetwork adds only a few kilobytes of per‑sensor metadata, making it feasible to run the tokenizer on‑board small satellites or ground stations with limited compute.
- Accelerated research: With an open‑source baseline, the community can focus on improving generative heads (e.g., diffusion, VQ‑GAN) rather than reinventing the tokenization layer for each new satellite mission.
Limitations & Future Work
- Spectral resolution ceiling: The current hypernetwork encodes sensor specs as a low‑dimensional vector; extremely high‑resolution hyperspectral data (hundreds of bands) still challenges reconstruction fidelity.
- Temporal dynamics omitted: EO‑VAE processes single frames; extending the architecture to handle time‑series (e.g., video‑style tokenization for repeat‑pass monitoring) is left for future work.
- Domain shift: While cross‑sensor tests are promising, performance drops when encountering sensors with drastically different noise characteristics (e.g., L‑band SAR vs. C‑band). Further regularization or domain‑adaptation strategies are needed.
Bottom line: EO‑VAE demonstrates that a single, adaptable VAE can serve as a universal tokenizer for the heterogeneous world of Earth‑observation data, opening the door for more unified and efficient generative AI solutions in remote sensing.*
Authors
- Nils Lehmann
- Yi Wang
- Zhitong Xiong
- Xiaoxiang Zhu
Paper Information
- arXiv ID: 2602.12177v1
- Categories: cs.CV
- Published: February 12, 2026
- PDF: Download PDF