[Paper] Vision-Language Model for Accurate Crater Detection
Source: arXiv - 2601.07795v1
Overview
The paper presents a new crater‑detection pipeline that leverages a vision‑language model (OWL‑v2) built on a Vision Transformer (ViT). By fine‑tuning this model with a parameter‑efficient Low‑Rank Adaptation (LoRA) strategy on high‑resolution lunar imagery, the authors achieve high recall (94 %) and solid precision (73 %) even under difficult lighting and terrain conditions—an important step toward safer lunar landings for ESA’s Argonaut mission.
Key Contributions
- Vision‑language model for planetary science – Adapts the state‑of‑the‑art OWL‑v2 (ViT + language encoder) to the crater detection problem, a first in lunar surface analysis.
- Parameter‑efficient fine‑tuning – Uses LoRA to inject a small set of trainable weights, keeping the massive pretrained backbone frozen and drastically reducing GPU memory and training time.
- Hybrid loss design – Combines Complete IoU (CIoU) for precise bounding‑box regression with a contrastive loss that encourages the model to separate crater vs. non‑crater patches in the joint visual‑text embedding space.
- High‑resolution, manually annotated dataset – Fine‑tunes on the IMPACT project’s curated LRO‑C DRC images, providing a reliable benchmark for future lunar CDA research.
- Robust performance across illumination extremes – Demonstrates consistent detection on images with harsh shadows, low contrast, and varied terrain roughness.
Methodology
- Backbone selection – The authors start from OWL‑v2, a multimodal transformer that processes an image patch and a textual prompt (e.g., “crater”) in a shared embedding space. Its ViT encoder extracts rich visual features, while the language encoder supplies semantic guidance.
- Low‑Rank Adaptation (LoRA) – Instead of retraining the entire transformer (hundreds of millions of parameters), LoRA injects two small trainable matrices (rank‑r) into each attention layer. This reduces the number of updated parameters by >99 % and allows fine‑tuning on a single GPU.
- Dataset & labeling – The IMPACT dataset contains ~10 k manually annotated craters on LRO‑C DRC images (0.5 m/pixel). Each crater is represented by a tight bounding box and a class label (“crater”).
- Loss function
- CIoU loss penalizes mis‑aligned bounding boxes, taking into account overlap, distance between centers, and aspect‑ratio consistency.
- Contrastive loss pushes the visual embedding of crater patches closer to the textual “crater” token and farther from non‑crater patches, improving classification confidence.
- Training pipeline – Images are tiled into 224 × 224 patches, fed to the frozen OWL‑v2 backbone, and the LoRA adapters are updated for 30 epochs using AdamW. Early stopping is based on validation recall.
Results & Findings
| Metric | Best value (on IMPACT test set) |
|---|---|
| Recall | 94.0 % (detects almost all true craters) |
| Precision | 73.1 % (reasonable false‑positive rate) |
| F1‑score | 0.82 |
| Inference speed | ~12 fps on an RTX 3090 (single‑image tile) |
- Visual inspection shows the model correctly identifies craters as small as 3 m in diameter and remains stable under strong shadows.
- Ablation studies confirm that LoRA contributes a ~2 % boost in recall while cutting training memory by ~80 %.
- Removing the contrastive component drops precision by ~8 %, highlighting the benefit of the multimodal signal.
Practical Implications
- Mission planning – Automated, high‑recall crater maps can be integrated into ESA’s landing‑site selection tools, reducing manual cartography workload and improving safety margins for the Argonaut lander.
- On‑board processing – The lightweight LoRA adapters make it feasible to run the model on edge‑class hardware (e.g., NVIDIA Jetson) for near‑real‑time hazard detection during descent.
- Cross‑domain reuse – The same vision‑language fine‑tuning pipeline can be applied to other planetary bodies (Mars, asteroids) or to related tasks such as boulder detection, rock classification, or terrain roughness estimation.
- Open‑source tooling – By exposing the LoRA weights and the CIoU‑contrastive loss implementation, developers can quickly prototype custom CDA solutions without retraining massive transformers from scratch.
Limitations & Future Work
- Precision ceiling – While recall is excellent, the 73 % precision indicates a non‑trivial false‑positive rate, especially for small, ambiguous features (e.g., shadows that mimic craters).
- Dataset bias – The IMPACT annotations focus on high‑resolution LRO‑C images; performance on lower‑resolution or different sensor modalities (e.g., SAR) remains untested.
- Scalability to full‑scene inference – Current tiling approach incurs overlap handling overhead; future work could explore end‑to‑end detection heads that output variable‑size masks.
- Temporal consistency – Incorporating multi‑temporal imagery could help disambiguate transient lighting effects from true topographic depressions.
The authors suggest extending the multimodal prompt set (e.g., “large crater”, “shallow pit”) and experimenting with larger LoRA ranks or hybrid adapters to push precision higher while keeping the model lightweight for spacecraft deployment.
Authors
- Patrick Bauer
- Marius Schwinning
- Florian Renk
- Andreas Weinmann
- Hichem Snoussi
Paper Information
- arXiv ID: 2601.07795v1
- Categories: cs.CV
- Published: January 12, 2026
- PDF: Download PDF