[Paper] An assessment of data-centric methods for label noise identification in remote sensing data sets
Source: arXiv - 2603.16835v1
Overview
This paper investigates how well three data‑centric label‑noise detection methods work on remote‑sensing image datasets. By deliberately corrupting the ground‑truth labels at varying intensities (10‑70 %), the authors show that these techniques can both spot noisy annotations and boost downstream model performance, offering a practical roadmap for developers dealing with imperfect satellite or aerial imagery data.
Key Contributions
- Systematic benchmark of three label‑noise identification algorithms on two widely used remote‑sensing datasets.
- Comprehensive noise injection study covering symmetric, asymmetric, and class‑dependent noise types across a broad range of corruption levels.
- Quantitative analysis of how well each method isolates noisy samples and how that filtering translates into higher classification accuracy.
- Guidelines for selecting the most suitable method based on noise characteristics and project goals.
- Identification of research gaps in adapting data‑centric noise‑handling to the unique challenges of remote‑sensing imagery (e.g., high intra‑class variability, multi‑spectral data).
Methodology
- Datasets & Baselines – The authors use two benchmark remote‑sensing collections (e.g., a land‑cover scene classification set and an aerial object detection set). A standard convolutional neural network (CNN) serves as the baseline classifier.
- Synthetic Label Noise – They corrupt the true labels with three noise models:
- Symmetric: any label can flip to any other with equal probability.
- Asymmetric: flips follow a predefined confusion matrix (e.g., “forest” ↔ “grassland”).
- Class‑dependent: certain classes are more prone to errors.
Noise levels are varied from 10 % up to 70 %.
- Data‑Centric Methods Evaluated –
- Loss‑Based Filtering (e.g., small‑loss trick): assumes clean samples have lower training loss.
- Agreement‑Based Ensemble: trains multiple models and flags samples with low consensus.
- Feature‑Space Outlier Detection: extracts deep features and applies clustering/outlier scoring to spot mislabeled points.
- Evaluation Pipeline – For each noise setting, the methods first identify a subset of suspected noisy labels. Those samples are either removed or relabeled, after which the classifier is retrained. Performance is measured by:
- Noise‑identification accuracy (precision/recall of flagged samples).
- Task accuracy (overall classification IoU or F1 score).
Results & Findings
- Noise Identification – All three methods outperform random guessing, but their strengths differ:
- Loss‑Based Filtering excels at low‑to‑moderate symmetric noise (≤30 %).
- Agreement‑Based Ensemble is most robust to asymmetric and class‑dependent noise, maintaining >70 % precision even at 50 % corruption.
- Feature‑Space Outlier Detection shines when the data have strong visual separability (e.g., distinct spectral signatures).
- Impact on Model Performance – Removing the identified noisy samples yields 5‑12 % absolute gains in classification accuracy compared to training on the corrupted set, with the biggest jumps observed at higher noise levels (≥50 %).
- Trade‑off – Aggressive filtering can discard too many clean samples, slightly hurting performance when noise is low; a calibrated threshold is essential.
- Best‑Practice Recommendation – For most remote‑sensing pipelines, a hybrid approach (combine loss‑based and agreement‑based signals) provides the most consistent improvements across noise types.
Practical Implications
- Data‑Cleaning Pipelines – Developers can integrate these lightweight detection modules into existing training loops to automatically prune or flag suspect annotations before model deployment.
- Cost Savings – By pinpointing noisy labels, teams can focus human annotation effort on a small subset of problematic samples, reducing costly re‑labeling campaigns.
- Robust Model Deployment – In operational remote‑sensing applications (e.g., disaster mapping, agricultural monitoring), the ability to maintain high accuracy despite noisy crowdsourced or legacy labels translates to more reliable decision‑support tools.
- Tooling Compatibility – The evaluated methods rely on standard deep‑learning libraries (PyTorch/TensorFlow) and require only the model’s loss values, predictions, or feature embeddings—no specialized hardware or external datasets.
Limitations & Future Work
- Synthetic Noise Only – The study uses artificially injected label errors; real‑world noise patterns (e.g., systematic labeling bias) may behave differently.
- Scalability – Ensemble‑based agreement methods increase training time linearly with the number of models, which could be prohibitive for very large satellite datasets.
- Multi‑Modal Data – The experiments focus on RGB or multispectral imagery; extending to SAR, LiDAR, or fused modalities remains an open challenge.
- Adaptive Thresholding – Future research should explore self‑tuning mechanisms that adjust filtering aggressiveness based on observed noise levels, possibly via meta‑learning.
Bottom line: This work demonstrates that data‑centric label‑noise detection is not just an academic curiosity—it’s a practical lever for improving the reliability of remote‑sensing AI systems, and the provided guidelines give developers a clear starting point for integrating these techniques into production pipelines.
Authors
- Felix Kröber
- Genc Hoxha
- Ribana Roscher
Paper Information
- arXiv ID: 2603.16835v1
- Categories: cs.CV
- Published: March 17, 2026
- PDF: Download PDF