[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data

Published: (February 6, 2026 at 01:33 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.06938v1

Overview

Deep learning models for medical imaging are only as good as the data they are trained on—yet high‑quality labels are scarce because they require expert physicians. This paper presents a systematic framework for detecting mislabeled samples in large Video Capsule Endoscopy (VCE) datasets, and demonstrates that cleaning the data boosts anomaly‑detection performance.

Key Contributions

  • A generic mislabel‑detection pipeline that works with any image‑or video‑based medical dataset, requiring only the trained classifier’s confidence scores and a small validation set.
  • Application to the two biggest public VCE datasets (the “Kvasir‑Capsule” and “Capsule‑Endoscopy” collections), each containing tens of thousands of low‑resolution frames.
  • Human‑in‑the‑loop verification: three board‑certified gastroenterologists re‑annotated the flagged samples, confirming that a substantial fraction were indeed mislabeled.
  • Quantitative improvement: after removing the identified noisy labels, state‑of‑the‑art anomaly detectors achieved up to +5.2 % AUC gain over the original, noisy training sets.
  • Open‑source release of the detection code and the cleaned annotation files, enabling reproducibility and immediate reuse by the community.

Methodology

  1. Train a baseline classifier (e.g., a ResNet‑50 or EfficientNet) on the original, potentially noisy dataset.
  2. Collect prediction confidences for every training sample using a k‑fold cross‑validation scheme to avoid bias from the model that generated the predictions.
  3. Score each sample with a mislabel likelihood based on two simple heuristics:
    • Low confidence (the model is uncertain even after seeing the sample many times).
    • High disagreement across folds (different models consistently predict a different class).
  4. Rank samples by this likelihood and pass the top‑N candidates to domain experts for manual review.
  5. Iteratively refine: after expert re‑annotation, retrain the classifier on the cleaned set and repeat the detection step if needed.

The approach deliberately avoids complex meta‑learning tricks; it leverages the already‑available model outputs, making it easy to plug into existing training pipelines.

Results & Findings

DatasetOriginal AUC (anomaly detection)Cleaned AUCRelative Gain
Kvasir‑Capsule0.8420.894+6.2 %
Capsule‑Endoscopy0.8150.867+5.2 %
  • Mislabeled rate: Roughly 8–10 % of the frames were flagged as suspicious; expert review confirmed that ≈70 % of those were indeed wrong labels.
  • Robustness: The detection pipeline performed consistently across two very different network architectures, indicating that the signal is not model‑specific.
  • Efficiency: Only the top 5 % of samples needed expert inspection to achieve the reported gains, keeping the manual effort manageable.

Practical Implications

  • Cleaner training data → more reliable AI assistants for gastroenterologists, reducing false alarms in capsule‑endoscopy screenings.
  • Rapid quality‑control tool for any medical imaging consortium that aggregates data from multiple hospitals, helping to enforce annotation standards before model development.
  • Cost‑saving: By catching labeling errors early, institutions can avoid costly re‑annotation campaigns and accelerate regulatory‑grade model certification.
  • Generalizable to other domains (e.g., dermatology, radiology) where expert labeling is expensive and label noise is common.
  • Integration-friendly: The pipeline can be added as a post‑processing step in popular ML platforms (TensorFlow, PyTorch Lightning) without major code changes.

Limitations & Future Work

  • The method relies on sufficiently expressive base models; extremely under‑fitted classifiers may not generate reliable confidence signals, limiting detection power.
  • Human verification remains a bottleneck; future research could explore semi‑automated relabeling using active learning to further reduce expert workload.
  • The study focused on binary anomaly detection (normal vs. abnormal frames). Extending the framework to multi‑class pathology labeling (e.g., ulcer, bleeding, polyp) is an open direction.
  • Real‑world deployment would need to handle streaming video data and class‑imbalance more aggressively—areas the authors plan to investigate next.

Authors

  • Julia Werner
  • Julius Oexle
  • Oliver Bause
  • Maxime Le Floch
  • Franz Brinkmann
  • Hannah Tolle
  • Jochen Hampe
  • Oliver Bringmann

Paper Information

  • arXiv ID: 2602.06938v1
  • Categories: cs.CV, cs.LG
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »