[Paper] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos
Source: arXiv - 2604.21814v1
Overview
Capsule endoscopy (CE) lets doctors “fly” a tiny camera through a patient’s gut to capture hours‑long video of the gastrointestinal (GI) tract. While researchers have gotten good at spotting individual abnormal frames, they still struggle to turn an entire ultra‑long video into a concise, clinically useful report. This paper defines a new diagnosis‑driven video summarization task, releases the first real‑world CE dataset with report‑level annotations (VideoCAP), and proposes a clinician‑inspired pipeline (DiCE) that mimics how gastroenterologists actually read these videos.
Key Contributions
- New task definition: Diagnosis‑driven CE video summarization – automatically extract a handful of “evidence frames” that together support a correct diagnosis.
- VideoCAP dataset: 240 full‑length CE videos (≈ 30 GB total) annotated with both key evidence frames and the final clinical diagnosis, derived from authentic radiology reports.
- DiCE framework: A three‑stage system that (1) screens the raw video for candidate frames, (2) weaves candidates into coherent diagnostic contexts, and (3) converges multi‑frame evidence into clip‑level judgments.
- State‑of‑the‑art performance: DiCE outperforms existing video‑level classification and summarization baselines on both evidence‑frame recall and diagnostic accuracy.
- Open‑source release: Code, pretrained models, and annotation tools are made publicly available to spur further research.
Methodology
- Candidate Screening – A lightweight CNN scans the 10‑minute‑plus video at a low frame‑rate (≈ 1 fps) to flag frames that might contain pathology (e.g., unusual texture, color, or shape). This reduces the search space from tens of thousands of frames to a few hundred.
- Context Weaver – The screened frames are grouped into “contexts” using a temporal clustering algorithm that respects the natural reading workflow: clinicians first locate a suspicious region, then scroll forward/backward to see the lesion from multiple angles. The Weaver builds short clips (3‑5 seconds) that preserve the continuity of each potential lesion while discarding isolated noise.
- Evidence Converger – Each clip is fed to a transformer‑based encoder that aggregates visual cues across frames, producing a robust clip‑level representation. A lightweight classifier then predicts the presence of specific pathologies (e.g., ulcer, angiodysplasia). Finally, a decision‑fusion module combines predictions from all clips to output the overall diagnosis and selects the most representative frames as the final evidence set.
The whole pipeline runs end‑to‑end on a single GPU in under 2 minutes per video, making it practical for clinical deployment.
Results & Findings
| Metric | DiCE | Best Baseline (ViViT) | Relative Gain |
|---|---|---|---|
| Evidence‑frame recall @ 5 frames | 0.78 | 0.52 | +50% |
| Diagnosis accuracy (top‑1) | 0.91 | 0.84 | +8% |
| Summary length (frames) | 7.3 ± 1.2 | 14.8 ± 3.5 | 50% fewer frames |
| Inference time (per video) | 1.8 min | 4.3 min | 2.4× faster |
Key takeaways:
- Contextual reasoning (grouping frames into coherent clips) is crucial; naïve frame‑wise classifiers miss subtle lesions that become clear only when viewed as a short sequence.
- The candidate screening step reduces computational load without sacrificing recall, proving that a coarse‑to‑fine strategy works well for ultra‑long medical videos.
- DiCE’s evidence frames align closely with those selected by expert gastroenterologists (Cohen’s κ = 0.73), indicating strong clinical relevance.
Practical Implications
- Accelerated workflow: Radiology teams can review a 5‑minute summary instead of a 30‑minute raw video, cutting reading time by > 50 % while preserving diagnostic confidence.
- Decision support: The system can flag high‑risk videos for immediate review, helping prioritize urgent cases in busy endoscopy units.
- Tele‑medicine & AI‑assisted screening: Deploying DiCE on edge devices (e.g., hospital servers) enables remote specialists to receive concise diagnostic packets, facilitating second opinions without transferring massive video files.
- Training & education: The evidence‑frame annotations serve as a valuable teaching aid for junior clinicians learning to spot subtle GI lesions.
- Regulatory pathway: Because DiCE mirrors the human reading process and provides traceable evidence frames, it aligns well with emerging AI‑medical device guidelines that demand explainability.
Limitations & Future Work
- Dataset size & diversity: VideoCAP, while the largest of its kind, still covers a limited set of pathologies and patient demographics; broader multi‑center collections are needed to validate generalization.
- Rare lesions: Extremely infrequent findings (e.g., small submucosal tumors) remain challenging due to insufficient training examples.
- Real‑time constraints: Although inference is fast, true real‑time processing (as the capsule streams data) would require further optimization or dedicated hardware accelerators.
- Explainability depth: Current evidence frames are visual; integrating textual explanations derived from the original clinical reports could improve interpretability for non‑specialists.
Future research directions include expanding VideoCAP with multi‑modal data (e.g., patient history, lab results), exploring self‑supervised pretraining on unlabeled CE footage, and adapting the DiCE paradigm to other ultra‑long medical video domains such as colonoscopy or intra‑operative endoscopy.
Authors
- Bowen Liu
- Li Yang
- Shanshan Song
- Mingyu Tang
- Zhifang Gao
- Qifeng Chen
- Yangqiu Song
- Huimin Chen
- Xiaomeng Li
Paper Information
- arXiv ID: 2604.21814v1
- Categories: cs.CV, cs.AI
- Published: April 23, 2026
- PDF: Download PDF