[Paper] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Published: 18 hours ago (April 23, 2026 at 12:07 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21814v1

Overview

Capsule endoscopy (CE) lets doctors “fly” a tiny camera through a patient’s gut to capture hours‑long video of the gastrointestinal (GI) tract. While researchers have gotten good at spotting individual abnormal frames, they still struggle to turn an entire ultra‑long video into a concise, clinically useful report. This paper defines a new diagnosis‑driven video summarization task, releases the first real‑world CE dataset with report‑level annotations (VideoCAP), and proposes a clinician‑inspired pipeline (DiCE) that mimics how gastroenterologists actually read these videos.

Key Contributions

New task definition: Diagnosis‑driven CE video summarization – automatically extract a handful of “evidence frames” that together support a correct diagnosis.
VideoCAP dataset: 240 full‑length CE videos (≈ 30 GB total) annotated with both key evidence frames and the final clinical diagnosis, derived from authentic radiology reports.
DiCE framework: A three‑stage system that (1) screens the raw video for candidate frames, (2) weaves candidates into coherent diagnostic contexts, and (3) converges multi‑frame evidence into clip‑level judgments.
State‑of‑the‑art performance: DiCE outperforms existing video‑level classification and summarization baselines on both evidence‑frame recall and diagnostic accuracy.
Open‑source release: Code, pretrained models, and annotation tools are made publicly available to spur further research.

Methodology

Candidate Screening – A lightweight CNN scans the 10‑minute‑plus video at a low frame‑rate (≈ 1 fps) to flag frames that might contain pathology (e.g., unusual texture, color, or shape). This reduces the search space from tens of thousands of frames to a few hundred.
Context Weaver – The screened frames are grouped into “contexts” using a temporal clustering algorithm that respects the natural reading workflow: clinicians first locate a suspicious region, then scroll forward/backward to see the lesion from multiple angles. The Weaver builds short clips (3‑5 seconds) that preserve the continuity of each potential lesion while discarding isolated noise.
Evidence Converger – Each clip is fed to a transformer‑based encoder that aggregates visual cues across frames, producing a robust clip‑level representation. A lightweight classifier then predicts the presence of specific pathologies (e.g., ulcer, angiodysplasia). Finally, a decision‑fusion module combines predictions from all clips to output the overall diagnosis and selects the most representative frames as the final evidence set.

The whole pipeline runs end‑to‑end on a single GPU in under 2 minutes per video, making it practical for clinical deployment.

Results & Findings

Metric	DiCE	Best Baseline (ViViT)	Relative Gain
Evidence‑frame recall @ 5 frames	0.78	0.52	+50%
Diagnosis accuracy (top‑1)	0.91	0.84	+8%
Summary length (frames)	7.3 ± 1.2	14.8 ± 3.5	50% fewer frames
Inference time (per video)	1.8 min	4.3 min	2.4× faster

Key takeaways:

Contextual reasoning (grouping frames into coherent clips) is crucial; naïve frame‑wise classifiers miss subtle lesions that become clear only when viewed as a short sequence.
The candidate screening step reduces computational load without sacrificing recall, proving that a coarse‑to‑fine strategy works well for ultra‑long medical videos.
DiCE’s evidence frames align closely with those selected by expert gastroenterologists (Cohen’s κ = 0.73), indicating strong clinical relevance.

Practical Implications

Accelerated workflow: Radiology teams can review a 5‑minute summary instead of a 30‑minute raw video, cutting reading time by > 50 % while preserving diagnostic confidence.
Decision support: The system can flag high‑risk videos for immediate review, helping prioritize urgent cases in busy endoscopy units.
Tele‑medicine & AI‑assisted screening: Deploying DiCE on edge devices (e.g., hospital servers) enables remote specialists to receive concise diagnostic packets, facilitating second opinions without transferring massive video files.
Training & education: The evidence‑frame annotations serve as a valuable teaching aid for junior clinicians learning to spot subtle GI lesions.
Regulatory pathway: Because DiCE mirrors the human reading process and provides traceable evidence frames, it aligns well with emerging AI‑medical device guidelines that demand explainability.

Limitations & Future Work

Dataset size & diversity: VideoCAP, while the largest of its kind, still covers a limited set of pathologies and patient demographics; broader multi‑center collections are needed to validate generalization.
Rare lesions: Extremely infrequent findings (e.g., small submucosal tumors) remain challenging due to insufficient training examples.
Real‑time constraints: Although inference is fast, true real‑time processing (as the capsule streams data) would require further optimization or dedicated hardware accelerators.
Explainability depth: Current evidence frames are visual; integrating textual explanations derived from the original clinical reports could improve interpretability for non‑specialists.

Future research directions include expanding VideoCAP with multi‑modal data (e.g., patient history, lab results), exploring self‑supervised pretraining on unlabeled CE footage, and adapting the DiCE paradigm to other ultra‑long medical video domains such as colonoscopy or intra‑operative endoscopy.

Authors

Bowen Liu
Li Yang
Shanshan Song
Mingyu Tang
Zhifang Gao
Qifeng Chen
Yangqiu Song
Huimin Chen
Xiaomeng Li

Paper Information

arXiv ID: 2604.21814v1
Categories: cs.CV, cs.AI
Published: April 23, 2026
PDF: Download PDF

[Paper] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] Addressing Image Authenticity When Cameras Use Generative AI

[Paper] Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning