[Paper] Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding
Source: arXiv - 2601.07761v1
Overview
The paper tackles a core bottleneck in large vision‑language models (LVLMs) for video reasoning: how to keep reasoning fast without sacrificing factual grounding. The authors propose Chain of Evidence (CoE), a framework that first extracts a concise set of visual “evidence” clips and then forces the language model to base its answers strictly on those anchors. By doing so, CoE slashes computation while dramatically reducing hallucinations, setting a new performance bar on several video‑understanding benchmarks.
Key Contributions
- Chain of Evidence (CoE) framework that cleanly separates visual grounding from textual reasoning, enabling joint optimization of both stages.
- Evidence Grounding Module (EGM) – a lightweight, query‑guided filter that selects a minimal, high‑quality subset of video frames/clips as evidence.
- Evidence‑Anchoring Protocol trained with reinforcement learning (RL) and a composite reward that penalizes reasoning that strays from the identified anchors.
- CoE‑Instruct dataset (≈164 k samples) featuring a dual‑annotation scheme: separate labels for perception (what to look at) and reasoning (how to answer).
- State‑of‑the‑art results on five video‑QA/understanding benchmarks (Video‑MME, MVBench, VSI‑Bench, etc.), with consistent gains in accuracy and a measurable drop in hallucination rates.
Methodology
- Query‑Guided Evidence Extraction – When a user asks a question about a video, the EGM receives the textual query and the raw video frames. It runs a fast visual encoder (e.g., a lightweight ConvNet or ViT) and scores each temporal segment for relevance, returning only the top‑k segments (typically 2–4).
- Evidence‑Anchored Reasoning – The selected evidence clips, together with the original query, are fed to a pre‑trained LVLM (e.g., LLaVA‑Video or Flamingo). The model’s decoder is constrained by an RL‑based policy that receives a reward for:
- Alignment – referencing the exact timestamps/segments used as evidence.
- Correctness – matching ground‑truth answers.
- Efficiency – keeping the answer length short.
The composite reward pushes the model to “anchor” each reasoning step to a concrete visual snippet, effectively turning the chain of thought into a chain of evidence.
- Training Pipeline – The EGM is first pretrained on the perception part of CoE‑Instruct (segment‑level relevance labels). Then the whole CoE system is fine‑tuned end‑to‑end with the RL loop, allowing the grounding and reasoning components to co‑adapt.
Results & Findings
- Accuracy boost: Across five benchmarks, CoE‑enhanced models improve top‑1 accuracy by 4–9 % over strong baselines that use full‑video reasoning.
- Hallucination reduction: The proportion of answers containing unsupported claims drops from ~22 % to <7 % (measured by human evaluation and automated fact‑checking).
- Speedup: Because only a handful of clips are processed, inference time is cut by ≈45 % compared with methods that attend to the entire video.
- Ablation studies confirm that both the EGM and the RL‑driven anchoring are necessary; removing the RL reward leads to a 3 % accuracy loss and a 12 % increase in hallucinations.
Practical Implications
- Cost‑effective video AI services – Cloud providers can offer video QA or summarization APIs that run on cheaper GPU instances, thanks to the reduced frame budget.
- More trustworthy assistants – Virtual agents (e.g., customer‑support bots that reference product demo videos) can now point to exact timestamps, improving user confidence and auditability.
- Developer‑friendly integration – The EGM is lightweight enough to be packaged as a plug‑in for existing video‑LLM pipelines (e.g., Hugging Face Transformers), requiring only a small fine‑tuning step on domain‑specific data.
- Regulatory compliance – In sectors where AI explanations must be traceable (e.g., medical imaging, autonomous driving), the evidence‑anchoring mechanism provides a concrete audit trail linking answers to visual evidence.
Limitations & Future Work
- Domain shift – The EGM is trained on the CoE‑Instruct dataset, which focuses on relatively clean, short clips. Performance may degrade on highly noisy or extremely long videos (e.g., surveillance footage).
- RL stability – The reinforcement‑learning stage can be sensitive to reward weighting; reproducing the exact training dynamics may require careful hyper‑parameter tuning.
- Scalability of annotations – The dual‑annotation schema is labor‑intensive; extending CoE‑Instruct to new domains will need efficient semi‑automatic labeling tools.
- Future directions suggested by the authors include:
- Self‑supervised evidence discovery to reduce annotation cost.
- Hierarchical evidence chains for multi‑step reasoning.
- Integration with multimodal retrieval systems for open‑world video corpora.
Authors
- Yanxiang Huang
- Guohua Gao
- Zhaoyang Wei
- Jianyuan Ni
Paper Information
- arXiv ID: 2601.07761v1
- Categories: cs.CV
- Published: January 12, 2026
- PDF: Download PDF