[Paper] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Source: arXiv - 2512.15693v1
Overview
AI‑generated video tools are getting so good that distinguishing real footage from synthetic content is becoming a real‑world security and trust issue. Skyra tackles this problem by building a multimodal large language model that not only flags AI‑crafted videos but also points out the specific visual glitches that give them away—providing a human‑readable “why” alongside the “what”.
Key Contributions
- ViF‑CoT‑4K dataset – the first large‑scale, fine‑grained collection of AI‑generated video frames annotated with human‑perceivable artifacts (e.g., flickering textures, inconsistent lighting).
- Skyra MLLM – a multimodal large language model trained to locate spatio‑temporal artifacts and generate natural‑language explanations for each detection.
- Two‑stage training pipeline – (1) supervised fine‑tuning on ViF‑CoT‑4K for artifact perception, (2) contrastive alignment with video‑level labels to boost detection accuracy.
- ViF‑Bench benchmark – 3 K high‑quality videos from >10 state‑of‑the‑art generators, covering diverse domains (deepfakes, text‑to‑video, style transfer).
- Explainable detection – Skyra outperforms prior binary classifiers on multiple metrics while also delivering concise, artifact‑grounded rationales.
Methodology
- Data Curation – Human annotators watched thousands of AI‑generated clips and marked any visual oddities they could perceive (e.g., jittery motion, missing shadows). These annotations were turned into a structured “artifact‑of‑thought” (CoT) format that pairs a video segment with a textual description of the flaw.
- Model Architecture – Skyra builds on a pretrained vision‑language backbone (e.g., CLIP‑ViT + LLaMA). The visual encoder processes video frames as a short clip, while a temporal transformer aggregates frame‑level features. The language decoder receives both the aggregated visual embedding and a prompt like “Explain why this video might be synthetic.”
- Two‑Stage Training
- Stage 1 (SFT): Supervised fine‑tuning on ViF‑CoT‑4K teaches the model to map visual cues to artifact descriptions.
- Stage 2 (Alignment): A contrastive loss aligns the model’s video‑level embedding with binary “real / synthetic” labels, sharpening its overall detection capability without sacrificing explanation quality.
- Inference – Given a new video, Skyra returns: (a) a confidence score for AI generation, (b) a list of detected artifacts with timestamps, and (c) a short natural‑language justification.
Results & Findings
| Metric | Skyra | Prior SOTA (binary) |
|---|---|---|
| Accuracy (ViF‑Bench) | 92.4 % | 84.1 % |
| AUROC | 0.96 | 0.88 |
| Explanation BLEU‑4 (human‑rated) | 31.2 | N/A |
| Avg. # of correctly identified artifacts per video | 3.7 | 1.2 (implicit) |
- Skyra consistently detects subtle artifacts that human reviewers missed, especially in low‑motion or heavily stylized clips.
- The explanation module achieves high correlation (≈0.78) with human judgments of “useful justification”.
- Ablation studies show that the two‑stage training adds ~5 % accuracy over a single‑stage fine‑tune, and that temporal aggregation is crucial for catching motion‑related glitches.
Practical Implications
- Content moderation pipelines can integrate Skyra to automatically flag suspicious videos and surface the exact frames/artifacts that triggered the alert, reducing manual review time.
- Media forensics tools gain an explainable layer, helping investigators present evidence in court or to the public with concrete visual proof.
- Developer APIs could expose Skyra’s artifact‑level output, enabling downstream applications (e.g., watermarking, deepfake detection SaaS) to provide richer feedback to end‑users.
- Video generation platforms can use the artifact detector as a quality‑control loop, automatically warning creators when their output contains perceptible flaws before publishing.
Limitations & Future Work
- Dataset bias – ViF‑CoT‑4K focuses on current generation models; emerging techniques may produce artifacts not represented in the training set.
- Temporal window – Skyra processes short clips (≈2 s); very long‑range inconsistencies (e.g., narrative continuity) remain out of scope.
- Explainability granularity – While the model lists artifacts, it does not yet quantify their severity or provide visual heatmaps.
- Future directions include expanding the dataset with adversarially crafted videos, scaling the temporal horizon, and coupling the artifact explanations with visual attention maps for tighter human‑machine interpretability.
Authors
- Yifei Li
- Wenzhao Zheng
- Yanran Zhang
- Runze Sun
- Yu Zheng
- Lei Chen
- Jie Zhou
- Jiwen Lu
Paper Information
- arXiv ID: 2512.15693v1
- Categories: cs.CV
- Published: December 17, 2025
- PDF: Download PDF