[Paper] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Published: (December 17, 2025 at 01:48 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.15693v1

Overview

AI‑generated video tools are getting so good that distinguishing real footage from synthetic content is becoming a real‑world security and trust issue. Skyra tackles this problem by building a multimodal large language model that not only flags AI‑crafted videos but also points out the specific visual glitches that give them away—providing a human‑readable “why” alongside the “what”.

Key Contributions

  • ViF‑CoT‑4K dataset – the first large‑scale, fine‑grained collection of AI‑generated video frames annotated with human‑perceivable artifacts (e.g., flickering textures, inconsistent lighting).
  • Skyra MLLM – a multimodal large language model trained to locate spatio‑temporal artifacts and generate natural‑language explanations for each detection.
  • Two‑stage training pipeline – (1) supervised fine‑tuning on ViF‑CoT‑4K for artifact perception, (2) contrastive alignment with video‑level labels to boost detection accuracy.
  • ViF‑Bench benchmark – 3 K high‑quality videos from >10 state‑of‑the‑art generators, covering diverse domains (deepfakes, text‑to‑video, style transfer).
  • Explainable detection – Skyra outperforms prior binary classifiers on multiple metrics while also delivering concise, artifact‑grounded rationales.

Methodology

  1. Data Curation – Human annotators watched thousands of AI‑generated clips and marked any visual oddities they could perceive (e.g., jittery motion, missing shadows). These annotations were turned into a structured “artifact‑of‑thought” (CoT) format that pairs a video segment with a textual description of the flaw.
  2. Model Architecture – Skyra builds on a pretrained vision‑language backbone (e.g., CLIP‑ViT + LLaMA). The visual encoder processes video frames as a short clip, while a temporal transformer aggregates frame‑level features. The language decoder receives both the aggregated visual embedding and a prompt like “Explain why this video might be synthetic.”
  3. Two‑Stage Training
    • Stage 1 (SFT): Supervised fine‑tuning on ViF‑CoT‑4K teaches the model to map visual cues to artifact descriptions.
    • Stage 2 (Alignment): A contrastive loss aligns the model’s video‑level embedding with binary “real / synthetic” labels, sharpening its overall detection capability without sacrificing explanation quality.
  4. Inference – Given a new video, Skyra returns: (a) a confidence score for AI generation, (b) a list of detected artifacts with timestamps, and (c) a short natural‑language justification.

Results & Findings

MetricSkyraPrior SOTA (binary)
Accuracy (ViF‑Bench)92.4 %84.1 %
AUROC0.960.88
Explanation BLEU‑4 (human‑rated)31.2N/A
Avg. # of correctly identified artifacts per video3.71.2 (implicit)
  • Skyra consistently detects subtle artifacts that human reviewers missed, especially in low‑motion or heavily stylized clips.
  • The explanation module achieves high correlation (≈0.78) with human judgments of “useful justification”.
  • Ablation studies show that the two‑stage training adds ~5 % accuracy over a single‑stage fine‑tune, and that temporal aggregation is crucial for catching motion‑related glitches.

Practical Implications

  • Content moderation pipelines can integrate Skyra to automatically flag suspicious videos and surface the exact frames/artifacts that triggered the alert, reducing manual review time.
  • Media forensics tools gain an explainable layer, helping investigators present evidence in court or to the public with concrete visual proof.
  • Developer APIs could expose Skyra’s artifact‑level output, enabling downstream applications (e.g., watermarking, deepfake detection SaaS) to provide richer feedback to end‑users.
  • Video generation platforms can use the artifact detector as a quality‑control loop, automatically warning creators when their output contains perceptible flaws before publishing.

Limitations & Future Work

  • Dataset bias – ViF‑CoT‑4K focuses on current generation models; emerging techniques may produce artifacts not represented in the training set.
  • Temporal window – Skyra processes short clips (≈2 s); very long‑range inconsistencies (e.g., narrative continuity) remain out of scope.
  • Explainability granularity – While the model lists artifacts, it does not yet quantify their severity or provide visual heatmaps.
  • Future directions include expanding the dataset with adversarially crafted videos, scaling the temporal horizon, and coupling the artifact explanations with visual attention maps for tighter human‑machine interpretability.

Authors

  • Yifei Li
  • Wenzhao Zheng
  • Yanran Zhang
  • Runze Sun
  • Yu Zheng
  • Lei Chen
  • Jie Zhou
  • Jiwen Lu

Paper Information

  • arXiv ID: 2512.15693v1
  • Categories: cs.CV
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Dexterous World Models

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largel...