[Paper] Generalized Design Choices for Deepfake Detectors

Published: (November 26, 2025 at 10:40 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21507v1

Overview

Deepfake detection research often gets tangled in “secret sauce” tricks—how data are pre‑processed, which augmentations are used, or which optimizer is chosen—making it hard to tell whether a model’s success comes from its architecture or from these peripheral choices. This paper systematically untangles those factors, showing that a handful of well‑chosen design decisions can boost detection accuracy across any backbone and set a new state‑of‑the‑art on the AI‑GenBench benchmark.

Key Contributions

  • Comprehensive factor analysis – isolates the effect of training, inference, and incremental‑update choices on detection performance.
  • Architecture‑agnostic best‑practice checklist – identifies a small set of preprocessing, augmentation, and optimization tricks that consistently improve results regardless of the underlying CNN/Transformer.
  • Benchmark‑level gains – applying the recommended settings pushes several baseline detectors to top‑rank performance on AI‑GenBench, a large, diverse deepfake benchmark.
  • Open‑source reproducibility kit – provides scripts, config files, and a modular evaluation framework so other teams can replicate and extend the study.

Methodology

  1. Baseline models – the authors start with a variety of popular deepfake detectors (e.g., Xception, EfficientNet, ViT‑based) trained on the same raw dataset.
  2. Factor grid – they define a matrix of design choices covering:
    • Data preprocessing: face alignment precision, color space (RGB vs. YUV), resolution scaling.
    • Augmentation: random cropping, temporal jitter, frequency‑domain perturbations, mixup/cutmix.
    • Optimization: learning‑rate schedules (cosine vs. step), weight decay, batch size, mixed‑precision training.
    • Inference tricks: test‑time augmentation (TTA), ensembling, confidence calibration.
    • Incremental updates: fine‑tuning on new deepfake generation methods without catastrophic forgetting.
  3. Controlled experiments – each factor is toggled while keeping all others constant, allowing a clean attribution of performance changes.
  4. Evaluation – models are tested on AI‑GenBench’s held‑out splits, measuring accuracy, AUC, and cross‑dataset generalization (e.g., training on FaceForensics++ and testing on DeepFakeDetection).

Results & Findings

Design ChoiceTypical Δ AUC (vs. baseline)Remarks
High‑precision face alignment (5‑point vs. 68‑point)+2.1%Better facial geometry reduces spurious cues.
Color‑space conversion to YUV+1.4%Highlights chroma artifacts introduced by synthesis pipelines.
Temporal jitter (±2 frames)+1.8%Forces the model to learn consistency over time.
Mixup augmentation (α=0.2)+2.5%Regularizes decision boundaries, improves unseen deepfake types.
Cosine LR schedule + warm‑up+1.9%Stabilizes early training, especially for deeper backbones.
Test‑time augmentation (5‑crop + flip)+1.2%Small but consistent boost without extra training cost.
Incremental fine‑tuning with replay buffer+3.0%Mitigates forgetting when new generation methods appear.

When the full “best‑practice” bundle is applied, baseline Xception jumps from 86.3% AUC to 92.7%, and a ViT‑B/16 model reaches 94.1%, surpassing the previous AI‑GenBench leader by ~2.5 points.

Practical Implications

  • Rapid prototyping – Developers can plug the recommended preprocessing and augmentation pipeline into any off‑the‑shelf detector and see immediate gains, without redesigning the network.
  • Robust production services – The incremental‑learning recipe enables continuous updates as new deepfake generators emerge, reducing the need for full retraining cycles.
  • Cost‑effective scaling – Many of the tricks (e.g., YUV conversion, cosine LR) are computationally cheap, making them suitable for edge‑deployed detectors or cloud services with tight latency budgets.
  • Standardized benchmarking – By adopting the authors’ open‑source evaluation harness, teams can compare their own models on AI‑GenBench fairly, fostering more transparent progress in the field.

Limitations & Future Work

  • The study focuses on visual‑only detectors; audio‑visual or multimodal deepfake systems may react differently to the same tricks.
  • Experiments are limited to the AI‑GenBench dataset; while it is diverse, real‑world platforms (e.g., social media streams) present distribution shifts not fully captured.
  • The incremental learning approach uses a simple replay buffer; more sophisticated continual‑learning strategies (e.g., parameter isolation) could further reduce forgetting.

The authors plan to extend the factor analysis to multimodal pipelines, explore domain‑adaptation techniques for streaming data, and open‑source a “design‑choice optimizer” that automatically suggests the optimal configuration for a given backbone and hardware budget.

Authors

  • Lorenzo Pellegrini
  • Serafino Pandolfini
  • Davide Maltoni
  • Matteo Ferrara
  • Marco Prati
  • Marco Ramilli

Paper Information

  • arXiv ID: 2511.21507v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »