[Paper] Generalized Design Choices for Deepfake Detectors

Published: 2 months ago (November 26, 2025 at 10:40 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21507v1

Overview

Deepfake detection research often gets tangled in “secret sauce” tricks—how data are pre‑processed, which augmentations are used, or which optimizer is chosen—making it hard to tell whether a model’s success comes from its architecture or from these peripheral choices. This paper systematically untangles those factors, showing that a handful of well‑chosen design decisions can boost detection accuracy across any backbone and set a new state‑of‑the‑art on the AI‑GenBench benchmark.

Key Contributions

Comprehensive factor analysis – isolates the effect of training, inference, and incremental‑update choices on detection performance.
Architecture‑agnostic best‑practice checklist – identifies a small set of preprocessing, augmentation, and optimization tricks that consistently improve results regardless of the underlying CNN/Transformer.
Benchmark‑level gains – applying the recommended settings pushes several baseline detectors to top‑rank performance on AI‑GenBench, a large, diverse deepfake benchmark.
Open‑source reproducibility kit – provides scripts, config files, and a modular evaluation framework so other teams can replicate and extend the study.

Methodology

Baseline models – the authors start with a variety of popular deepfake detectors (e.g., Xception, EfficientNet, ViT‑based) trained on the same raw dataset.
Factor grid – they define a matrix of design choices covering:
- Data preprocessing: face alignment precision, color space (RGB vs. YUV), resolution scaling.
- Augmentation: random cropping, temporal jitter, frequency‑domain perturbations, mixup/cutmix.
- Optimization: learning‑rate schedules (cosine vs. step), weight decay, batch size, mixed‑precision training.
- Inference tricks: test‑time augmentation (TTA), ensembling, confidence calibration.
- Incremental updates: fine‑tuning on new deepfake generation methods without catastrophic forgetting.
Controlled experiments – each factor is toggled while keeping all others constant, allowing a clean attribution of performance changes.
Evaluation – models are tested on AI‑GenBench’s held‑out splits, measuring accuracy, AUC, and cross‑dataset generalization (e.g., training on FaceForensics++ and testing on DeepFakeDetection).

Results & Findings

Design Choice	Typical Δ AUC (vs. baseline)	Remarks
High‑precision face alignment (5‑point vs. 68‑point)	+2.1%	Better facial geometry reduces spurious cues.
Color‑space conversion to YUV	+1.4%	Highlights chroma artifacts introduced by synthesis pipelines.
Temporal jitter (±2 frames)	+1.8%	Forces the model to learn consistency over time.
Mixup augmentation (α=0.2)	+2.5%	Regularizes decision boundaries, improves unseen deepfake types.
Cosine LR schedule + warm‑up	+1.9%	Stabilizes early training, especially for deeper backbones.
Test‑time augmentation (5‑crop + flip)	+1.2%	Small but consistent boost without extra training cost.
Incremental fine‑tuning with replay buffer	+3.0%	Mitigates forgetting when new generation methods appear.

When the full “best‑practice” bundle is applied, baseline Xception jumps from 86.3% AUC to 92.7%, and a ViT‑B/16 model reaches 94.1%, surpassing the previous AI‑GenBench leader by ~2.5 points.

Practical Implications

Rapid prototyping – Developers can plug the recommended preprocessing and augmentation pipeline into any off‑the‑shelf detector and see immediate gains, without redesigning the network.
Robust production services – The incremental‑learning recipe enables continuous updates as new deepfake generators emerge, reducing the need for full retraining cycles.
Cost‑effective scaling – Many of the tricks (e.g., YUV conversion, cosine LR) are computationally cheap, making them suitable for edge‑deployed detectors or cloud services with tight latency budgets.
Standardized benchmarking – By adopting the authors’ open‑source evaluation harness, teams can compare their own models on AI‑GenBench fairly, fostering more transparent progress in the field.

Limitations & Future Work

The study focuses on visual‑only detectors; audio‑visual or multimodal deepfake systems may react differently to the same tricks.
Experiments are limited to the AI‑GenBench dataset; while it is diverse, real‑world platforms (e.g., social media streams) present distribution shifts not fully captured.
The incremental learning approach uses a simple replay buffer; more sophisticated continual‑learning strategies (e.g., parameter isolation) could further reduce forgetting.

The authors plan to extend the factor analysis to multimodal pipelines, explore domain‑adaptation techniques for streaming data, and open‑source a “design‑choice optimizer” that automatically suggests the optimal configuration for a given backbone and hardware budget.

Authors

Lorenzo Pellegrini
Serafino Pandolfini
Davide Maltoni
Matteo Ferrara
Marco Prati
Marco Ramilli

Paper Information

arXiv ID: 2511.21507v1
Categories: cs.CV
Published: November 26, 2025
PDF: Download PDF

[Paper] Generalized Design Choices for Deepfake Detectors

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

[Paper] Visual Generation Tuning