[Paper] Generalized Design Choices for Deepfake Detectors
Source: arXiv - 2511.21507v1
Overview
Deepfake detection research often gets tangled in “secret sauce” tricks—how data are pre‑processed, which augmentations are used, or which optimizer is chosen—making it hard to tell whether a model’s success comes from its architecture or from these peripheral choices. This paper systematically untangles those factors, showing that a handful of well‑chosen design decisions can boost detection accuracy across any backbone and set a new state‑of‑the‑art on the AI‑GenBench benchmark.
Key Contributions
- Comprehensive factor analysis – isolates the effect of training, inference, and incremental‑update choices on detection performance.
- Architecture‑agnostic best‑practice checklist – identifies a small set of preprocessing, augmentation, and optimization tricks that consistently improve results regardless of the underlying CNN/Transformer.
- Benchmark‑level gains – applying the recommended settings pushes several baseline detectors to top‑rank performance on AI‑GenBench, a large, diverse deepfake benchmark.
- Open‑source reproducibility kit – provides scripts, config files, and a modular evaluation framework so other teams can replicate and extend the study.
Methodology
- Baseline models – the authors start with a variety of popular deepfake detectors (e.g., Xception, EfficientNet, ViT‑based) trained on the same raw dataset.
- Factor grid – they define a matrix of design choices covering:
- Data preprocessing: face alignment precision, color space (RGB vs. YUV), resolution scaling.
- Augmentation: random cropping, temporal jitter, frequency‑domain perturbations, mixup/cutmix.
- Optimization: learning‑rate schedules (cosine vs. step), weight decay, batch size, mixed‑precision training.
- Inference tricks: test‑time augmentation (TTA), ensembling, confidence calibration.
- Incremental updates: fine‑tuning on new deepfake generation methods without catastrophic forgetting.
- Controlled experiments – each factor is toggled while keeping all others constant, allowing a clean attribution of performance changes.
- Evaluation – models are tested on AI‑GenBench’s held‑out splits, measuring accuracy, AUC, and cross‑dataset generalization (e.g., training on FaceForensics++ and testing on DeepFakeDetection).
Results & Findings
| Design Choice | Typical Δ AUC (vs. baseline) | Remarks |
|---|---|---|
| High‑precision face alignment (5‑point vs. 68‑point) | +2.1% | Better facial geometry reduces spurious cues. |
| Color‑space conversion to YUV | +1.4% | Highlights chroma artifacts introduced by synthesis pipelines. |
| Temporal jitter (±2 frames) | +1.8% | Forces the model to learn consistency over time. |
| Mixup augmentation (α=0.2) | +2.5% | Regularizes decision boundaries, improves unseen deepfake types. |
| Cosine LR schedule + warm‑up | +1.9% | Stabilizes early training, especially for deeper backbones. |
| Test‑time augmentation (5‑crop + flip) | +1.2% | Small but consistent boost without extra training cost. |
| Incremental fine‑tuning with replay buffer | +3.0% | Mitigates forgetting when new generation methods appear. |
When the full “best‑practice” bundle is applied, baseline Xception jumps from 86.3% AUC to 92.7%, and a ViT‑B/16 model reaches 94.1%, surpassing the previous AI‑GenBench leader by ~2.5 points.
Practical Implications
- Rapid prototyping – Developers can plug the recommended preprocessing and augmentation pipeline into any off‑the‑shelf detector and see immediate gains, without redesigning the network.
- Robust production services – The incremental‑learning recipe enables continuous updates as new deepfake generators emerge, reducing the need for full retraining cycles.
- Cost‑effective scaling – Many of the tricks (e.g., YUV conversion, cosine LR) are computationally cheap, making them suitable for edge‑deployed detectors or cloud services with tight latency budgets.
- Standardized benchmarking – By adopting the authors’ open‑source evaluation harness, teams can compare their own models on AI‑GenBench fairly, fostering more transparent progress in the field.
Limitations & Future Work
- The study focuses on visual‑only detectors; audio‑visual or multimodal deepfake systems may react differently to the same tricks.
- Experiments are limited to the AI‑GenBench dataset; while it is diverse, real‑world platforms (e.g., social media streams) present distribution shifts not fully captured.
- The incremental learning approach uses a simple replay buffer; more sophisticated continual‑learning strategies (e.g., parameter isolation) could further reduce forgetting.
The authors plan to extend the factor analysis to multimodal pipelines, explore domain‑adaptation techniques for streaming data, and open‑source a “design‑choice optimizer” that automatically suggests the optimal configuration for a given backbone and hardware budget.
Authors
- Lorenzo Pellegrini
- Serafino Pandolfini
- Davide Maltoni
- Matteo Ferrara
- Marco Prati
- Marco Ramilli
Paper Information
- arXiv ID: 2511.21507v1
- Categories: cs.CV
- Published: November 26, 2025
- PDF: Download PDF