[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints
Source: arXiv - 2601.05986v1
Overview
Deepfake detection models are increasingly deployed in platforms that need to verify the authenticity of video content. This paper shows that many of these detectors, even when hardened with adversarial training, can still be fooled by subtle, transferable perturbations—especially when the attacker’s data or model differs from the defender’s. By extending the DUMB benchmarking framework to deepfake detection, the authors provide a realistic stress‑test that mirrors how adversaries operate in the wild.
Key Contributions
- DUMB‑er Benchmark for Deepfakes – adapts the Dataset‑Sources‑Model‑Balance (DUMB) methodology to evaluate robustness under transferability constraints (i.e., attacker and defender use different data or architectures).
- Comprehensive Empirical Study – tests five state‑of‑the‑art detectors (RECCE, SRM, XCeption, UCF, SPSL) against three popular attacks (PGD, FGSM, FPBA) on two widely used datasets (FaceForensics++ and Celeb‑DF‑V2).
- Cross‑Dataset Insight – reveals that adversarial training improves in‑distribution robustness but can hurt performance when the test data comes from a different distribution.
- Case‑Aware Defense Recommendations – proposes that defense strategies must be tuned to the expected mismatch scenario (e.g., same‑source vs. cross‑source attacks).
- Open‑Source Evaluation Suite – releases code and benchmark scripts so the community can reproduce and extend the analysis.
Methodology
Benchmark Construction (DUMB‑er)
- Dataset Sources: Two deepfake corpora (FaceForensics++ and Celeb‑DF‑V2) serve as source and target domains.
- Model Architecture: Five detectors covering handcrafted features (SRM), deep CNNs (XCeption), and hybrid approaches (RECCE, UCF, SPSL).
- Balance: Each detector is trained on a balanced mix of real and fake videos, then optionally fine‑tuned with adversarial examples.
Adversarial Attack Scenarios
- White‑Box: Attacker knows the exact model and training data (baseline).
- Transferability‑Constrained: Attacker trains a surrogate model on a different dataset or architecture, then generates perturbations (PGD, FGSM, FPBA) that are applied to the target detector.
Evaluation Protocol
- In‑Distribution: Test and attack both use the same dataset the detector was trained on.
- Cross‑Dataset: Test set comes from the other dataset, simulating real‑world distribution shift.
- Metrics: detection accuracy, AUC, and robustness drop (difference between clean and adversarial performance).
Results & Findings
| Scenario | Clean Accuracy | Adversarial Accuracy (PGD) | Effect of Adversarial Training |
|---|---|---|---|
| In‑distribution (same source) | ~92 % | ~45 % | ↑ to ~78 % (robustness gain) |
| Cross‑dataset (different source) | ~85 % | ~38 % | ↓ to ~70 % (robustness loss) |
- Adversarial training helps when the attacker’s surrogate matches the defender’s data distribution (e.g., both use FaceForensics++).
- When data mismatches, some defenses overfit to the adversarial patterns of the source domain, causing a negative transfer that harms detection on the target domain.
- Attack transferability varies: FPBA (feature‑preserving) is the most successful across datasets, while FGSM’s impact drops sharply under cross‑dataset conditions.
- Detector‑specific trends: Handcrafted‑feature models (SRM) are more resilient to transfer attacks than pure CNNs, but they still suffer under aggressive PGD perturbations.
Practical Implications
- Deployments must anticipate distribution shift – platforms that ingest user‑generated videos from diverse sources should not rely on a single adversarial‑training recipe.
- Hybrid defenses are promising – combining handcrafted cues (e.g., SRM) with learned features can mitigate transfer attacks without sacrificing clean‑data performance.
- Continuous fine‑tuning – periodic re‑training on freshly collected, possibly adversarially perturbed data from the target platform can keep robustness in check.
- Security‑by‑design – developers should integrate a robustness monitoring pipeline that flags sudden drops in detection confidence, indicating a potential adversarial campaign.
- Tooling – the released benchmark can be plugged into CI pipelines to evaluate new detector versions against realistic adversarial scenarios before production rollout.
Limitations & Future Work
- Scope of Datasets – Only two deepfake corpora were examined; emerging datasets with higher visual fidelity may exhibit different transfer dynamics.
- Attack Diversity – The study focuses on gradient‑based attacks; future work should explore generative adversarial attacks that synthesize more naturalistic perturbations.
- Real‑World Constraints – Perturbations are assumed to be imperceptible at the pixel level; in practice, compression, streaming, and device‑specific processing could alter attack efficacy.
- Defense Strategies – The paper evaluates standard adversarial training; exploring certified defenses, ensemble methods, or meta‑learning could yield more universally robust detectors.
Bottom line: adversarial training isn’t a silver bullet for deepfake detection. Its benefits hinge on how closely the training and deployment environments align, urging practitioners to adopt adaptive, data‑aware defense pipelines.
Authors
- Adrian Serrano
- Erwan Umlil
- Ronan Thomas
Paper Information
- arXiv ID: 2601.05986v1
- Categories: cs.CV, cs.CR
- Published: January 9, 2026
- PDF: Download PDF