[Paper] ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors
Source: arXiv - 2601.02359v1
Overview
The paper introduces ExposeAnyone, a self‑supervised system that detects deep‑fake videos without ever seeing a forged example during training. By learning how a person’s facial expressions should sync with their speech, the model can spot inconsistencies that betray a fake—achieving state‑of‑the‑art performance on several benchmark datasets and showing strong resilience to common video degradations.
Key Contributions
- Zero‑shot forgery detection – Uses a diffusion‑based audio‑to‑expression generator that can flag unseen deepfakes without any supervised fake data.
- Personalized modeling – The system is “personalized” to a target subject via a small reference video set, enabling identity‑aware detection through reconstruction error.
- Self‑supervised training – Learns purely from authentic audio‑visual pairs, sidestepping the over‑fitting problems of supervised fake‑detector pipelines.
- Robustness to corruptions – Maintains high detection accuracy under blur, compression, and other real‑world video artifacts.
- Broad benchmark gains – Improves average AUC by 4.22 % over the previous best on DF‑TIMIT, DFDCP, KoDF, and IDForge, and successfully detects Sora2‑generated fakes where other methods fail.
Methodology
1. Audio‑to‑Expression Diffusion Model
- Trains a conditional diffusion network to synthesize a sequence of facial expression parameters (e.g., 3D landmarks or blendshape coefficients) given an audio clip.
- The diffusion process iteratively denoises a random latent, guided by the audio, until a plausible expression trajectory emerges.
2. Personalization (Subject‑Specific Fine‑Tuning)
- For each person of interest, a short “reference set” of genuine video clips is used to adapt the generic diffusion model.
- This step aligns the model’s latent space with the subject’s unique facial dynamics and identity cues.
3. Forgery Scoring via Reconstruction Error
- When a test video is presented, the system feeds its audio to the personalized model and reconstructs the expected expression sequence.
- The identity distance (e.g., L2 norm between reconstructed and observed facial features) serves as a forgery score: larger errors suggest the visual stream does not match the audio‑driven expectation, indicating manipulation.
4. Zero‑Shot Detection Pipeline
- No fake examples are required at any stage; the detector relies solely on the mismatch between audio‑driven predictions and the actual video.
Results & Findings
| Dataset | Prior SOTA AUC | ExposeAnyone AUC | Δ AUC |
|---|---|---|---|
| DF‑TIMIT | 84.1 % | 88.3 % | +4.2 % |
| DFDCP | 81.7 % | 85.9 % | +4.2 % |
| KoDF | 78.4 % | 82.6 % | +4.2 % |
| IDForge | 80.2 % | 84.5 % | +4.3 % |
- Sora2 detection – ExposeAnyone correctly flags Sora2‑generated videos (AUC ≈ 87 %) while the best competing method drops below 70 %.
- Corruption robustness – Under heavy Gaussian blur (σ = 5) and JPEG compression (Q = 20), the AUC drop is < 2 %, whereas supervised baselines lose > 6 %.
These numbers demonstrate that the audio‑driven reconstruction error is a powerful, manipulation‑agnostic cue.
Practical Implications
- Content‑moderation platforms – Deploy a lightweight “personalization” step for high‑risk accounts (e.g., public figures) and run real‑time forgery checks without maintaining a constantly updated fake‑dataset.
- Authentication pipelines – Add an “audio‑expression consistency” check to video‑based identity verification (e.g., remote KYC) to thwart deep‑fake attacks.
- Tooling for developers – Export the diffusion model as an ONNX/TensorRT graph, enabling integration into existing video‑processing back‑ends with modest GPU resources.
- Forensic analysis – Use the reconstruction error heatmap to pinpoint exactly where a video diverges from expected facial dynamics, aiding manual review.
Limitations & Future Work
- Reference data requirement – Personalization needs a few seconds of clean video per subject; the method is less effective for completely unknown identities.
- Audio quality dependence – Extremely noisy or dubbed audio reduces reconstruction fidelity, potentially increasing false positives.
- Scalability to large user bases – Maintaining personalized models for millions of users would demand model‑sharing strategies or on‑the‑fly fine‑tuning.
- Future directions – Explore few‑shot meta‑learning to reduce reference data, extend the approach to multi‑modal cues (e.g., lip‑reading + facial motion), and optimize diffusion inference for edge‑device deployment.
Authors
- Kaede Shiohara
- Toshihiko Yamasaki
- Vladislav Golyanik
Paper Information
- arXiv ID: 2601.02359v1
- Categories: cs.CV
- Published: January 5, 2026
- PDF: Download PDF