[Paper] ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Published: (January 5, 2026 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.02359v1

Overview

The paper introduces ExposeAnyone, a self‑supervised system that detects deep‑fake videos without ever seeing a forged example during training. By learning how a person’s facial expressions should sync with their speech, the model can spot inconsistencies that betray a fake—achieving state‑of‑the‑art performance on several benchmark datasets and showing strong resilience to common video degradations.

Key Contributions

  • Zero‑shot forgery detection – Uses a diffusion‑based audio‑to‑expression generator that can flag unseen deepfakes without any supervised fake data.
  • Personalized modeling – The system is “personalized” to a target subject via a small reference video set, enabling identity‑aware detection through reconstruction error.
  • Self‑supervised training – Learns purely from authentic audio‑visual pairs, sidestepping the over‑fitting problems of supervised fake‑detector pipelines.
  • Robustness to corruptions – Maintains high detection accuracy under blur, compression, and other real‑world video artifacts.
  • Broad benchmark gains – Improves average AUC by 4.22 % over the previous best on DF‑TIMIT, DFDCP, KoDF, and IDForge, and successfully detects Sora2‑generated fakes where other methods fail.

Methodology

1. Audio‑to‑Expression Diffusion Model

  • Trains a conditional diffusion network to synthesize a sequence of facial expression parameters (e.g., 3D landmarks or blendshape coefficients) given an audio clip.
  • The diffusion process iteratively denoises a random latent, guided by the audio, until a plausible expression trajectory emerges.

2. Personalization (Subject‑Specific Fine‑Tuning)

  • For each person of interest, a short “reference set” of genuine video clips is used to adapt the generic diffusion model.
  • This step aligns the model’s latent space with the subject’s unique facial dynamics and identity cues.

3. Forgery Scoring via Reconstruction Error

  • When a test video is presented, the system feeds its audio to the personalized model and reconstructs the expected expression sequence.
  • The identity distance (e.g., L2 norm between reconstructed and observed facial features) serves as a forgery score: larger errors suggest the visual stream does not match the audio‑driven expectation, indicating manipulation.

4. Zero‑Shot Detection Pipeline

  • No fake examples are required at any stage; the detector relies solely on the mismatch between audio‑driven predictions and the actual video.

Results & Findings

DatasetPrior SOTA AUCExposeAnyone AUCΔ AUC
DF‑TIMIT84.1 %88.3 %+4.2 %
DFDCP81.7 %85.9 %+4.2 %
KoDF78.4 %82.6 %+4.2 %
IDForge80.2 %84.5 %+4.3 %
  • Sora2 detection – ExposeAnyone correctly flags Sora2‑generated videos (AUC ≈ 87 %) while the best competing method drops below 70 %.
  • Corruption robustness – Under heavy Gaussian blur (σ = 5) and JPEG compression (Q = 20), the AUC drop is < 2 %, whereas supervised baselines lose > 6 %.

These numbers demonstrate that the audio‑driven reconstruction error is a powerful, manipulation‑agnostic cue.

Practical Implications

  • Content‑moderation platforms – Deploy a lightweight “personalization” step for high‑risk accounts (e.g., public figures) and run real‑time forgery checks without maintaining a constantly updated fake‑dataset.
  • Authentication pipelines – Add an “audio‑expression consistency” check to video‑based identity verification (e.g., remote KYC) to thwart deep‑fake attacks.
  • Tooling for developers – Export the diffusion model as an ONNX/TensorRT graph, enabling integration into existing video‑processing back‑ends with modest GPU resources.
  • Forensic analysis – Use the reconstruction error heatmap to pinpoint exactly where a video diverges from expected facial dynamics, aiding manual review.

Limitations & Future Work

  • Reference data requirement – Personalization needs a few seconds of clean video per subject; the method is less effective for completely unknown identities.
  • Audio quality dependence – Extremely noisy or dubbed audio reduces reconstruction fidelity, potentially increasing false positives.
  • Scalability to large user bases – Maintaining personalized models for millions of users would demand model‑sharing strategies or on‑the‑fly fine‑tuning.
  • Future directions – Explore few‑shot meta‑learning to reduce reference data, extend the approach to multi‑modal cues (e.g., lip‑reading + facial motion), and optimize diffusion inference for edge‑device deployment.

Authors

  • Kaede Shiohara
  • Toshihiko Yamasaki
  • Vladislav Golyanik

Paper Information

  • arXiv ID: 2601.02359v1
  • Categories: cs.CV
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »