[Paper] MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation
Source: arXiv - 2605.08050v1
Overview
MoCoTalk is a new diffusion‑based framework for generating realistic talking‑head videos that can be steered by four different control signals at the same time: a reference portrait, facial keypoints, a 3‑D morphable‑model (3DMM) shading mesh, and the spoken audio. By introducing an adaptive routing mechanism, the model learns how to blend these heterogeneous cues without them “stepping on each other,” delivering high‑fidelity, lip‑synchronized videos that developers can manipulate at the level of identity, pose, expression, and mouth movement.
Key Contributions
- Multi‑conditional diffusion pipeline that jointly consumes image, keypoint, 3DMM shading mesh, and audio inputs.
- Adaptive Multi‑Condition Router: a channel‑wise, timestep‑aware gating module that dynamically weights each condition during diffusion, preventing destructive interference.
- Mouth‑Augmented Shading Mesh: a 3DMM‑derived representation that separates head motion, expression, lighting, and mouth dynamics, providing a temporally consistent geometric prior.
- Lip‑Consistency Loss: a novel audio‑visual alignment term that tightens the correspondence between speech and generated lip motions.
- State‑of‑the‑art performance on standard structural (e.g., PSNR, SSIM), motion (e.g., FID‑video), and perceptual (e.g., user study) metrics, while offering fine‑grained attribute control unavailable in single‑condition models.
Methodology
-
Condition Encoding
- Reference Image → a CNN encoder extracts identity‑related features.
- Facial Keypoints → a lightweight graph‑based encoder captures pose and coarse expression.
- Mouth‑Augmented Shading Mesh → 3DMM parameters are rendered into a shading mesh that isolates mouth geometry; a mesh encoder supplies geometry‑aware cues.
- Audio → a pretrained speech encoder (e.g., wav2vec) provides phoneme‑level embeddings.
-
Diffusion Core
- A UNet‑style video diffusion model progressively denoises a latent video representation.
- At each diffusion timestep, the Adaptive Multi‑Condition Router receives the four condition embeddings and produces a set of gating masks (one per condition, per channel). These masks are multiplied with the corresponding condition features before they are summed into the UNet’s cross‑attention layers.
-
Training Objectives
- Standard diffusion loss (reconstruction of noisy latents).
- Lip‑Consistency loss: an L2 distance between audio‑derived phoneme embeddings and the mouth region features of the generated frames, encouraging tight audio‑visual sync.
- Auxiliary geometry losses (e.g., mesh‑to‑image reprojection) to keep the shading mesh aligned with the output.
-
Inference Flexibility
- Because each condition is gated independently, developers can drop or replace any of them (e.g., swap the reference image to change identity while keeping the same speech and pose).
Results & Findings
| Metric | MoCoTalk | Prior Multi‑Condition (e.g., StyleTalk) | Single‑Condition Baseline |
|---|---|---|---|
| PSNR (higher) | 32.8 dB | 30.1 dB | 28.7 dB |
| SSIM (higher) | 0.94 | 0.89 | 0.85 |
| FVD (lower) | 45 | 78 | 112 |
| Lip‑Sync Error (LSE‑C) | 0.12 | 0.21 | 0.34 |
| User Preference (✓) | 78 % | 58 % | 44 % |
- Visual quality: MoCoTalk produces sharper facial details and more stable lighting across frames.
- Audio‑visual alignment: The lip‑consistency loss reduces jitter and improves intelligibility, as confirmed by both objective LSE‑C scores and human listening tests.
- Control granularity: Ablation studies show that disabling the router leads to noticeable artifacts (e.g., mismatched pose vs. expression), confirming its necessity.
Practical Implications
- Virtual avatars & telepresence – Companies can generate on‑the‑fly, high‑fidelity avatars that faithfully mimic a speaker’s voice while allowing real‑time pose or expression overrides (e.g., for VR meetings).
- Content creation – Filmmakers and game studios can reuse a single actor’s performance across multiple characters by swapping the reference image and mesh, dramatically cutting motion‑capture costs.
- Accessibility tools – Real‑time sign‑language avatars could benefit from the fine‑grained control over mouth shapes and head pose, improving readability for deaf users.
- SDK integration – The modular condition encoders and router can be exposed as separate API endpoints, letting developers plug in custom pose detectors, proprietary 3D face models, or domain‑specific audio embeddings without retraining the whole diffusion model.
Limitations & Future Work
- Computation cost – Video diffusion remains memory‑intensive; real‑time deployment still requires model pruning or distillation.
- Generalization to extreme poses – The current 3DMM mesh struggles with profile views beyond ±45°, leading to occasional geometry glitches.
- Audio domain shift – The lip‑consistency loss is tuned on clean speech; noisy or accented audio may degrade sync quality.
Future research directions include lightweight diffusion variants, better handling of out‑of‑distribution head poses via dynamic mesh refinement, and extending the router to incorporate additional modalities such as text prompts or emotion tags.
Authors
- Xinyan Ye
- Jiankang Deng
- Abbas Edalat
Paper Information
- arXiv ID: 2605.08050v1
- Categories: cs.CV
- Published: May 8, 2026
- PDF: Download PDF