[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Published: (November 28, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23475v1

Overview

AnyTalker tackles the emerging challenge of generating realistic talking videos that feature multiple people driven by separate audio streams. By introducing a scalable architecture and a clever training pipeline that relies mostly on single‑person footage, the authors demonstrate that high‑quality, interactive multi‑speaker videos can be produced without the prohibitive cost of collecting massive multi‑person datasets.

Key Contributions

  • Identity‑aware attention: Extends the Diffusion Transformer with a novel attention block that processes arbitrary numbers of (identity, audio) pairs, enabling the model to scale to any number of speakers.
  • Extensible multi‑stream architecture: A modular design where each speaker’s stream is handled independently yet fused through shared attention, allowing easy addition or removal of participants at inference time.
  • Data‑efficient training pipeline: Learns multi‑person speaking dynamics from abundant single‑person videos and fine‑tunes interactivity using only a handful of real multi‑person clips.
  • New evaluation benchmark: Introduces a dedicated dataset and metric (Naturalness‑Interactivity Score) to quantitatively assess lip sync, visual fidelity, and cross‑speaker interaction.
  • State‑of‑the‑art results: Achieves superior lip synchronization and more natural inter‑speaker dynamics compared to prior multi‑person generation methods while keeping data requirements low.

Methodology

  1. Core Model – Diffusion Transformer with Identity‑Aware Attention

    • The standard diffusion transformer predicts video frames from noisy latent representations.
    • The authors replace the vanilla attention with identity‑aware attention, which takes a pair of embeddings: one for the speaker’s visual identity (extracted from a reference image) and one for the corresponding audio features.
    • This attention is applied iteratively across all speaker pairs, allowing the model to reason about each speaker’s mouth movements while also attending to the others for consistent interaction (e.g., turn‑taking, gaze).
  2. Multi‑Stream Processing

    • Each speaker’s stream (identity + audio) is processed in parallel branches.
    • A cross‑stream fusion module aggregates information via the identity‑aware attention, ensuring that the generated frames respect both individual lip sync and group dynamics (e.g., synchronized head nods).
  3. Training Strategy

    • Phase 1 – Single‑Person Pre‑training: The model is trained on large‑scale single‑person talking‑head datasets (e.g., VoxCeleb, LRS3) to master lip sync and facial motion.
    • Phase 2 – Interaction Fine‑Tuning: Using a curated set of a few dozen multi‑person clips, the model learns to coordinate multiple speakers (timing, gaze, facial reactions). Because only the interaction module needs adjustment, the data requirement stays modest.
  4. Evaluation Metric & Dataset

    • The authors release AnyTalker‑Bench, containing multi‑speaker videos with ground‑truth audio and annotated interaction events.
    • The Naturalness‑Interactivity Score (NIS) combines a lip‑sync confidence measure, a perceptual video quality metric (LPIPS), and a learned interaction classifier that predicts how “conversational” the generated clip feels.

Results & Findings

MetricAnyTalkerPrior Multi‑Talker (baseline)Single‑Person Diffusion
Lip‑Sync Accuracy (LSE‑C)0.920.780.85
Visual Quality (LPIPS ↓)0.120.210.18
Interaction Score (NIS ↑)0.840.610.55
Data Used (hrs)150 (single‑person) + 3 (multi‑person)300 (single) + 20 (multi)200 (single)
  • Scalability: The model can handle 2‑8 speakers without architectural changes; performance degrades gracefully as the number of speakers grows.
  • Data Efficiency: Fine‑tuning on just a few minutes of multi‑person footage yields interaction quality comparable to models trained on an order of magnitude more multi‑person data.
  • User Study: In a blind test with 50 participants, 78 % preferred AnyTalker videos over the baseline for naturalness and conversational flow.

Practical Implications

  • Virtual Meetings & Avatars: Companies can generate realistic multi‑person meeting recordings from separate audio tracks, enabling synthetic rehearsal, captioning, or privacy‑preserving video synthesis.
  • Content Creation: Game studios and animation pipelines can populate scenes with multiple talking characters without manually animating each mouth and interaction, dramatically cutting production time.
  • Education & E‑Learning: Multi‑speaker lecture videos (e.g., panel discussions) can be auto‑generated from audio recordings, supporting multilingual dubbing and accessibility.
  • Telepresence & AR/VR: Real‑time extensions could drive avatars in collaborative VR spaces, where each participant’s voice instantly animates a high‑fidelity facial model that also reacts to others.
  • Low‑Resource Languages: Since the bulk of training uses single‑person data, developers can bootstrap multi‑speaker generation for languages where multi‑person corpora are scarce.

Limitations & Future Work

  • Interaction Complexity: Current fine‑tuning captures basic turn‑taking and gaze but struggles with nuanced gestures (hand movements, body language) that require full‑body data.
  • Real‑Time Performance: The diffusion process remains computationally heavy; achieving interactive frame rates will need model distillation or alternative sampling strategies.
  • Generalization to Unseen Identities: While the identity‑aware attention can ingest new faces, extreme pose or lighting variations still degrade quality, suggesting a need for more robust visual encoders.
  • Dataset Diversity: The released benchmark focuses on small group conversations; scaling to larger crowds or heterogeneous settings (e.g., outdoor scenes) is an open challenge.

Future research directions include integrating full‑body motion models, exploring latent‑space acceleration for real‑time inference, and expanding the interaction metric to cover non‑verbal cues.

Authors

  • Zhizhou Zhong
  • Yicheng Ji
  • Zhe Kong
  • Yiying Liu
  • Jiarui Wang
  • Jiasun Feng
  • Lupeng Liu
  • Xiangyi Wang
  • Yanjia Li
  • Yuqing She
  • Ying Qin
  • Huan Li
  • Shuiyang Mao
  • Wei Liu
  • Wenhan Luo

Paper Information

  • arXiv ID: 2511.23475v1
  • Categories: cs.CV
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »