[Paper] Talking Together: Synthesizing Co-Located 3D Conversations from Audio
Source: arXiv - 2603.08674v1
Overview
This paper presents the first system that can turn a single mixed‑audio recording of a face‑to‑face conversation into two fully animated 3D avatars that not only lip‑sync perfectly but also maintain realistic spatial relationships—relative positions, head orientations, and mutual gaze. By doing so, it bridges the gap between today’s “talking‑head” video‑conference avatars and truly immersive, co‑located virtual dialogues.
Key Contributions
- Dual‑stream 3D animation pipeline that simultaneously generates the full facial performance of both speakers from a single audio track.
- Speaker role embeddings + cross‑speaker attention to disentangle mixed audio and capture turn‑taking dynamics.
- Text‑driven control of relative head pose, allowing developers to script where each avatar should be positioned or turned.
- Eye‑gaze loss that explicitly encourages natural, mutual eye contact between the two avatars.
- Large‑scale dyadic conversation dataset (≈2 M speaker pairs) harvested from in‑the‑wild videos, enabling data‑hungry deep models to learn realistic interaction cues.
- Quantitative and user‑study evidence showing higher perceived realism and interaction coherence compared with state‑of‑the‑art talking‑head generators.
Methodology
- Data Collection – The authors built an automated pipeline that scrapes publicly available videos, detects dialogue scenes, extracts paired face tracks, and aligns them with the mixed audio. This yields a massive corpus of synchronized 3D facial scans (via existing 3D face reconstruction tools) and audio.
- Dual‑Stream Architecture – Two parallel neural streams each output a 3D facial animation (mesh vertices, blendshape coefficients, eye‑gaze vectors) for one participant.
- Speaker Role Embedding tags each stream as “Speaker A” or “Speaker B”, giving the network a notion of turn order.
- Cross‑Attention Module lets each stream attend to the other’s hidden state, enabling the model to infer who is speaking at any moment and to coordinate gestures (e.g., nodding while the other talks).
- Audio Disentanglement – The mixed audio is passed through a shared encoder; the cross‑attention splits the signal into speaker‑specific prosodic features that drive lip‑sync.
- Spatial & Gaze Control – A lightweight text parser converts simple commands like “Speaker A faces left, Speaker B looks at Speaker A” into target orientation vectors that are injected as conditioning inputs.
- Loss Functions –
- Lip‑Sync Loss (L1 on phoneme‑aligned blendshapes)
- Pose Consistency Loss (penalizes unrealistic head jumps)
- Eye‑Gaze Loss (encourages reciprocal gaze direction)
- Adversarial Loss (a discriminator judges overall realism).
Results & Findings
| Metric | Baseline (Talking‑Head) | Proposed Dual‑Stream |
|---|---|---|
| Lip‑Sync Error (ms) | 38 | 21 |
| Gaze Reciprocity Score (0‑1) | 0.42 | 0.78 |
| User Study – Realism (5‑point Likert) | 3.1 | 4.3 |
| User Study – Interaction Coherence | 2.9 | 4.0 |
- The system produces smoother head movements and maintains consistent eye contact throughout the dialogue.
- Text‑driven pose control works reliably: deviations from the commanded orientation stay below 5°.
- Ablation studies confirm that removing cross‑attention or the eye‑gaze loss degrades both objective metrics and perceived realism dramatically.
Practical Implications
- VR/AR Telepresence – Developers can replace low‑fidelity video streams with lightweight 3D avatars that still convey subtle non‑verbal cues, reducing bandwidth while preserving presence.
- Virtual Production & Gaming – Automated generation of two‑character cutscenes from voice‑over recordings cuts down on manual animation labor.
- Remote Collaboration Tools – Real‑time integration could enable “spatial chat” where participants appear around a virtual table, with the system handling turn‑taking and gaze automatically.
- Accessibility – The text‑based pose controller allows designers to script inclusive interactions (e.g., ensuring both avatars face the camera for sign‑language overlays).
Limitations & Future Work
- Audio Quality Dependency – The model assumes relatively clean speech; heavy background noise still hurts speaker disentanglement.
- Static Body Representation – Only facial and head motion are modeled; full‑body gestures remain out of scope.
- Real‑Time Performance – Current inference runs at ~8 fps on a high‑end GPU; optimizing for real‑time deployment is an open challenge.
- Cultural Nuances – The dataset is biased toward Western conversational styles; future work should broaden cultural diversity to capture different eye‑contact norms and gestural conventions.
Overall, this research pushes conversational avatar generation from static “talking heads” toward truly interactive, spatially aware 3D agents—opening new avenues for immersive communication platforms.
Authors
- Mengyi Shan
- Shouchieh Chang
- Ziqian Bai
- Shichen Liu
- Yinda Zhang
- Luchuan Song
- Rohit Pandey
- Sean Fanello
- Zeng Huang
Paper Information
- arXiv ID: 2603.08674v1
- Categories: cs.CV
- Published: March 9, 2026
- PDF: Download PDF