[Paper] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Published: (December 2, 2025 at 01:55 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.03034v1

Overview

MAViD introduces a multimodal audio‑visual dialogue system that can both understand user queries and generate realistic, long‑form video‑plus‑speech responses. By tackling the twin challenges of deep multimodal fusion and controllable generation, the work pushes conversational agents beyond text‑only chatbots toward immersive, human‑like interactions.

Key Contributions

  • Conductor‑Creator architecture: separates reasoning (Conductor) from content synthesis (Creator), enabling fine‑grained control over motion and speech.
  • Hybrid AR‑Diffusion generation: combines an autoregressive audio model with a diffusion‑based video model to produce high‑fidelity, temporally consistent audiovisual clips.
  • Novel multimodal fusion module: explicitly links consecutive video clips and audio streams, preserving identity, timbre, and tone across long dialogues.
  • End‑to‑end training pipeline: jointly optimizes understanding, instruction generation, and audiovisual synthesis on a unified dataset.
  • Extensive evaluation: demonstrates coherent, context‑aware long‑duration dialogues and superior visual/audio quality compared with prior non‑interactive baselines.

Methodology

  1. Understanding & Instruction (Conductor)

    • Takes a multimodal user query (text, audio, video) and performs perception, reasoning, and planning.
    • Decomposes the desired response into two instruction streams: a motion plan (what should happen visually) and a speech plan (what should be said, with tone and timbre).
  2. Content Synthesis (Creator)

    • Audio branch: an autoregressive transformer predicts mel‑spectrogram frames conditioned on the speech plan, ensuring natural prosody and speaker consistency.
    • Video branch: a diffusion model (DiT‑style) generates video frames from the motion plan, guided by the audio output to keep lip‑sync and gesture alignment.
  3. Fusion Module

    • Bridges the audio and video streams across successive clips using cross‑modal attention and temporal convolution, so that a 30‑second dialogue feels like a single continuous scene rather than disjoint snippets.
  4. Training

    • Trained on a curated audio‑visual dialogue dataset where each turn includes user input, ground‑truth response video, and transcribed speech.
    • Losses combine language modeling, audio reconstruction (L1 + adversarial), video diffusion (denoising score matching), and a multimodal consistency term.

Results & Findings

  • Coherence: Human evaluators rated MAViD’s dialogues 23 % more contextually coherent than the strongest baseline (a text‑to‑video model with separate TTS).
  • Audio‑Video Sync: Lip‑sync error dropped from 0.42 s (baseline) to 0.07 s, approaching real‑world recordings.
  • Identity Preservation: Over 30‑second interactions, speaker identity (face, voice timbre) remained stable in >95 % of cases, a notable improvement over prior diffusion‑only pipelines that suffered drift.
  • Generation Speed: The hybrid AR‑Diffusion design yields a 2.3× faster inference time than a pure diffusion approach, making near‑real‑time interaction feasible on a single RTX 4090.

Practical Implications

  • Virtual Assistants & Customer Service: Deploy agents that can show a product demo while talking through instructions, reducing reliance on static screenshots or separate video clips.
  • E‑learning & Training: Generate personalized, on‑the‑fly tutorial videos that adapt to learner questions, maintaining a consistent instructor avatar.
  • Gaming & XR: Populate interactive NPCs with believable speech and gestures, enabling richer story‑driven experiences without hand‑crafted cutscenes.
  • Content Creation: Automate the production of explainer videos or marketing reels where the script and visual storyboard are generated from a single multimodal prompt.

Limitations & Future Work

  • Dataset Bias: The training corpus is limited to a few domains (e.g., indoor scenes, English speakers), which may affect generalization to outdoor or multilingual settings.
  • Compute Requirements: While faster than pure diffusion, real‑time deployment on edge devices still demands hardware acceleration.
  • Fine‑Grained Control: Current instruction granularity is limited to motion vs. speech; future versions could expose style, emotion, or camera parameters to developers.
  • Evaluation Metrics: Objective metrics for long‑duration multimodal coherence remain an open research problem; the authors plan to develop benchmark suites for this purpose.

Authors

  • Youxin Pang
  • Jiajun Liu
  • Lingfeng Tan
  • Yong Zhang
  • Feng Gao
  • Xiang Deng
  • Zhuoliang Kang
  • Xiaoming Wei
  • Yebin Liu

Paper Information

  • arXiv ID: 2512.03034v1
  • Categories: cs.CV
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »