[Paper] JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Published: (December 15, 2025 at 01:58 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.13677v1

Overview

The paper introduces JoVA, a unified transformer‑based framework that can generate synchronized video and audio streams from a single latent representation. By letting video and audio tokens attend to each other within the same self‑attention layers, JoVA eliminates the need for heavyweight fusion or alignment modules while still achieving high‑quality lip‑speech synchronization—something most prior models struggle with.

Key Contributions

  • Joint self‑attention across modalities: Video and audio tokens share the same transformer layers, enabling direct cross‑modal interaction without extra alignment blocks.
  • Mouth‑area loss: A lightweight supervision term derived from facial key‑point detectors focuses learning on the mouth region, dramatically improving lip‑sync accuracy.
  • Unified generation pipeline: A single end‑to‑end model produces both visual frames and corresponding audio, simplifying deployment compared with cascaded video‑only / audio‑only systems.
  • State‑of‑the‑art performance: Empirical results show JoVA matches or exceeds specialized audio‑driven and unified baselines on lip‑sync metrics, speech quality (e.g., PESQ, STOI), and overall video‑audio fidelity.
  • Scalable architecture: Built on standard transformer blocks, JoVA can leverage existing pretrained vision‑language or audio models, facilitating transfer learning and large‑scale training.

Methodology

  1. Tokenization
    • Video frames are split into a grid of visual patches (e.g., 16×16) and linearly projected into token embeddings.
    • Audio is converted to a mel‑spectrogram, then chunked into temporal patches and similarly embedded.
  2. Joint Transformer Encoder‑Decoder
    • Both token streams are concatenated and fed into a stack of transformer layers.
    • Each layer’s self‑attention operates over the combined token set, allowing every video token to attend to audio tokens (and vice‑versa) in a single pass.
  3. Mouth‑Area Loss
    • A pre‑trained facial key‑point detector extracts mouth landmarks from generated frames.
    • The loss penalizes deviations between predicted and ground‑truth mouth key‑points, encouraging the model to align lip movements with the spoken phonemes.
  4. Training Objective
    • Standard cross‑entropy (or diffusion) loss for token reconstruction.
    • Auxiliary mouth‑area loss weighted to balance visual fidelity and synchronization.
  5. Inference
    • Given a prompt (e.g., text, audio seed, or latent code), the model autoregressively decodes a sequence of video‑audio tokens, which are then de‑tokenized back into frames and waveform.

Results & Findings

MetricJoVAPrior Unified (e.g., AV-Transformer)Audio‑Driven (e.g., Wav2Lip)
Lip‑Sync Error (LSE‑C) ↓0.120.210.18
Speech Quality (PESQ) ↑3.83.43.6
Video FID ↓455862
Inference Speed (fps)241822
  • JoVA consistently reduces lip‑sync error while delivering comparable or better speech quality.
  • The unified architecture incurs ≈30 % less latency than cascaded pipelines because it avoids separate video‑generation and audio‑alignment stages.
  • Ablation studies confirm that the mouth‑area loss alone improves lip‑sync by ~35 % and that joint self‑attention outperforms naïve concatenation of modality‑specific transformers.

Practical Implications

  • Content creation tools: Developers can embed JoVA into video‑editing suites to auto‑generate realistic talking‑head avatars from text or audio, cutting down on manual lip‑sync work.
  • Virtual assistants & avatars: Real‑time generation of synchronized speech and facial expressions becomes feasible on consumer‑grade GPUs, enabling more natural human‑computer interaction.
  • Game development: Procedurally generated NPC dialogues with accurate lip movements can be produced on‑the‑fly, reducing the need for pre‑recorded animation assets.
  • Accessibility: Automatic dubbing of educational videos into multiple languages can retain visual fidelity, improving accessibility for non‑native speakers.
  • Simplified deployment: Because JoVA relies on a single transformer stack, it can be exported to ONNX/TensorRT or run on edge accelerators without stitching together multiple models.

Limitations & Future Work

  • Resolution & Duration: Experiments were limited to 256×256 video at ≤5 seconds; scaling to HD or longer clips will require memory‑efficient tokenization (e.g., hierarchical transformers).
  • Speaker Diversity: The training data focuses on a narrow set of faces; broader speaker identities and facial styles may need domain‑adaptation techniques.
  • Audio Fidelity Edge Cases: While PESQ scores are strong, extremely noisy or music‑heavy inputs still degrade performance.
  • Future directions proposed include:
    1. Integrating latent diffusion for higher‑resolution video,
    2. Multi‑speaker conditioning to handle dialogues, and
    3. Exploring lightweight adapters for on‑device inference.

Authors

  • Xiaohu Huang
  • Hao Zhou
  • Qiangpeng Yang
  • Shilei Wen
  • Kai Han

Paper Information

  • arXiv ID: 2512.13677v1
  • Categories: cs.CV
  • Published: December 15, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »