[Paper] Klear: Unified Multi-Task Audio-Video Joint Generation

Published: (January 7, 2026 at 01:03 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04151v1

Overview

The paper presents Klear, a unified framework that can generate synchronized audio‑video content and also handle single‑modality tasks (audio‑only or video‑only). By redesigning the model architecture, training pipeline, and data collection process, the authors achieve tight lip‑speech alignment, high visual fidelity, and strong generalization—addressing long‑standing issues of asynchrony and unimodal degradation in existing generative systems.

Key Contributions

  • Single‑tower architecture with unified DiT (Diffusion Transformer) blocks and an Omni‑Full Attention mechanism that jointly processes audio, video, and text, enabling tight cross‑modal alignment.
  • Progressive multitask training that randomly masks modalities and follows a multistage curriculum, preventing unimodal collapse and encouraging robust audio‑visual world knowledge.
  • Large‑scale dense‑caption dataset (first of its kind) built via an automated pipeline that annotates and filters millions of audio‑video‑caption triplets with strict temporal alignment.
  • Demonstrated state‑of‑the‑art performance across a suite of tasks (joint generation, audio‑only synthesis, video‑only synthesis, and instruction‑following) with results comparable to proprietary systems like Veo 3.
  • Scalable design that can be trained on massive datasets without sacrificing inference speed, thanks to the unified attention and diffusion backbone.

Methodology

  1. Model Design – Klear treats audio, video frames, and textual prompts as a single sequence of tokens. The DiT blocks (diffusion‑style transformers) process this sequence with Omni‑Full Attention, which computes full self‑attention across all modalities at each layer, ensuring that audio cues directly influence video generation (e.g., lip movements) and vice‑versa.
  2. Training Regime
    • Random Modality Masking: For each training step, one or more modalities are masked out, forcing the model to learn to reconstruct missing parts from the remaining signals. This yields a single model capable of joint and unimodal generation.
    • Curriculum Stages: Training progresses from easy (high‑quality, well‑aligned clips) to harder examples (noisy, out‑of‑distribution data), gradually expanding the model’s robustness.
  3. Data Curation – An automated pipeline scrapes public video platforms, runs speech‑to‑text and visual captioning models, then applies strict temporal alignment checks and quality filters. The result is a multi‑modal dataset with dense captions (sentence‑level descriptions for each short video segment), providing rich supervision for both semantic and timing aspects.

Results & Findings

  • Audio‑Video Synchrony: Measured lip‑reading error rates drop by >30 % compared to prior open‑source baselines, indicating near‑human alignment.
  • Visual Fidelity: FID scores improve by 0.12 on standard video synthesis benchmarks, while preserving fine‑grained details (e.g., facial expressions).
  • Instruction Following: On a newly introduced multimodal instruction benchmark, Klear achieves a 45 % higher success rate than the best existing open model, matching the performance of the commercial Veo 3 system.
  • Generalization: When evaluated on out‑of‑distribution domains (e.g., animated cartoons, low‑light footage), Klear maintains >80 % of its in‑domain performance, demonstrating the effectiveness of the curriculum and large‑scale data.

Practical Implications

  • Content Creation Pipelines: Developers can integrate Klear into video‑editing tools to auto‑generate synchronized voice‑overs or dub existing footage without manual lip‑sync work.
  • Interactive Media & Games: Real‑time generation of character speech and facial animation becomes feasible, reducing the need for pre‑recorded assets and enabling dynamic NPC dialogue.
  • Accessibility: Automatic generation of sign‑language videos from audio or captioned video can improve accessibility services.
  • Multimodal Assistants: Voice‑enabled agents could produce short explanatory videos on the fly, using a single model for both audio narration and visual illustration.
  • Scalable Training: The unified architecture and data pipeline provide a blueprint for other teams to build large‑scale multimodal generative models without stitching together separate audio and video networks.

Limitations & Future Work

  • Compute Requirements: Training Klear still demands multi‑GPU clusters and extensive diffusion steps, which may be prohibitive for smaller labs.
  • Dataset Bias: Although the data pipeline filters for quality, the source videos inherit cultural and language biases that can affect generation fairness.
  • Temporal Resolution: Very fast speech or rapid scene cuts can still cause minor misalignments; finer‑grained temporal modeling is an open challenge.
  • Future Directions: The authors suggest exploring more efficient diffusion samplers, incorporating explicit phoneme‑to‑viseme mappings for even tighter lip sync, and extending the dense‑caption dataset to cover more languages and domains.

Authors

  • Jun Wang
  • Chunyu Qiang
  • Yuxin Guo
  • Yiran Wang
  • Xijuan Zeng
  • Chen Zhang
  • Pengfei Wan

Paper Information

  • arXiv ID: 2601.04151v1
  • Categories: cs.CV, cs.AI, cs.MM, cs.SD
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »