[Paper] LTX-2: Efficient Joint Audio-Visual Foundation Model

Published: (January 6, 2026 at 01:24 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03233v1

Overview

LTX‑2 is an open‑source, large‑scale foundation model that can generate synchronized video and audio from a single text prompt. By marrying a 14 B‑parameter video transformer with a 5 B‑parameter audio transformer through cross‑attention, the system produces cinematic‑quality clips where the soundtrack follows characters, ambience, and emotions—something that today’s text‑to‑video diffusion models lack.

Key Contributions

  • Unified audiovisual diffusion architecture: asymmetric dual‑stream transformer (video ≫ audio) linked by bidirectional cross‑attention and shared timestep conditioning.
  • Modality‑aware classifier‑free guidance (modality‑CFG): lets users balance visual fidelity vs. audio fidelity on the fly.
  • Multilingual text encoder: broadens prompt comprehension beyond English.
  • Efficient training & inference: achieves state‑of‑the‑art quality at a fraction of the compute cost of proprietary systems.
  • Open‑source release: full model weights, training scripts, and inference pipelines are publicly available.

Methodology

  1. Dual‑stream transformer – Two separate transformer stacks process video and audio tokens. The video stream (14 B parameters) gets the bulk of capacity because visual generation is more compute‑intensive; the audio stream (5 B) focuses on high‑fidelity sound.
  2. Cross‑modal attention – At each diffusion timestep, video tokens attend to audio tokens and vice‑versa, allowing the model to align lip movements, environmental sounds, and musical cues. Temporal positional embeddings ensure that the attention respects the chronological order of frames and audio samples.
  3. Shared timestep conditioning – Both streams receive the same diffusion timestep embedding via an AdaLN (adaptive layer‑norm) module, guaranteeing that video and audio evolve synchronously.
  4. Multilingual text encoder – A pre‑trained multilingual encoder (e.g., XLM‑R) converts the user prompt into a language‑agnostic embedding that drives both streams.
  5. Modality‑CFG – Extends classifier‑free guidance by applying separate guidance scales for video and audio, giving developers fine‑grained control over the trade‑off between visual detail and audio realism.

Training uses a large curated dataset of paired video‑audio clips (≈2 M samples) with diffusion noise schedules applied jointly to both modalities. The loss is a weighted sum of video and audio reconstruction errors, encouraging tight audiovisual coupling.

Results & Findings

  • Quantitative: LTX‑2 outperforms all open‑source baselines on standard audiovisual metrics (e.g., FVD for video, Fréchet Audio Distance for sound) and narrows the gap to commercial systems by ~15 % on average.
  • Qualitative: Generated clips exhibit coherent speech that matches on‑screen lip movements, realistic ambient sounds (rain, crowd chatter), and stylistic audio cues (e.g., horror‑style drones) that align with visual mood.
  • Efficiency: Inference time is ~2× faster than comparable proprietary models, and GPU memory usage is reduced by ~30 % thanks to the asymmetric design.
  • Control: Modality‑CFG enables users to prioritize visual fidelity (e.g., crisp action scenes) while keeping audio intelligible, or vice‑versa for audio‑centric applications like podcast video generation.

Practical Implications

  • Content creation pipelines – Video editors and indie developers can generate fully‑fledged video ads, explainer clips, or game cutscenes without hiring separate voice‑over artists or sound designers.
  • Multilingual media – The multilingual encoder makes it easy to produce localized videos with native‑language narration and culturally appropriate soundscapes.
  • Rapid prototyping – Teams can iterate on storyboards by swapping prompts and instantly seeing synchronized audiovisual results, cutting pre‑production time.
  • Accessibility tools – Automatic generation of descriptive audio tracks for the visually impaired becomes feasible at scale.
  • Edge deployment – Because the model is more compute‑efficient, it can be fine‑tuned or distilled for on‑device applications (e.g., AR/VR experiences with live audiovisual synthesis).

Limitations & Future Work

  • Audio fidelity ceiling – While impressive, the 5 B audio stream still lags behind dedicated speech synthesis models on nuanced prosody and high‑frequency details.
  • Dataset bias – Training data is skewed toward Western media; prompts involving non‑Western cultural contexts sometimes yield mismatched sound effects.
  • Temporal length – Current implementation handles clips up to ~10 seconds; longer narratives require segment stitching or hierarchical diffusion.
  • Future directions proposed by the authors include scaling the audio stream, incorporating explicit music generation modules, and extending the model to support interactive conditioning (e.g., real‑time user sketches).

LTX‑2 demonstrates that a single, well‑engineered foundation model can bridge the long‑standing gap between video generation and audio synthesis, opening new avenues for developers to build richer, more immersive media experiences with far less manual effort.

Authors

  • Yoav HaCohen
  • Benny Brazowski
  • Nisan Chiprut
  • Yaki Bitterman
  • Andrew Kvochko
  • Avishai Berkowitz
  • Daniel Shalem
  • Daphna Lifschitz
  • Dudu Moshe
  • Eitan Porat
  • Eitan Richardson
  • Guy Shiran
  • Itay Chachy
  • Jonathan Chetboun
  • Michael Finkelson
  • Michael Kupchick
  • Nir Zabari
  • Nitzan Guetta
  • Noa Kotler
  • Ofir Bibi
  • Ori Gordon
  • Poriya Panet
  • Roi Benita
  • Shahar Armon
  • Victor Kulikov
  • Yaron Inger
  • Yonatan Shiftan
  • Zeev Melumian
  • Zeev Farbman

Paper Information

  • arXiv ID: 2601.03233v1
  • Categories: cs.CV
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »