[Paper] TV2TV: A Unified Framework for Interleaved Language and Video Generation

Published: 2 months ago (December 4, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05103v1

Overview

The paper introduces TV2TV, a new “omni” video‑text model that treats video generation as a back‑and‑forth dialogue between language and pixels. By letting a language model “think in words” before the visual model “acts in pixels,” TV2TV produces higher‑quality, more controllable videos and can be steered with textual prompts at any point in the generation process.

Key Contributions

Interleaved generation framework – a single model that alternates between next‑token (text) prediction and next‑frame (video flow‑matching) prediction.
Mixture‑of‑Transformers (MoT) architecture – separate language‑modeling and video‑modeling towers that share a common latent space and are trained jointly.
Dynamic switching policy – at inference time the model decides autonomously when to generate text versus video, enabling on‑the‑fly reasoning.
Fine‑grained textual control – users can inject or edit textual instructions mid‑generation to reshape the video trajectory.
Scalable training on mixed data – combines synthetic game footage with automatically generated action captions and real‑world sports videos paired with VLM‑derived descriptions.
Empirical gains – substantial improvements in visual fidelity (measured by FVD/IS) and prompt alignment (CLIP‑Score) over strong baselines on both game and natural video benchmarks.

Methodology

Data preparation – The authors build two corpora:
- (a) a video‑game dataset where each frame sequence is paired with human‑written action captions,
- (b) a large collection of sports clips automatically annotated with natural‑language descriptions using a vision‑language model.
Model design –
- Language tower: a standard decoder‑only transformer trained to predict the next token given previous tokens and a latent video context.
- Video tower: a flow‑matching diffusion model that predicts the next video frame conditioned on past frames and the current textual embedding.
- Mixture‑of‑Transformers: a gating network learns to route the hidden state to either tower at each step, effectively deciding “think” vs. “act.”
Joint training – Both towers share a common embedding space and are optimized together with a combined loss: cross‑entropy for text and flow‑matching loss for video.
Inference algorithm – Starting from an initial prompt, the model iteratively samples either a token or a frame. A learned policy (a lightweight classifier on the hidden state) triggers a switch when the language tower signals that a new high‑level concept is needed.
Control interface – Developers can intervene by inserting custom tokens at any generation step, which immediately influences subsequent frame predictions.

Results & Findings

Dataset	Metric (↑ better)	TV2TV	Prior SOTA
Game‑play (synthetic)	FVD ↓	45	78
Game‑play	CLIP‑Score ↑	0.71	0.58
Sports (real)	IS ↑	12.4	9.1
Sports	Prompt‑Alignment (BLEU‑4) ↑	0.34	0.22

Visual quality: TV2TV reduces the Fréchet Video Distance (FVD) by ~40 % on synthetic data and improves Inception Score on real videos, indicating sharper, more coherent frames.
Prompt alignment: The interleaved language step yields higher CLIP‑Score and BLEU‑4, meaning the generated video follows the textual description more faithfully.
Control experiments: Inserting a single corrective sentence (“the car should turn left”) halfway through generation reliably altered the trajectory without degrading visual quality.
Ablation: Removing the dynamic switching policy (forcing a fixed text‑then‑video schedule) drops both FVD and alignment scores, confirming the importance of on‑the‑fly reasoning.

Practical Implications

Content creation pipelines – Game studios or advertising teams can generate prototype cut‑scenes by writing a script and letting TV2TV flesh out the visuals, dramatically cutting iteration time.
Interactive media – Developers of VR/AR experiences could let users type or speak commands that instantly reshape ongoing video streams, enabling “text‑driven gameplay.”
Data augmentation – Synthetic video data with aligned captions can be produced at scale for training downstream vision‑language models, reducing the need for costly manual annotation.
Fine‑grained editing – Existing video assets can be edited by inserting textual patches (e.g., “add a rainstorm here”), offering a new workflow for post‑production.
Open‑ended AI agents – The architecture demonstrates a viable path toward agents that plan actions in language before executing them visually, useful for robotics simulators or autonomous vehicle scenario generation.

Limitations & Future Work

Scalability to long videos – The current model handles clips up to ~8 seconds; extending the horizon may require hierarchical planning or memory‑efficient transformers.
Reliance on caption quality – For natural videos, the VLM‑generated descriptions can be noisy, which can propagate errors into the video tower.
Compute cost – Joint training of two large transformers with flow‑matching diffusion is resource‑intensive, limiting accessibility for smaller labs.
User control granularity – While textual interventions work, more precise spatial control (e.g., specifying object locations) is not yet supported.

Future research directions include hierarchical interleaving (scene‑level language → shot‑level video), multimodal conditioning with audio, and lightweight distillation techniques to bring TV2TV to edge devices.

Authors

Xiaochuang Han
Youssef Emad
Melissa Hall
John Nguyen
Karthik Padthe
Liam Robbins
Amir Bar
Delong Chen
Michal Drozdzal
Maha Elbayad
Yushi Hu
Shang‑Wen Li
Sreya Dutta Roy
Jakob Verbeek
XuDong Wang
Marjan Ghazvininejad
Luke Zettlemoyer
Emily Dinan

Paper Information

arXiv ID: 2512.05103v1
Categories: cs.LG, cs.AI, cs.CV
Published: December 4, 2025
PDF: Download PDF

[Paper] TV2TV: A Unified Framework for Interleaved Language and Video Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception