[Paper] TV2TV: A Unified Framework for Interleaved Language and Video Generation
Source: arXiv - 2512.05103v1
Overview
The paper introduces TV2TV, a new “omni” video‑text model that treats video generation as a back‑and‑forth dialogue between language and pixels. By letting a language model “think in words” before the visual model “acts in pixels,” TV2TV produces higher‑quality, more controllable videos and can be steered with textual prompts at any point in the generation process.
Key Contributions
- Interleaved generation framework – a single model that alternates between next‑token (text) prediction and next‑frame (video flow‑matching) prediction.
- Mixture‑of‑Transformers (MoT) architecture – separate language‑modeling and video‑modeling towers that share a common latent space and are trained jointly.
- Dynamic switching policy – at inference time the model decides autonomously when to generate text versus video, enabling on‑the‑fly reasoning.
- Fine‑grained textual control – users can inject or edit textual instructions mid‑generation to reshape the video trajectory.
- Scalable training on mixed data – combines synthetic game footage with automatically generated action captions and real‑world sports videos paired with VLM‑derived descriptions.
- Empirical gains – substantial improvements in visual fidelity (measured by FVD/IS) and prompt alignment (CLIP‑Score) over strong baselines on both game and natural video benchmarks.
Methodology
- Data preparation – The authors build two corpora:
- (a) a video‑game dataset where each frame sequence is paired with human‑written action captions,
- (b) a large collection of sports clips automatically annotated with natural‑language descriptions using a vision‑language model.
- Model design –
- Language tower: a standard decoder‑only transformer trained to predict the next token given previous tokens and a latent video context.
- Video tower: a flow‑matching diffusion model that predicts the next video frame conditioned on past frames and the current textual embedding.
- Mixture‑of‑Transformers: a gating network learns to route the hidden state to either tower at each step, effectively deciding “think” vs. “act.”
- Joint training – Both towers share a common embedding space and are optimized together with a combined loss: cross‑entropy for text and flow‑matching loss for video.
- Inference algorithm – Starting from an initial prompt, the model iteratively samples either a token or a frame. A learned policy (a lightweight classifier on the hidden state) triggers a switch when the language tower signals that a new high‑level concept is needed.
- Control interface – Developers can intervene by inserting custom tokens at any generation step, which immediately influences subsequent frame predictions.
Results & Findings
| Dataset | Metric (↑ better) | TV2TV | Prior SOTA |
|---|---|---|---|
| Game‑play (synthetic) | FVD ↓ | 45 | 78 |
| Game‑play | CLIP‑Score ↑ | 0.71 | 0.58 |
| Sports (real) | IS ↑ | 12.4 | 9.1 |
| Sports | Prompt‑Alignment (BLEU‑4) ↑ | 0.34 | 0.22 |
- Visual quality: TV2TV reduces the Fréchet Video Distance (FVD) by ~40 % on synthetic data and improves Inception Score on real videos, indicating sharper, more coherent frames.
- Prompt alignment: The interleaved language step yields higher CLIP‑Score and BLEU‑4, meaning the generated video follows the textual description more faithfully.
- Control experiments: Inserting a single corrective sentence (“the car should turn left”) halfway through generation reliably altered the trajectory without degrading visual quality.
- Ablation: Removing the dynamic switching policy (forcing a fixed text‑then‑video schedule) drops both FVD and alignment scores, confirming the importance of on‑the‑fly reasoning.
Practical Implications
- Content creation pipelines – Game studios or advertising teams can generate prototype cut‑scenes by writing a script and letting TV2TV flesh out the visuals, dramatically cutting iteration time.
- Interactive media – Developers of VR/AR experiences could let users type or speak commands that instantly reshape ongoing video streams, enabling “text‑driven gameplay.”
- Data augmentation – Synthetic video data with aligned captions can be produced at scale for training downstream vision‑language models, reducing the need for costly manual annotation.
- Fine‑grained editing – Existing video assets can be edited by inserting textual patches (e.g., “add a rainstorm here”), offering a new workflow for post‑production.
- Open‑ended AI agents – The architecture demonstrates a viable path toward agents that plan actions in language before executing them visually, useful for robotics simulators or autonomous vehicle scenario generation.
Limitations & Future Work
- Scalability to long videos – The current model handles clips up to ~8 seconds; extending the horizon may require hierarchical planning or memory‑efficient transformers.
- Reliance on caption quality – For natural videos, the VLM‑generated descriptions can be noisy, which can propagate errors into the video tower.
- Compute cost – Joint training of two large transformers with flow‑matching diffusion is resource‑intensive, limiting accessibility for smaller labs.
- User control granularity – While textual interventions work, more precise spatial control (e.g., specifying object locations) is not yet supported.
Future research directions include hierarchical interleaving (scene‑level language → shot‑level video), multimodal conditioning with audio, and lightweight distillation techniques to bring TV2TV to edge devices.
Authors
- Xiaochuang Han
- Youssef Emad
- Melissa Hall
- John Nguyen
- Karthik Padthe
- Liam Robbins
- Amir Bar
- Delong Chen
- Michal Drozdzal
- Maha Elbayad
- Yushi Hu
- Shang‑Wen Li
- Sreya Dutta Roy
- Jakob Verbeek
- XuDong Wang
- Marjan Ghazvininejad
- Luke Zettlemoyer
- Emily Dinan
Paper Information
- arXiv ID: 2512.05103v1
- Categories: cs.LG, cs.AI, cs.CV
- Published: December 4, 2025
- PDF: Download PDF