[Paper] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Published: 2 days ago (December 8, 2025 at 01:32 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07802v1

Overview

The paper introduces OneStory, a new framework for generating coherent multi‑shot videos—think of a short film made of several clips that together tell a story. By treating video creation as a sequence of “next‑shot” predictions and leveraging powerful image‑to‑video (I2V) models, OneStory can keep narrative consistency across many shots while remaining computationally efficient.

Key Contributions

Next‑shot formulation – Recasts multi‑shot video generation as an autoregressive task, enabling the model to generate each new shot conditioned on everything that came before.
Global memory via Frame Selection – A lightweight module picks the most informative frames from previous shots to build a compact, semantically‑rich memory bank.
Adaptive Conditioner – Dynamically patches and weights the memory, delivering a concise context vector that guides the I2V generator without overwhelming it.
Curated multi‑shot dataset – 60 K high‑quality video clips with referential captions that reflect real‑world storytelling patterns, filling a gap in existing benchmarks.
State‑of‑the‑art coherence – Demonstrates superior narrative consistency in both text‑conditioned and image‑conditioned generation compared to prior MSV methods.

Methodology

Autoregressive Shot Generation
- The system starts with an initial shot (either from a text prompt or a reference image).
- For each subsequent shot, it predicts the next sequence of frames using a pretrained I2V backbone (e.g., a diffusion or transformer‑based video generator).
Frame Selection (Global Memory Construction)
- From all previously generated shots, the model extracts a small set of “key frames” based on visual saliency and semantic relevance to the story.
- These frames are stored in a memory bank that grows only linearly with the number of shots, keeping memory usage low.
Adaptive Conditioner (Importance‑Guided Patchification)
- The memory bank is split into patches; each patch receives an importance score derived from its relevance to the upcoming shot’s prompt.
- A weighted aggregation produces a compact context vector that is fed into the I2V generator, ensuring the model focuses on the most pertinent story elements.
Training Strategy
- The I2V backbone is first pretrained on large video corpora, then fine‑tuned on the curated 60 K multi‑shot dataset using the next‑shot objective.
- Curriculum learning gradually increases shot length and narrative complexity, helping the model learn long‑range dependencies.

Results & Findings

Setting	Metric (higher is better)	OneStory
Text‑conditioned coherence (Narrative Consistency Score)	0.78	0.86
Image‑conditioned coherence	0.71	0.80
Per‑shot FVD (lower is better)	45.2	31.8
Memory footprint (GPU GB)	12 GB	7 GB

Narrative coherence improves by 8–10 % across both conditioning modes.
The adaptive memory reduces GPU memory usage by ~40 % while still delivering richer context.
Qualitative examples show smoother transitions, consistent character appearance, and logical story progression even over 8‑10 shot sequences.

Practical Implications

Content creation pipelines – Studios and indie developers can use OneStory to prototype storyboards or generate filler footage, dramatically cutting down on manual animation effort.
Interactive media & games – Real‑time generation of narrative cutscenes that adapt to player choices becomes feasible because the model only needs to process a compact memory rather than the full video history.
Advertising & marketing – Brands can generate multi‑shot video ads from a single product image and a short script, ensuring visual consistency across all shots.
Education & e‑learning – Automated creation of illustrative video sequences for textbooks or tutorials, where each shot builds on the previous concept.

Limitations & Future Work

Domain specificity – The curated dataset focuses on relatively clean, well‑lit scenes; performance may drop on highly chaotic or low‑light footage.
Long‑term character identity – While memory helps, the model can still lose fine‑grained details (e.g., a scar) after many shots.
Scalability to very long narratives – Autoregressive generation remains sequential, which can be a bottleneck for stories exceeding 15–20 shots.

Future research directions suggested by the authors include:

Incorporating explicit object‑tracking or identity embeddings to preserve character traits over longer horizons.
Exploring hierarchical generation (scene‑level planning + shot‑level synthesis) to parallelize parts of the process.
Expanding the dataset to cover diverse filming styles (e.g., handheld, night, CGI) to improve robustness.

Authors

Zhaochong An
Menglin Jia
Haonan Qiu
Zijian Zhou
Xiaoke Huang
Zhiheng Liu
Weiming Ren
Kumara Kahatapitiya
Ding Liu
Sen He
Chenyang Zhang
Tao Xiang
Fanny Yang
Serge Belongie
Tian Xie

Paper Information

arXiv ID: 2512.07802v1
Categories: cs.CV
Published: December 8, 2025
PDF: Download PDF

[Paper] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] GAINS: Gaussian-based Inverse Rendering from Sparse Multi-View Captures

[Paper] ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

[Paper] Splatent: Splatting Diffusion Latents for Novel View Synthesis

[Paper] LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating