[Paper] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Source: arXiv - 2512.07802v1
Overview
The paper introduces OneStory, a new framework for generating coherent multi‑shot videos—think of a short film made of several clips that together tell a story. By treating video creation as a sequence of “next‑shot” predictions and leveraging powerful image‑to‑video (I2V) models, OneStory can keep narrative consistency across many shots while remaining computationally efficient.
Key Contributions
- Next‑shot formulation – Recasts multi‑shot video generation as an autoregressive task, enabling the model to generate each new shot conditioned on everything that came before.
- Global memory via Frame Selection – A lightweight module picks the most informative frames from previous shots to build a compact, semantically‑rich memory bank.
- Adaptive Conditioner – Dynamically patches and weights the memory, delivering a concise context vector that guides the I2V generator without overwhelming it.
- Curated multi‑shot dataset – 60 K high‑quality video clips with referential captions that reflect real‑world storytelling patterns, filling a gap in existing benchmarks.
- State‑of‑the‑art coherence – Demonstrates superior narrative consistency in both text‑conditioned and image‑conditioned generation compared to prior MSV methods.
Methodology
-
Autoregressive Shot Generation
- The system starts with an initial shot (either from a text prompt or a reference image).
- For each subsequent shot, it predicts the next sequence of frames using a pretrained I2V backbone (e.g., a diffusion or transformer‑based video generator).
-
Frame Selection (Global Memory Construction)
- From all previously generated shots, the model extracts a small set of “key frames” based on visual saliency and semantic relevance to the story.
- These frames are stored in a memory bank that grows only linearly with the number of shots, keeping memory usage low.
-
Adaptive Conditioner (Importance‑Guided Patchification)
- The memory bank is split into patches; each patch receives an importance score derived from its relevance to the upcoming shot’s prompt.
- A weighted aggregation produces a compact context vector that is fed into the I2V generator, ensuring the model focuses on the most pertinent story elements.
-
Training Strategy
- The I2V backbone is first pretrained on large video corpora, then fine‑tuned on the curated 60 K multi‑shot dataset using the next‑shot objective.
- Curriculum learning gradually increases shot length and narrative complexity, helping the model learn long‑range dependencies.
Results & Findings
| Setting | Metric (higher is better) | OneStory | Prior MSV Baselines |
|---|---|---|---|
| Text‑conditioned coherence (Narrative Consistency Score) | 0.78 | 0.86 | |
| Image‑conditioned coherence | 0.71 | 0.80 | |
| Per‑shot FVD (lower is better) | 45.2 | 31.8 | |
| Memory footprint (GPU GB) | 12 GB | 7 GB |
- Narrative coherence improves by 8–10 % across both conditioning modes.
- The adaptive memory reduces GPU memory usage by ~40 % while still delivering richer context.
- Qualitative examples show smoother transitions, consistent character appearance, and logical story progression even over 8‑10 shot sequences.
Practical Implications
- Content creation pipelines – Studios and indie developers can use OneStory to prototype storyboards or generate filler footage, dramatically cutting down on manual animation effort.
- Interactive media & games – Real‑time generation of narrative cutscenes that adapt to player choices becomes feasible because the model only needs to process a compact memory rather than the full video history.
- Advertising & marketing – Brands can generate multi‑shot video ads from a single product image and a short script, ensuring visual consistency across all shots.
- Education & e‑learning – Automated creation of illustrative video sequences for textbooks or tutorials, where each shot builds on the previous concept.
Limitations & Future Work
- Domain specificity – The curated dataset focuses on relatively clean, well‑lit scenes; performance may drop on highly chaotic or low‑light footage.
- Long‑term character identity – While memory helps, the model can still lose fine‑grained details (e.g., a scar) after many shots.
- Scalability to very long narratives – Autoregressive generation remains sequential, which can be a bottleneck for stories exceeding 15–20 shots.
Future research directions suggested by the authors include:
- Incorporating explicit object‑tracking or identity embeddings to preserve character traits over longer horizons.
- Exploring hierarchical generation (scene‑level planning + shot‑level synthesis) to parallelize parts of the process.
- Expanding the dataset to cover diverse filming styles (e.g., handheld, night, CGI) to improve robustness.
Authors
- Zhaochong An
- Menglin Jia
- Haonan Qiu
- Zijian Zhou
- Xiaoke Huang
- Zhiheng Liu
- Weiming Ren
- Kumara Kahatapitiya
- Ding Liu
- Sen He
- Chenyang Zhang
- Tao Xiang
- Fanny Yang
- Serge Belongie
- Tian Xie
Paper Information
- arXiv ID: 2512.07802v1
- Categories: cs.CV
- Published: December 8, 2025
- PDF: Download PDF