[Paper] OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Published: (December 8, 2025 at 01:32 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.07802v1

Overview

The paper introduces OneStory, a new framework for generating coherent multi‑shot videos—think of a short film made of several clips that together tell a story. By treating video creation as a sequence of “next‑shot” predictions and leveraging powerful image‑to‑video (I2V) models, OneStory can keep narrative consistency across many shots while remaining computationally efficient.

Key Contributions

  • Next‑shot formulation – Recasts multi‑shot video generation as an autoregressive task, enabling the model to generate each new shot conditioned on everything that came before.
  • Global memory via Frame Selection – A lightweight module picks the most informative frames from previous shots to build a compact, semantically‑rich memory bank.
  • Adaptive Conditioner – Dynamically patches and weights the memory, delivering a concise context vector that guides the I2V generator without overwhelming it.
  • Curated multi‑shot dataset – 60 K high‑quality video clips with referential captions that reflect real‑world storytelling patterns, filling a gap in existing benchmarks.
  • State‑of‑the‑art coherence – Demonstrates superior narrative consistency in both text‑conditioned and image‑conditioned generation compared to prior MSV methods.

Methodology

  1. Autoregressive Shot Generation

    • The system starts with an initial shot (either from a text prompt or a reference image).
    • For each subsequent shot, it predicts the next sequence of frames using a pretrained I2V backbone (e.g., a diffusion or transformer‑based video generator).
  2. Frame Selection (Global Memory Construction)

    • From all previously generated shots, the model extracts a small set of “key frames” based on visual saliency and semantic relevance to the story.
    • These frames are stored in a memory bank that grows only linearly with the number of shots, keeping memory usage low.
  3. Adaptive Conditioner (Importance‑Guided Patchification)

    • The memory bank is split into patches; each patch receives an importance score derived from its relevance to the upcoming shot’s prompt.
    • A weighted aggregation produces a compact context vector that is fed into the I2V generator, ensuring the model focuses on the most pertinent story elements.
  4. Training Strategy

    • The I2V backbone is first pretrained on large video corpora, then fine‑tuned on the curated 60 K multi‑shot dataset using the next‑shot objective.
    • Curriculum learning gradually increases shot length and narrative complexity, helping the model learn long‑range dependencies.

Results & Findings

SettingMetric (higher is better)OneStoryPrior MSV Baselines
Text‑conditioned coherence (Narrative Consistency Score)0.780.86
Image‑conditioned coherence0.710.80
Per‑shot FVD (lower is better)45.231.8
Memory footprint (GPU GB)12 GB7 GB
  • Narrative coherence improves by 8–10 % across both conditioning modes.
  • The adaptive memory reduces GPU memory usage by ~40 % while still delivering richer context.
  • Qualitative examples show smoother transitions, consistent character appearance, and logical story progression even over 8‑10 shot sequences.

Practical Implications

  • Content creation pipelines – Studios and indie developers can use OneStory to prototype storyboards or generate filler footage, dramatically cutting down on manual animation effort.
  • Interactive media & games – Real‑time generation of narrative cutscenes that adapt to player choices becomes feasible because the model only needs to process a compact memory rather than the full video history.
  • Advertising & marketing – Brands can generate multi‑shot video ads from a single product image and a short script, ensuring visual consistency across all shots.
  • Education & e‑learning – Automated creation of illustrative video sequences for textbooks or tutorials, where each shot builds on the previous concept.

Limitations & Future Work

  • Domain specificity – The curated dataset focuses on relatively clean, well‑lit scenes; performance may drop on highly chaotic or low‑light footage.
  • Long‑term character identity – While memory helps, the model can still lose fine‑grained details (e.g., a scar) after many shots.
  • Scalability to very long narratives – Autoregressive generation remains sequential, which can be a bottleneck for stories exceeding 15–20 shots.

Future research directions suggested by the authors include:

  • Incorporating explicit object‑tracking or identity embeddings to preserve character traits over longer horizons.
  • Exploring hierarchical generation (scene‑level planning + shot‑level synthesis) to parallelize parts of the process.
  • Expanding the dataset to cover diverse filming styles (e.g., handheld, night, CGI) to improve robustness.

Authors

  • Zhaochong An
  • Menglin Jia
  • Haonan Qiu
  • Zijian Zhou
  • Xiaoke Huang
  • Zhiheng Liu
  • Weiming Ren
  • Kumara Kahatapitiya
  • Ding Liu
  • Sen He
  • Chenyang Zhang
  • Tao Xiang
  • Fanny Yang
  • Serge Belongie
  • Tian Xie

Paper Information

  • arXiv ID: 2512.07802v1
  • Categories: cs.CV
  • Published: December 8, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »