[Paper] CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Published: (March 9, 2026 at 01:26 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08648v1

Overview

The paper CAST: Modeling Visual State Transitions for Consistent Video Retrieval tackles a gap in current video‑retrieval systems: they treat each clip in isolation, ignoring how visual states (objects, actions, and identities) evolve over time. By formalizing Consistent Video Retrieval (CVR) and introducing a lightweight adapter that can be slotted onto any frozen vision‑language model, the authors demonstrate a practical way to retrieve video segments that respect temporal coherence—crucial for building longer, story‑like video experiences.

Key Contributions

  • Formal definition of Consistent Video Retrieval (CVR) and a diagnostic benchmark covering three diverse datasets (YouCook2, COIN, CrossTask).
  • CAST adapter: a plug‑and‑play module that predicts a state‑conditioned residual (Δ) from the visual history, biasing the embedding space toward plausible state transitions.
  • Backbone‑agnostic design: works with any frozen vision‑language encoder (e.g., CLIP, BLIP, Flamingo) without fine‑tuning the large model itself.
  • Empirical gains: consistent improvements over strong zero‑shot baselines on YouCook2 and CrossTask, competitive results on COIN, and robust performance across multiple foundation models.
  • Reranking for video generation: CAST’s transition scores can be used to rank candidate continuations from black‑box generators (e.g., Veo), yielding more temporally coherent outputs.

Methodology

  1. Problem framing – CVR is cast as retrieving a clip (c_t) given a visual history (H_{t-1} = {c_1, …, c_{t-1}}). The goal is to select a clip whose latent representation aligns not only semantically but also with the state trajectory implied by (H_{t-1}).

  2. CAST architecture

    • Frozen encoder: a pre‑trained vision‑language model provides a base embedding (e_t) for each candidate clip.
    • History encoder: a lightweight transformer (or simple RNN) ingests the sequence of past embeddings, producing a state context vector (s_{t-1}).
    • Residual predictor: a small MLP takes (s_{t-1}) and outputs a residual (\Delta_t) that is added to the candidate embedding: (\tilde{e}_t = e_t + \Delta_t).
    • Scoring – similarity between (\tilde{e}_t) and a textual query (or next‑step prompt) yields the retrieval score.
  3. Training – CAST is trained on paired video sequences with supervision only on the ordering (next‑clip prediction). Because the backbone stays frozen, training is fast and requires modest GPU memory.

  4. Plug‑and‑play usage – At inference, any existing video‑retrieval pipeline can prepend CAST to re‑score candidates, or use CAST’s Δ‑scores as a reranker for generated video continuations.

Results & Findings

DatasetBackboneBaseline (zero‑shot)CAST‑augmentedΔ (absolute)
YouCook2CLIP‑ViT‑B/32R@1 = 21.4%R@1 = 27.9%+6.5 pts
CrossTaskBLIP‑BaseR@1 = 18.7%R@1 = 24.3%+5.6 pts
COINFlamingo‑SmallR@1 = 15.2%R@1 = 15.8%+0.6 pts
  • Consistent gains across all backbones show that the residual update captures genuine temporal dynamics rather than overfitting to a specific encoder.
  • Reranking experiment: when CAST’s transition score is used to reorder 5 candidate continuations from the Veo generator, human evaluators reported a 12% increase in perceived temporal coherence.
  • Ablation: removing the history encoder or using a naïve averaging of past embeddings drops performance back to baseline, confirming the importance of explicit state modeling.

Practical Implications

  • Long‑form video editing tools (e.g., automated storyboarding, highlight reels) can integrate CAST to fetch clips that naturally follow the current narrative, reducing manual stitching.
  • Content recommendation engines for platforms like YouTube Shorts or TikTok can surface follow‑up videos that respect the viewer’s recent watch history, improving watch‑time and user satisfaction.
  • Video‑to‑video generation pipelines can employ CAST as a cheap, post‑hoc consistency filter, avoiding costly retraining of generative models while still delivering smoother continuations.
  • Developer-friendly: because CAST is a small adapter, it can be dropped into existing PyTorch or TensorFlow pipelines with a few lines of code, and it runs in under 5 ms per candidate on a single GPU.

Limitations & Future Work

  • Dataset bias: the benchmark focuses on instructional and cooking domains; performance on highly dynamic content (e.g., sports, movies) remains untested.
  • State granularity: CAST models state at the embedding level, which may miss fine‑grained object‑level changes (e.g., subtle pose shifts).
  • Scalability of history length: longer histories increase the transformer’s cost; future work could explore hierarchical or memory‑compressed encodings.
  • End‑to‑end training: while freezing the backbone simplifies deployment, jointly fine‑tuning the encoder and CAST could unlock further gains, especially for domain‑specific applications.

Overall, CAST offers a pragmatic bridge between powerful frozen vision‑language models and the need for temporally consistent video retrieval—a step that can be immediately leveraged by developers building next‑generation video experiences.

Authors

  • Yanqing Liu
  • Yingcheng Liu
  • Fanghong Dong
  • Budianto Budianto
  • Cihang Xie
  • Yan Jiao

Paper Information

  • arXiv ID: 2603.08648v1
  • Categories: cs.CV
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »