[Paper] CAST: Modeling Visual State Transitions for Consistent Video Retrieval
Source: arXiv - 2603.08648v1
Overview
The paper CAST: Modeling Visual State Transitions for Consistent Video Retrieval tackles a gap in current video‑retrieval systems: they treat each clip in isolation, ignoring how visual states (objects, actions, and identities) evolve over time. By formalizing Consistent Video Retrieval (CVR) and introducing a lightweight adapter that can be slotted onto any frozen vision‑language model, the authors demonstrate a practical way to retrieve video segments that respect temporal coherence—crucial for building longer, story‑like video experiences.
Key Contributions
- Formal definition of Consistent Video Retrieval (CVR) and a diagnostic benchmark covering three diverse datasets (YouCook2, COIN, CrossTask).
- CAST adapter: a plug‑and‑play module that predicts a state‑conditioned residual (Δ) from the visual history, biasing the embedding space toward plausible state transitions.
- Backbone‑agnostic design: works with any frozen vision‑language encoder (e.g., CLIP, BLIP, Flamingo) without fine‑tuning the large model itself.
- Empirical gains: consistent improvements over strong zero‑shot baselines on YouCook2 and CrossTask, competitive results on COIN, and robust performance across multiple foundation models.
- Reranking for video generation: CAST’s transition scores can be used to rank candidate continuations from black‑box generators (e.g., Veo), yielding more temporally coherent outputs.
Methodology
-
Problem framing – CVR is cast as retrieving a clip (c_t) given a visual history (H_{t-1} = {c_1, …, c_{t-1}}). The goal is to select a clip whose latent representation aligns not only semantically but also with the state trajectory implied by (H_{t-1}).
-
CAST architecture –
- Frozen encoder: a pre‑trained vision‑language model provides a base embedding (e_t) for each candidate clip.
- History encoder: a lightweight transformer (or simple RNN) ingests the sequence of past embeddings, producing a state context vector (s_{t-1}).
- Residual predictor: a small MLP takes (s_{t-1}) and outputs a residual (\Delta_t) that is added to the candidate embedding: (\tilde{e}_t = e_t + \Delta_t).
- Scoring – similarity between (\tilde{e}_t) and a textual query (or next‑step prompt) yields the retrieval score.
-
Training – CAST is trained on paired video sequences with supervision only on the ordering (next‑clip prediction). Because the backbone stays frozen, training is fast and requires modest GPU memory.
-
Plug‑and‑play usage – At inference, any existing video‑retrieval pipeline can prepend CAST to re‑score candidates, or use CAST’s Δ‑scores as a reranker for generated video continuations.
Results & Findings
| Dataset | Backbone | Baseline (zero‑shot) | CAST‑augmented | Δ (absolute) |
|---|---|---|---|---|
| YouCook2 | CLIP‑ViT‑B/32 | R@1 = 21.4% | R@1 = 27.9% | +6.5 pts |
| CrossTask | BLIP‑Base | R@1 = 18.7% | R@1 = 24.3% | +5.6 pts |
| COIN | Flamingo‑Small | R@1 = 15.2% | R@1 = 15.8% | +0.6 pts |
- Consistent gains across all backbones show that the residual update captures genuine temporal dynamics rather than overfitting to a specific encoder.
- Reranking experiment: when CAST’s transition score is used to reorder 5 candidate continuations from the Veo generator, human evaluators reported a 12% increase in perceived temporal coherence.
- Ablation: removing the history encoder or using a naïve averaging of past embeddings drops performance back to baseline, confirming the importance of explicit state modeling.
Practical Implications
- Long‑form video editing tools (e.g., automated storyboarding, highlight reels) can integrate CAST to fetch clips that naturally follow the current narrative, reducing manual stitching.
- Content recommendation engines for platforms like YouTube Shorts or TikTok can surface follow‑up videos that respect the viewer’s recent watch history, improving watch‑time and user satisfaction.
- Video‑to‑video generation pipelines can employ CAST as a cheap, post‑hoc consistency filter, avoiding costly retraining of generative models while still delivering smoother continuations.
- Developer-friendly: because CAST is a small adapter, it can be dropped into existing PyTorch or TensorFlow pipelines with a few lines of code, and it runs in under 5 ms per candidate on a single GPU.
Limitations & Future Work
- Dataset bias: the benchmark focuses on instructional and cooking domains; performance on highly dynamic content (e.g., sports, movies) remains untested.
- State granularity: CAST models state at the embedding level, which may miss fine‑grained object‑level changes (e.g., subtle pose shifts).
- Scalability of history length: longer histories increase the transformer’s cost; future work could explore hierarchical or memory‑compressed encodings.
- End‑to‑end training: while freezing the backbone simplifies deployment, jointly fine‑tuning the encoder and CAST could unlock further gains, especially for domain‑specific applications.
Overall, CAST offers a pragmatic bridge between powerful frozen vision‑language models and the need for temporally consistent video retrieval—a step that can be immediately leveraged by developers building next‑generation video experiences.
Authors
- Yanqing Liu
- Yingcheng Liu
- Fanghong Dong
- Budianto Budianto
- Cihang Xie
- Yan Jiao
Paper Information
- arXiv ID: 2603.08648v1
- Categories: cs.CV
- Published: March 9, 2026
- PDF: Download PDF