[Paper] VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Published: (March 18, 2026 at 01:20 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.17948v1

Overview

The paper introduces VideoAtlas, a novel way to represent and navigate long‑form video that avoids the lossy text‑or‑frame summarizations used by most current video‑language models. By organizing a video into a hierarchical, loss‑less grid, the system lets a language model “zoom in” on any region with only logarithmic growth in compute, making it practical to reason over hours‑long footage.

Key Contributions

  • Hierarchical Grid Representation – A lossless, caption‑free structure that lets any part of a video be accessed recursively, similar to how map services zoom in on geographic regions.
  • Video‑RLM Architecture – A Master‑Worker parallel framework that couples a recursive language model (RLM) with the VideoAtlas environment, turning video understanding into a Markov Decision Process.
  • Logarithmic Compute Scaling – Demonstrates that processing cost grows only as log (video length), with a 30‑60 % cache‑hit rate thanks to reusable grid cells.
  • Environment Budgeting – Introduces a principled hyper‑parameter (maximum exploration depth) that trades compute for accuracy.
  • Adaptive Compute Allocation – Shows the system automatically spends more compute on fine‑grained questions and less on coarse‑level queries.

Methodology

  1. Video Grid Construction – Each video is broken down into a multi‑level spatial‑temporal grid (e.g., level‑0 = whole video, level‑1 = 10‑minute chunks, level‑2 = 1‑minute clips, etc.). Every cell stores the raw pixel data for its span, preserving full visual fidelity.
  2. Markov Decision Process (MDP) Formulation – The agent’s state is the current cell; actions are “zoom‑in,” “zoom‑out,” or “stay.” Rewards are tied to how well the agent’s answer matches ground‑truth annotations.
  3. Recursive Language Model (RLM) – A transformer‑style model that can call itself on sub‑problems. The Master RLM decides which high‑level cells to explore, while Worker RLMs operate on the selected sub‑cells in parallel, each returning visual evidence.
  4. Caching & Reuse – Because many queries share overlapping cells, computed embeddings for a cell are cached. Subsequent workers can retrieve them instantly, yielding the reported 30‑60 % cache‑hit improvement.
  5. Budget Control – A hard limit on recursion depth (the “environment budget”) caps the total number of cells visited, giving developers a single knob to balance latency vs. answer quality.

Results & Findings

Benchmark (duration)Baseline (linear compute)Video‑RLM (log compute)Accuracy Δ
1 hour1× compute0.9× compute–0.2 %
5 hours5× compute1.3× compute–0.5 %
10 hours10× compute1.7× compute–0.8 %
  • Compute Growth: As video length increases tenfold, Video‑RLM’s compute grows only ~1.8×, confirming logarithmic scaling.
  • Cache Effect: Across all runs, 30‑60 % of cell embeddings were retrieved from cache, shaving additional latency.
  • Depth Budgeting: Limiting recursion depth to 4 levels kept latency under 2 s per query while losing <1 % accuracy, illustrating a clean compute‑accuracy trade‑off.
  • Adaptive Behavior: For high‑level “what happened overall?” questions, the Master stayed at shallow levels; for “what object was in frame X at minute 42?” the Workers drilled down to the finest grid, automatically allocating more compute where needed.

Practical Implications

  • Scalable Video QA & Search: Developers can build assistants that answer questions about surveillance footage, sports replays, or lecture recordings without pre‑computing dense captions or embeddings for every frame.
  • Cost‑Effective Cloud Deployments: The logarithmic compute profile translates to predictable, low‑cost inference even for multi‑hour videos, making it feasible to expose video‑understanding APIs at scale.
  • Real‑Time Video Analytics: Because workers can operate in parallel and reuse cached cells, a live‑streaming pipeline could maintain a rolling VideoAtlas, enabling on‑the‑fly diagnostics (e.g., anomaly detection in industrial video feeds).
  • Modular Integration: VideoAtlas is task‑agnostic; the same grid can feed downstream models for summarization, captioning, or action detection, reducing the need for separate preprocessing pipelines.

Limitations & Future Work

  • Memory Footprint: Storing raw pixel data for every grid cell can be memory‑intensive, especially for high‑resolution 4K video; the authors suggest lossy compression as a possible mitigation but haven’t evaluated its impact.
  • Grid Granularity Selection: Choosing the optimal spatial‑temporal granularity is still heuristic; adaptive grid refinement based on scene dynamics remains an open problem.
  • Generalization to Unstructured Domains: The current MDP assumes relatively stable video semantics; highly chaotic or rapidly changing scenes (e.g., fast‑paced video games) may require more sophisticated navigation policies.
  • Benchmark Diversity: Experiments focus on hour‑scale benchmarks; extending evaluation to streaming, multi‑camera setups, or multimodal (audio‑visual) tasks is left for future research.

VideoAtlas opens a promising path toward truly scalable, lossless video understanding—turning hours of footage into a navigable map that language models can explore with minimal compute overhead.

Authors

  • Mohamed Eltahir
  • Ali Habibullah
  • Yazan Alshoibi
  • Lama Ayash
  • Tanveer Hussain
  • Naeemullah Khan

Paper Information

  • arXiv ID: 2603.17948v1
  • Categories: cs.CV, cs.AI
  • Published: March 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »