[Paper] Recurrent Video Masked Autoencoders
Source: arXiv - 2512.13684v1
Overview
The paper introduces Recurrent Video Masked Autoencoders (RVM), a new way to learn video representations using a transformer‑based recurrent network that aggregates dense image features over time. By framing learning as an asymmetric masked‑pixel reconstruction task, RVM delivers a single “generalist” encoder that rivals state‑of‑the‑art video models on action‑recognition and tracking while also excelling on dense‑spatial tasks traditionally dominated by image‑only models.
Key Contributions
- Recurrent architecture for video: Replaces costly full spatio‑temporal attention with a lightweight recurrent transformer that propagates features frame‑by‑frame, keeping computation linear in video length.
- Asymmetric masked prediction: Only the future frames are masked, allowing the model to learn temporal dynamics from a simple pixel‑reconstruction loss—no extra supervision or distillation needed.
- Parameter efficiency: Small‑scale RVM models achieve up to 30× better parameter efficiency than competing video MAE approaches while matching or surpassing their accuracy.
- Unified encoder: A single pretrained backbone performs competitively on both video‑level tasks (action classification, point/object tracking) and dense‑spatial tasks (geometry, segmentation) without task‑specific finetuning.
- Stable long‑range feature propagation: Demonstrates that recurrent updates remain coherent over long sequences, addressing the drift problems seen in vanilla recurrent nets.
- Qualitative insights: Visualizations reveal that RVM captures scene semantics, motion patterns, and structural cues, confirming that the learned embeddings are rich and interpretable.
Methodology
- Backbone encoder: A standard Vision Transformer (ViT) processes each video frame independently, producing a dense grid of patch embeddings.
- Recurrent aggregation: A lightweight transformer‑style recurrent module takes the current frame’s embeddings and the hidden state from the previous frame, updating the hidden state with a cross‑attention between the two. This yields a temporally‑aware representation for the current frame while keeping the cost O(T·N) (T = frames, N = patches).
- Masked reconstruction objective: For each training clip, a random subset of future patches is masked out. The model must reconstruct the missing pixel values from the unmasked patches and the recurrent hidden state, using a simple L2 pixel loss. Because only the future is masked, the network learns to predict upcoming visual content, implicitly capturing motion and temporal context.
- Training regime: No extra supervision (e.g., optical flow, labels) or knowledge‑distillation tricks are used. The model is trained on large‑scale video datasets (e.g., Kinetics‑400) with standard data augmentations.
- Fine‑tuning: After pretraining, the recurrent encoder can be frozen or fine‑tuned for downstream tasks. For classification, a simple linear head is attached; for tracking, the dense embeddings are fed into a lightweight correlation tracker.
Results & Findings
| Benchmark | RVM (small) | VideoMAE (large) | V‑JEPA | DINOv2 (image) |
|---|---|---|---|---|
| Kinetics‑400 Top‑1 (finetune) | 78.3 % | 80.1 % | 79.5 % | – |
| Something‑Something‑V2 (action) | 61.2 % | 62.8 % | 62.0 % | – |
| UAV123 (object tracking) | 71.5 % AO | 70.9 % AO | 70.2 % AO | – |
| COCO‑Stuff (dense segmentation) | 45.8 % mIoU | – | – | 44.7 % mIoU |
| Parameter count | 22 M | 86 M | 84 M | 300 M (ViT‑L) |
- Competitive accuracy despite being 3–4× smaller than VideoMAE/V‑JEPA.
- Linear scaling: Inference time grows linearly with video length, unlike full spatio‑temporal attention which becomes cubic.
- Robust long‑range predictions: Feature similarity remains high (>0.85 cosine) across 60‑frame horizons, indicating stable temporal propagation.
- Qualitative: Attention maps highlight moving objects and scene layout, confirming that the model learns both motion cues and geometric structure.
Practical Implications
- Edge & mobile deployment: The small‑parameter, linear‑time design makes RVM ideal for on‑device video analytics (e.g., real‑time action detection on smartphones or drones).
- Unified pipeline: Teams can use a single pretrained encoder for diverse downstream tasks—classification, tracking, segmentation—reducing engineering overhead and storage costs.
- Scalable video indexing: Because the recurrent encoder can process streams frame‑by‑frame, it fits naturally into streaming pipelines for video search or content moderation without buffering large clips.
- Accelerated research prototyping: The simple pixel‑reconstruction loss eliminates the need for costly multi‑task pretraining or teacher models, allowing rapid iteration on new video datasets.
- Potential for multimodal extensions: The recurrent backbone can be paired with audio or text streams, opening doors to unified video‑audio‑text representation learning with minimal extra compute.
Limitations & Future Work
- Masking strategy is still uniform random: More sophisticated spatio‑temporal masking (e.g., motion‑aware) could further boost performance.
- No explicit handling of variable frame rates: The recurrent module assumes a fixed temporal stride; adapting to irregular video capture would require additional temporal modeling.
- Benchmarks limited to relatively short clips: While the recurrent design scales linearly, empirical evaluation on ultra‑long videos (e.g., hours‑long surveillance) remains to be explored.
- Future directions suggested by the authors include: integrating hierarchical recurrence (multi‑scale temporal states), combining RVM with contrastive objectives for better cross‑modal alignment, and extending the framework to self‑supervised video captioning or reinforcement‑learning agents that need compact, temporally‑aware visual embeddings.
Authors
- Daniel Zoran
- Nikhil Parthasarathy
- Yi Yang
- Drew A Hudson
- Joao Carreira
- Andrew Zisserman
Paper Information
- arXiv ID: 2512.13684v1
- Categories: cs.CV
- Published: December 15, 2025
- PDF: Download PDF