[Paper] Exploring High-Order Self-Similarity for Video Understanding

Published: 1 day ago (April 22, 2026 at 12:48 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.20760v1

Overview

The paper “Exploring High‑Order Self‑Similarity for Video Understanding” proposes a new way to capture motion patterns in video by looking at self‑similarity not just between two frames, but across multiple temporal scales. By stacking these “higher‑order” similarity maps, the authors build a lightweight plug‑in—Multi‑Order Self‑Similarity (MOSS)—that can be dropped into existing video models to boost their temporal reasoning with almost no extra compute.

Key Contributions

Higher‑order space‑time self‑similarity (STSS): Shows that similarity maps of order > 1 expose complementary motion cues (e.g., acceleration, periodicity) that first‑order STSS misses.
MOSS module: A compact neural block that extracts, learns, and fuses multi‑order STSS features; can be attached to any backbone (CNN, Transformer, etc.).
Broad empirical validation: Demonstrates consistent gains on three very different tasks—action classification, motion‑centric video VQA, and real‑world robot perception—while adding < 2 % FLOPs and < 5 MB memory.
Open‑source release: Code, pretrained checkpoints, and a simple API for plugging MOSS into popular video libraries (PyTorchVideo, MMAction2).

Methodology

Space‑time self‑similarity (STSS): For a video tensor (X \in \mathbb{R}^{T \times H \times W \times C}), the first‑order STSS is computed by correlating each spatio‑temporal patch with every other patch, yielding a 4‑D similarity volume.
Higher‑order STSS: The authors recursively apply the same correlation operation on the similarity volume itself.
- Second‑order STSS captures how similarity patterns evolve over time (e.g., a moving object that speeds up).
- Third‑order and beyond can model more complex dynamics like oscillations or repetitive gestures.
MOSS block:
- Extraction: A set of 1×1 convolutions reduces the dimensionality of each STSS order.
- Learning: Separate lightweight MLPs (or depthwise convolutions) learn order‑specific embeddings.
- Fusion: Learned embeddings are summed/concatenated and passed through a final linear layer that produces a temporal feature map compatible with the host backbone.
Integration: MOSS can be inserted after any intermediate feature stage (e.g., after a ResNet‑3D block or a Vision Transformer token mixer). Because the similarity calculations are performed on already‑extracted features, the extra cost is modest.

Results & Findings

Task	Baseline	+MOSS	Δ (absolute)	Δ (relative)
Kinetics‑400 (action recognition)	78.2 % top‑1	80.5 %	+2.3 %	+2.9 %
MSRVTT‑QA (motion‑centric VQA)	44.1 %	47.8 %	+3.7 %	+8.4 %
Real‑world robot grasping (sim‑to‑real)	71.5 % success	76.2 %	+4.7 %	+6.6 %
Compute overhead	—	+1.8 % FLOPs	—	—
Memory increase	—	+4.2 MB	—	—

Takeaway: Across very different domains, adding MOSS yields consistent double‑digit relative improvements while keeping the model lightweight. Ablation studies confirm that each order contributes uniquely—removing the second‑order term drops performance by ~1 %, and dropping third‑order eliminates another ~0.5 %.

Practical Implications

Plug‑and‑play temporal boost: Developers can upgrade existing video pipelines (e.g., video analytics, AR/VR content moderation) by inserting a single MOSS layer without redesigning the whole architecture.
Edge‑friendly: The marginal FLOP and memory increase make MOSS suitable for on‑device inference on smartphones, drones, or embedded robotics platforms where power budgets are tight.
Better motion reasoning for downstream AI: Tasks that rely on subtle dynamics—gesture control, sports analytics, autonomous navigation—can benefit from the richer temporal descriptors that higher‑order STSS provides.
Unified code base: Since the authors release a PyTorch module with a simple MOSS(in_channels, orders=[1,2,3]) API, integrating it into frameworks like Detectron2‑Video or TensorFlow Hub is straightforward.

Limitations & Future Work

Scalability to very long videos: Computing similarity volumes grows quadratically with the number of frames; the current implementation caps at ~32 frames and uses temporal down‑sampling for longer clips.
Order selection is heuristic: The paper experiments with up to third‑order STSS; higher orders may capture even richer dynamics but also risk over‑fitting and increased cost. An adaptive mechanism to select the optimal order per video is an open question.
Domain‑specific tuning: While MOSS works out‑of‑the‑box on several benchmarks, optimal placement (which backbone stage) and hyper‑parameters still require modest task‑specific tuning.

Future directions include efficient approximations (e.g., low‑rank factorization of similarity tensors), dynamic order scheduling during inference, and extending MOSS to multimodal streams (audio‑visual self‑similarity).

If you’re building video‑centric products and want a quick win on temporal modeling, give MOSS a try—its modest footprint and strong empirical gains make it a compelling addition to modern video AI stacks.

Authors

Manjin Kim
Heeseung Kwon
Karteek Alahari
Minsu Cho

Paper Information

arXiv ID: 2604.20760v1
Categories: cs.CV
Published: April 22, 2026
PDF: Download PDF

[Paper] Exploring High-Order Self-Similarity for Video Understanding

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

[Paper] Context Unrolling in Omni Models

[Paper] Vista4D: Video Reshooting with 4D Point Clouds