[Paper] Exploring High-Order Self-Similarity for Video Understanding
Source: arXiv - 2604.20760v1
Overview
The paper “Exploring High‑Order Self‑Similarity for Video Understanding” proposes a new way to capture motion patterns in video by looking at self‑similarity not just between two frames, but across multiple temporal scales. By stacking these “higher‑order” similarity maps, the authors build a lightweight plug‑in—Multi‑Order Self‑Similarity (MOSS)—that can be dropped into existing video models to boost their temporal reasoning with almost no extra compute.
Key Contributions
- Higher‑order space‑time self‑similarity (STSS): Shows that similarity maps of order > 1 expose complementary motion cues (e.g., acceleration, periodicity) that first‑order STSS misses.
- MOSS module: A compact neural block that extracts, learns, and fuses multi‑order STSS features; can be attached to any backbone (CNN, Transformer, etc.).
- Broad empirical validation: Demonstrates consistent gains on three very different tasks—action classification, motion‑centric video VQA, and real‑world robot perception—while adding < 2 % FLOPs and < 5 MB memory.
- Open‑source release: Code, pretrained checkpoints, and a simple API for plugging MOSS into popular video libraries (PyTorchVideo, MMAction2).
Methodology
- Space‑time self‑similarity (STSS): For a video tensor (X \in \mathbb{R}^{T \times H \times W \times C}), the first‑order STSS is computed by correlating each spatio‑temporal patch with every other patch, yielding a 4‑D similarity volume.
- Higher‑order STSS: The authors recursively apply the same correlation operation on the similarity volume itself.
- Second‑order STSS captures how similarity patterns evolve over time (e.g., a moving object that speeds up).
- Third‑order and beyond can model more complex dynamics like oscillations or repetitive gestures.
- MOSS block:
- Extraction: A set of 1×1 convolutions reduces the dimensionality of each STSS order.
- Learning: Separate lightweight MLPs (or depthwise convolutions) learn order‑specific embeddings.
- Fusion: Learned embeddings are summed/concatenated and passed through a final linear layer that produces a temporal feature map compatible with the host backbone.
- Integration: MOSS can be inserted after any intermediate feature stage (e.g., after a ResNet‑3D block or a Vision Transformer token mixer). Because the similarity calculations are performed on already‑extracted features, the extra cost is modest.
Results & Findings
| Task | Baseline | +MOSS | Δ (absolute) | Δ (relative) |
|---|---|---|---|---|
| Kinetics‑400 (action recognition) | 78.2 % top‑1 | 80.5 % | +2.3 % | +2.9 % |
| MSRVTT‑QA (motion‑centric VQA) | 44.1 % | 47.8 % | +3.7 % | +8.4 % |
| Real‑world robot grasping (sim‑to‑real) | 71.5 % success | 76.2 % | +4.7 % | +6.6 % |
| Compute overhead | — | +1.8 % FLOPs | — | — |
| Memory increase | — | +4.2 MB | — | — |
Takeaway: Across very different domains, adding MOSS yields consistent double‑digit relative improvements while keeping the model lightweight. Ablation studies confirm that each order contributes uniquely—removing the second‑order term drops performance by ~1 %, and dropping third‑order eliminates another ~0.5 %.
Practical Implications
- Plug‑and‑play temporal boost: Developers can upgrade existing video pipelines (e.g., video analytics, AR/VR content moderation) by inserting a single MOSS layer without redesigning the whole architecture.
- Edge‑friendly: The marginal FLOP and memory increase make MOSS suitable for on‑device inference on smartphones, drones, or embedded robotics platforms where power budgets are tight.
- Better motion reasoning for downstream AI: Tasks that rely on subtle dynamics—gesture control, sports analytics, autonomous navigation—can benefit from the richer temporal descriptors that higher‑order STSS provides.
- Unified code base: Since the authors release a PyTorch module with a simple
MOSS(in_channels, orders=[1,2,3])API, integrating it into frameworks like Detectron2‑Video or TensorFlow Hub is straightforward.
Limitations & Future Work
- Scalability to very long videos: Computing similarity volumes grows quadratically with the number of frames; the current implementation caps at ~32 frames and uses temporal down‑sampling for longer clips.
- Order selection is heuristic: The paper experiments with up to third‑order STSS; higher orders may capture even richer dynamics but also risk over‑fitting and increased cost. An adaptive mechanism to select the optimal order per video is an open question.
- Domain‑specific tuning: While MOSS works out‑of‑the‑box on several benchmarks, optimal placement (which backbone stage) and hyper‑parameters still require modest task‑specific tuning.
Future directions include efficient approximations (e.g., low‑rank factorization of similarity tensors), dynamic order scheduling during inference, and extending MOSS to multimodal streams (audio‑visual self‑similarity).
If you’re building video‑centric products and want a quick win on temporal modeling, give MOSS a try—its modest footprint and strong empirical gains make it a compelling addition to modern video AI stacks.
Authors
- Manjin Kim
- Heeseung Kwon
- Karteek Alahari
- Minsu Cho
Paper Information
- arXiv ID: 2604.20760v1
- Categories: cs.CV
- Published: April 22, 2026
- PDF: Download PDF