[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Published: (February 13, 2026 at 01:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.13191v1

Overview

The paper CoPE‑VideoLM introduces a new way to feed video data into large language models by exploiting the native building blocks of video codecs—motion vectors and residuals—rather than processing every frame as a full‑resolution image. This dramatically cuts the computational cost while preserving, and in many cases improving, performance on a wide range of video‑understanding tasks.

Key Contributions

  • Codec‑centric representation: Leverages motion vectors and residuals from standard video codecs as lightweight “tokens” for non‑keyframes, avoiding expensive full‑frame encoding.
  • Hybrid encoder architecture: Combines a small transformer that ingests codec primitives with a conventional image encoder for keyframes, and aligns their latent spaces via a dedicated pre‑training stage.
  • Efficiency gains: Achieves up to 86 % reduction in time‑to‑first‑token and 93 % fewer tokens compared to traditional VideoLM pipelines.
  • Performance parity or improvement: Matches or exceeds state‑of‑the‑art results on 14 video‑understanding benchmarks covering QA, temporal reasoning, long‑form comprehension, and spatial scene analysis.
  • Scalable density control: Allows developers to trade off keyframe frequency versus codec‑primitive density, tailoring compute budgets to specific application needs.

Methodology

  1. Keyframe selection: A small subset of frames (e.g., 1‑2 fps) is still processed with a full‑image encoder (ViT‑style) to capture high‑level visual semantics.
  2. Codec primitive extraction: For every intervening frame, the video’s compressed bitstream is parsed to obtain motion vectors (indicating pixel displacement) and residuals (the difference after motion compensation). These are already highly sparse and encode temporal changes efficiently.
  3. Codec‑primitive encoder: A lightweight transformer ingests the motion‑vector and residual tokens, producing a compact temporal representation.
  4. Cross‑modal alignment pre‑training: The model is first trained to align codec‑primitive embeddings with the image‑encoder embeddings on a large unlabeled video corpus, which speeds up later fine‑tuning.
  5. End‑to‑end fine‑tuning: The combined encoder feeds into a standard language model (e.g., LLaMA‑based) that is trained on downstream video‑language tasks.

The whole pipeline can be visualized as:

Video → Codec (keyframes + motion/residuals) → Hybrid Encoder → LLM → Text Output

Results & Findings

MetricBaseline VideoLMCoPE‑VideoLM (best config)
Token count per second of video1,20084
Time‑to‑first‑token (ms)42058
Average accuracy on VQA‑style benchmarks71.3 %71.8 %
Temporal reasoning (NExT‑QA)58.2 %59.5 %
Long‑form video QA (HowToVQA)45.1 %45.6 %
  • Efficiency: The token reduction translates directly into lower GPU memory usage and faster inference, making real‑time or on‑device deployment feasible.
  • Robustness to density changes: Even when keyframe frequency is halved, the model retains > 95 % of its original performance, thanks to the rich motion‑vector signal.
  • Generalization: The same pretrained encoder works across diverse domains (cooking videos, sports highlights, instructional clips) without task‑specific redesign.

Practical Implications

  • Cost‑effective video AI services: Cloud providers can serve more concurrent video‑LLM requests with the same hardware budget, reducing inference cost per hour.
  • Edge and mobile deployment: The lightweight codec‑primitive encoder fits within the memory constraints of modern smartphones and AR glasses, enabling on‑device video understanding (e.g., real‑time captioning, activity detection).
  • Simplified data pipelines: Since the approach reuses existing codec outputs, developers can skip costly frame‑extraction and image‑tokenization steps, integrating directly with streaming pipelines (e.g., WebRTC, RTMP).
  • Customizable latency‑accuracy trade‑offs: By adjusting keyframe spacing or selecting a subset of motion vectors, product teams can fine‑tune the balance between responsiveness and depth of understanding for interactive applications like video chat assistants or live sports analytics.

Limitations & Future Work

  • Codec dependency: The method assumes access to the video’s compressed bitstream; raw‑frame workflows (e.g., from cameras without encoding) would need an extra encoding step.
  • Loss of fine‑grained visual detail: While motion vectors capture motion well, subtle texture changes that are not reflected in residuals may be missed, potentially affecting tasks that rely on fine visual cues (e.g., facial expression analysis).
  • Generalization to exotic codecs: The paper focuses on H.264/H.265; extending to newer or proprietary codecs may require additional engineering.
  • Future directions: The authors suggest exploring learned compression primitives, adaptive keyframe selection based on scene dynamics, and tighter integration with multimodal LLMs that handle audio and text simultaneously.

Authors

  • Sayan Deb Sarkar
  • Rémi Pautrat
  • Ondrej Miksik
  • Marc Pollefeys
  • Iro Armeni
  • Mahdi Rad
  • Mihai Dusmanu

Paper Information

  • arXiv ID: 2602.13191v1
  • Categories: cs.CV, cs.AI, cs.CL
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »