[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Published: 3 days ago (February 13, 2026 at 01:57 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.13191v1

Overview

The paper CoPE‑VideoLM introduces a new way to feed video data into large language models by exploiting the native building blocks of video codecs—motion vectors and residuals—rather than processing every frame as a full‑resolution image. This dramatically cuts the computational cost while preserving, and in many cases improving, performance on a wide range of video‑understanding tasks.

Key Contributions

Codec‑centric representation: Leverages motion vectors and residuals from standard video codecs as lightweight “tokens” for non‑keyframes, avoiding expensive full‑frame encoding.
Hybrid encoder architecture: Combines a small transformer that ingests codec primitives with a conventional image encoder for keyframes, and aligns their latent spaces via a dedicated pre‑training stage.
Efficiency gains: Achieves up to 86 % reduction in time‑to‑first‑token and 93 % fewer tokens compared to traditional VideoLM pipelines.
Performance parity or improvement: Matches or exceeds state‑of‑the‑art results on 14 video‑understanding benchmarks covering QA, temporal reasoning, long‑form comprehension, and spatial scene analysis.
Scalable density control: Allows developers to trade off keyframe frequency versus codec‑primitive density, tailoring compute budgets to specific application needs.

Methodology

Keyframe selection: A small subset of frames (e.g., 1‑2 fps) is still processed with a full‑image encoder (ViT‑style) to capture high‑level visual semantics.
Codec primitive extraction: For every intervening frame, the video’s compressed bitstream is parsed to obtain motion vectors (indicating pixel displacement) and residuals (the difference after motion compensation). These are already highly sparse and encode temporal changes efficiently.
Codec‑primitive encoder: A lightweight transformer ingests the motion‑vector and residual tokens, producing a compact temporal representation.
Cross‑modal alignment pre‑training: The model is first trained to align codec‑primitive embeddings with the image‑encoder embeddings on a large unlabeled video corpus, which speeds up later fine‑tuning.
End‑to‑end fine‑tuning: The combined encoder feeds into a standard language model (e.g., LLaMA‑based) that is trained on downstream video‑language tasks.

The whole pipeline can be visualized as:

Video → Codec (keyframes + motion/residuals) → Hybrid Encoder → LLM → Text Output

Results & Findings

Metric	Baseline VideoLM	CoPE‑VideoLM (best config)
Token count per second of video	1,200	84
Time‑to‑first‑token (ms)	420	58
Average accuracy on VQA‑style benchmarks	71.3 %	71.8 %
Temporal reasoning (NExT‑QA)	58.2 %	59.5 %
Long‑form video QA (HowToVQA)	45.1 %	45.6 %

Efficiency: The token reduction translates directly into lower GPU memory usage and faster inference, making real‑time or on‑device deployment feasible.
Robustness to density changes: Even when keyframe frequency is halved, the model retains > 95 % of its original performance, thanks to the rich motion‑vector signal.
Generalization: The same pretrained encoder works across diverse domains (cooking videos, sports highlights, instructional clips) without task‑specific redesign.

Practical Implications

Cost‑effective video AI services: Cloud providers can serve more concurrent video‑LLM requests with the same hardware budget, reducing inference cost per hour.
Edge and mobile deployment: The lightweight codec‑primitive encoder fits within the memory constraints of modern smartphones and AR glasses, enabling on‑device video understanding (e.g., real‑time captioning, activity detection).
Simplified data pipelines: Since the approach reuses existing codec outputs, developers can skip costly frame‑extraction and image‑tokenization steps, integrating directly with streaming pipelines (e.g., WebRTC, RTMP).
Customizable latency‑accuracy trade‑offs: By adjusting keyframe spacing or selecting a subset of motion vectors, product teams can fine‑tune the balance between responsiveness and depth of understanding for interactive applications like video chat assistants or live sports analytics.

Limitations & Future Work

Codec dependency: The method assumes access to the video’s compressed bitstream; raw‑frame workflows (e.g., from cameras without encoding) would need an extra encoding step.
Loss of fine‑grained visual detail: While motion vectors capture motion well, subtle texture changes that are not reflected in residuals may be missed, potentially affecting tasks that rely on fine visual cues (e.g., facial expression analysis).
Generalization to exotic codecs: The paper focuses on H.264/H.265; extending to newer or proprietary codecs may require additional engineering.
Future directions: The authors suggest exploring learned compression primitives, adaptive keyframe selection based on scene dynamics, and tighter integration with multimodal LLMs that handle audio and text simultaneously.

Authors

Sayan Deb Sarkar
Rémi Pautrat
Ondrej Miksik
Marc Pollefeys
Iro Armeni
Mahdi Rad
Mihai Dusmanu

Paper Information

arXiv ID: 2602.13191v1
Categories: cs.CV, cs.AI, cs.CL
Published: February 13, 2026
PDF: Download PDF

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation