[Paper] AdaCodec: A Predictive Visual Code for Video MLLMs

Published: 3 days ago (June 1, 2026 at 01:56 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2606.02569v1

Overview

The paper introduces AdaCodec, a “predictive visual code” that lets video‑multimodal large language models (video MLLMs) transmit only the information that truly changes between frames. By treating a full‑resolution frame as a reference only when the scene can’t be reliably predicted from earlier context, AdaCodec replaces redundant RGB tokens with compact “P‑tokens” that capture motion and residuals, dramatically cutting token usage and latency while boosting performance on a wide range of video benchmarks.

Key Contributions

Predictive visual coding paradigm: Formalizes the idea of sending a full reference frame only when necessary and otherwise encoding inter‑frame changes.
AdaCodec architecture: A lightweight encoder that decides, per frame, whether to emit full visual tokens or compact predictive tokens based on a conditional predictive cost.
Token‑budget efficiency: Achieves comparable or superior accuracy to a strong 224k‑token baseline while using as few as 32k tokens (≈ 1/7 of the budget).
Speedup in inference: Reduces time‑to‑first‑token from 9.26 s to 1.62 s on average, making real‑time or near‑real‑time video‑LLM applications feasible.
Broad benchmark validation: Improves results on all 11 evaluated video‑MLLM benchmarks, including long‑video understanding tasks where redundancy is most severe.

Methodology

Reference‑frame selection – For each incoming frame, AdaCodec estimates a conditional predictive cost: how hard it would be to reconstruct the frame from previous context. If the cost exceeds a learned threshold, the frame is treated as a new reference and encoded with the full set of visual tokens (the same tokenization used by existing video‑MLLMs).
Predictive token generation – When the cost is low, the model computes motion vectors and residuals (the difference between the predicted frame and the actual frame). These are quantized into a small number of P‑tokens that capture only the change information.
Adaptive token budgeting – A lightweight controller dynamically balances the mix of full‑frame tokens and P‑tokens to stay within a pre‑specified token budget while minimizing predictive error.
Integration with video‑MLLM – The token stream (full‑frame tokens + P‑tokens) is fed unchanged into an off‑the‑shelf video‑LLM (e.g., Qwen3‑VL‑8B). No architectural changes to the language model are required; AdaCodec acts as a pre‑processor that compresses the visual stream.

Results & Findings

Benchmark	Tokens (Baseline)	Tokens (AdaCodec)	Accuracy Δ	Latency Δ
Long‑video QA (e.g., ActivityNet-QA)	224 k	32 k	+3.2 %	–
General‑video QA (5 datasets)	224 k	32 k	+1.5 % avg.	–
Time‑to‑first‑token	9.26 s	1.62 s	—	~82 % faster

Token efficiency: Even with a 7× reduction in token count, AdaCodec outperforms the full‑frame baseline on every benchmark.
Scalability: Gains are larger on longer videos where frame‑to‑frame redundancy is higher.
Compatibility: Works with existing video‑LLM pipelines without retraining the language model, demonstrating that visual token compression alone can yield substantial benefits.

Practical Implications

Cost‑effective video AI services – Cloud providers can lower GPU memory and compute costs by feeding fewer visual tokens, enabling cheaper inference‑as‑a‑service for video QA, captioning, or summarization.
Real‑time applications – The dramatic reduction in latency opens doors for interactive use‑cases such as live video assistants, AR/VR overlays, or on‑device video understanding where bandwidth and compute are limited.
Edge deployment – Compact P‑tokens can be transmitted over low‑bandwidth links (e.g., 5G, IoT), allowing edge devices to send only predictive updates to a central LLM rather than full frames.
Framework integration – Since AdaCodec is a pre‑processing layer, developers can plug it into existing pipelines (e.g., Hugging Face Transformers, LangChain) with minimal code changes.

Limitations & Future Work

Predictive cost estimation relies on a learned threshold; sub‑optimal thresholds could either waste tokens on unnecessary reference frames or degrade visual fidelity.
Residual encoding quality may suffer on highly dynamic scenes (e.g., fast motion, rapid lighting changes), where the predictive model struggles.
The current evaluation focuses on English‑language benchmarks; multilingual or domain‑specific video data may expose new challenges.
Future research could explore joint training of the predictive encoder with the language model, richer motion representations (optical flow, depth), and adaptive token budgets that react to real‑time latency constraints.

Authors

Haowen Hou
Zhen Huang
Zheming Liang
Qingyi Si
Chenglin Li
Shuai Dong
Kele Shao
Ruilin Li
Dianyi Wang
Nan Duan
Jiaqi Wang

Paper Information

arXiv ID: 2606.02569v1
Categories: cs.CV, cs.AI, cs.CL
Published: June 1, 2026
PDF: Download PDF

[Paper] AdaCodec: A Predictive Visual Code for Video MLLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Continual Visual and Verbal Learning Through a Child's Egocentric Input

[Paper] Neuron Populations Exhibit Divergent Selectivity with Scale

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning