[Paper] AdaCodec: A Predictive Visual Code for Video MLLMs
Source: arXiv - 2606.02569v1
Overview
The paper introduces AdaCodec, a “predictive visual code” that lets video‑multimodal large language models (video MLLMs) transmit only the information that truly changes between frames. By treating a full‑resolution frame as a reference only when the scene can’t be reliably predicted from earlier context, AdaCodec replaces redundant RGB tokens with compact “P‑tokens” that capture motion and residuals, dramatically cutting token usage and latency while boosting performance on a wide range of video benchmarks.
Key Contributions
- Predictive visual coding paradigm: Formalizes the idea of sending a full reference frame only when necessary and otherwise encoding inter‑frame changes.
- AdaCodec architecture: A lightweight encoder that decides, per frame, whether to emit full visual tokens or compact predictive tokens based on a conditional predictive cost.
- Token‑budget efficiency: Achieves comparable or superior accuracy to a strong 224k‑token baseline while using as few as 32k tokens (≈ 1/7 of the budget).
- Speedup in inference: Reduces time‑to‑first‑token from 9.26 s to 1.62 s on average, making real‑time or near‑real‑time video‑LLM applications feasible.
- Broad benchmark validation: Improves results on all 11 evaluated video‑MLLM benchmarks, including long‑video understanding tasks where redundancy is most severe.
Methodology
- Reference‑frame selection – For each incoming frame, AdaCodec estimates a conditional predictive cost: how hard it would be to reconstruct the frame from previous context. If the cost exceeds a learned threshold, the frame is treated as a new reference and encoded with the full set of visual tokens (the same tokenization used by existing video‑MLLMs).
- Predictive token generation – When the cost is low, the model computes motion vectors and residuals (the difference between the predicted frame and the actual frame). These are quantized into a small number of P‑tokens that capture only the change information.
- Adaptive token budgeting – A lightweight controller dynamically balances the mix of full‑frame tokens and P‑tokens to stay within a pre‑specified token budget while minimizing predictive error.
- Integration with video‑MLLM – The token stream (full‑frame tokens + P‑tokens) is fed unchanged into an off‑the‑shelf video‑LLM (e.g., Qwen3‑VL‑8B). No architectural changes to the language model are required; AdaCodec acts as a pre‑processor that compresses the visual stream.
Results & Findings
| Benchmark | Tokens (Baseline) | Tokens (AdaCodec) | Accuracy Δ | Latency Δ |
|---|---|---|---|---|
| Long‑video QA (e.g., ActivityNet-QA) | 224 k | 32 k | +3.2 % | – |
| General‑video QA (5 datasets) | 224 k | 32 k | +1.5 % avg. | – |
| Time‑to‑first‑token | 9.26 s | 1.62 s | — | ~82 % faster |
- Token efficiency: Even with a 7× reduction in token count, AdaCodec outperforms the full‑frame baseline on every benchmark.
- Scalability: Gains are larger on longer videos where frame‑to‑frame redundancy is higher.
- Compatibility: Works with existing video‑LLM pipelines without retraining the language model, demonstrating that visual token compression alone can yield substantial benefits.
Practical Implications
- Cost‑effective video AI services – Cloud providers can lower GPU memory and compute costs by feeding fewer visual tokens, enabling cheaper inference‑as‑a‑service for video QA, captioning, or summarization.
- Real‑time applications – The dramatic reduction in latency opens doors for interactive use‑cases such as live video assistants, AR/VR overlays, or on‑device video understanding where bandwidth and compute are limited.
- Edge deployment – Compact P‑tokens can be transmitted over low‑bandwidth links (e.g., 5G, IoT), allowing edge devices to send only predictive updates to a central LLM rather than full frames.
- Framework integration – Since AdaCodec is a pre‑processing layer, developers can plug it into existing pipelines (e.g., Hugging Face Transformers, LangChain) with minimal code changes.
Limitations & Future Work
- Predictive cost estimation relies on a learned threshold; sub‑optimal thresholds could either waste tokens on unnecessary reference frames or degrade visual fidelity.
- Residual encoding quality may suffer on highly dynamic scenes (e.g., fast motion, rapid lighting changes), where the predictive model struggles.
- The current evaluation focuses on English‑language benchmarks; multilingual or domain‑specific video data may expose new challenges.
- Future research could explore joint training of the predictive encoder with the language model, richer motion representations (optical flow, depth), and adaptive token budgets that react to real‑time latency constraints.
Authors
- Haowen Hou
- Zhen Huang
- Zheming Liang
- Qingyi Si
- Chenglin Li
- Shuai Dong
- Kele Shao
- Ruilin Li
- Dianyi Wang
- Nan Duan
- Jiaqi Wang
Paper Information
- arXiv ID: 2606.02569v1
- Categories: cs.CV, cs.AI, cs.CL
- Published: June 1, 2026
- PDF: Download PDF