[Paper] When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Published: 1 day ago (February 4, 2026 at 01:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.04864v1

Overview

The paper “When LLaVA Meets Objects: Token Composition for Vision‑Language Models” tackles a core bottleneck in modern autoregressive vision‑language models (VLMs): they need thousands of visual tokens to encode an image, which makes inference slow and costly. The authors introduce Mask‑LLaVA, a token‑efficient framework that mixes object‑level masks, global scene tokens, and fine‑grained patch tokens, enabling the model to drop many tokens at test time without a noticeable loss in accuracy.

Key Contributions

Multi‑level token composition: Combines mask‑based object tokens, global image tokens, and local patch tokens into a single visual representation.
Dynamic token pruning at inference: Allows the number of object tokens to be reduced on‑the‑fly, adapting compute to the hardware budget.
Training‑time token sharing: All token types are used during training, so the model learns to cooperate across scales, but only a subset is required during deployment.
Competitive performance: Achieves results on par with the original LLaVA and other token‑efficient baselines while using ≤ 30 % of the visual tokens.
Extensive benchmark evaluation: Tested on standard VQA, captioning, and multimodal reasoning datasets, demonstrating robustness across tasks.

Methodology

Feature Extraction
- Global token: A single vector from a CNN/ViT backbone summarizing the whole image.
- Patch tokens: Regular grid of small patches (e.g., 16×16) providing fine‑grained detail.
- Mask‑based object tokens: Regions detected by a pretrained object detector (e.g., Mask‑RCNN). Each region is pooled into a token that captures object shape and semantics.
Token Fusion
- All three token sets are concatenated and fed to a lightweight transformer encoder that learns cross‑attention among them.
- During training, the model sees the full set, encouraging it to distribute information across scales.
Dynamic Inference
- At test time, a token budget can be specified. The model can drop a configurable number of object tokens (or even all of them) while still using the global and patch tokens.
- No retraining is required; the encoder has already learned to compensate for missing tokens.
Autoregressive Language Decoder
- The fused visual representation conditions a large language model (the LLaVA decoder) that generates answers, captions, or other text outputs token‑by‑token.

Results & Findings

Dataset	Baseline (LLaVA)	Mask‑LLaVA (full tokens)	Mask‑LLaVA (30 % tokens)
VQAv2	73.2 %	72.8 %	71.9 %
COCO Caption	126.4 CIDEr	125.9 CIDEr	124.3 CIDEr
GQA	61.5 %	60.9 %	60.1 %

Token reduction: Using only ~30 % of the visual tokens (mostly global + a few object tokens) incurs < 2 % absolute drop in accuracy.
Speedup: Inference time improves by 2.5×–3× on a single A100 GPU because the transformer processes fewer tokens.
Ablation: Removing any token type (global, patch, or object) degrades performance more than the dynamic pruning, confirming that the three levels provide complementary information.

Practical Implications

Cost‑effective deployment: Cloud services or edge devices can throttle the token budget based on latency or budget constraints, making VLMs viable for real‑time applications (e.g., interactive assistants, AR overlays).
Scalable multimodal pipelines: Existing LLaVA‑based products can adopt Mask‑LLaVA with minimal code changes—just swap the visual encoder and optionally set a token budget.
Better handling of crowded scenes: Object masks let the model focus on salient entities, which is useful for robotics, autonomous driving, or retail analytics where specific objects matter more than background texture.
Energy savings: Fewer tokens mean less memory traffic and lower GPU power draw, aligning with sustainability goals for large‑scale AI services.

Limitations & Future Work

Dependency on a pretrained detector: The quality of mask‑based tokens hinges on the object detector; failures in detection can propagate to the language model.
Fixed token hierarchy: The current design uses three static levels; exploring adaptive token granularity (e.g., dynamically merging patches) could yield further gains.
Benchmark scope: Experiments focus on standard VQA and captioning tasks; evaluating on more diverse domains (medical imaging, video) remains open.
Hardware‑specific tuning: Optimal token budgets may vary across GPUs/TPUs; automated profiling tools could help developers choose the right trade‑off.

Mask‑LLaVA demonstrates that clever token composition can dramatically cut the compute cost of vision‑language models while preserving most of their capabilities—an insight that could accelerate the adoption of multimodal AI in production environments.

Authors

Soumya Jahagirdar
Walid Bousselham
Anna Kukleva
Hilde Kuehne

Paper Information

arXiv ID: 2602.04864v1
Categories: cs.CV
Published: February 4, 2026
PDF: Download PDF

[Paper] When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reinforced Attention Learning

[Paper] CoWTracker: Tracking by Warping instead of Correlation

[Paper] PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

[Paper] Laminating Representation Autoencoders for Efficient Diffusion