[Paper] When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Published: (February 4, 2026 at 01:50 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.04864v1

Overview

The paper “When LLaVA Meets Objects: Token Composition for Vision‑Language Models” tackles a core bottleneck in modern autoregressive vision‑language models (VLMs): they need thousands of visual tokens to encode an image, which makes inference slow and costly. The authors introduce Mask‑LLaVA, a token‑efficient framework that mixes object‑level masks, global scene tokens, and fine‑grained patch tokens, enabling the model to drop many tokens at test time without a noticeable loss in accuracy.

Key Contributions

  • Multi‑level token composition: Combines mask‑based object tokens, global image tokens, and local patch tokens into a single visual representation.
  • Dynamic token pruning at inference: Allows the number of object tokens to be reduced on‑the‑fly, adapting compute to the hardware budget.
  • Training‑time token sharing: All token types are used during training, so the model learns to cooperate across scales, but only a subset is required during deployment.
  • Competitive performance: Achieves results on par with the original LLaVA and other token‑efficient baselines while using ≤ 30 % of the visual tokens.
  • Extensive benchmark evaluation: Tested on standard VQA, captioning, and multimodal reasoning datasets, demonstrating robustness across tasks.

Methodology

  1. Feature Extraction

    • Global token: A single vector from a CNN/ViT backbone summarizing the whole image.
    • Patch tokens: Regular grid of small patches (e.g., 16×16) providing fine‑grained detail.
    • Mask‑based object tokens: Regions detected by a pretrained object detector (e.g., Mask‑RCNN). Each region is pooled into a token that captures object shape and semantics.
  2. Token Fusion

    • All three token sets are concatenated and fed to a lightweight transformer encoder that learns cross‑attention among them.
    • During training, the model sees the full set, encouraging it to distribute information across scales.
  3. Dynamic Inference

    • At test time, a token budget can be specified. The model can drop a configurable number of object tokens (or even all of them) while still using the global and patch tokens.
    • No retraining is required; the encoder has already learned to compensate for missing tokens.
  4. Autoregressive Language Decoder

    • The fused visual representation conditions a large language model (the LLaVA decoder) that generates answers, captions, or other text outputs token‑by‑token.

Results & Findings

DatasetBaseline (LLaVA)Mask‑LLaVA (full tokens)Mask‑LLaVA (30 % tokens)
VQAv273.2 %72.8 %71.9 %
COCO Caption126.4 CIDEr125.9 CIDEr124.3 CIDEr
GQA61.5 %60.9 %60.1 %
  • Token reduction: Using only ~30 % of the visual tokens (mostly global + a few object tokens) incurs < 2 % absolute drop in accuracy.
  • Speedup: Inference time improves by 2.5×–3× on a single A100 GPU because the transformer processes fewer tokens.
  • Ablation: Removing any token type (global, patch, or object) degrades performance more than the dynamic pruning, confirming that the three levels provide complementary information.

Practical Implications

  • Cost‑effective deployment: Cloud services or edge devices can throttle the token budget based on latency or budget constraints, making VLMs viable for real‑time applications (e.g., interactive assistants, AR overlays).
  • Scalable multimodal pipelines: Existing LLaVA‑based products can adopt Mask‑LLaVA with minimal code changes—just swap the visual encoder and optionally set a token budget.
  • Better handling of crowded scenes: Object masks let the model focus on salient entities, which is useful for robotics, autonomous driving, or retail analytics where specific objects matter more than background texture.
  • Energy savings: Fewer tokens mean less memory traffic and lower GPU power draw, aligning with sustainability goals for large‑scale AI services.

Limitations & Future Work

  • Dependency on a pretrained detector: The quality of mask‑based tokens hinges on the object detector; failures in detection can propagate to the language model.
  • Fixed token hierarchy: The current design uses three static levels; exploring adaptive token granularity (e.g., dynamically merging patches) could yield further gains.
  • Benchmark scope: Experiments focus on standard VQA and captioning tasks; evaluating on more diverse domains (medical imaging, video) remains open.
  • Hardware‑specific tuning: Optimal token budgets may vary across GPUs/TPUs; automated profiling tools could help developers choose the right trade‑off.

Mask‑LLaVA demonstrates that clever token composition can dramatically cut the compute cost of vision‑language models while preserving most of their capabilities—an insight that could accelerate the adoption of multimodal AI in production environments.

Authors

  • Soumya Jahagirdar
  • Walid Bousselham
  • Anna Kukleva
  • Hilde Kuehne

Paper Information

  • arXiv ID: 2602.04864v1
  • Categories: cs.CV
  • Published: February 4, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Reinforced Attention Learning

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending th...