[Paper] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Published: 1 week ago (December 21, 2025 at 06:02 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.18910v1

Overview

Delta‑LLaVA tackles one of the biggest pain points in multimodal large language models (MLLMs): the massive computational overhead caused by dense visual tokens. By redesigning the visual‑to‑language projector, the authors achieve a token‑efficient pipeline that keeps reasoning quality while slashing inference latency and training time.

Key Contributions

DeltaProjection: A low‑rank, multi‑level alignment module that compresses raw vision features into a compact sub‑space before they reach the language model.
Base‑then‑Specialize Architecture: A two‑stage design where a lightweight “base” projector handles coarse alignment, followed by a few Transformer “specialization” blocks that refine global and local context under a strict token budget (144 tokens).
Significant Speedups: Up to 55 % faster inference, ~4‑5× faster pre‑training, and 1.5× faster fine‑tuning compared to conventional MLP projectors.
Broad Benchmark Gains: Consistent performance improvements across standard vision‑language tasks (e.g., VQAv2, COCO captioning) despite using far fewer visual tokens.
Extensive Ablations: Demonstrates that most of the benefit comes from the early token formation step rather than simply adding more Transformer layers.

Methodology

Vision Encoder → Multi‑Level Features: A standard CNN/ViT extracts feature maps at several resolutions.
DeltaProjection (Base Layer):
- Applies a low‑rank linear transform (the “delta”) to each feature level, projecting them into a shared low‑dimensional space.
- The projection is additive: it learns the difference (Δ) between the raw feature and its compact representation, which keeps the parameter count low.
Token Consolidation: The projected features are concatenated and down‑sampled to 144 tokens using a simple pooling operation.
Specialization Transformers: One‑to‑three shallow Transformer blocks (≈2‑4 layers each) operate on the 144 tokens, allowing the model to capture higher‑order interactions without exploding the token count.
Language Model Integration: The refined token sequence is fed into the LLM (e.g., LLaVA’s LLM backbone) via the usual cross‑attention mechanism.

The whole pipeline is end‑to‑end trainable, but the low‑rank base alignment can be pre‑trained separately, further accelerating later fine‑tuning.

Results & Findings

Metric	Baseline (MLP projector)	Delta‑LLaVA (144 tokens)	Speedup
VQAv2 accuracy	73.1 %	74.6 %	+55 % inference
COCO Caption CIDEr	124.3	126.8	4‑5× pre‑train
LLaVA‑Chat win rate	68 %	70 %	1.5× fine‑tune
FLOPs (per image)	12.8 G	5.6 G	—

Token budget matters: When the same number of tokens (144) is used, the DeltaProjection consistently outperforms a naïve down‑sampling + MLP pipeline.
Ablation: Removing the specialization Transformers drops performance by ~1 % absolute, confirming their role in refining the compact token set.
Scalability: Experiments with higher‑resolution inputs (up to 4K) show that Delta‑LLaVA’s runtime grows linearly with image size, unlike the quadratic blow‑up of dense tokenizers.

Practical Implications

Faster Prototyping: Developers can iterate on vision‑language applications (e.g., visual assistants, document understanding) with sub‑second latency on commodity GPUs.
Cost‑Effective Cloud Deployments: Lower FLOPs translate directly into reduced inference cost, making MLLM services more economically viable at scale.
Edge‑Friendly Deployments: The compact token representation (144 tokens ≈ 1 KB) fits comfortably within memory‑constrained environments, opening doors for on‑device multimodal AI (AR glasses, robotics).
Simplified Pipeline Integration: Because DeltaProjection is a drop‑in replacement for the usual MLP projector, existing LLaVA‑style stacks can adopt it with minimal code changes.
Future‑Proofing: The base‑then‑specialize paradigm separates coarse alignment from fine‑grained reasoning, allowing teams to swap in stronger vision encoders or larger language backbones without re‑engineering the whole projector.

Limitations & Future Work

Fixed Token Budget: The current design locks the token count at 144; dynamic token allocation based on image complexity could yield further gains.
Specialization Depth: Only shallow Transformers were explored; deeper specialization may be needed for tasks requiring fine‑grained spatial reasoning (e.g., detailed diagram parsing).
Generalization to Non‑Vision Modalities: While the paper focuses on images, extending DeltaProjection to video or 3‑D data remains an open question.
Benchmark Diversity: Experiments were limited to mainstream vision‑language datasets; real‑world industrial workloads (e.g., medical imaging reports) may expose new challenges.

The authors suggest exploring adaptive rank selection for the DeltaProjection and integrating modal‑aware token budgeting as promising avenues for the next generation of token‑efficient MLLMs.

Authors

Mohamad Zamini
Diksha Shukla

Paper Information

arXiv ID: 2512.18910v1
Categories: cs.CV
Published: December 21, 2025
PDF: Download PDF

[Paper] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model