[Paper] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Published: (January 15, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.10710v1

Overview

Vision‑Language Models (VLMs) have become the backbone of many AI products that need to “see and talk,” from image captioning tools to visual assistants.
Current VLMs, however, suffer from a visual bottleneck: they only pass the final output of the vision encoder into the language model, ignoring the rich hierarchy of low‑level and mid‑level visual features. The paper “From One-to-One to Many-to-Many: Dynamic Cross‑Layer Injection for Deep Vision‑Language Fusion” proposes a lightweight plug‑in that lets the language model tap into any vision layer on demand, dramatically improving multimodal reasoning.

Key Contributions

  • Cross‑Layer Injection (CLI) – a generic framework that builds a many‑to‑many bridge between vision encoders and large language models (LLMs).
  • Adaptive Multi‑Projection (AMP) – aligns and compresses feature maps from multiple vision depths into a common space without heavy retraining.
  • Adaptive Gating Fusion (AGF) – a context‑aware gating mechanism that lets the LLM decide which visual signals are most useful at each decoding step.
  • Parameter‑efficient integration – CLI adds only a tiny fraction of extra parameters (≈0.5 % of the total model size) and can be dropped into existing VLMs such as LLaVA‑OneVision and LLaVA‑1.5.
  • Broad empirical validation – improvements reported on 18 benchmarks covering captioning, visual question answering, reasoning, and grounding, with gains of 3–12 % absolute over strong baselines.

Methodology

  1. Multi‑layer feature extraction – The vision encoder (e.g., a ViT or ConvNeXt) produces a stack of hidden states at different depths, each capturing a different granularity (edges → textures → objects → scene semantics).
  2. Adaptive Multi‑Projection (AMP) – Each layer’s feature map is passed through a lightweight linear projection (or a tiny MLP) that reshapes it to a unified dimension. AMP also learns a small set of scaling factors so that deeper layers don’t dominate shallower ones.
  3. Dynamic injection into the LLM – During text generation, the LLM’s decoder hidden state is fed into the Adaptive Gating Fusion (AGF) module. AGF computes a gating vector (via a sigmoid‑activated attention) that weights the projected visual tokens according to the current linguistic context (e.g., the question being answered).
  4. On‑demand fusion – The gated visual tokens are concatenated with the LLM’s token embeddings, allowing the language model to “look” at the most relevant visual cues at each step, rather than being forced to rely on a single static visual token.
  5. Training – Only the AMP and AGF parameters are fine‑tuned (≈1–2 M weights). The rest of the vision encoder and LLM stay frozen, making the approach fast to adapt to new models or datasets.

Results & Findings

BenchmarkBaseline (LLaVA‑1.5)+ CLIRelative Gain
VQAv2 (answer accuracy)71.2 %78.4 %+7.2 %
COCO Caption (CIDEr)124.5133.8+7.5 %
OK-VQA (accuracy)45.1 %51.3 %+6.2 %
RefCOCO (referring expression)68.9 %74.5 %+5.6 %
ScienceQA (multimodal reasoning)78.0 %84.1 %+6.1 %
  • Consistent gains across tasks: Whether the problem needs fine‑grained detail (object detection) or high‑level reasoning (science QA), CLI’s dynamic access to the visual hierarchy helps.
  • Parameter efficiency: Adding < 2 M trainable parameters yields > 5 % absolute improvements, far cheaper than re‑training the whole vision encoder.
  • Scalability: The same CLI module works with both LLaVA‑OneVision (a smaller LLM) and LLaVA‑1.5 (a 13 B model), showing that the approach scales with model size.

Practical Implications

  • Richer AI assistants – Developers can embed CLI into chat‑based assistants (e.g., customer‑support bots that need to interpret product images) to let the language side ask for “more detail” from the vision side on the fly.
  • Improved visual debugging tools – When building tools that explain model decisions, CLI’s gating signals reveal which visual layer contributed to a particular answer, aiding interpretability.
  • Cost‑effective model upgrades – Companies can upgrade existing VLM deployments by adding the tiny CLI plug‑in rather than retraining massive vision encoders, saving GPU hours and cloud spend.
  • Better multimodal retrieval – Search engines that match text queries to images can benefit from multi‑layer cues (e.g., texture for “silky fabric” vs. object for “red car”), leading to higher relevance.
  • Edge‑device friendliness – Because CLI adds minimal parameters and inference overhead (a few extra matrix multiplications), it can be deployed on on‑device AI chips where memory is limited.

Limitations & Future Work

  • Static vision encoder – CLI does not fine‑tune the underlying vision backbone, so any systematic bias or blind spot in the encoder remains.
  • Gating complexity – While lightweight, the AGF gating still introduces a per‑token computation that may become a bottleneck for extremely long generation sequences.
  • Generalization to non‑transformer vision models – The paper focuses on ViT‑style encoders; extending AMP/AGF to CNN‑based or hybrid backbones may need additional engineering.
  • Future directions suggested by the authors include: (1) jointly training the vision encoder with CLI for end‑to‑end optimality, (2) exploring hierarchical gating where the LLM can request multiple layers simultaneously, and (3) applying CLI to video‑language models where temporal dynamics add another dimension to the injection process.

Authors

  • Cheng Chen
  • Yuyu Guo
  • Pengpeng Zeng
  • Jingkuan Song
  • Peng Di
  • Hang Yu
  • Lianli Gao

Paper Information

  • arXiv ID: 2601.10710v1
  • Categories: cs.CV
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »