[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Published: (January 16, 2026 at 12:45 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.11464v1

Overview

The paper introduces MHA2MLA‑VLM, a lightweight framework that converts existing vision‑language models (VLMs) into the Multi‑Head Latent Attention (MLA) format. By compressing the ever‑growing key‑value (KV) cache that powers transformer inference, the method slashes memory usage and speeds up inference without the need for expensive full‑model pre‑training.

Key Contributions

  • Parameter‑efficient conversion pipeline that retrofits off‑the‑shelf VLMs to MLA with only a few hundred training steps.
  • Modality‑adaptive partial‑RoPE: a selective rotary‑position‑embedding mask that preserves essential dimensions for both image and text streams while discarding redundant ones.
  • Modality‑decoupled low‑rank KV compression: independent low‑rank approximations for visual and textual KV matrices, yielding higher compression ratios than a monolithic approach.
  • Activation‑error‑driven fine‑tuning: optimizing the difference in model outputs (instead of raw parameter distance) dramatically reduces performance degradation after conversion.
  • Compatibility with existing KV‑quantization techniques, enabling a combined memory‑saving effect.
  • Empirical validation on three popular VLMs (e.g., CLIP‑ViT, BLIP‑2, and Flamingo‑style models) showing near‑original accuracy with < 30 % of the original KV footprint.

Methodology

  1. Partial‑RoPE Masking – Traditional rotary position embeddings are applied across all attention heads. The authors propose a mask that zeroes out dimensions irrelevant to a given modality (visual vs. textual), allowing the same transformer block to handle both streams without cross‑contamination.

  2. Separate Low‑Rank Approximation – The KV cache for each modality is factorized as

    $$
    K_{\text{vision}} \approx U_V S_V V_V^\top ,\qquad
    K_{\text{text}} \approx U_T S_T V_T^\top ,
    $$

    where the rank is chosen per‑modality. This decoupling respects the different statistical properties of image patches and token embeddings.

  3. Fine‑Tuning Objective – Instead of minimizing

    $$| \theta_{\text{orig}} - \theta_{\text{MLA}}|_2,$$

    the authors minimize

    $$| f_{\text{orig}}(x) - f_{\text{MLA}}(x) |_2$$

    over a small supervised set, directly aligning the model’s predictions.

  4. Parameter‑Efficient Adaptation – Only a tiny set of adapter layers (≈ 0.5 % of total parameters) are introduced, keeping the conversion cost low and allowing rapid deployment on edge devices.

Results & Findings

Model (original)KV size (GB)KV size after MHA2MLA‑VLMTop‑1 Image‑Text Retrieval ΔInference latency ↓
CLIP‑ViT‑B/324.21.2 (≈ 71 % reduction)–0.3 %28 % faster
BLIP‑2‑FlanT56.81.9 (≈ 72 % reduction)–0.5 %31 % faster
Flamingo‑7B9.52.6 (≈ 73 % reduction)–0.2 %27 % faster
  • Performance loss stays under 0.5 % on standard VLM benchmarks (MS‑COCO, Flickr30K).
  • Fine‑tuning data required is tiny: ~5 k image‑text pairs (≈ 0.1 % of the original pre‑training corpus).
  • When combined with 8‑bit KV quantization, total memory drops to ≈ 10 % of the baseline while preserving accuracy.

Practical Implications

  • Edge Deployment – Developers can now run large VLMs on devices with < 2 GB RAM (e.g., smartphones, AR glasses) by swapping the KV cache for its MLA counterpart.
  • Cost‑Effective Scaling – Cloud inference services can serve more concurrent requests per GPU because the KV cache no longer dominates memory consumption.
  • Rapid Prototyping – Existing VLM pipelines (e.g., captioning, visual QA) can be upgraded to MLA with a few hours of fine‑tuning, avoiding the need to train a new model from scratch.
  • Interoperability – The method works with any transformer‑based VLM, making it a drop‑in upgrade for open‑source projects like HuggingFace’s transformers library.
  • Energy Savings – Smaller KV footprints translate to fewer memory accesses, which is a notable win for green AI initiatives and battery‑powered devices.

Limitations & Future Work

  • Modality‑specific rank selection still requires manual tuning; an automated rank‑selection algorithm could streamline the process.
  • The approach assumes a fixed transformer architecture; extending to models with mixed‑modality cross‑attention layers (e.g., Perceiver‑IO) remains open.
  • Experiments focus on retrieval and captioning tasks; applying MHA2MLA‑VLM to more complex multimodal reasoning (e.g., video‑language) is left for future research.
  • While the KV cache is heavily compressed, model weights themselves are unchanged; combining MLA conversion with weight quantization or pruning could push memory savings even further.

Authors

  • Xiaoran Fan
  • Zhichao Sun
  • Tao Ji
  • Lixing Shen
  • Tao Gui

Paper Information

  • arXiv ID: 2601.11464v1
  • Categories: cs.CV, cs.AI, cs.CL, cs.LG
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »