[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Published: 3 weeks ago (January 16, 2026 at 12:45 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.11464v1

Overview

The paper introduces MHA2MLA‑VLM, a lightweight framework that converts existing vision‑language models (VLMs) into the Multi‑Head Latent Attention (MLA) format. By compressing the ever‑growing key‑value (KV) cache that powers transformer inference, the method slashes memory usage and speeds up inference without the need for expensive full‑model pre‑training.

Key Contributions

Parameter‑efficient conversion pipeline that retrofits off‑the‑shelf VLMs to MLA with only a few hundred training steps.
Modality‑adaptive partial‑RoPE: a selective rotary‑position‑embedding mask that preserves essential dimensions for both image and text streams while discarding redundant ones.
Modality‑decoupled low‑rank KV compression: independent low‑rank approximations for visual and textual KV matrices, yielding higher compression ratios than a monolithic approach.
Activation‑error‑driven fine‑tuning: optimizing the difference in model outputs (instead of raw parameter distance) dramatically reduces performance degradation after conversion.
Compatibility with existing KV‑quantization techniques, enabling a combined memory‑saving effect.
Empirical validation on three popular VLMs (e.g., CLIP‑ViT, BLIP‑2, and Flamingo‑style models) showing near‑original accuracy with < 30 % of the original KV footprint.

Methodology

Partial‑RoPE Masking – Traditional rotary position embeddings are applied across all attention heads. The authors propose a mask that zeroes out dimensions irrelevant to a given modality (visual vs. textual), allowing the same transformer block to handle both streams without cross‑contamination.
Separate Low‑Rank Approximation – The KV cache for each modality is factorized as

$$
K_{\text{vision}} \approx U_V S_V V_V^\top ,\qquad
K_{\text{text}} \approx U_T S_T V_T^\top ,
$$

where the rank is chosen per‑modality. This decoupling respects the different statistical properties of image patches and token embeddings.
Fine‑Tuning Objective – Instead of minimizing

$$| \theta_{\text{orig}} - \theta_{\text{MLA}}|_2,$$

the authors minimize

$$| f_{\text{orig}}(x) - f_{\text{MLA}}(x) |_2$$

over a small supervised set, directly aligning the model’s predictions.
Parameter‑Efficient Adaptation – Only a tiny set of adapter layers (≈ 0.5 % of total parameters) are introduced, keeping the conversion cost low and allowing rapid deployment on edge devices.

Results & Findings

Model (original)	KV size (GB)	KV size after MHA2MLA‑VLM	Top‑1 Image‑Text Retrieval Δ	Inference latency ↓
CLIP‑ViT‑B/32	4.2	1.2 (≈ 71 % reduction)	–0.3 %	28 % faster
BLIP‑2‑FlanT5	6.8	1.9 (≈ 72 % reduction)	–0.5 %	31 % faster
Flamingo‑7B	9.5	2.6 (≈ 73 % reduction)	–0.2 %	27 % faster

Performance loss stays under 0.5 % on standard VLM benchmarks (MS‑COCO, Flickr30K).
Fine‑tuning data required is tiny: ~5 k image‑text pairs (≈ 0.1 % of the original pre‑training corpus).
When combined with 8‑bit KV quantization, total memory drops to ≈ 10 % of the baseline while preserving accuracy.

Practical Implications

Edge Deployment – Developers can now run large VLMs on devices with < 2 GB RAM (e.g., smartphones, AR glasses) by swapping the KV cache for its MLA counterpart.
Cost‑Effective Scaling – Cloud inference services can serve more concurrent requests per GPU because the KV cache no longer dominates memory consumption.
Rapid Prototyping – Existing VLM pipelines (e.g., captioning, visual QA) can be upgraded to MLA with a few hours of fine‑tuning, avoiding the need to train a new model from scratch.
Interoperability – The method works with any transformer‑based VLM, making it a drop‑in upgrade for open‑source projects like HuggingFace’s transformers library.
Energy Savings – Smaller KV footprints translate to fewer memory accesses, which is a notable win for green AI initiatives and battery‑powered devices.

Limitations & Future Work

Modality‑specific rank selection still requires manual tuning; an automated rank‑selection algorithm could streamline the process.
The approach assumes a fixed transformer architecture; extending to models with mixed‑modality cross‑attention layers (e.g., Perceiver‑IO) remains open.
Experiments focus on retrieval and captioning tasks; applying MHA2MLA‑VLM to more complex multimodal reasoning (e.g., video‑language) is left for future research.
While the KV cache is heavily compressed, model weights themselves are unchanged; combining MLA conversion with weight quantization or pruning could push memory savings even further.

Authors

Xiaoran Fan
Zhichao Sun
Tao Ji
Lixing Shen
Tao Gui

Paper Information

arXiv ID: 2601.11464v1
Categories: cs.CV, cs.AI, cs.CL, cs.LG
Published: January 16, 2026
PDF: Download PDF

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] PubMed-OCR: PMC Open Access OCR Annotations

[Paper] LLMs can Compress LLMs: Adaptive Pruning by Agents

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini