[Paper] Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability
Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing ...