[Paper] VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading
Source: arXiv - 2605.28818v1
Overview
The paper investigates whether vision‑language models (VLMs) produce text representations that are more like human brain activity and eye‑movement patterns than standard large language models (LLMs) when reading plain text. By pairing each VLM with a closely matched LLM and testing them on a purely textual reading task, the authors isolate the effect of multimodal pre‑training from any real‑time visual input. Their results suggest that multimodal training does not give a blanket advantage; any benefit appears only for sentences that are rich in visual semantics.
Key Contributions
- Controlled comparison of LLM–VLM pairs under a text‑only reading paradigm, eliminating confounds from online visual cues.
- Alignment benchmark using whole‑cortex fMRI recordings and synchronized eye‑tracking saccades from human participants reading naturalistic passages.
- Evidence that multimodal pre‑training yields selective rather than global improvements in model‑human alignment, especially for visually evocative sentences.
- Open‑source in‑silico framework for probing how training history (visual vs. purely linguistic) shapes language representations in the brain.
Methodology
- Model selection – For each VLM (e.g., CLIP‑based, Flamingo‑style), the authors chose an LLM with the same architecture size and tokenizer, differing only in pre‑training data (multimodal vs. text‑only).
- Stimuli – Participants read continuous natural‑language passages while their brain activity (fMRI) and eye movements were recorded. The same passages were fed to the models as plain text.
- Representation extraction – Hidden‑state activations from each model layer were recorded for every token.
- Alignment metrics
- Neural alignment: Linear encoding models map model activations to voxel‑wise fMRI responses; correlation scores quantify fit.
- Eye‑movement alignment: Predicted attention weights (e.g., from transformer attention heads) are compared to actual saccade landing positions using spatial similarity metrics.
- Selective analysis – Sentences were annotated for visual semantic density (e.g., presence of concrete nouns, vivid imagery). Alignment scores were compared across low‑ vs. high‑visual‑content groups.
Results & Findings
- No global advantage: Across the entire corpus, VLMs and LLMs performed comparably in predicting fMRI patterns and eye‑movement locations.
- Selective boost: For sentences rich in visual semantics (e.g., “The crimson sunset painted the horizon”), VLMs showed modest but statistically significant improvements in both neural and eye‑tracking alignment (≈3–5% higher correlation).
- Layer‑wise patterns: The advantage was most pronounced in middle transformer layers, suggesting that visual pre‑training reshapes intermediate representations rather than the final output layer.
- Cross‑modal consistency: The same sentences that yielded better neural alignment also produced better eye‑movement alignment, reinforcing the notion of a shared underlying representation.
Practical Implications
- Model selection for reading‑assist tools – When building applications that need to predict human reading behavior (e.g., adaptive e‑readers, gaze‑based UI), a plain LLM may be sufficient unless the target content is highly visual.
- Fine‑tuning strategies – Instead of training from scratch on massive multimodal corpora, developers could selectively inject visual grounding (e.g., via image‑caption pairs) for domains where visual semantics matter (technical manuals, storytelling).
- Neuro‑AI diagnostics – The alignment framework can serve as a sanity check for any language model intended to simulate human cognition; developers can benchmark against fMRI/eye‑tracking data to gauge “human‑likeness.”
- Resource allocation – Since multimodal pre‑training is computationally expensive, teams can prioritize it only for niche use‑cases, saving compute and carbon budget for broader NLP tasks.
Limitations & Future Work
- Dataset scope – The study used a single natural‑reading dataset; results may differ with other languages, genres, or more diverse participant pools.
- Static visual content – Only text‑only input was examined; real‑time visual context (e.g., illustrated articles) could amplify VLM benefits.
- Model diversity – The analysis focused on a limited set of VLM architectures; newer vision‑language transformers might exhibit stronger alignment.
- Causal mechanisms – While correlations were identified, the exact neural mechanisms by which visual pre‑training influences language representations remain unclear. Future work could probe causality via ablation studies or integrate explicit visual grounding during inference.
Authors
- Jinzhou Wu
- Zhengwu Ma
- Jixing Li
- Baoping Tang
- Zitong Lu
Paper Information
- arXiv ID: 2605.28818v1
- Categories: cs.CL, q-bio.NC
- Published: May 27, 2026
- PDF: Download PDF