[Paper] TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

Published: 5 days ago (June 5, 2026 at 12:54 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.07451v1

Overview

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

Key Contributions

This paper presents research in the following areas:

cs.CV
cs.AI
cs.CL
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Sweta Mahajan
Sukrut Rao
Jiahao Xie
Alexander Koller
Bernt Schiele

Paper Information

arXiv ID: 2606.07451v1
Categories: cs.CV, cs.AI, cs.CL, cs.LG
Published: June 5, 2026
PDF: Download PDF

[Paper] TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

[Paper] Continual Visual and Verbal Learning Through a Child's Egocentric Input

[Paper] Neuron Populations Exhibit Divergent Selectivity with Scale