[Paper] Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

Published: 3 days ago (June 9, 2026 at 11:08 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.10967v1

Overview

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

Key Contributions

This paper presents research in the following areas:

cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Pradnya Halady
Jiale Wei
Zdravko Marinov
Alexander Jaus
Simon Reiß

Paper Information

arXiv ID: 2606.10967v1
Categories: cs.CV
Published: June 9, 2026
PDF: Download PDF

[Paper] Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] InterleaveThinker: Reinforcing Agentic Interleaved Generation

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] Modality Forcing for Scalable Spatial Generation

[Paper] RepWAM: World Action Modeling with Representation Visual-Action Tokenizers