[Paper] Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

Published: (June 9, 2026 at 11:08 AM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.10967v1

Overview

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

Key Contributions

This paper presents research in the following areas:

  • cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

  • Pradnya Halady
  • Jiale Wei
  • Zdravko Marinov
  • Alexander Jaus
  • Simon Reiß

Paper Information

  • arXiv ID: 2606.10967v1
  • Categories: cs.CV
  • Published: June 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »