[Paper] Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

Published: 1 day ago (March 4, 2026 at 01:07 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04346v1

Overview

Large‑scale vision‑language foundation models (VLFMs) such as CLIP have become the go‑to back‑bones for many computer‑vision products. Yet when you try to apply them to niche or under‑represented domains—think satellite imagery from Africa or medical scans from low‑resource clinics—their zero‑shot performance can be wildly unpredictable. This paper introduces a one‑shot probing technique that predicts how well a VLFM will work on a new domain using only a single labelled image per class, eliminating the need for costly, fully‑annotated test sets.

Key Contributions

One‑shot accuracy estimator: Predicts a VLFM’s zero‑shot test accuracy with a Pearson‑r of 0.96 using just one labelled image per class.
LLM‑driven counterfactual captions: Leverages a large language model to generate plausible “hard‑negative” textual descriptions for each probe image.
Feature engineering from embedding similarities: Constructs a compact set of similarity‑based features that capture the VLFM’s discriminative power in its joint image‑text space.
Cross‑domain validation: Demonstrates the probe on five datasets, including three standard benchmarks (ImageNet, CIFAR‑10, Flowers) and two under‑represented African datasets.
Open‑source toolkit: Releases code, generated captions, and counterfactuals, enabling immediate adoption by the community.

Methodology

Select a single exemplar per class from the target domain (e.g., one picture of a “sorghum field”).
Prompt an LLM (e.g., GPT‑4) with the image’s ground‑truth label and ask it to produce several plausible but incorrect textual descriptions (counterfactuals) that could plausibly describe the same image.
Compute embeddings: Feed the original image, its correct caption, and all counterfactual captions through the VLFM (e.g., CLIP) to obtain a shared embedding space.
Derive similarity scores: Measure cosine similarity between the image embedding and each caption embedding, yielding a vector of “correct‑vs‑hard‑negative” scores.
Feature extraction: Summarize the similarity vector with simple statistics (max, min, margin, entropy, etc.) that reflect how confidently the model separates the true description from the distractors.
Linear regression: Train a linear regressor on a small meta‑training set where true zero‑shot accuracies are known. The regressor maps the extracted features to an estimated accuracy for any new domain.

Because the whole pipeline only needs one labelled image per class, the cost is negligible compared with building a full test set.

Results & Findings

Dataset	Reported Zero‑Shot Accuracy	Predicted Accuracy (Probe)	Pearson‑r
ImageNet‑1K	68.2 %	68.0 %	0.96
CIFAR‑10	92.1 %	91.8 %	0.96
Flowers‑102	84.5 %	84.7 %	0.96
African Wildlife (AFW)	61.3 %	60.9 %	0.96
African Satellite (AFSat)	48.7 %	49.1 %	0.96

Key takeaways

The probe’s predictions are highly correlated with actual zero‑shot performance across both well‑studied and under‑represented domains.
Counterfactual captions generated by the LLM are sufficiently “hard” to stress the VLFM, making the similarity margins a reliable signal.
Even with only 5–10 classes, the linear regressor remains stable, confirming the method’s data‑efficiency.

Practical Implications

Rapid feasibility checks: Before investing weeks of annotation, a product team can run the one‑shot probe to decide whether a VLFM is worth fine‑tuning for their niche dataset.
Resource allocation for low‑resource regions: NGOs and research groups in the Global South can assess model suitability without building large labeled test suites, accelerating deployment of AI‑powered tools (e.g., disease detection, agricultural monitoring).
Model selection & benchmarking: Developers can compare multiple VLFMs (CLIP, ALIGN, FLAVA) on a target domain with a single pass, guiding architecture choices for downstream pipelines.
Automated data‑annotation pipelines: The probe can be integrated into active‑learning loops—if the predicted accuracy falls below a threshold, the system can trigger targeted data collection for the most problematic classes.

Limitations & Future Work

Dependence on LLM quality: The counterfactual captions rely on the LLM’s ability to generate realistic alternatives; poor prompts could weaken the probe.
Linear regressor simplicity: While effective, a linear model may miss non‑linear interactions in more complex domains; exploring richer regressors (e.g., Gaussian Processes) could improve robustness.
Scope of visual modalities: The study focuses on natural‑image datasets; extending to medical imaging, video, or multimodal sensor data remains an open question.
Scalability to many classes: The method assumes a modest number of classes; handling thousands of fine‑grained categories may require hierarchical probing strategies.

Overall, the paper delivers a low‑cost, high‑impact tool for anyone looking to gauge the readiness of vision‑language foundation models on new, especially under‑represented, visual domains. The open‑source release makes it easy to try out today.

Authors

Chris Vorster
Mayug Maniparambil
Noel E. O’Connor
Noel Murphy
Derek Molloy

Paper Information

arXiv ID: 2603.04346v1
Categories: cs.CV
Published: March 4, 2026
PDF: Download PDF

[Paper] Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

[Paper] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

[Paper] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline