[Paper] Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

Published: 1 day ago (March 4, 2026 at 12:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.04341v1

Overview

The paper introduces Hold‑One‑Shot‑Out (HOSO), a lightweight trick that lets CLIP‑Adapter models automatically pick the right blending ratio between frozen CLIP knowledge and a few‑shot adapter—without needing a separate validation set. By reserving just a single example from the support set, HOSO learns this hyper‑parameter on‑the‑fly, delivering a consistent boost across a wide range of few‑shot vision benchmarks.

Key Contributions

Validation‑free blending ratio: A one‑shot hold‑out scheme that learns the optimal mixing weight between CLIP and the adapter during training.
HOSO‑Adapter: A concrete instantiation that plugs into any CLIP‑Adapter‑style method, requiring only a minor change to the training loop.
Strong empirical gains: Improves average accuracy by >4 % over the vanilla CLIP‑Adapter across 11 few‑shot datasets, and even beats the “oracle” test‑set tuned ratio in 8‑ and 16‑shot regimes.
Ablation insights: Demonstrates that decoupling the blending‑ratio learning from adapter training and using a single hold‑out example are both essential for the performance lift.
Open‑source release: Full code and reproducibility scripts are provided (https://github.com/chris‑vorster/HOSO‑Adapter).

Methodology

Standard CLIP‑Adapter recap – The base model keeps CLIP’s image and text encoders frozen and learns a lightweight linear adapter on top of the image features. A blending ratio α mixes the original CLIP logits with the adapter logits:
[ \text{logits} = (1-\alpha) \cdot \text{CLIP} + \alpha \cdot \text{Adapter} ]
Choosing α is critical; too low and the model ignores the few‑shot data, too high and it overfits.
Hold‑One‑Shot‑Out (HOSO) trick –
- From the K‑shot support set, reserve one example per class as a hold‑out (the “one‑shot”).
- Train the adapter on the remaining K‑1 examples per class (the training split).
- Simultaneously, treat α as a learnable scalar and optimize it on the hold‑out split using the same loss (cross‑entropy). Because the hold‑out set is disjoint from the training split, α is tuned without peeking at the test data.
- After training, discard the hold‑out examples; at inference time the model uses the learned α for the whole dataset.
Decoupled training – The adapter weights and α are updated in separate gradient steps (or with separate learning rates) to avoid the optimizer collapsing α to a trivial value.

The whole pipeline adds only a few lines of code and no extra hyper‑parameters beyond the usual learning rate schedule.

Results & Findings

Setting	Avg. Accuracy (baseline CLIP‑Adapter)	Avg. Accuracy (HOSO‑Adapter)	Δ
4‑shot	71.2 %	75.6 %	+4.4 %
8‑shot	73.8 %	78.1 %	+4.3 %
16‑shot	75.5 %	79.9 %	+4.4 %

Oracle comparison: When the baseline’s α is tuned on the test set (an unrealistic “oracle” scenario), HOSO‑Adapter still matches or exceeds it in the 8‑ and 16‑shot regimes.
Ablations:
- Removing the hold‑out (learning α on the same K‑1 examples) drops performance by ~2 %.
- Jointly updating adapter and α (no decoupling) leads to unstable training and lower final accuracy.
- Using more than one hold‑out example per class yields diminishing returns, confirming that a single shot is sufficient.

Overall, the experiments confirm that a single validation‑free example per class is enough to calibrate the blending ratio reliably.

Practical Implications

Zero‑validation few‑shot pipelines: Teams can now deploy CLIP‑based adapters in environments where a validation split is unavailable (e.g., on‑device personalization, rapid prototyping, or privacy‑sensitive domains).
Reduced hyper‑parameter tuning overhead: No need to run grid searches for α across datasets; the model self‑tunes during the few‑shot training phase.
Plug‑and‑play for existing adapters: Since HOSO only modifies the training loop, any CLIP‑Adapter implementation (or similar linear‑probe methods) can adopt it with minimal code changes.
Faster iteration cycles: Developers can train a few‑shot model in a single pass and immediately evaluate on the target task, accelerating product development for vision‑AI features such as custom image classifiers, domain‑specific search, or on‑the‑fly label expansion.

Limitations & Future Work

One‑shot hold‑out assumption: The method assumes at least one labeled example per class; truly zero‑shot scenarios remain out of scope.
Scalability to many classes: With hundreds of classes, reserving one example per class reduces the effective training data, which could hurt performance on extremely low‑shot regimes.
Extension beyond linear adapters: The paper focuses on CLIP‑Adapter; applying HOSO to more complex fine‑tuning strategies (e.g., LoRA, prompt tuning) is left for future investigation.
Theoretical analysis: While empirical results are strong, a deeper theoretical justification for why a single hold‑out suffices is an open research direction.

Bottom line: HOSO offers a pragmatic, validation‑free solution for few‑shot CLIP adaptation, turning a cumbersome hyper‑parameter search into a trivial one‑shot hold‑out step—an attractive addition to any developer’s vision‑AI toolkit.

Authors

Chris Vorster
Mayug Maniparambil
Noel E. O’Connor
Noel Murphy
Derek Molloy

Paper Information

arXiv ID: 2603.04341v1
Categories: cs.CV
Published: March 4, 2026
PDF: Download PDF

[Paper] Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

[Paper] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

[Paper] Accelerating Text-to-Video Generation with Calibrated Sparse Attention

[Paper] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline