[Paper] Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters
Source: arXiv - 2603.04341v1
Overview
The paper introduces Hold‑One‑Shot‑Out (HOSO), a lightweight trick that lets CLIP‑Adapter models automatically pick the right blending ratio between frozen CLIP knowledge and a few‑shot adapter—without needing a separate validation set. By reserving just a single example from the support set, HOSO learns this hyper‑parameter on‑the‑fly, delivering a consistent boost across a wide range of few‑shot vision benchmarks.
Key Contributions
- Validation‑free blending ratio: A one‑shot hold‑out scheme that learns the optimal mixing weight between CLIP and the adapter during training.
- HOSO‑Adapter: A concrete instantiation that plugs into any CLIP‑Adapter‑style method, requiring only a minor change to the training loop.
- Strong empirical gains: Improves average accuracy by >4 % over the vanilla CLIP‑Adapter across 11 few‑shot datasets, and even beats the “oracle” test‑set tuned ratio in 8‑ and 16‑shot regimes.
- Ablation insights: Demonstrates that decoupling the blending‑ratio learning from adapter training and using a single hold‑out example are both essential for the performance lift.
- Open‑source release: Full code and reproducibility scripts are provided (https://github.com/chris‑vorster/HOSO‑Adapter).
Methodology
-
Standard CLIP‑Adapter recap – The base model keeps CLIP’s image and text encoders frozen and learns a lightweight linear adapter on top of the image features. A blending ratio α mixes the original CLIP logits with the adapter logits:
[ \text{logits} = (1-\alpha) \cdot \text{CLIP} + \alpha \cdot \text{Adapter} ]
Choosing α is critical; too low and the model ignores the few‑shot data, too high and it overfits. -
Hold‑One‑Shot‑Out (HOSO) trick –
- From the K‑shot support set, reserve one example per class as a hold‑out (the “one‑shot”).
- Train the adapter on the remaining K‑1 examples per class (the training split).
- Simultaneously, treat α as a learnable scalar and optimize it on the hold‑out split using the same loss (cross‑entropy). Because the hold‑out set is disjoint from the training split, α is tuned without peeking at the test data.
- After training, discard the hold‑out examples; at inference time the model uses the learned α for the whole dataset.
-
Decoupled training – The adapter weights and α are updated in separate gradient steps (or with separate learning rates) to avoid the optimizer collapsing α to a trivial value.
The whole pipeline adds only a few lines of code and no extra hyper‑parameters beyond the usual learning rate schedule.
Results & Findings
| Setting | Avg. Accuracy (baseline CLIP‑Adapter) | Avg. Accuracy (HOSO‑Adapter) | Δ |
|---|---|---|---|
| 4‑shot | 71.2 % | 75.6 % | +4.4 % |
| 8‑shot | 73.8 % | 78.1 % | +4.3 % |
| 16‑shot | 75.5 % | 79.9 % | +4.4 % |
- Oracle comparison: When the baseline’s α is tuned on the test set (an unrealistic “oracle” scenario), HOSO‑Adapter still matches or exceeds it in the 8‑ and 16‑shot regimes.
- Ablations:
- Removing the hold‑out (learning α on the same K‑1 examples) drops performance by ~2 %.
- Jointly updating adapter and α (no decoupling) leads to unstable training and lower final accuracy.
- Using more than one hold‑out example per class yields diminishing returns, confirming that a single shot is sufficient.
Overall, the experiments confirm that a single validation‑free example per class is enough to calibrate the blending ratio reliably.
Practical Implications
- Zero‑validation few‑shot pipelines: Teams can now deploy CLIP‑based adapters in environments where a validation split is unavailable (e.g., on‑device personalization, rapid prototyping, or privacy‑sensitive domains).
- Reduced hyper‑parameter tuning overhead: No need to run grid searches for α across datasets; the model self‑tunes during the few‑shot training phase.
- Plug‑and‑play for existing adapters: Since HOSO only modifies the training loop, any CLIP‑Adapter implementation (or similar linear‑probe methods) can adopt it with minimal code changes.
- Faster iteration cycles: Developers can train a few‑shot model in a single pass and immediately evaluate on the target task, accelerating product development for vision‑AI features such as custom image classifiers, domain‑specific search, or on‑the‑fly label expansion.
Limitations & Future Work
- One‑shot hold‑out assumption: The method assumes at least one labeled example per class; truly zero‑shot scenarios remain out of scope.
- Scalability to many classes: With hundreds of classes, reserving one example per class reduces the effective training data, which could hurt performance on extremely low‑shot regimes.
- Extension beyond linear adapters: The paper focuses on CLIP‑Adapter; applying HOSO to more complex fine‑tuning strategies (e.g., LoRA, prompt tuning) is left for future investigation.
- Theoretical analysis: While empirical results are strong, a deeper theoretical justification for why a single hold‑out suffices is an open research direction.
Bottom line: HOSO offers a pragmatic, validation‑free solution for few‑shot CLIP adaptation, turning a cumbersome hyper‑parameter search into a trivial one‑shot hold‑out step—an attractive addition to any developer’s vision‑AI toolkit.
Authors
- Chris Vorster
- Mayug Maniparambil
- Noel E. O’Connor
- Noel Murphy
- Derek Molloy
Paper Information
- arXiv ID: 2603.04341v1
- Categories: cs.CV
- Published: March 4, 2026
- PDF: Download PDF