[Paper] Hold-One-Shot-Out (HOSO) for Validation-Free Few-Shot CLIP Adapters

Published: (March 4, 2026 at 12:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2603.04341v1

Overview

The paper introduces Hold‑One‑Shot‑Out (HOSO), a lightweight trick that lets CLIP‑Adapter models automatically pick the right blending ratio between frozen CLIP knowledge and a few‑shot adapter—without needing a separate validation set. By reserving just a single example from the support set, HOSO learns this hyper‑parameter on‑the‑fly, delivering a consistent boost across a wide range of few‑shot vision benchmarks.

Key Contributions

  • Validation‑free blending ratio: A one‑shot hold‑out scheme that learns the optimal mixing weight between CLIP and the adapter during training.
  • HOSO‑Adapter: A concrete instantiation that plugs into any CLIP‑Adapter‑style method, requiring only a minor change to the training loop.
  • Strong empirical gains: Improves average accuracy by >4 % over the vanilla CLIP‑Adapter across 11 few‑shot datasets, and even beats the “oracle” test‑set tuned ratio in 8‑ and 16‑shot regimes.
  • Ablation insights: Demonstrates that decoupling the blending‑ratio learning from adapter training and using a single hold‑out example are both essential for the performance lift.
  • Open‑source release: Full code and reproducibility scripts are provided (https://github.com/chris‑vorster/HOSO‑Adapter).

Methodology

  1. Standard CLIP‑Adapter recap – The base model keeps CLIP’s image and text encoders frozen and learns a lightweight linear adapter on top of the image features. A blending ratio α mixes the original CLIP logits with the adapter logits:
    [ \text{logits} = (1-\alpha) \cdot \text{CLIP} + \alpha \cdot \text{Adapter} ]
    Choosing α is critical; too low and the model ignores the few‑shot data, too high and it overfits.

  2. Hold‑One‑Shot‑Out (HOSO) trick

    • From the K‑shot support set, reserve one example per class as a hold‑out (the “one‑shot”).
    • Train the adapter on the remaining K‑1 examples per class (the training split).
    • Simultaneously, treat α as a learnable scalar and optimize it on the hold‑out split using the same loss (cross‑entropy). Because the hold‑out set is disjoint from the training split, α is tuned without peeking at the test data.
    • After training, discard the hold‑out examples; at inference time the model uses the learned α for the whole dataset.
  3. Decoupled training – The adapter weights and α are updated in separate gradient steps (or with separate learning rates) to avoid the optimizer collapsing α to a trivial value.

The whole pipeline adds only a few lines of code and no extra hyper‑parameters beyond the usual learning rate schedule.

Results & Findings

SettingAvg. Accuracy (baseline CLIP‑Adapter)Avg. Accuracy (HOSO‑Adapter)Δ
4‑shot71.2 %75.6 %+4.4 %
8‑shot73.8 %78.1 %+4.3 %
16‑shot75.5 %79.9 %+4.4 %
  • Oracle comparison: When the baseline’s α is tuned on the test set (an unrealistic “oracle” scenario), HOSO‑Adapter still matches or exceeds it in the 8‑ and 16‑shot regimes.
  • Ablations:
    • Removing the hold‑out (learning α on the same K‑1 examples) drops performance by ~2 %.
    • Jointly updating adapter and α (no decoupling) leads to unstable training and lower final accuracy.
    • Using more than one hold‑out example per class yields diminishing returns, confirming that a single shot is sufficient.

Overall, the experiments confirm that a single validation‑free example per class is enough to calibrate the blending ratio reliably.

Practical Implications

  • Zero‑validation few‑shot pipelines: Teams can now deploy CLIP‑based adapters in environments where a validation split is unavailable (e.g., on‑device personalization, rapid prototyping, or privacy‑sensitive domains).
  • Reduced hyper‑parameter tuning overhead: No need to run grid searches for α across datasets; the model self‑tunes during the few‑shot training phase.
  • Plug‑and‑play for existing adapters: Since HOSO only modifies the training loop, any CLIP‑Adapter implementation (or similar linear‑probe methods) can adopt it with minimal code changes.
  • Faster iteration cycles: Developers can train a few‑shot model in a single pass and immediately evaluate on the target task, accelerating product development for vision‑AI features such as custom image classifiers, domain‑specific search, or on‑the‑fly label expansion.

Limitations & Future Work

  • One‑shot hold‑out assumption: The method assumes at least one labeled example per class; truly zero‑shot scenarios remain out of scope.
  • Scalability to many classes: With hundreds of classes, reserving one example per class reduces the effective training data, which could hurt performance on extremely low‑shot regimes.
  • Extension beyond linear adapters: The paper focuses on CLIP‑Adapter; applying HOSO to more complex fine‑tuning strategies (e.g., LoRA, prompt tuning) is left for future investigation.
  • Theoretical analysis: While empirical results are strong, a deeper theoretical justification for why a single hold‑out suffices is an open research direction.

Bottom line: HOSO offers a pragmatic, validation‑free solution for few‑shot CLIP adaptation, turning a cumbersome hyper‑parameter search into a trivial one‑shot hold‑out step—an attractive addition to any developer’s vision‑AI toolkit.

Authors

  • Chris Vorster
  • Mayug Maniparambil
  • Noel E. O’Connor
  • Noel Murphy
  • Derek Molloy

Paper Information

  • arXiv ID: 2603.04341v1
  • Categories: cs.CV
  • Published: March 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »