[Paper] CAOS: Conformal Aggregation of One-Shot Predictors
Source: arXiv - 2601.05219v1
Overview
One‑shot prediction lets you fine‑tune a massive pre‑trained model to a brand‑new task with just a single labeled example. While this is a huge win for rapid prototyping, it leaves developers without reliable uncertainty estimates—something that’s crucial when decisions have downstream costs. The paper CAOS: Conformal Aggregation of One‑Shot Predictors introduces a new conformal inference framework that fills this gap, delivering statistically sound prediction sets even when you only have that one labeled datum.
Key Contributions
- CAOS framework: A novel conformal method that aggregates multiple one‑shot predictors instead of relying on a single model.
- Leave‑one‑out calibration: A clever calibration scheme that makes the most of the single labeled example, avoiding the data‑waste of traditional split‑conformal approaches.
- Theoretical guarantee: Proven marginal coverage under a monotonicity argument, despite breaking the usual exchangeability assumptions.
- Empirical validation: Demonstrated on one‑shot facial landmark detection and RAFT text classification, showing tighter (smaller) prediction sets than standard baselines while preserving the promised coverage level.
Methodology
- Generate a pool of one‑shot predictors – Starting from a frozen foundation model, the authors train several lightweight adapters, each using the same single labeled example but with different random seeds, data augmentations, or hyper‑parameter tweaks.
- Aggregate predictions – For a new input, each adapter produces a point prediction (e.g., a set of facial landmarks). CAOS combines these predictions into a score that reflects how far a candidate output deviates from the ensemble.
- Leave‑one‑out calibration – The single labeled example is temporarily treated as a “test” point while the remaining adapters are used to compute calibration scores. This process is repeated for each adapter, yielding a full set of calibration quantiles without discarding any data.
- Construct prediction sets – Using the calibrated quantile, CAOS builds a set of outputs that, with high probability (e.g., 90 %), contains the true answer. The construction respects the monotonicity of the aggregation score, which is the key to the coverage proof.
Results & Findings
| Task | Baseline (Split‑Conformal) | CAOS | Reduction in Set Size |
|---|---|---|---|
| One‑shot facial landmarking (5‑point) | 95 % coverage, avg. set radius 4.2 px | 95 % coverage, avg. radius 2.8 px | ≈33 % smaller |
| RAFT text classification (sentiment) | 90 % coverage, avg. set cardinality 3.1 | 90 % coverage, avg. cardinality 2.2 | ≈29 % smaller |
- Coverage stays at the nominal level (90–95 %) across all experiments, confirming the theoretical guarantee.
- Prediction sets are consistently tighter, meaning developers get more informative uncertainty bounds without sacrificing reliability.
Practical Implications
- Faster product iteration – Teams can deploy one‑shot fine‑tuned models with built‑in confidence intervals, reducing the need for costly data collection before launch.
- Safety‑critical systems – In domains like medical imaging or autonomous driving, CAOS‑derived sets can flag when a one‑shot model’s prediction is too ambiguous, prompting human review.
- Model‑agnostic tooling – Because CAOS works with any foundation model that can be adapted in a one‑shot fashion, it can be packaged as a plug‑in for popular ML libraries (e.g., Hugging Face Transformers, PyTorch Lightning).
- Resource efficiency – The leave‑one‑out calibration eliminates the need to reserve a validation split, saving precious labeled data and compute time.
Limitations & Future Work
- Scalability of the predictor pool – Generating many one‑shot adapters incurs extra compute; the paper explores modest pool sizes (5–10) but larger ensembles may be needed for very complex tasks.
- Assumption of monotonicity – The coverage proof hinges on a monotonic aggregation score, which may not hold for all types of predictors (e.g., highly non‑linear output spaces).
- Domain‑specific calibration – While the leave‑one‑out scheme works well for the studied tasks, extending CAOS to structured outputs (e.g., full segmentation maps) may require custom score functions.
- Future directions include adaptive pool sizing, integration with active learning loops to acquire additional labels when uncertainty remains high, and broader benchmarks across vision, speech, and reinforcement‑learning settings.
Authors
- Maja Waldron
Paper Information
- arXiv ID: 2601.05219v1
- Categories: stat.ML, cs.AI, cs.LG
- Published: January 8, 2026
- PDF: Download PDF