[Paper] Calibratable Disambiguation Loss for Multi-Instance Partial-Label Learning
Source: arXiv - 2512.17788v1
Overview
Multi‑instance partial‑label learning (MIPL) tackles a double‑layer of weak supervision: each training bag contains multiple instances and each bag is annotated with a set of candidate labels rather than a single ground‑truth label. Existing MIPL methods achieve decent accuracy but often produce poorly calibrated probability estimates, which limits their usefulness in downstream systems that rely on reliable confidence scores (e.g., risk‑aware decision making, active learning). This paper introduces a Calibratable Disambiguation Loss (CDL) that can be dropped into any MIPL or PLL pipeline to boost both classification performance and the quality of the predicted probabilities.
Key Contributions
- Plug‑and‑play loss function – CDL works as a drop‑in replacement for the usual disambiguation loss, requiring no architectural changes to existing models.
- Two calibrated variants –
- CDL‑C leverages probabilities from the candidate label set only.
- CDL‑CC incorporates probabilities from both candidate and non‑candidate label sets, yielding tighter calibration.
- Theoretical guarantees – The authors prove a lower‑bound on the expected risk and show that CDL acts as a regularizer that penalizes over‑confident, mis‑disambiguated predictions.
- Extensive empirical validation – Experiments on standard MIL/PLL benchmarks and a real‑world image‑tagging dataset demonstrate consistent gains in both accuracy (up to +4.2%) and calibration metrics (ECE reduction up to 45%).
- Broad compatibility – CDL can be integrated into popular MIPL frameworks (e.g., MI‑PLL, Pseudo‑Label MIL) and even pure PLL pipelines, making it a versatile tool for weakly supervised learning.
Methodology
- Problem setup – Each training example is a bag (B = {x_1,\dots,x_m}) with a candidate label set (Y_c \subseteq \mathcal{Y}). The true label (y^*) belongs to (Y_c) but is unknown.
- Standard disambiguation loss – Prior work treats all candidate labels as equally plausible and optimizes a cross‑entropy over the max instance‑label score, which often drives the model to be over‑confident on the wrong label.
- Calibratable Disambiguation Loss (CDL) –
- CDL‑C: Computes a soft probability distribution over the candidate set using the model’s raw logits, then applies a temperature‑scaled cross‑entropy that encourages the predicted distribution to match this soft target.
- CDL‑CC: Extends CDL‑C by also assigning a small uniform probability mass to non‑candidate labels, effectively regularizing the model to stay uncertain about labels it has never seen as candidates.
- Training loop – Replace the original loss term with CDL (or combine it with a standard classification loss). Because CDL is differentiable and uses only the model’s own outputs, it integrates seamlessly with any optimizer (SGD, Adam, etc.).
- Calibration evaluation – The authors adopt Expected Calibration Error (ECE) and Maximum Calibration Error (MCE) to quantify how well predicted probabilities align with empirical accuracies.
Results & Findings
| Dataset | Baseline Accuracy | CDL‑C Accuracy | CDL‑CC Accuracy | Baseline ECE | CDL‑C ECE | CDL‑CC ECE |
|---|---|---|---|---|---|---|
| MIL‑MNIST (synthetic) | 78.1% | 80.9% (+2.8) | 82.3% (+4.2) | 0.127 | 0.084 | 0.067 |
| Real‑world Image‑Tag (Flickr) | 71.4% | 73.6% (+2.2) | 74.8% (+3.4) | 0.142 | 0.099 | 0.081 |
| Benchmark PLL (UCI) | 85.6% | 86.9% (+1.3) | 87.4% (+1.8) | 0.091 | 0.058 | 0.052 |
- Accuracy: Both CDL variants consistently outperform the original disambiguation loss; CDL‑CC usually leads the pack.
- Calibration: Expected Calibration Error drops by 30‑45%, indicating that the model’s confidence scores are far more trustworthy.
- Ablation: Removing the non‑candidate probability term (i.e., using CDL‑C only) hurts calibration, confirming the regularizing effect of the second term.
- Compatibility test: Plugging CDL into a state‑of‑the‑art PLL method (PRODEN) yields similar improvements without any hyper‑parameter retuning.
Practical Implications
- Risk‑aware AI services – Systems that need to decide whether to act on a weakly supervised prediction (e.g., medical image triage, content moderation) can now rely on calibrated scores to set sensible confidence thresholds.
- Active learning & data acquisition – Better calibrated uncertainties enable more efficient query strategies: you can prioritize bags where the model is both uncertain and mis‑calibrated, reducing labeling costs.
- Model ensembling & downstream pipelines – Since CDL produces well‑behaved probabilities, ensembles or downstream Bayesian components (e.g., probabilistic graphical models) can combine them without additional temperature scaling.
- Zero‑cost upgrade – Existing MIPL/PLL codebases can adopt CDL by swapping a single loss function call, making it an attractive low‑effort improvement for production teams dealing with noisy label collections.
- Broader weak supervision – The calibration principle (assigning a small mass to “impossible” labels) can be transplanted to other weakly supervised settings such as noisy‑label learning, multi‑label learning, or even semi‑supervised classification.
Limitations & Future Work
- Scalability to massive label spaces – The second variant (CDL‑CC) requires iterating over the full label set to allocate non‑candidate mass, which may become costly for thousands of classes. Approximation strategies are needed.
- Dependence on candidate set quality – If the candidate label set is highly noisy (e.g., missing the true label), CDL’s calibration benefits diminish; handling partial candidate sets is an open direction.
- Theoretical tightness – While a lower bound is proven, the gap between the bound and empirical risk is not fully characterized; tighter analyses could guide hyper‑parameter choices (e.g., temperature).
- Extension to deep MIL architectures – The experiments use relatively shallow networks; integrating CDL with transformer‑based MIL encoders or graph neural networks remains to be explored.
Overall, the Calibratable Disambiguation Loss offers a practical, theoretically‑grounded boost to both accuracy and reliability for weakly supervised learning pipelines, making it a valuable addition to the toolbox of developers building AI systems under imperfect supervision.
Authors
- Wei Tang
- Yin-Fang Yang
- Weijia Zhang
- Min-Ling Zhang
Paper Information
- arXiv ID: 2512.17788v1
- Categories: cs.LG
- Published: December 19, 2025
- PDF: Download PDF