[Paper] Region-Normalized DPO for Medical Image Segmentation under Noisy Judges
Source: arXiv - 2601.23222v1
Overview
The paper introduces Region‑Normalized Direct Preference Optimization (RN‑DPO), a new way to fine‑tune medical image segmentation models using cheap, noisy “quality‑control” signals instead of expensive pixel‑wise annotations. By reshaping how preference feedback is applied, RN‑DPO makes it possible to improve segmentation performance without any extra ground‑truth masks, opening the door to scalable, continuously‑learning medical imaging systems.
Key Contributions
- Preference‑based fine‑tuning for segmentation: Adapts Direct Preference Optimization (DPO), originally designed for language models, to dense pixel‑wise tasks.
- Region‑normalized objective: Introduces a segmentation‑aware loss that scales the update by the size of the disagreement region between two masks, dampening the impact of noisy or misleading preferences.
- Systematic analysis of preference mining: Shows that naïvely picking the top‑ranked proposal from a noisy judge can hurt performance, and proposes a more robust mining strategy.
- Empirical validation on two medical datasets: Demonstrates consistent gains over standard DPO and strong baselines across multiple noise levels and label‑budget regimes.
- Zero additional pixel annotations: Achieves improvements using only the existing QC signals (model agreement, uncertainty, learned mask‑quality scores), keeping annotation costs at near‑zero.
Methodology
-
Base segmenter: Train a conventional supervised segmentation network on a small, fully‑annotated set (the “seed” data).
-
Generate proposals: Run the base model on unlabeled images to produce multiple candidate masks (e.g., via test‑time augmentation, dropout, or different model checkpoints).
-
Collect noisy preferences: Use an automatic QC judge (uncertainty estimator, agreement score, or a learned quality predictor) to rank the proposals. The judge’s output is noisy—sometimes it prefers a worse mask.
-
Preference pair mining: Form pairs ((m_i, m_j)) where the judge says (m_i) is better than (m_j). The paper experiments with several mining policies (top‑ranked only, random, hybrid).
-
Region‑Normalized DPO loss:
[ \mathcal{L}_{\text{RN‑DPO}} = -\log \sigma!\Big(\frac{S(m_i)-S(m_j)}{|m_i \ominus m_j|_1 + \epsilon}\Big) ]
where (S(\cdot)) is the model’s score for a mask, (\ominus) denotes pixel‑wise XOR (the disagreement region), and the denominator normalizes by its area. This reduces the learning signal when the disagreement region is tiny (often a noisy comparison) and amplifies it when the masks differ substantially.
-
Fine‑tuning: Optimize the segmenter with the RN‑DPO loss on the unlabeled pool, keeping the original supervised loss on the seed set to preserve core knowledge.
Results & Findings
| Dataset | Seed annotations | Preference noise level | Metric (Dice) | Standard DPO | RN‑DPO (proposed) |
|---|---|---|---|---|---|
| Abdominal CT | 5 % | Low (high‑quality judge) | 0.78 → 0.84 | 0.81 | 0.86 |
| Brain MRI | 10 % | Medium (moderate‑quality judge) | 0.71 → 0.77 | 0.73 | 0.79 |
| Brain MRI | 10 % | High (very noisy judge) | 0.71 → 0.74 | 0.72 | 0.75 |
- Stability: RN‑DPO shows smoother training curves and fewer catastrophic drops when the judge is unreliable.
- Robustness to mining strategy: Unlike vanilla DPO, RN‑DPO is less sensitive to whether the top‑ranked or random pairs are used.
- No extra pixel labels: All gains are achieved solely from the unlabeled pool and the cheap QC signals.
Practical Implications
- Scalable model updates: Hospitals can continuously improve segmentation models as new scans arrive, using only the existing QC metrics already emitted by their pipelines.
- Reduced annotation bottleneck: Radiology teams can allocate their limited annotation budget to a small seed set, while the rest of the data fuels improvement automatically.
- Plug‑and‑play component: RN‑DPO is a loss function that can be dropped into any PyTorch/TF segmentation model without architectural changes.
- Safety net for noisy feedback: The region‑normalization acts as a safeguard, preventing a single erroneous QC signal from corrupting the model—a crucial property for regulated medical AI.
- Beyond medicine: Any domain with dense predictions (satellite imagery, autonomous‑driving perception) and cheap quality scores could adopt RN‑DPO to leverage weak supervision at scale.
Limitations & Future Work
- Dependence on a base segmenter: The approach assumes a reasonably good initial model; extremely poor seeds may not generate useful proposal diversity.
- Judge quality still matters: While RN‑DPO mitigates noise, extremely biased or adversarial judges can still degrade performance.
- Region‑normalization hyper‑parameter: The small constant (\epsilon) and the exact form of the denominator were hand‑tuned; automated adaptation could improve robustness.
- Extension to multi‑class / multi‑organ segmentation: Experiments focused on binary masks; scaling to complex, multi‑label scenarios remains an open question.
- Real‑world deployment studies: Future work should evaluate RN‑DPO in live clinical workflows, measuring not just Dice but downstream impact on diagnosis or treatment planning.
Authors
- Hamza Kalisch
- Constantin Seibold
- Jens Kleesiek
- Ken Herrmann
- Frederic Jonske
Paper Information
- arXiv ID: 2601.23222v1
- Categories: cs.CV
- Published: January 30, 2026
- PDF: Download PDF