[Paper] Region-Normalized DPO for Medical Image Segmentation under Noisy Judges

Published: 3 months ago (January 30, 2026 at 12:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.23222v1

Overview

The paper introduces Region‑Normalized Direct Preference Optimization (RN‑DPO), a new way to fine‑tune medical image segmentation models using cheap, noisy “quality‑control” signals instead of expensive pixel‑wise annotations. By reshaping how preference feedback is applied, RN‑DPO makes it possible to improve segmentation performance without any extra ground‑truth masks, opening the door to scalable, continuously‑learning medical imaging systems.

Key Contributions

Preference‑based fine‑tuning for segmentation: Adapts Direct Preference Optimization (DPO), originally designed for language models, to dense pixel‑wise tasks.
Region‑normalized objective: Introduces a segmentation‑aware loss that scales the update by the size of the disagreement region between two masks, dampening the impact of noisy or misleading preferences.
Systematic analysis of preference mining: Shows that naïvely picking the top‑ranked proposal from a noisy judge can hurt performance, and proposes a more robust mining strategy.
Empirical validation on two medical datasets: Demonstrates consistent gains over standard DPO and strong baselines across multiple noise levels and label‑budget regimes.
Zero additional pixel annotations: Achieves improvements using only the existing QC signals (model agreement, uncertainty, learned mask‑quality scores), keeping annotation costs at near‑zero.

Methodology

Base segmenter: Train a conventional supervised segmentation network on a small, fully‑annotated set (the “seed” data).
Generate proposals: Run the base model on unlabeled images to produce multiple candidate masks (e.g., via test‑time augmentation, dropout, or different model checkpoints).
Collect noisy preferences: Use an automatic QC judge (uncertainty estimator, agreement score, or a learned quality predictor) to rank the proposals. The judge’s output is noisy—sometimes it prefers a worse mask.
Preference pair mining: Form pairs ((m_i, m_j)) where the judge says (m_i) is better than (m_j). The paper experiments with several mining policies (top‑ranked only, random, hybrid).
Region‑Normalized DPO loss:
[ \mathcal{L}_{\text{RN‑DPO}} = -\log \sigma!\Big(\frac{S(m_i)-S(m_j)}{|m_i \ominus m_j|_1 + \epsilon}\Big) ]
where (S(\cdot)) is the model’s score for a mask, (\ominus) denotes pixel‑wise XOR (the disagreement region), and the denominator normalizes by its area. This reduces the learning signal when the disagreement region is tiny (often a noisy comparison) and amplifies it when the masks differ substantially.
Fine‑tuning: Optimize the segmenter with the RN‑DPO loss on the unlabeled pool, keeping the original supervised loss on the seed set to preserve core knowledge.

Results & Findings

Dataset	Seed annotations	Preference noise level	Metric (Dice)	Standard DPO	RN‑DPO (proposed)
Abdominal CT	5 %	Low (high‑quality judge)	0.78 → 0.84	0.81	0.86
Brain MRI	10 %	Medium (moderate‑quality judge)	0.71 → 0.77	0.73	0.79
Brain MRI	10 %	High (very noisy judge)	0.71 → 0.74	0.72	0.75

Stability: RN‑DPO shows smoother training curves and fewer catastrophic drops when the judge is unreliable.
Robustness to mining strategy: Unlike vanilla DPO, RN‑DPO is less sensitive to whether the top‑ranked or random pairs are used.
No extra pixel labels: All gains are achieved solely from the unlabeled pool and the cheap QC signals.

Practical Implications

Scalable model updates: Hospitals can continuously improve segmentation models as new scans arrive, using only the existing QC metrics already emitted by their pipelines.
Reduced annotation bottleneck: Radiology teams can allocate their limited annotation budget to a small seed set, while the rest of the data fuels improvement automatically.
Plug‑and‑play component: RN‑DPO is a loss function that can be dropped into any PyTorch/TF segmentation model without architectural changes.
Safety net for noisy feedback: The region‑normalization acts as a safeguard, preventing a single erroneous QC signal from corrupting the model—a crucial property for regulated medical AI.
Beyond medicine: Any domain with dense predictions (satellite imagery, autonomous‑driving perception) and cheap quality scores could adopt RN‑DPO to leverage weak supervision at scale.

Limitations & Future Work

Dependence on a base segmenter: The approach assumes a reasonably good initial model; extremely poor seeds may not generate useful proposal diversity.
Judge quality still matters: While RN‑DPO mitigates noise, extremely biased or adversarial judges can still degrade performance.
Region‑normalization hyper‑parameter: The small constant (\epsilon) and the exact form of the denominator were hand‑tuned; automated adaptation could improve robustness.
Extension to multi‑class / multi‑organ segmentation: Experiments focused on binary masks; scaling to complex, multi‑label scenarios remains an open question.
Real‑world deployment studies: Future work should evaluate RN‑DPO in live clinical workflows, measuring not just Dice but downstream impact on diagnosis or treatment planning.

Authors

Hamza Kalisch
Constantin Seibold
Jens Kleesiek
Ken Herrmann
Frederic Jonske

Paper Information

arXiv ID: 2601.23222v1
Categories: cs.CV
Published: January 30, 2026
PDF: Download PDF

[Paper] Region-Normalized DPO for Medical Image Segmentation under Noisy Judges

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

[Paper] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

[Paper] PaperBanana: Automating Academic Illustration for AI Scientists