[Paper] Pre-train to Gain: Robust Learning Without Clean Labels

Published: 2 months ago (November 25, 2025 at 03:48 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.20844v1

Overview

Training deep neural networks on datasets that contain mislabeled examples is a notorious pain point—models tend to memorize the noise, which hurts their real‑world performance. The paper Pre‑train to Gain: Robust Learning Without Clean Labels shows that a simple two‑step recipe—self‑supervised pre‑training followed by ordinary supervised fine‑tuning—can dramatically improve robustness, even when no clean subset of the data is available.

Key Contributions

Label‑agnostic pre‑training: Demonstrates that self‑supervised learning (SSL) methods (SimCLR, Barlow Twins) can be used to learn a strong feature extractor without any labels.
Noise‑robust fine‑tuning: Shows that standard supervised training on top of the SSL‑pre‑trained backbone yields far higher accuracy on noisy datasets than training from scratch.
Comprehensive evaluation: Experiments on CIFAR‑10 and CIFAR‑100 with both synthetic (uniform, asymmetric) and real‑world noise (WebVision‑type) confirm the approach’s consistency across noise rates.
Improved label‑error detection: The SSL‑pre‑trained models produce better representations for downstream error‑detection tools, boosting F1 and balanced accuracy scores.
Competitive with ImageNet pre‑training: At low noise levels the method matches ImageNet‑pre‑trained baselines, and it outperforms them by a large margin when noise is severe.

Methodology

Self‑Supervised Pre‑training
- Choose an SSL algorithm (SimCLR or Barlow Twins).
- Train a convolutional backbone (e.g., ResNet‑18) on the unlabeled training images using only data augmentations and a contrastive / redundancy‑reduction loss.
- No human‑provided labels are needed; the model learns to map different views of the same image to similar embeddings.
Supervised Fine‑tuning on Noisy Labels
- Freeze or lightly fine‑tune the backbone while training a linear classifier (or a small head) on the noisy labeled dataset.
- Use the usual cross‑entropy loss; the network now benefits from the robust features learned in step 1, which reduces its tendency to overfit the wrong labels.
Evaluation & Error Detection
- Measure classification accuracy on a clean test set.
- Apply a simple label‑error detector (e.g., confidence‑thresholding or a small auxiliary network) on the fine‑tuned model’s outputs to assess how well the representations expose mislabeled samples.

The pipeline requires no extra clean data, only the noisy training set and the computational budget for an SSL pre‑training phase (typically a few epochs on the same dataset).

Results & Findings

Dataset	Noise Type	Noise Rate	Baseline (scratch) Acc.	SSL‑pre‑trained Acc.	Δ Accuracy
CIFAR‑10	Uniform	40 %	71.2 %	78.9 %	+7.7 %
CIFAR‑10	Asymmetric	60 %	64.5 %	73.3 %	+8.8 %
CIFAR‑100	Real‑world (WebVision)	50 %	48.1 %	56.4 %	+8.3 %

Consistent gains across all noise levels; the higher the noise, the larger the gap.
Label‑error detection improves by ~10 % in F1 score, meaning downstream cleaning pipelines become more reliable.
Compared to models pre‑trained on ImageNet, the SSL‑pre‑trained approach matches performance at ≤20 % noise and outperforms by up to 12 % absolute accuracy at ≥50 % noise.

These numbers illustrate that the learned representations are inherently more noise‑tolerant than those obtained from supervised ImageNet pre‑training.

Practical Implications

Data‑centric pipelines: Teams that collect large, imperfect datasets (e.g., web‑scraped images, user‑generated content) can plug in an SSL pre‑training stage to get a “clean‑ish” feature extractor without manual labeling.
Reduced reliance on curated subsets: Many existing noisy‑label methods require a small clean validation set for loss correction or sample re‑weighting. This work eliminates that requirement, simplifying data acquisition and lowering annotation costs.
Better downstream tools: Improved embeddings boost the performance of anomaly detectors, active‑learning query strategies, and semi‑supervised label‑propagation, leading to faster data cleaning loops.
Hardware‑friendly: The SSL phase can be run on the same hardware used for regular training (e.g., a single GPU) and scales linearly with dataset size, making it feasible for most production teams.
Transferability: Once a robust backbone is obtained on a noisy source domain, it can be fine‑tuned on related tasks (e.g., object detection, segmentation) with far fewer clean annotations.

Limitations & Future Work

Computation overhead: Adding an SSL pre‑training stage increases total training time (typically 2–3× the cost of a single supervised run).
SSL hyper‑parameters: The quality of the learned features depends on augmentation choices and loss temperature settings; sub‑optimal configs can diminish gains.
Domain shift: Experiments are limited to CIFAR‑scale images; it remains to be seen how the approach scales to high‑resolution or non‑visual modalities (e.g., audio, text).
Theoretical understanding: While empirical results are strong, a formal analysis of why SSL mitigates label noise is still an open research question.

Future work could explore lightweight SSL variants, curriculum‑style fine‑tuning that gradually introduces noisy labels, and extending the method to multimodal or streaming data scenarios.

Authors

David Szczecina
Nicholas Pellegrino
Paul Fieguth

Paper Information

arXiv ID: 2511.20844v1
Categories: cs.LG, cs.AI, cs.NE
Published: November 25, 2025
PDF: Download PDF

[Paper] Pre-train to Gain: Robust Learning Without Clean Labels

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval