[Paper] Pre-train to Gain: Robust Learning Without Clean Labels

Published: (November 25, 2025 at 03:48 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.20844v1

Overview

Training deep neural networks on datasets that contain mislabeled examples is a notorious pain point—models tend to memorize the noise, which hurts their real‑world performance. The paper Pre‑train to Gain: Robust Learning Without Clean Labels shows that a simple two‑step recipe—self‑supervised pre‑training followed by ordinary supervised fine‑tuning—can dramatically improve robustness, even when no clean subset of the data is available.

Key Contributions

  • Label‑agnostic pre‑training: Demonstrates that self‑supervised learning (SSL) methods (SimCLR, Barlow Twins) can be used to learn a strong feature extractor without any labels.
  • Noise‑robust fine‑tuning: Shows that standard supervised training on top of the SSL‑pre‑trained backbone yields far higher accuracy on noisy datasets than training from scratch.
  • Comprehensive evaluation: Experiments on CIFAR‑10 and CIFAR‑100 with both synthetic (uniform, asymmetric) and real‑world noise (WebVision‑type) confirm the approach’s consistency across noise rates.
  • Improved label‑error detection: The SSL‑pre‑trained models produce better representations for downstream error‑detection tools, boosting F1 and balanced accuracy scores.
  • Competitive with ImageNet pre‑training: At low noise levels the method matches ImageNet‑pre‑trained baselines, and it outperforms them by a large margin when noise is severe.

Methodology

  1. Self‑Supervised Pre‑training

    • Choose an SSL algorithm (SimCLR or Barlow Twins).
    • Train a convolutional backbone (e.g., ResNet‑18) on the unlabeled training images using only data augmentations and a contrastive / redundancy‑reduction loss.
    • No human‑provided labels are needed; the model learns to map different views of the same image to similar embeddings.
  2. Supervised Fine‑tuning on Noisy Labels

    • Freeze or lightly fine‑tune the backbone while training a linear classifier (or a small head) on the noisy labeled dataset.
    • Use the usual cross‑entropy loss; the network now benefits from the robust features learned in step 1, which reduces its tendency to overfit the wrong labels.
  3. Evaluation & Error Detection

    • Measure classification accuracy on a clean test set.
    • Apply a simple label‑error detector (e.g., confidence‑thresholding or a small auxiliary network) on the fine‑tuned model’s outputs to assess how well the representations expose mislabeled samples.

The pipeline requires no extra clean data, only the noisy training set and the computational budget for an SSL pre‑training phase (typically a few epochs on the same dataset).

Results & Findings

DatasetNoise TypeNoise RateBaseline (scratch) Acc.SSL‑pre‑trained Acc.Δ Accuracy
CIFAR‑10Uniform40 %71.2 %78.9 %+7.7 %
CIFAR‑10Asymmetric60 %64.5 %73.3 %+8.8 %
CIFAR‑100Real‑world (WebVision)50 %48.1 %56.4 %+8.3 %
  • Consistent gains across all noise levels; the higher the noise, the larger the gap.
  • Label‑error detection improves by ~10 % in F1 score, meaning downstream cleaning pipelines become more reliable.
  • Compared to models pre‑trained on ImageNet, the SSL‑pre‑trained approach matches performance at ≤20 % noise and outperforms by up to 12 % absolute accuracy at ≥50 % noise.

These numbers illustrate that the learned representations are inherently more noise‑tolerant than those obtained from supervised ImageNet pre‑training.

Practical Implications

  • Data‑centric pipelines: Teams that collect large, imperfect datasets (e.g., web‑scraped images, user‑generated content) can plug in an SSL pre‑training stage to get a “clean‑ish” feature extractor without manual labeling.
  • Reduced reliance on curated subsets: Many existing noisy‑label methods require a small clean validation set for loss correction or sample re‑weighting. This work eliminates that requirement, simplifying data acquisition and lowering annotation costs.
  • Better downstream tools: Improved embeddings boost the performance of anomaly detectors, active‑learning query strategies, and semi‑supervised label‑propagation, leading to faster data cleaning loops.
  • Hardware‑friendly: The SSL phase can be run on the same hardware used for regular training (e.g., a single GPU) and scales linearly with dataset size, making it feasible for most production teams.
  • Transferability: Once a robust backbone is obtained on a noisy source domain, it can be fine‑tuned on related tasks (e.g., object detection, segmentation) with far fewer clean annotations.

Limitations & Future Work

  • Computation overhead: Adding an SSL pre‑training stage increases total training time (typically 2–3× the cost of a single supervised run).
  • SSL hyper‑parameters: The quality of the learned features depends on augmentation choices and loss temperature settings; sub‑optimal configs can diminish gains.
  • Domain shift: Experiments are limited to CIFAR‑scale images; it remains to be seen how the approach scales to high‑resolution or non‑visual modalities (e.g., audio, text).
  • Theoretical understanding: While empirical results are strong, a formal analysis of why SSL mitigates label noise is still an open research question.

Future work could explore lightweight SSL variants, curriculum‑style fine‑tuning that gradually introduces noisy labels, and extending the method to multimodal or streaming data scenarios.

Authors

  • David Szczecina
  • Nicholas Pellegrino
  • Paul Fieguth

Paper Information

  • arXiv ID: 2511.20844v1
  • Categories: cs.LG, cs.AI, cs.NE
  • Published: November 25, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »