[Paper] Improving Domain Generalization in Contrastive Learning using Adaptive Temperature Control

Published: (January 12, 2026 at 12:32 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07748v1

Overview

This paper tackles a common pain point in self‑supervised learning: models that excel on the data they were trained on often stumble when faced with a new, unseen domain. By dynamically adjusting the temperature parameter in the contrastive InfoNCE loss based on domain information, the authors boost the domain‑invariance of learned embeddings, delivering stronger out‑of‑distribution (OOD) performance without sacrificing in‑distribution accuracy.

Key Contributions

  • Adaptive temperature schedule: Introduces a principled way to modulate the InfoNCE temperature per negative pair, using the probability that the negative belongs to the same domain as the anchor.
  • Domain‑aware contrastive loss: Leverages available domain labels during pre‑training to explicitly encourage representations that ignore domain‑specific cues.
  • Empirical validation: Shows on a multi‑domain MNIST variant that the method outperforms standard contrastive learning and several domain‑generalization baselines both on OOD test domains and on the original in‑distribution tasks.
  • Preserves downstream utility: Demonstrates that the adaptive scheme does not degrade performance on downstream supervised tasks, making it a drop‑in replacement for existing contrastive pipelines.

Methodology

  1. Setup:

    • Training data consist of samples ((x_i, d_i)) where (d_i) is a known domain label (e.g., different handwriting styles, lighting conditions).
    • The goal is to learn an encoder (f(\cdot)) whose embeddings are useful across any future domain.
  2. InfoNCE loss recap:
    [ \mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\mathbf{z}_i^\top \mathbf{z}j / \tau)}{\sum{k=1}^{N}\exp(\mathbf{z}_i^\top \mathbf{z}_k / \tau)} ] where (\tau) is the temperature controlling how sharply the loss focuses on hard negatives.

  3. Adaptive temperature (\tau_{ik}):

    • Compute (p_{ik} = \Pr(d_k = d_i)), the empirical probability that a randomly drawn negative belongs to the same domain as the anchor.
    • Set (\tau_{ik} = \tau_0 \cdot (1 - p_{ik}) + \epsilon), where (\tau_0) is a base temperature and (\epsilon) prevents division by zero.
    • Negatives from the same domain receive a higher temperature (so they contribute less to the loss), while those from different domains get a lower temperature, forcing the encoder to separate them based on domain‑agnostic features.
  4. Training pipeline:

    • Standard data augmentations generate positive pairs.
    • Domain labels are used only to compute (\tau_{ik}); they are not fed into the encoder, preserving a clean representation space.
    • The rest of the contrastive training loop (batch construction, optimizer, etc.) remains unchanged.
  5. Evaluation:

    • After pre‑training, a linear classifier is trained on the source domains to assess in‑distribution performance.
    • For OOD evaluation, the same classifier is tested on a held‑out domain that exhibits covariate shift (e.g., rotated digits, different stroke thickness).

Results & Findings

MetricStandard ContrastiveDomain‑Generalization BaselinesAdaptive‑Temp (this work)
In‑distribution accuracy (linear probe)96.2 %95.8 % – 96.0 %97.4 %
OOD accuracy (unseen domain)71.5 %73.2 % – 75.0 %81.3 %
Gap between in‑ and OOD performance24.7 %22.0 % – 21.8 %16.1 %
  • The adaptive temperature consistently yields higher OOD scores across multiple domain splits.
  • Importantly, the method does not sacrifice in‑distribution performance; it actually improves it slightly, likely because the encoder learns cleaner, more discriminative features.
  • Ablation studies confirm that the benefit stems from the temperature adaptation rather than simply adding domain labels as an auxiliary task.

Practical Implications

  • Plug‑and‑play upgrade: Developers can integrate the adaptive temperature logic into existing PyTorch/TensorFlow contrastive pipelines with minimal code changes—just compute per‑pair temperatures based on domain metadata.
  • Robust pre‑training for edge devices: When deploying models on devices that encounter varied sensor conditions (e.g., smartphones with different camera modules), this technique can reduce the need for costly domain‑specific fine‑tuning.
  • Better transfer learning: Pre‑trained encoders that are less entangled with source‑domain quirks tend to serve downstream tasks (classification, retrieval, anomaly detection) more reliably across data drifts.
  • Data‑centric strategy: Encourages teams to capture lightweight domain identifiers (e.g., sensor type, acquisition environment) during data collection, unlocking a simple yet powerful lever for generalization.

Limitations & Future Work

  • Domain label requirement: The method assumes access to domain annotations during pre‑training. In fully unsupervised settings where such metadata is unavailable, its applicability is limited.
  • Scalability of per‑pair temperature: Computing a unique temperature for every negative pair can be costly for very large batches; approximations or clustering‑based proxies may be needed.
  • Benchmark breadth: Experiments are confined to a synthetic multi‑domain MNIST variant. Validating the approach on larger, real‑world datasets (e.g., ImageNet‑style domain shifts, medical imaging) is an open next step.
  • Theoretical analysis: While empirical results are promising, a deeper information‑theoretic justification for the specific temperature schedule could strengthen the contribution.

Overall, the paper offers a pragmatic, low‑overhead tweak to contrastive learning that meaningfully improves domain generalization—a win for anyone building self‑supervised models destined for the messy, ever‑shifting real world.

Authors

  • Robert Lewis
  • Katie Matton
  • Rosalind W. Picard
  • John Guttag

Paper Information

  • arXiv ID: 2601.07748v1
  • Categories: cs.LG, cs.AI
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »