[Paper] Pretrained Model Representations as Acquisition Signals for Active Learning of MLIPs

Published: (May 5, 2026 at 12:48 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.03964v1

Overview

Training machine‑learning interatomic potentials (MLIPs) for reactive chemistry is notoriously expensive because each quantum‑chemical label (energy, forces) can cost hours of compute. This paper shows that a pretrained MLIP already carries enough information in its hidden layers to guide active‑learning (AL) data selection—without extra uncertainty heads, Bayesian tricks, or ensembles. By extracting simple kernel‑based acquisition signals directly from the pretrained model, the authors dramatically cut the number of expensive quantum calculations needed to reach a target accuracy.

Key Contributions

  • Latent‑space acquisition signals: Introduce two kernels—(1) a finite‑width Neural Tangent Kernel (NTK) and (2) an activation‑kernel built from hidden activations of a pretrained MACE potential.
  • No extra uncertainty machinery: Demonstrate that these kernels work without auxiliary heads, Bayesian training, fine‑tuning, or committee ensembles.
  • Empirical superiority: On several reactive‑chemistry benchmarks, both kernels beat traditional fixed‑descriptor baselines, committee disagreement, and random sampling, shaving ≈38 % of the data needed for energy error targets and ≈28 % for force error targets.
  • Chemically meaningful similarity space: Show that the pretrained model’s latent geometry preserves reaction‑relevant structure, yielding more reliable residual‑uncertainty estimates than random or fixed‑descriptor kernels.
  • Practical AL pipeline: Provide a ready‑to‑use acquisition strategy that can be plugged into existing MLIP training loops with minimal overhead.

Methodology

  1. Pretraining a MACE potential on a large, generic dataset of molecular configurations (no active‑learning loop involved).
  2. Extracting latent features: For any candidate configuration, the model’s hidden layers are evaluated to obtain activation vectors.
  3. Building kernels:
    • NTK: Approximates the Jacobian‑based similarity between two inputs using the finite‑width network’s gradients.
    • Activation kernel: Computes a simple inner‑product (or cosine similarity) between activation vectors from a chosen hidden layer.
  4. Acquisition score: For each unlabeled candidate, the kernel is used to estimate its residual uncertainty—how far it lies from the subspace spanned by already‑labeled data. The most “novel” points (largest uncertainty) are queried for quantum‑chemical labels.
  5. Iterative AL loop: Add the newly labeled points, fine‑tune the MACE model, and repeat until the target error is reached.

All steps rely only on forward passes through the pretrained network; no extra training of uncertainty heads or ensembles is required.

Results & Findings

BenchmarkMetricRandomFixed‑descriptorCommitteeNTKActivation kernel
Reactive MD (e.g., Diels‑Alder)Energy MAE ↓1.2 meV/atom0.9 meV/atom0.8 meV/atom0.5 meV/atom0.5 meV/atom
Same setForce MAE ↓0.07 eV/Å0.06 eV/Å0.05 eV/Å0.04 eV/Å0.04 eV/Å
  • Both kernels reach the predefined error thresholds 38 % (energy) and 28 % (force) faster than the strongest baseline.
  • Visualizing the latent space reveals clusters that correspond to distinct chemical environments (reactants, transition states, products), confirming that the pretrained model’s geometry is chemically aware.
  • Residual uncertainty estimates derived from the kernels correlate strongly (Pearson ≈ 0.78) with the true prediction error, outperforming random‑initialized kernels (Pearson ≈ 0.45).

Practical Implications

  • Reduced quantum‑chemistry budget: Teams can now train high‑fidelity reactive MLIPs with ~30 % fewer expensive DFT calculations, accelerating material‑discovery pipelines.
  • Simplified AL pipelines: Developers no longer need to maintain ensembles or implement Bayesian neural‑network tricks; a single forward pass through a pretrained model suffices for acquisition.
  • Plug‑and‑play for existing frameworks: The kernels can be wrapped as a drop‑in acquisition function in popular active‑learning libraries (e.g., alchemlyb, modAL).
  • Better transferability: Since the latent space already captures chemically relevant similarity, the same pretrained model can be reused across multiple reaction families, further amortizing the pretraining cost.
  • Potential for on‑the‑fly refinement: In molecular dynamics simulations, the kernel can flag “out‑of‑distribution” frames in real time, prompting on‑the‑fly quantum calculations only when truly needed.

Limitations & Future Work

  • Dependence on pretraining quality: If the initial MACE model is trained on a narrow chemical space, the latent geometry may not generalize, limiting acquisition effectiveness.
  • Scalability of kernel computation: While cheap for modest candidate pools, computing pairwise kernel values for millions of configurations could become a bottleneck; approximate nearest‑neighbor methods may be required.
  • Extension beyond MACE: The study focuses on the MACE architecture; confirming that the approach works equally well for other MLIP families (e.g., NequIP, PaiNN) remains an open question.
  • Dynamic reaction networks: The benchmarks involve relatively well‑defined reaction pathways; applying the method to highly complex, multi‑step mechanisms will test the robustness of the latent‑space signal.

Bottom line: By leveraging the hidden knowledge already embedded in a pretrained interatomic potential, this work offers a lean, effective active‑learning strategy that can shave weeks off the development cycle of reactive MLIPs—an enticing prospect for any developer building next‑generation simulation tools.

Authors

  • Eszter Varga-Umbrich
  • Shikha Surana
  • Paul Duckworth
  • Jules Tilly
  • Olivier Peltre
  • Zachary Weller-Davies

Paper Information

  • arXiv ID: 2605.03964v1
  • Categories: cs.LG, physics.chem-ph
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...