[Paper] Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition

Published: (February 18, 2026 at 01:27 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.16684v1

Overview

This paper tackles a core challenge in computer‑aided drug design: automatically suggesting realistic chemical modifications that a medicinal chemist would make when iterating on a lead compound. By training a large “foundation” model on millions of matched molecular pair transformations (MMPTs) and augmenting it with a retrieval system, the authors enable controllable, diverse generation of analog molecules that align with human intuition.

Key Contributions

  • Variable‑to‑variable formulation – reframes analog generation as “given a source molecule, produce a target molecule” rather than treating the whole molecule as a monolithic token.
  • Large‑scale foundation model – pre‑trained on a massive corpus of MMPTs, learning the statistical patterns of medicinal‑chemistry edits.
  • Prompt‑based controllability – introduces simple textual or structural prompts (e.g., “add a methyl group”, “replace a phenyl with a pyridine”) that steer the model toward desired transformation patterns.
  • Retrieval‑Augmented Generation (MMPT‑RAG) – integrates an external similarity search over a reference library of known analogs, providing contextual cues that improve relevance and project‑specificity.
  • Comprehensive evaluation – demonstrates gains in diversity, novelty, and fidelity on both public chemical datasets and real‑world patent collections.

Methodology

  1. Data preparation – The authors mined public chemistry databases and patent literature to extract matched molecular pairs: two molecules that differ by a single, well‑defined chemical edit (e.g., a functional‑group swap). Each pair is represented as a source SMILES string and a target SMILES string.
  2. Model architecture – A transformer‑based encoder‑decoder is trained to map the source SMILES to the target SMILES. Because the task is variable‑to‑variable, the model learns to focus on the difference rather than memorizing whole‑molecule vocabularies.
  3. Prompting mechanism – Users can prepend a short “edit prompt” (e.g., +CH3, replace=Cl→F) to the source SMILES. The model treats this as an additional conditioning token, biasing the decoder toward the requested transformation.
  4. Retrieval‑augmented generation – Before decoding, a similarity search (FAISS over a fingerprint index) fetches the k most relevant analogs from a domain‑specific library. Their SMILES are concatenated to the prompt, giving the model extra context about how chemists have previously modified similar scaffolds.
  5. Training & fine‑tuning – The base model is pre‑trained on the full MMPT corpus, then optionally fine‑tuned on a narrower project‑specific set (e.g., a single therapeutic area) to capture subtle series‑level trends.

Results & Findings

MetricBaseline (whole‑molecule models)MMPT‑RAG (this work)
Diversity (Tanimoto‑based)0.310.48
Novelty (unseen in training)0.620.78
Edit‑accuracy (correct transformation type)0.550.71
Human‑evaluation (chemists rating realism)3.1 / 54.2 / 5
  • Diversity & novelty improve because the model learns to recombine edits rather than copy whole molecules.
  • Prompt compliance reaches >80 % when the edit is explicitly specified, showing that simple textual cues are sufficient for fine‑grained control.
  • In a patent‑reconstruction scenario (given a lead scaffold, generate analogs that could plausibly appear in a new filing), MMPT‑RAG recovers >70 % of the actual analogs reported in the patent, outperforming prior rule‑based and graph‑generative baselines.

Practical Implications

  • Lead‑optimization pipelines – Integrate MMPT‑RAG as a “suggest‑the‑next‑analog” module. Chemists can input a scaffold and a desired edit (e.g., increase lipophilicity) and receive a ranked list of synthetically plausible candidates.
  • Project‑specific knowledge transfer – By feeding a company’s internal compound library into the retrieval index, the model automatically respects proprietary SAR trends, reducing the risk of proposing chemically irrelevant changes.
  • Rapid SAR hypothesis testing – Developers can script batch generations with varied prompts, then feed the outputs into downstream property‑prediction models (ADMET, docking) for high‑throughput virtual screening.
  • Low‑code integration – The prompting interface works with plain SMILES strings, making it easy to wrap in REST APIs or Jupyter notebooks without deep ML expertise.

Limitations & Future Work

  • Synthetic feasibility not guaranteed – While the model learns common medicinal‑chemistry edits, it does not explicitly enforce reaction‑level constraints; coupling with a retrosynthesis engine would be needed for production‑ready suggestions.
  • Dependence on retrieval quality – The RAG component’s performance hinges on the relevance of the external library; poorly curated or overly narrow databases can bias generation.
  • Scalability of prompts – Very complex multi‑step transformations (e.g., “add a heterocycle then oxidize”) still challenge the current single‑prompt design.
  • Future directions proposed include:
    1. Joint training with a reaction‑prediction model to embed synthetic routes.
    2. Hierarchical prompting for multi‑step design.
    3. Extending the framework to protein‑targeted generative tasks (e.g., scaffold hopping guided by binding‑site information).

Authors

  • Bo Pan
  • Peter Zhiping Zhang
  • Hao‑Wei Pang
  • Alex Zhu
  • Xiang Yu
  • Liying Zhang
  • Liang Zhao

Paper Information

  • arXiv ID: 2602.16684v1
  • Categories: cs.LG
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »