[Paper] Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

Published: 3 days ago (February 13, 2026 at 12:39 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.13136v1

Overview

A new paper tackles retrosynthesis—the problem of figuring out how to make a target molecule—from a fresh angle: the order in which atoms are presented to a neural network matters. By deliberately placing the atoms that form the reaction center at the front of the input sequence, the authors turn implicit chemical knowledge into a simple positional cue that modern graph transformers can exploit. The result is a template‑free system that reaches state‑of‑the‑art accuracy while needing far fewer inference steps than previous diffusion‑based models.

Key Contributions

Positional inductive bias: Introduces a “reaction‑center‑first” atom ordering that makes the most chemically relevant substructure easy for the model to spot.
RetroDiT backbone: A graph transformer equipped with rotary position embeddings that directly consumes the ordered atom sequence.
Discrete flow matching: Decouples training from sampling, allowing the model to generate retrosynthetic routes in 20‑50 steps (vs. ~500 steps for earlier diffusion approaches).
Strong empirical results: Sets new top‑1 accuracy records on USPTO‑50K (61.2 %) and USPTO‑Full (51.3 %) with predicted reaction centers; jumps to 71.1 % / 63.4 % when oracle centers are supplied.
Efficiency over scale: Shows that a 280 K‑parameter model with the ordering trick matches the performance of a 65 M‑parameter model lacking it, highlighting the power of structural priors over brute‑force scaling.

Methodology

Two‑stage view of a reaction – First, identify the reaction center (atoms whose bonds change); second, reconstruct the full precursor molecules.
Atom ordering as a bias – The authors reorder the graph’s node list so that reaction‑center atoms appear at the beginning of the sequence fed to the transformer. This turns “where the chemistry happens” into a simple positional pattern.
RetroDiT architecture – A graph transformer that processes the ordered node list, using rotary position embeddings to preserve relative order information without sacrificing permutation invariance of the rest of the graph.
Discrete flow matching – Instead of learning a continuous diffusion process, the model learns a discrete transformation that directly maps a latent “noise” graph to a valid precursor graph. Training is done once; at inference time the model can step through a short, fixed number of discrete transitions (20‑50) to produce a candidate synthesis route.
Reaction‑center prediction – A lightweight classifier predicts the reaction center from the target molecule; its output guides the ordering for the main generator.

Results & Findings

Dataset	Setting	Top‑1 Accuracy
USPTO‑50K	Predicted centers	61.2 %
USPTO‑Full	Predicted centers	51.3 %
USPTO‑50K	Oracle (ground‑truth) centers	71.1 %
USPTO‑Full	Oracle centers	63.4 %

Speed: Generation requires only 20‑50 discrete flow steps, a 10×‑25× speed‑up over prior diffusion‑based retrosynthesis models that needed ~500 steps.
Parameter efficiency: A 0.28 M‑parameter RetroDiT matches a 65 M‑parameter baseline that lacks the ordering bias, confirming that the structural prior is more valuable than sheer model size.
Data efficiency: The approach outperforms large foundation models trained on 10 B reactions, despite using only the standard USPTO datasets (≈1 M reactions).

Practical Implications

Faster AI‑assisted synthesis planning: Chemists can obtain candidate routes in seconds rather than minutes, enabling tighter integration into interactive design tools and automated lab workflows.
Reduced compute costs: The discrete flow matching scheme and small model size lower GPU memory and inference time, making deployment feasible on on‑premise servers or even high‑end workstations.
Better generalization with limited data: By encoding domain knowledge as a simple ordering, companies with proprietary reaction databases (often far smaller than public corpora) can train competitive retrosynthesis models without needing massive data collection.
Plug‑and‑play reaction‑center predictor: The modular design lets developers swap in a more sophisticated center‑prediction model (e.g., a graph‑based classifier fine‑tuned on a specific chemistry domain) to further boost accuracy.
Potential for downstream automation: The short, deterministic generation pipeline is well‑suited for coupling with robotic synthesis platforms that require rapid, reliable route suggestions.

Limitations & Future Work

Reliance on accurate reaction‑center prediction: If the center classifier errs, the ordering cue can mislead the generator, degrading performance.
Template‑free but still heuristic: While the model does not use explicit templates, the discrete flow steps are handcrafted; exploring fully learned flow dynamics could yield further gains.
Scalability to exotic chemistries: The benchmarks focus on patent reactions; extending to organometallic or biocatalytic transformations may require additional domain‑specific priors.
Integration with multi‑step planning: The paper evaluates single‑step retrosynthesis; future work could embed the method into a recursive planner that assembles multi‑step synthetic routes.

Authors

Chenguang Wang
Zihan Zhou
Lei Bai
Tianshu Yu

Paper Information

arXiv ID: 2602.13136v1
Categories: cs.LG
Published: February 13, 2026
PDF: Download PDF

[Paper] Order Matters in Retrosynthesis: Structure-aware Generation via Reaction-Center-Guided Discrete Flow Matching

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Selection of CMIP6 Models for Regional Precipitation Projection and Climate Change Assessment in the Jhelum and Chenab River Basins