[Paper] Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement
Source: arXiv - 2512.19530v1
Overview
The paper tackles a long‑standing bottleneck in synthetic chemistry: predicting how reaction yields change when you continuously vary solvent composition. By building a high‑throughput “Catechol Benchmark” dataset (1,227 yield measurements across 24 pure solvents and their binary mixtures) and testing modern machine‑learning models, the authors show that a graph‑neural‑network (GNN) architecture can learn solvent effects with far greater accuracy than traditional tabular or language‑model approaches.
Key Contributions
- Catechol Benchmark dataset – a publicly released, high‑quality collection of transient flow experiments covering continuous solvent volume fractions.
- Rigorous evaluation protocols – leave‑one‑solvent‑out (LOSO) and leave‑one‑mixture‑out (LOMO) splits that truly test a model’s ability to extrapolate to unseen solvent environments.
- Hybrid GNN architecture – combines Graph Attention Networks (GAT) for molecular structure, Differential Reaction Fingerprints (DRFP) for reaction context, and a learned continuous mixture encoding for solvents.
- State‑of‑the‑art performance – achieves an MSE of 0.0039 ± 0.0003, a ~60 % reduction over the best baseline and >25× improvement over strong tabular ensembles.
- Open‑source release – dataset, evaluation scripts, and reference implementations are made available to the community.
Methodology
- Data collection – The authors performed high‑throughput flow chemistry experiments on the allyl‑substituted catechol rearrangement, measuring yields in 24 pure solvents and all possible binary mixtures (by volume). Each experiment is tagged with the exact solvent fraction (e.g., 0 % B, 23 % B, …, 100 % B).
- Feature engineering
- Molecular graphs of reactants and products are fed into a Graph Attention Network that learns atom‑level interactions.
- DRFP vectors capture the overall reaction transformation in a compact, differentiable form.
- Mixture encoding treats the solvent composition as a continuous vector (percentage of solvent B) and learns an embedding jointly with the GNN.
- Model training & evaluation – The hybrid model is trained to regress the experimental yield. Performance is measured under two stringent cross‑validation schemes:
- LOSO: all data from one pure solvent are held out.
- LOMO: all data from one binary mixture are held out.
This forces the model to infer solvent effects it has never seen during training.
- Baselines – Gradient‑Boosted Decision Trees (GBDT), Random Forests, and large language‑model embeddings (e.g., Qwen‑7B) are trained on the same data for comparison.
- Ablation studies – The authors systematically remove components (graph message‑passing, mixture encoding, DRFP) to quantify each part’s contribution.
Results & Findings
| Model | MSE (LOSO) | MSE (LOMO) |
|---|---|---|
| GBDT (tabular) | 0.099 | 0.102 |
| Qwen‑7B (LLM embeddings) | 0.129 | 0.135 |
| Hybrid GNN (full) | 0.0039 ± 0.0003 | 0.0042 ± 0.0004 |
| GNN w/o mixture encoding | 0.0121 | 0.0134 |
| GNN w/o DRFP | 0.0098 | 0.0105 |
- The hybrid GNN outperforms all baselines by an order of magnitude in both LOSO and LOMO settings.
- Removing the continuous solvent embedding degrades performance by ~3×, confirming that treating solvent composition as a continuous variable is essential.
- The model’s predictions remain accurate even for solvent mixtures that were never observed during training, demonstrating genuine interpolation capability across the solvent space.
Practical Implications
- Process development – Chemists can now query a trained model to predict yields for any solvent blend without running costly experiments, accelerating solvent screening in drug synthesis or fine‑chemical production.
- Continuous‑flow reactors – The dataset originates from flow chemistry; integrating the model into a control loop could enable real‑time solvent composition tuning to maintain optimal yields.
- Data‑efficient reaction modeling – The benchmark shows that with a modest number of well‑designed experiments, a GNN can learn complex solvent effects, reducing the experimental burden for new reaction families.
- Tooling for ML‑enabled chemistry platforms – Open‑source code and evaluation pipelines make it straightforward to plug the model into existing cheminformatics stacks (e.g., RDKit, PyTorch Geometric).
- Beyond solvents – The continuous mixture encoding strategy can be generalized to other tunable reaction parameters (temperature gradients, catalyst loadings, pressure), opening a path toward fully differentiable reaction optimization.
Limitations & Future Work
- Scope of chemistry – The study focuses on a single reaction class (catechol rearrangement). Generalizing to vastly different mechanisms will require additional data.
- Solvent representation – While the continuous volume fraction works for binary mixtures, extending to ternary or higher‑order mixtures may need more sophisticated encoding schemes.
- Scale‑up considerations – The dataset is generated in a high‑throughput flow setup; translating predictions to batch reactors could involve additional transport phenomena not captured here.
- Interpretability – Although the GNN provides accurate predictions, extracting chemically intuitive “solvent effect rules” remains an open challenge.
Future research directions include expanding the benchmark to multi‑component solvent systems, integrating temperature/pressure as continuous variables, and exploring explainable GNN techniques to surface mechanistic insights for chemists.
Authors
- Hongsheng Xing
- Qiuxin Si
Paper Information
- arXiv ID: 2512.19530v1
- Categories: cs.LG, cs.AI
- Published: December 22, 2025
- PDF: Download PDF