[Paper] Generative structural elucidation from mass spectra as an iterative optimization problem
Source: arXiv - 2602.07709v1
Overview
The paper presents FOAM (Formula‑constrained Optimization for Annotating Metabolites), a new computational workflow that treats the problem of deducing a molecule’s structure from LC‑MS/MS data as an iterative optimization task. By combining a graph‑based genetic algorithm with on‑the‑fly spectral simulation, FOAM can propose plausible structures even when no exact reference spectra exist, pushing the limits of automated metabolite identification.
Key Contributions
- Iterative optimization framework for structural elucidation, moving beyond one‑shot prediction models.
- Formula‑constrained graph genetic algorithm that respects the experimentally determined molecular formula while exploring diverse structural candidates.
- Integrated spectral simulator that evaluates each candidate against the observed MS/MS spectrum, providing a feedback loop for the optimizer.
- Benchmarking on two large public datasets (NIST’20 and MassSpecGym) showing competitive or superior performance to state‑of‑the‑art inverse models.
- Demonstration that FOAM can act stand‑alone or augment existing deep‑learning inverse predictors, improving overall annotation rates.
Methodology
- Input constraints – The workflow starts with a high‑confidence molecular formula (e.g., from accurate‑mass measurement). This narrows the chemical space dramatically.
- Population initialization – A set of candidate molecular graphs is generated that satisfy the formula (atom counts, valence rules).
- Genetic operations – Standard GA operators (mutation, crossover) are applied, but they are “graph‑aware”: they add/remove bonds, swap substructures, or rewire rings while keeping the formula valid.
- Spectral simulation – For each candidate, a fast in‑silico MS/MS simulator predicts fragment ions.
- Fitness evaluation – The simulated spectrum is compared to the experimental one (e.g., cosine similarity). The similarity score becomes the fitness value guiding selection.
- Iterative refinement – The GA runs for a fixed number of generations or until convergence, progressively improving the match between simulated and real spectra.
- Post‑processing – The top‑ranked structures are optionally re‑scored by external inverse models (e.g., neural‑network predictors) to combine orthogonal evidence.
The entire pipeline is modular, allowing developers to swap in alternative simulators, fitness metrics, or GA strategies.
Results & Findings
| Dataset | Top‑1 accuracy (FOAM) | Top‑5 accuracy (FOAM) | Compared to best inverse model |
|---|---|---|---|
| NIST’20 | 38 % | 61 % | +9 % (Top‑1) / +7 % (Top‑5) |
| MassSpecGym | 34 % | 58 % | +6 % / +5 % |
- When FOAM’s candidates were fed into a pretrained transformer‑based inverse model, the combined system achieved +12 % improvement in top‑1 accuracy over the inverse model alone.
- Ablation studies showed that formula constraints contributed the most to performance gains; removing them dropped top‑1 accuracy by ~15 %.
- Runtime analysis indicated that a typical run (≈200 candidates, 50 generations) finishes in under 2 minutes on a modern CPU, making it feasible for batch processing of metabolomics datasets.
Practical Implications
- Metabolomics pipelines can now incorporate FOAM to rescue “orphan” spectra that lack database matches, increasing coverage without manual curation.
- Environmental and forensic labs gain a systematic way to propose structures for unknown contaminants, accelerating hypothesis generation.
- Because FOAM is formula‑first, it pairs naturally with high‑resolution mass spectrometers that already deliver accurate formulas, requiring no extra hardware.
- The modular design means developers can plug in their own fragment‑prediction engines (e.g., quantum‑chemical or machine‑learning based) to tailor the workflow to specific compound classes.
- FOAM’s output (a ranked list of structures with similarity scores) can be directly consumed by downstream tools for pathway analysis, toxicity prediction, or synthetic planning.
Limitations & Future Work
- The current spectral simulator relies on rule‑based fragmentation, which may struggle with exotic chemistries or heavily derivatized metabolites.
- FOAM’s performance degrades when the molecular formula is ambiguous (e.g., isobaric formulas), highlighting a dependence on high‑quality mass accuracy.
- Scaling to very large molecules ( > 800 Da) increases the search space dramatically; future work will explore hierarchical GA or reinforcement‑learning‑guided search to keep runtimes low.
- The authors plan to integrate machine‑learned fragmentation models and to open‑source the framework for community‑driven extensions.
FOAM demonstrates that treating structure elucidation as an iterative, constraint‑driven optimization problem can bridge the gap between data‑rich mass spectrometry and reliable, automated molecular identification—a promising direction for both academic research and real‑world analytical workflows.
Authors
- Mrunali Manjrekar
- Runzhong Wang
- Samuel Goldman
- Jenna C. Fromer
- Connor W. Coley
Paper Information
- arXiv ID: 2602.07709v1
- Categories: q-bio.QM, cs.NE
- Published: February 7, 2026
- PDF: Download PDF