[Paper] From In Silico to In Vitro: Evaluating Molecule Generative Models for Hit Generation

Published: (December 26, 2025 at 09:02 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.22031v1

Overview

The paper From In Silico to In Vitro: Evaluating Molecule Generative Models for Hit Generation asks a simple but bold question: can modern deep‑generative AI actually produce “hit‑like” molecules that are ready to enter the early‑stage drug‑discovery workflow? By treating hit generation as a stand‑alone task, the authors benchmark several state‑of‑the‑art generative models, evaluate them with a custom multi‑criteria pipeline, and even synthesize a handful of predicted GSK‑3β inhibitors that prove active in the lab.

Key Contributions

  • First formal framing of “hit‑like molecule generation” as an independent, measurable task rather than a vague component of the full drug‑discovery pipeline.
  • A comprehensive evaluation framework that combines physicochemical filters, structural similarity checks, and target‑specific docking scores to define a realistic “hit‑like” chemical space.
  • Benchmarking of three generative architectures (two autoregressive models and one diffusion‑based model) across multiple training datasets and settings.
  • Empirical validation: several AI‑generated compounds were synthesized and experimentally confirmed as active GSK‑3β inhibitors.
  • Critical analysis of current metrics, exposing gaps between standard generative‑model scores (validity, uniqueness, novelty) and true drug‑discovery relevance.

Methodology

  1. Data Curation – Public bioactivity databases (e.g., ChEMBL) were filtered to create target‑specific training sets for several proteins, including GSK‑3β. Each set was split into “hit‑like” (high‑affinity) and “non‑hit” molecules.
  2. Model Selection
    • Autoregressive Model A (SMILES‑based RNN).
    • Autoregressive Model B (Transformer‑style language model).
    • Diffusion Model (graph‑based diffusion process that iteratively denoises a random molecular graph).
  3. Training Regimes – Models were trained under three conditions: (i) full‑dataset training, (ii) hit‑only fine‑tuning, and (iii) multi‑task learning with auxiliary property predictors.
  4. Multi‑Stage Filtering Pipeline – Generated molecules passed through:
    • Physicochemical filters (Lipinski, PAINS, synthetic accessibility).
    • Structural similarity to known actives (Tanimoto ≥ 0.4).
    • Docking against the target protein (AutoDock Vina) to obtain a binding‑score threshold.
  5. Metrics – Standard generative metrics (validity, uniqueness, novelty) plus hit‑likeness score (percentage of molecules surviving the full pipeline).
  6. Experimental Validation – The top‑ranked GSK‑3β candidates were synthesized, purified, and tested in an enzymatic inhibition assay.

Results & Findings

ModelValidityUniquenessNoveltyHit‑likeness (post‑filter)
Autoregressive A98 %92 %85 %12 %
Autoregressive B99 %95 %88 %15 %
Diffusion97 %97 %90 %18 %
  • All models produced chemically valid SMILES/graphs; the diffusion model yielded the highest diversity.
  • After the full filtering pipeline, ≈15–18 % of generated compounds qualified as “hit‑like,” a dramatic enrichment compared to random sampling (≈2 %).
  • Docking scores for the top‑10 candidates were comparable to known actives (average ΔG ≈ ‑9.5 kcal/mol).
  • Experimental hit rate: 4 out of 7 synthesized GSK‑3β candidates showed ≥ 50 % inhibition at 10 µM, confirming that the AI‑generated molecules are biologically relevant.
  • The authors note that standard metrics (e.g., novelty) alone are poor proxies for downstream success; the multi‑stage pipeline is essential for realistic assessment.

Practical Implications

  • Accelerated hit identification – Teams can replace a portion of high‑throughput screening with AI‑generated libraries, cutting costs and time.
  • Target‑specific library design – By fine‑tuning on a small set of known actives, developers can quickly generate focused compound collections for any protein with a structural model.
  • Integration into existing pipelines – The filtering pipeline can be scripted into CI/CD‑style workflows (e.g., using RDKit, OpenEye, and docking engines), enabling automated “AI‑first” hit generation before wet‑lab validation.
  • Open‑source tooling – The paper’s code and datasets (released under permissive licenses) provide a ready‑to‑use baseline for companies building proprietary generative‑chemistry platforms.
  • Risk mitigation – Since the models still produce a non‑trivial fraction of undesirable molecules (synthetic inaccessibility, PAINS), a downstream human‑in‑the‑loop review remains necessary.

Limitations & Future Work

  • Training data bias – Public bioactivity databases are skewed toward certain chemotypes and assay types, limiting the chemical space the models can learn.
  • Evaluation metrics – The authors highlight that docking scores are only a proxy for true binding affinity; more rigorous free‑energy calculations or ML‑based affinity predictors could improve ranking.
  • Scalability of synthesis – While a handful of hits were validated, scaling up to hundreds of candidates will require better synthetic‑route prediction and cost estimation.
  • Generalization to novel targets – The study focused on a few well‑characterized proteins; extending the approach to orphan or poorly‑characterized targets remains an open challenge.
  • Future directions include incorporating active‑learning loops (feedback from wet‑lab assays to retrain the generator), exploring multimodal models that jointly handle 3D conformations, and developing richer evaluation suites that combine ADMET predictions with the current hit‑likeness criteria.

Authors

  • Nagham Osman
  • Vittorio Lembo
  • Giovanni Bottegoni
  • Laura Toni

Paper Information

  • arXiv ID: 2512.22031v1
  • Categories: cs.LG, cs.AI
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »