[Paper] Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

Published: (May 7, 2026 at 01:51 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06644v1

Overview

A new study introduces a graph‑based machine‑learning pipeline that predicts the quantum yield (QY) of fluorescent proteins directly from their three‑dimensional structures. By focusing on the local chemical environment around the mature chromophore, the authors achieve state‑of‑the‑art accuracy, especially for proteins that are evolutionarily distant from the training set.

Key Contributions

  • Chromophore‑centric mechanism graphs: Converts each protein structure into a typed 3‑D residue graph and explicitly partitions the chromophore into phenolate, bridge, and imidazolinone regions.
  • Edge‑specific signal propagation: Propagates physicochemical “signals” (e.g., aromatic stacking, charge interactions) along graph edges to generate 121 enriched features, of which 52 are non‑trivial for regression.
  • Interpretability by design: Every feature encodes a concrete contact channel, seed signal, and target chromophore region, allowing mechanistic insight without post‑hoc explainers.
  • Superior predictive performance: On a benchmark of 531 fluorescent proteins, the model reaches R = 0.772 ± 0.008 and MAE = 0.131 ± 0.002, beating strong baselines such as ESM‑C and SaProt.
  • Robustness to low sequence similarity: In the hardest “remote” bucket (< 50 % identity), the method still outperforms baselines (R = 0.697 vs. 0.633/0.575/0.408).
  • Mechanistic validation: Selected features recover known biophysical mechanisms (e.g., aromatic packing in GFP, charge balance in red proteins), confirming that the model is learning meaningful chemistry.

Methodology

  1. Structure → Graph: Each protein’s PDB file is turned into a graph where nodes are residues and edges represent spatial contacts. Nodes are typed (e.g., aromatic, charged) and edges carry distance information.
  2. Chromophore registration: The graph is aligned to a reference “mature chromophore” state, then split into three functional sub‑regions (phenolate, bridge, imidazolinone).
  3. Signal channels: Physicochemical properties (aromaticity, polarity, flexibility, etc.) are treated as “signals” that can travel along edges. For each channel, the algorithm aggregates how strongly it reaches each chromophore region, yielding a set of enrichment scores.
  4. Feature pruning: 121 raw scores are generated; identity‑based shortcuts (e.g., “same residue as chromophore”) are removed, leaving 52 informative features.
  5. Regression model: An ExtraTrees ensemble (gradient‑boosted decision trees) is trained separately for each emission band (green, red, far‑red) using the 52 features.
  6. Evaluation: Random 5‑fold cross‑validation, homology‑controlled splits, and top‑K bright‑protein screening (e.g., Bright @ 5) assess both regression quality and practical screening power.

Results & Findings

MetricProposed MethodBest Baseline
Pearson R (random CV)0.772 ± 0.0080.734 (ESM‑C)
MAE (random CV)0.131 ± 0.0020.152 (SaProt)
Bright @ 5 (top‑5 screen)0.7040.618 (Band mean)
Remote bucket R (<50 % ID)0.6970.633 (ESM‑C)

Interpretation: The model not only predicts QY more accurately but also excels at identifying the brightest candidates from a large pool—a key need in protein engineering pipelines. Feature analysis revealed band‑specific mechanisms:

  • GFP‑like (green): aromatic packing and asymmetric “clamp” residues stabilize the phenolate.
  • Red proteins: a delicate balance of positive/negative charges around the bridge region governs radiative decay.
  • Far‑red: flexibility‑risk trade‑offs and bulky side‑chain contacts dominate.

Practical Implications

  • Accelerated protein engineering: Researchers can feed a set of candidate structures into the tool and instantly rank them by predicted QY, cutting down experimental screening cycles.
  • Design of custom fluorophores: By inspecting the most influential graph features, engineers can rationally mutate residues that improve specific signal channels (e.g., introduce aromatic residues near the phenolate to boost green fluorescence).
  • Cross‑species applicability: Because performance holds even for low‑identity proteins, the method is useful for mining novel fluorescent proteins from metagenomic or synthetic libraries where sequence homology is minimal.
  • Integration with existing pipelines: The feature extraction step is compatible with standard structural bioinformatics tools (e.g., Biopython, PyMOL), and the ExtraTrees model can be wrapped in a lightweight API for high‑throughput cloud or on‑device inference.

Limitations & Future Work

  • Structure dependence: The approach requires high‑quality 3‑D models; prediction accuracy may degrade for proteins lacking resolved structures or reliable homology models.
  • Feature set size: Although reduced to 52 non‑trivial features, the pipeline still involves a non‑trivial preprocessing step that could be a bottleneck for massive libraries.
  • Generalization beyond fluorescent proteins: The method is tailored to chromophore‑centric mechanisms; extending it to other functional sites (e.g., enzyme active sites) will need domain‑specific graph partitioning.
  • Future directions: The authors plan to (1) incorporate dynamic information from molecular dynamics simulations to capture conformational flexibility, (2) explore end‑to‑end graph neural networks that learn signal propagation automatically, and (3) release an open‑source package to lower the barrier for community adoption.

Authors

  • Yuchen Xiong
  • Swee Keong Yeap
  • Steven Aw Yoong Kit

Paper Information

  • arXiv ID: 2605.06644v1
  • Categories: cs.LG
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...