[Paper] Symbolic Discovery of Stochastic Differential Equations with Genetic Programming

Published: 18 hours ago (March 10, 2026 at 08:43 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.09597v1

Overview

The paper presents a new way to let machines discover the underlying stochastic differential equations (SDEs) that generate noisy time‑series data. By extending symbolic regression with genetic programming to jointly learn both the deterministic “drift” term and the stochastic “diffusion” term, the authors turn raw, noisy observations into human‑readable, generative models—a capability that can accelerate scientific insight and data‑driven engineering.

Key Contributions

Symbolic SDE discovery: First method that uses genetic programming to evolve closed‑form expressions for both drift and diffusion functions of an SDE.
Maximum‑likelihood fitness: Introduces a likelihood‑based objective that evaluates candidate equations directly on the observed stochastic trajectories.
Scalable to high dimensions: Demonstrates successful recovery of multi‑dimensional systems (e.g., coupled oscillators) without exponential blow‑up.
Robust to sparse sampling: Works even when data are sampled irregularly or at low frequency, a common scenario in real‑world sensing.
Generalization to stochastic PDEs: Extends the approach to spatially extended systems, showing that the same symbolic framework can handle partial differential equations with noise.

Methodology

Data generation – The authors assume a set of observed trajectories ({x(t_i)}) generated by an SDE of the form
[ \mathrm{d}x = f(x),\mathrm{d}t + g(x),\mathrm{d}W_t, ]
where (f) (drift) and (g) (diffusion) are unknown symbolic functions and (W_t) is a Wiener process.
Genetic programming (GP) backbone – A population of candidate symbolic trees for (f) and (g) is evolved using standard GP operators (crossover, mutation, selection).
Fitness via maximum likelihood – For each candidate pair ((\hat f,\hat g)), the log‑likelihood of the observed increments (\Delta x) under the Euler‑Maruyama discretization is computed:
[ \log \mathcal{L} = -\sum_i \frac{(\Delta x_i - \hat f(x_i)\Delta t)^2}{2\hat g(x_i)^2\Delta t}
- \frac{1}{2}\log\bigl(2\pi \hat g(x_i)^2\Delta t\bigr). ]
  This directly rewards models that explain both the mean trend and the variance of the data.
Parsimony pressure – A complexity penalty (e.g., tree depth) is added to avoid bloated expressions, encouraging concise, interpretable formulas.
Validation & selection – The best‑scoring individuals are tested on held‑out trajectories and, when needed, refined with a short local optimization (e.g., gradient‑based coefficient tuning).

Results & Findings

Experiment	Ground‑truth SDE	Recovery Accuracy	Notable Observation
1‑D Ornstein‑Uhlenbeck	(dx = -\theta x,dt + \sigma dW_t)	Exact drift & diffusion recovered (within 1 % error)	Works with as few as 50 data points
2‑D Coupled Lotka‑Volterra	Non‑linear drift, state‑dependent diffusion	Correct functional forms identified, coefficients within 5 %	Scales linearly with dimension
Stochastic Burgers’ PDE (1‑D)	(\partial_t u = \nu \partial_{xx} u + u\partial_x u + \eta(x,t))	Symbolic diffusion term (\eta) captured, drift terms recovered	Demonstrates extension to spatially distributed noise

Overall, the method outperforms baseline sparse regression (which only targets drift) and remains stable when the sampling interval (\Delta t) is increased up to an order of magnitude.

Practical Implications

Model‑based simulation – Engineers can automatically obtain a compact SDE that not only predicts the mean behavior but also reproduces realistic variability, enabling Monte‑Carlo style scenario testing without hand‑crafting noise models.
Control & reinforcement learning – Accurate drift‑diffusion models are essential for stochastic optimal control and model‑based RL; symbolic forms make it easier to embed analytical solutions or constraints.
System identification in IoT / edge devices – The algorithm tolerates sparse, irregular data, making it suitable for sensor networks where bandwidth or power limits sampling rates.
Explainable AI – Because the output is a human‑readable equation, domain experts can validate, modify, or extend the discovered dynamics, bridging the gap between black‑box ML and traditional physics‑based modeling.
Rapid prototyping of scientific hypotheses – Researchers can feed experimental time series (e.g., finance, neuroscience, climate) into the pipeline and instantly obtain candidate governing equations for further testing.

Limitations & Future Work

Assumption of Itô‑type noise – The current likelihood formulation assumes additive Wiener noise; extensions to Lévy flights or multiplicative noise with non‑Gaussian statistics are not covered.
Computational cost of GP – While scalable, the evolutionary search can be expensive for very high‑dimensional systems (> 10 variables) or massive datasets; hybrid approaches with gradient‑based fine‑tuning are suggested.
Model selection bias – The parsimony penalty is heuristic; more principled Bayesian model evidence could improve robustness against over‑fitting.
Partial observability – The method presumes full state observation; handling hidden variables or latent dynamics remains an open challenge.

Future research directions include integrating deep neural surrogates for the likelihood term, extending to stochastic partial differential equations with complex boundary conditions, and developing online GP variants that update the symbolic model as new data streams in.

Bottom line: By marrying genetic programming with a likelihood‑based fitness that respects both drift and diffusion, this work opens the door to automated, interpretable discovery of noisy dynamical laws—a tool that could become a staple in the developer’s toolbox for modeling, simulation, and AI‑driven control.

Authors

Sigur de Vries
Sander W. Keemink
Marcel A. J. van Gerven

Paper Information

arXiv ID: 2603.09597v1
Categories: cs.NE, cs.SC
Published: March 10, 2026
PDF: Download PDF

[Paper] Symbolic Discovery of Stochastic Differential Equations with Genetic Programming

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] CREATE: Testing LLMs for Associative Creativity

[Paper] ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare