[Paper] Symbolic Discovery of Stochastic Differential Equations with Genetic Programming
Source: arXiv - 2603.09597v1
Overview
The paper presents a new way to let machines discover the underlying stochastic differential equations (SDEs) that generate noisy time‑series data. By extending symbolic regression with genetic programming to jointly learn both the deterministic “drift” term and the stochastic “diffusion” term, the authors turn raw, noisy observations into human‑readable, generative models—a capability that can accelerate scientific insight and data‑driven engineering.
Key Contributions
- Symbolic SDE discovery: First method that uses genetic programming to evolve closed‑form expressions for both drift and diffusion functions of an SDE.
- Maximum‑likelihood fitness: Introduces a likelihood‑based objective that evaluates candidate equations directly on the observed stochastic trajectories.
- Scalable to high dimensions: Demonstrates successful recovery of multi‑dimensional systems (e.g., coupled oscillators) without exponential blow‑up.
- Robust to sparse sampling: Works even when data are sampled irregularly or at low frequency, a common scenario in real‑world sensing.
- Generalization to stochastic PDEs: Extends the approach to spatially extended systems, showing that the same symbolic framework can handle partial differential equations with noise.
Methodology
-
Data generation – The authors assume a set of observed trajectories ({x(t_i)}) generated by an SDE of the form
[ \mathrm{d}x = f(x),\mathrm{d}t + g(x),\mathrm{d}W_t, ]
where (f) (drift) and (g) (diffusion) are unknown symbolic functions and (W_t) is a Wiener process. -
Genetic programming (GP) backbone – A population of candidate symbolic trees for (f) and (g) is evolved using standard GP operators (crossover, mutation, selection).
-
Fitness via maximum likelihood – For each candidate pair ((\hat f,\hat g)), the log‑likelihood of the observed increments (\Delta x) under the Euler‑Maruyama discretization is computed:
[ \log \mathcal{L} = -\sum_i \frac{(\Delta x_i - \hat f(x_i)\Delta t)^2}{2\hat g(x_i)^2\Delta t}- \frac{1}{2}\log\bigl(2\pi \hat g(x_i)^2\Delta t\bigr).
]
This directly rewards models that explain both the mean trend and the variance of the data.
- \frac{1}{2}\log\bigl(2\pi \hat g(x_i)^2\Delta t\bigr).
]
-
Parsimony pressure – A complexity penalty (e.g., tree depth) is added to avoid bloated expressions, encouraging concise, interpretable formulas.
-
Validation & selection – The best‑scoring individuals are tested on held‑out trajectories and, when needed, refined with a short local optimization (e.g., gradient‑based coefficient tuning).
Results & Findings
| Experiment | Ground‑truth SDE | Recovery Accuracy | Notable Observation |
|---|---|---|---|
| 1‑D Ornstein‑Uhlenbeck | (dx = -\theta x,dt + \sigma dW_t) | Exact drift & diffusion recovered (within 1 % error) | Works with as few as 50 data points |
| 2‑D Coupled Lotka‑Volterra | Non‑linear drift, state‑dependent diffusion | Correct functional forms identified, coefficients within 5 % | Scales linearly with dimension |
| Stochastic Burgers’ PDE (1‑D) | (\partial_t u = \nu \partial_{xx} u + u\partial_x u + \eta(x,t)) | Symbolic diffusion term (\eta) captured, drift terms recovered | Demonstrates extension to spatially distributed noise |
Overall, the method outperforms baseline sparse regression (which only targets drift) and remains stable when the sampling interval (\Delta t) is increased up to an order of magnitude.
Practical Implications
- Model‑based simulation – Engineers can automatically obtain a compact SDE that not only predicts the mean behavior but also reproduces realistic variability, enabling Monte‑Carlo style scenario testing without hand‑crafting noise models.
- Control & reinforcement learning – Accurate drift‑diffusion models are essential for stochastic optimal control and model‑based RL; symbolic forms make it easier to embed analytical solutions or constraints.
- System identification in IoT / edge devices – The algorithm tolerates sparse, irregular data, making it suitable for sensor networks where bandwidth or power limits sampling rates.
- Explainable AI – Because the output is a human‑readable equation, domain experts can validate, modify, or extend the discovered dynamics, bridging the gap between black‑box ML and traditional physics‑based modeling.
- Rapid prototyping of scientific hypotheses – Researchers can feed experimental time series (e.g., finance, neuroscience, climate) into the pipeline and instantly obtain candidate governing equations for further testing.
Limitations & Future Work
- Assumption of Itô‑type noise – The current likelihood formulation assumes additive Wiener noise; extensions to Lévy flights or multiplicative noise with non‑Gaussian statistics are not covered.
- Computational cost of GP – While scalable, the evolutionary search can be expensive for very high‑dimensional systems (> 10 variables) or massive datasets; hybrid approaches with gradient‑based fine‑tuning are suggested.
- Model selection bias – The parsimony penalty is heuristic; more principled Bayesian model evidence could improve robustness against over‑fitting.
- Partial observability – The method presumes full state observation; handling hidden variables or latent dynamics remains an open challenge.
Future research directions include integrating deep neural surrogates for the likelihood term, extending to stochastic partial differential equations with complex boundary conditions, and developing online GP variants that update the symbolic model as new data streams in.
Bottom line: By marrying genetic programming with a likelihood‑based fitness that respects both drift and diffusion, this work opens the door to automated, interpretable discovery of noisy dynamical laws—a tool that could become a staple in the developer’s toolbox for modeling, simulation, and AI‑driven control.
Authors
- Sigur de Vries
- Sander W. Keemink
- Marcel A. J. van Gerven
Paper Information
- arXiv ID: 2603.09597v1
- Categories: cs.NE, cs.SC
- Published: March 10, 2026
- PDF: Download PDF