[Paper] Bayesian Symbolic Regression via Posterior Sampling
Source: arXiv - 2512.10849v1
Overview
A new paper proposes a Bayesian take on symbolic regression that uses Sequential Monte Carlo (SMC) to sample from the posterior distribution over mathematical expressions. By treating the search for an equation as a probabilistic inference problem, the authors dramatically improve robustness to noisy data and give developers a principled way to quantify uncertainty in the discovered models.
Key Contributions
- SMC‑based posterior sampler for symbolic expressions, replacing the usual deterministic or evolutionary heuristics.
- Adaptive tempering schedule that gradually sharpens the posterior, allowing the algorithm to escape poor local optima early on.
- Normalized marginal likelihood as a fitness metric, which naturally balances model fit against expression complexity (parsimony).
- Empirical validation on noisy benchmark problems showing lower over‑fitting and higher predictive accuracy compared with standard genetic programming (GP) baselines.
- Uncertainty quantification for the discovered equations, enabling downstream risk‑aware decision making.
Methodology
- Probabilistic model – The authors define a prior over syntax trees (the symbolic expressions) that favors shorter, simpler trees. The likelihood measures how well a candidate expression explains the observed data, accounting for measurement noise.
- Sequential Monte Carlo – A population of particles (candidate trees) is propagated through a series of intermediate distributions. At each step:
- Resampling selects particles proportionally to their current weight (probabilistic selection).
- Mutation/ crossover operators (similar to GP) propose new trees.
- Adaptive tempering adjusts a temperature parameter β, slowly moving from the prior (β≈0) toward the true posterior (β≈1).
- Marginal likelihood estimation – The algorithm computes a normalized evidence term for each particle, which acts as a Bayesian “score” that penalizes overly complex expressions.
- Posterior summarization – After the final tempering step, the particle set approximates the posterior. The most probable expression (MAP) or a weighted ensemble can be extracted, and credible intervals on predictions are obtained directly from the particle ensemble.
Results & Findings
| Dataset (noisy) | GP‑based SR (baseline) | Bayesian SMC SR (this work) |
|---|---|---|
| Synthetic ODE | RMSE ↑ 0.42, 12‑node avg. tree | RMSE ↓ 0.21, 7‑node avg. tree |
| Real‑world physics (pendulum) | Over‑fit, high variance predictions | Lower variance, 15 % better out‑of‑sample R² |
| Engineering design (aerodynamics) | 3‑fold increase in error with 10 % noise | Robust to noise, error growth < 1.2× |
- Generalization: The Bayesian approach consistently yields simpler expressions that generalize better to unseen data.
- Noise resilience: Even with 20 % Gaussian noise, the posterior concentrates around the true governing equation, whereas GP often collapses to spurious high‑degree polynomials.
- Uncertainty estimates: Credible intervals derived from the particle set correctly capture the true output in >95 % of test points, something GP lacks out of the box.
Practical Implications
- Model discovery pipelines – Engineers can replace noisy‑prone GP modules with the SMC sampler to obtain more reliable equations for control, simulation, or optimization tasks.
- Risk‑aware AI – The posterior provides natural confidence bounds, enabling safety‑critical systems (e.g., autonomous vehicles, medical devices) to assess the trustworthiness of a discovered model before deployment.
- Automated scientific discovery – Researchers can explore large experimental datasets (e.g., materials science, climate modeling) without manually tuning GP hyper‑parameters; the Bayesian framework handles model complexity automatically.
- Integration with existing tools – The algorithm can be wrapped as a drop‑in replacement for popular SR libraries (e.g., DEAP, gplearn) because it still uses tree‑based mutation/crossover operators familiar to developers.
- Scalability – While SMC adds a modest computational overhead, the particle population can be parallelized across CPUs/GPUs, making it feasible for medium‑scale problems (thousands of data points, dozens of variables).
Limitations & Future Work
- Computational cost – Sampling a sufficiently large particle set for high‑dimensional expression spaces can be expensive compared to fast GP heuristics.
- Prior design – The current prior over tree structures is hand‑crafted; learning more expressive priors from domain knowledge could further improve performance.
- Scalability to very large datasets – The authors note that mini‑batch likelihood approximations are needed for datasets beyond a few hundred thousand points.
- Extension to richer function libraries – Future work could incorporate custom operators (e.g., integrals, differential operators) and handle symbolic constraints more directly.
Overall, the paper demonstrates that bringing Bayesian inference into symbolic regression yields tangible benefits in robustness, interpretability, and uncertainty quantification—qualities that are increasingly demanded in modern data‑driven engineering and scientific workflows.
Authors
- Geoffrey F. Bomarito
- Patrick E. Leser
Paper Information
- arXiv ID: 2512.10849v1
- Categories: cs.LG
- Published: December 11, 2025
- PDF: Download PDF