[Paper] Bayesian Symbolic Regression via Posterior Sampling

Published: 1 month ago (December 11, 2025 at 12:38 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10849v1

Overview

A new paper proposes a Bayesian take on symbolic regression that uses Sequential Monte Carlo (SMC) to sample from the posterior distribution over mathematical expressions. By treating the search for an equation as a probabilistic inference problem, the authors dramatically improve robustness to noisy data and give developers a principled way to quantify uncertainty in the discovered models.

Key Contributions

SMC‑based posterior sampler for symbolic expressions, replacing the usual deterministic or evolutionary heuristics.
Adaptive tempering schedule that gradually sharpens the posterior, allowing the algorithm to escape poor local optima early on.
Normalized marginal likelihood as a fitness metric, which naturally balances model fit against expression complexity (parsimony).
Empirical validation on noisy benchmark problems showing lower over‑fitting and higher predictive accuracy compared with standard genetic programming (GP) baselines.
Uncertainty quantification for the discovered equations, enabling downstream risk‑aware decision making.

Methodology

Probabilistic model – The authors define a prior over syntax trees (the symbolic expressions) that favors shorter, simpler trees. The likelihood measures how well a candidate expression explains the observed data, accounting for measurement noise.
Sequential Monte Carlo – A population of particles (candidate trees) is propagated through a series of intermediate distributions. At each step:
- Resampling selects particles proportionally to their current weight (probabilistic selection).
- Mutation/ crossover operators (similar to GP) propose new trees.
- Adaptive tempering adjusts a temperature parameter β, slowly moving from the prior (β≈0) toward the true posterior (β≈1).
Marginal likelihood estimation – The algorithm computes a normalized evidence term for each particle, which acts as a Bayesian “score” that penalizes overly complex expressions.
Posterior summarization – After the final tempering step, the particle set approximates the posterior. The most probable expression (MAP) or a weighted ensemble can be extracted, and credible intervals on predictions are obtained directly from the particle ensemble.

Results & Findings

Dataset (noisy)	GP‑based SR (baseline)	Bayesian SMC SR (this work)
Synthetic ODE	RMSE ↑ 0.42, 12‑node avg. tree	RMSE ↓ 0.21, 7‑node avg. tree
Real‑world physics (pendulum)	Over‑fit, high variance predictions	Lower variance, 15 % better out‑of‑sample R²
Engineering design (aerodynamics)	3‑fold increase in error with 10 % noise	Robust to noise, error growth < 1.2×

Generalization: The Bayesian approach consistently yields simpler expressions that generalize better to unseen data.
Noise resilience: Even with 20 % Gaussian noise, the posterior concentrates around the true governing equation, whereas GP often collapses to spurious high‑degree polynomials.
Uncertainty estimates: Credible intervals derived from the particle set correctly capture the true output in >95 % of test points, something GP lacks out of the box.

Practical Implications

Model discovery pipelines – Engineers can replace noisy‑prone GP modules with the SMC sampler to obtain more reliable equations for control, simulation, or optimization tasks.
Risk‑aware AI – The posterior provides natural confidence bounds, enabling safety‑critical systems (e.g., autonomous vehicles, medical devices) to assess the trustworthiness of a discovered model before deployment.
Automated scientific discovery – Researchers can explore large experimental datasets (e.g., materials science, climate modeling) without manually tuning GP hyper‑parameters; the Bayesian framework handles model complexity automatically.
Integration with existing tools – The algorithm can be wrapped as a drop‑in replacement for popular SR libraries (e.g., DEAP, gplearn) because it still uses tree‑based mutation/crossover operators familiar to developers.
Scalability – While SMC adds a modest computational overhead, the particle population can be parallelized across CPUs/GPUs, making it feasible for medium‑scale problems (thousands of data points, dozens of variables).

Limitations & Future Work

Computational cost – Sampling a sufficiently large particle set for high‑dimensional expression spaces can be expensive compared to fast GP heuristics.
Prior design – The current prior over tree structures is hand‑crafted; learning more expressive priors from domain knowledge could further improve performance.
Scalability to very large datasets – The authors note that mini‑batch likelihood approximations are needed for datasets beyond a few hundred thousand points.
Extension to richer function libraries – Future work could incorporate custom operators (e.g., integrals, differential operators) and handle symbolic constraints more directly.

Overall, the paper demonstrates that bringing Bayesian inference into symbolic regression yields tangible benefits in robustness, interpretability, and uncertainty quantification—qualities that are increasingly demanded in modern data‑driven engineering and scientific workflows.

Authors

Geoffrey F. Bomarito
Patrick E. Leser

Paper Information

arXiv ID: 2512.10849v1
Categories: cs.LG
Published: December 11, 2025
PDF: Download PDF

[Paper] Bayesian Symbolic Regression via Posterior Sampling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously