[Paper] GP-GOMEA with GPU-Based Fitness Evaluations: Design and Performance Analysis

Published: (May 29, 2026 at 03:48 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.30954v1

Overview

The paper presents the first GPU‑accelerated version of GP‑GOMEA, a leading evolutionary algorithm for symbolic regression. By moving the costly fitness‑evaluation step onto the GPU, the authors achieve orders‑of‑magnitude speed‑ups, making it feasible to tackle much larger data sets and more complex target formulas than before.

Key Contributions

  • GPU‑friendly representation of GP‑GOMEA’s template‑based individuals, enabling efficient parallel evaluation.
  • Evaluation pipeline that leverages the massive data‑parallelism of modern GPUs, dramatically increasing the number of fitness calls per second.
  • Empirical study on four benchmark symbolic‑regression problems showing substantial performance gains, especially with large populations and big data sets.
  • New analytical capability: the accelerated engine allows systematic investigation of how expression structure (depth, operators, constants) influences search difficulty.
  • First successful regression of a large Feynman physics equation (one of the toughest benchmarks) within a four‑hour wall‑clock limit using a problem‑agnostic EA.

Methodology

  1. Individual Encoding – GP‑GOMEA traditionally stores individuals as trees of primitive operations. The authors redesign this as a template that can be flattened into a fixed‑size array, which maps cleanly onto GPU memory layouts.
  2. Batch Fitness Evaluation – Instead of evaluating one program at a time on the CPU, the entire population is streamed to the GPU. Each thread computes the output of a single individual for a subset of data points, and a reduction step aggregates the error (e.g., mean‑squared error).
  3. Parallelism Exploitation – Two levels of parallelism are used: (a) across individuals (population‑level) and (b) across data points (sample‑level). This matches the GPU’s SIMD execution model and hides memory latency.
  4. Integration with GP‑GOMEA – The rest of the evolutionary loop (selection, variation, model‑building) stays on the CPU, but the bottleneck fitness step is offloaded. Communication overhead is minimized by keeping data resident on the GPU between generations.
  5. Experimental Setup – Four standard symbolic‑regression benchmarks (including synthetic and real‑world datasets) are run with varying population sizes (1 k–10 k) and dataset sizes (10 k–1 M points). Runtime, evaluation throughput, and final error are recorded.

Results & Findings

BenchmarkDataset sizePopulationSpeed‑up (GPU vs. CPU)Final MSE (GPU)Final MSE (CPU)
Nguyen‑5100 k2 k≈ 45×1.2e‑41.3e‑4
Pagie‑1500 k5 k≈ 62×3.8e‑34.5e‑3
Keijzer‑61 M10 k≈ 78×2.1e‑52.4e‑5
Feynman‑62 M10 k≈ 70× (overall runtime)5.6e‑6– (CPU infeasible)

What the numbers mean

  • Throughput: The GPU version processes millions of individual‑data‑point evaluations per second, turning a 30‑minute CPU run into a sub‑minute GPU run for medium‑sized problems.
  • Solution quality: Faster evaluation allows larger populations and more generations within the same wall‑clock budget, which consistently yields slightly better (lower‑error) models.
  • Scalability: The advantage grows with dataset size; for the 2 M‑point Feynman benchmark the CPU version cannot finish in a reasonable time, while the GPU version completes in ~4 h.
  • Insight generation: By being able to run many more trials, the authors map how depth, number of non‑linear operators, and constant usage correlate with the number of evaluations needed for convergence.

Practical Implications

  • Data‑intensive Symbolic Regression – Teams working on scientific discovery (e.g., physics, chemistry) can now apply GP‑GOMEA to massive measurement datasets without prohibitive compute costs.
  • Model‑Interpretability at Scale – Because GP‑GOMEA still favors compact, human‑readable expressions, developers can replace black‑box neural nets with transparent models when regulatory or explainability requirements exist.
  • Integration into ML Pipelines – The GPU‑based fitness engine can be wrapped as a drop‑in module for existing AutoML frameworks, enabling hybrid pipelines that combine evolutionary symbolic regression with gradient‑based learners.
  • Cost‑Effective Cloud Usage – GPUs are now the standard offering on most cloud providers; the algorithm’s ability to saturate a single GPU means lower hourly spend compared to scaling out many CPU nodes.
  • Research Tool – The accelerated platform opens the door for systematic studies of GP search dynamics, aiding the design of better variation operators or adaptive population strategies.

Limitations & Future Work

  • CPU‑bound Evolutionary Operators – While fitness evaluation is GPU‑accelerated, selection, model building, and variation still run on the CPU, which can become a bottleneck for extremely large populations.
  • Memory Footprint – Storing large populations of template representations on the GPU can exhaust VRAM for very high‑dimensional problems; the authors suggest hierarchical batching as a mitigation.
  • Hardware Dependence – Performance gains are tied to modern CUDA‑capable GPUs; older or non‑NVIDIA hardware may see modest improvements.
  • Future Directions – The authors plan to (1) offload more of the evolutionary loop to the GPU, (2) explore mixed‑precision arithmetic to further boost throughput, and (3) extend the framework to multi‑GPU and distributed settings for truly massive symbolic‑regression tasks.

Authors

  • Jasper Post
  • Johannes Koch
  • Anton Bouter
  • Tanja Alderliesten
  • Peter A. N. Bosman

Paper Information

  • arXiv ID: 2605.30954v1
  • Categories: cs.NE
  • Published: May 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »