[Paper] Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

Published: (May 8, 2026 at 02:29 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.07323v1

Overview

The paper introduces DoLQ, a novel framework that uses large language models (LLMs) to evaluate both the quantitative fit and the qualitative plausibility of candidate ordinary differential equations (ODEs) discovered from data. By marrying symbolic regression with AI‑driven “scientist reasoning,” the authors achieve more reliable recovery of governing equations—an essential step for building trustworthy scientific‑machine‑learning models.

Key Contributions

  • LLM‑augmented evaluation: Introduces a Scientist Agent that leverages an LLM to perform qualitative checks (e.g., physical plausibility, dimensional consistency) alongside traditional quantitative error metrics.
  • Multi‑agent architecture: Combines three specialized agents—Sampler, Parameter Optimizer, and Scientist—to iteratively propose, refine, and validate ODE candidates.
  • Improved discovery performance: Demonstrates higher success rates and more accurate symbolic recovery on standard multi‑dimensional ODE benchmarks compared with state‑of‑the‑art symbolic regression methods.
  • Open‑source implementation: Provides a ready‑to‑run codebase (GitHub link) that can be plugged into existing scientific‑ML pipelines.

Methodology

  1. Sampler Agent – Randomly generates candidate ODE structures (e.g., dx/dt = a·x + b·y²).
  2. Parameter Optimizer – Uses gradient‑based or evolutionary techniques to fit the numerical coefficients of the sampled structure to the observed time‑series data, minimizing a loss such as mean‑squared error.
  3. Scientist Agent (LLM) – Sends the candidate equation and its fitted parameters to a large language model (e.g., GPT‑4). The LLM returns:
    • Qualitative feedback: checks for dimensional consistency, known physical laws, and intuitive behavior (e.g., “the term should be negative for a damped oscillator”).
    • Quantitative scoring: a confidence score derived from the LLM’s internal reasoning about the fit quality.
  4. Synthesis & Guidance – The system aggregates the LLM’s qualitative insights with the numeric loss, producing a composite score that steers the next sampling round toward more plausible candidates. This loop repeats until convergence or a preset budget is exhausted.

The approach is deliberately modular: any off‑the‑shelf symbolic regression engine can replace the Sampler, and any LLM with a suitable prompting interface can act as the Scientist.

Results & Findings

  • Benchmark performance: On a suite of 10 multi‑dimensional ODE problems (including Lotka‑Volterra, Lorenz, and damped harmonic oscillators), DoLQ achieved a 92 % success rate in recovering the exact symbolic form, versus 68 % for the best prior method.
  • Error reduction: The average normalized mean‑squared error dropped from 0.13 (baseline) to 0.04 with DoLQ, indicating tighter quantitative fits.
  • Qualitative gains: In cases where the baseline recovered a mathematically correct but physically implausible term (e.g., a positive feedback loop where damping is expected), DoLQ’s LLM feedback eliminated those candidates early, saving computational budget.
  • Ablation study: Removing the LLM‑based qualitative check reduced success rates by ~15 %, confirming that the “scientist” reasoning contributes meaningfully beyond raw loss minimization.

Practical Implications

  • Faster model validation: Engineers can trust discovered ODEs sooner because the LLM flags physically impossible terms before expensive simulations are run.
  • Reduced manual tuning: Traditional symbolic regression often requires hand‑crafted constraints; DoLQ automates this via natural‑language prompts, lowering the barrier for domain experts who are not ML specialists.
  • Plug‑and‑play for digital twins: Companies building digital twins of physical systems (e.g., HVAC, robotics, power grids) can integrate DoLQ to automatically infer governing dynamics from sensor streams, accelerating the twin‑creation lifecycle.
  • Improved safety‑critical modeling: In fields like aerospace or biomedical engineering, ensuring that learned equations respect conservation laws is crucial; DoLQ’s qualitative layer provides an extra safety net.

Limitations & Future Work

  • LLM reliability: The approach inherits the stochastic nature of LLM outputs; occasional mis‑interpretations of physics can misguide the search, requiring prompt engineering or ensemble LLMs.
  • Scalability to PDEs: The current design focuses on ODEs; extending the framework to partial differential equations (spatial‑temporal systems) will need richer sampling strategies and more sophisticated LLM reasoning.
  • Computational overhead: Querying an LLM at each iteration adds latency, especially when using commercial APIs; future work could explore distilled, locally‑run models or caching mechanisms.

DoLQ opens a promising avenue where symbolic regression and large‑language‑model reasoning co‑evolve, bringing us closer to fully automated discovery of physically sound dynamical models.

Authors

  • Sum Kyun Song
  • Bong Gyun Shin
  • Jae Yong Lee

Paper Information

  • arXiv ID: 2605.07323v1
  • Categories: cs.AI, cs.LG, cs.NE, cs.SC
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...