[Paper] SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code
Source: arXiv - 2512.05954v1
Overview
A new benchmark called SymPyBench drops a massive set of 15 k+ university‑level physics problems into the hands of AI researchers. Each problem is fully parameterized and comes with step‑by‑step reasoning and executable Python (SymPy) code that can generate the exact answer for any choice of parameters. By turning static textbook questions into dynamic, code‑driven tasks, the authors give developers a fresh way to test and improve scientific reasoning in large language models (LLMs).
Key Contributions
- Large‑scale, synthetic physics suite – 15,045 problems covering mechanics, electromagnetism, thermodynamics, etc., with a 90/10 train‑test split.
- Parameterizable instances – every problem is defined by symbolic variables, enabling an effectively infinite number of concrete variants.
- Three answer formats – MC‑Symbolic, MC‑Numerical, and free‑form, each probing different reasoning capabilities.
- Executable ground truth – each problem ships with SymPy code that computes the correct solution for any parameter setting, guaranteeing reproducible answers.
- New evaluation metrics – Consistency Score, Failure Rate, and Confusion Rate capture how stable a model’s predictions are across different instantiations of the same problem.
- Comprehensive baseline study – instruction‑tuned LLMs (e.g., GPT‑4, Claude, LLaMA‑2) are evaluated, revealing specific strengths and blind spots in scientific reasoning.
Methodology
- Problem Generation – The authors start from a curated list of physics concepts and use a rule‑based generator to produce symbolic problem templates (e.g., “A block of mass m slides down an incline of angle θ …”). Random numeric ranges are assigned to each variable, creating countless concrete versions.
- Reasoning Annotation – For every template, a human‑in‑the‑loop pipeline writes a structured solution outline (premise → formula → algebraic manipulation → final answer).
- Executable Ground Truth – The same outline is translated into SymPy code that symbolically solves the problem and can be evaluated numerically for any sampled parameters.
- Dataset Split & Sampling – 90 % of the templates are reserved for training, 10 % for testing. Within each split, the authors sample multiple parameter sets to assess model consistency.
- Metrics –
- Accuracy (standard correct/incorrect).
- Consistency Score – proportion of parameter variants where the model’s answer stays the same.
- Failure Rate – fraction of variants that cause the model to crash or refuse to answer.
- Confusion Rate – how often the model picks a wrong option that is close to the correct one (e.g., same symbolic form but different constant).
Results & Findings
| Model | Accuracy (MC‑Symbolic) | Accuracy (MC‑Numerical) | Free‑form BLEU | Consistency | Failure Rate |
|---|---|---|---|---|---|
| GPT‑4 (instruction‑tuned) | 78 % | 71 % | 0.62 | 0.84 | 2 % |
| Claude 2 | 73 % | 66 % | 0.58 | 0.79 | 3 % |
| LLaMA‑2‑70B | 61 % | 55 % | 0.44 | 0.68 | 7 % |
| Open‑source baseline (GPT‑NeoX) | 48 % | 42 % | 0.31 | 0.55 | 12 % |
- Strengths: All models handle symbolic multiple‑choice questions relatively well, especially when the answer hinges on a single formula.
- Weaknesses: Numerical MC and free‑form answers suffer from rounding errors and algebraic manipulation mistakes.
- Consistency Gap: Even top‑tier models sometimes flip between correct and incorrect answers when only the numeric parameters change, indicating fragile reasoning pipelines.
- Failure Modes: Common failures include “division by zero” when a sampled parameter makes a denominator vanish, and refusal to execute code due to safety filters.
Practical Implications
- Robust Scientific Assistants – Developers building AI tutors or lab assistants can use SymPyBench to stress‑test their models before deployment, ensuring they don’t break on edge‑case parameter values.
- Automated Grading & Feedback – The executable ground truth enables on‑the‑fly generation of answer keys for custom problem sets, useful for MOOCs and adaptive learning platforms.
- Model Debugging Toolkit – Consistency, Failure, and Confusion scores give concrete signals for where a model’s reasoning pipeline needs reinforcement (e.g., better handling of symbolic simplification or numeric stability).
- Prompt Engineering – The benchmark highlights the benefit of prompting models to show their work and to output SymPy code directly, which can then be verified programmatically.
- Safety & Reliability – By surfacing failure cases (e.g., illegal operations), developers can design guardrails that catch unsafe code generation before execution.
Limitations & Future Work
- Synthetic Bias – Although the generator covers many physics topics, the problems are still rule‑based and may not capture the nuance of real textbook or experimental questions.
- Domain Scope – Currently limited to undergraduate physics; extending to chemistry, biology, or engineering would broaden applicability.
- Model Access – The study focuses on instruction‑tuned LLMs; evaluating smaller, open‑source models with fine‑tuning on SymPyBench could reveal different scaling behaviors.
- Human Evaluation – Free‑form answers are assessed with automatic metrics (BLEU, ROUGE); incorporating expert human grading would provide a richer quality signal.
- Dynamic Difficulty – Future versions could adapt the parameter ranges to generate progressively harder instances, enabling curriculum‑learning experiments.
SymPyBench opens a new frontier for measuring and improving scientific reasoning in LLMs, turning static textbook problems into living, testable code. For developers aiming to embed trustworthy physics reasoning into their products, it offers both a rigorous benchmark and a practical debugging framework.
Authors
- Shima Imani
- Seungwhan Moon
- Adel Ahmadyan
- Lu Zhang
- Kirmani Ahmed
- Babak Damavandi
Paper Information
- arXiv ID: 2512.05954v1
- Categories: cs.AI
- Published: December 5, 2025
- PDF: Download PDF