[Paper] FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Published: (December 23, 2025 at 02:40 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20732v1

Overview

The paper introduces FEM‑Bench, a new benchmark that tests how well large language models (LLMs) can write correct code for finite element method (FEM) simulations—a cornerstone of computational mechanics. By framing scientific reasoning as a coding problem with strict physical and numerical constraints, the authors provide a concrete way to measure progress toward AI systems that can model the real world.

Key Contributions

  • A dedicated scientific‑reasoning benchmark built around FEM tasks drawn from a first‑year graduate computational mechanics curriculum.
  • 33 well‑defined problems covering geometry creation, material modeling, boundary‑condition specification, mesh generation, and post‑processing.
  • Standardized evaluation protocol: each model gets five independent attempts per task; success is measured both at the function‑level (does the code run) and at the unit‑test level (does the output meet physics‑based tolerances).
  • Comprehensive baseline results for several state‑of‑the‑art LLMs (Gemini 3 Pro, GPT‑5, Claude 3, Llama 2‑70B, etc.), revealing large performance gaps.
  • Open‑source benchmark suite (datasets, reference solutions, and evaluation scripts) to enable reproducible research and community extensions.

Methodology

  1. Task Design – The authors curated 33 FEM problems that are “introductory but non‑trivial.” Each problem specifies a physical scenario (e.g., a cantilever beam under load), the required material model, and the desired output (displacement field, stress distribution, etc.).
  2. Prompt Construction – For every task, a natural‑language prompt describes the physics, the numerical method, and the target programming language (Python with FEniCS or MATLAB).
  3. Model Interaction – Selected LLMs generate code snippets in response to the prompts. The process is repeated five times per model to capture variability.
  4. Automated Verification – Generated code is executed in a sandbox. Two tiers of success are recorded:
    • Function Success – The script runs without errors and produces any output.
    • Joint Success (Unit Tests) – The output is compared against a reference solution using tolerance‑based assertions (e.g., max displacement error < 1 %).
  5. Metrics – Success rates are aggregated across tasks and attempts, yielding per‑model scores such as “30/33 tasks solved at least once” or an “Average Joint Success Rate of 73.8 %.”

Results & Findings

Model (best attempt)Function‑Level SuccessJoint Success (Avg. %)
Gemini 3 Pro (function writing)30 / 33 tasks solved at least once; 26 / 33 solved in all 5 attempts
GPT‑5 (unit‑test writing)73.8 % average joint success
Claude 318 / 33 (≥1 success)45 %
Llama 2‑70B12 / 33 (≥1 success)31 %

Key takeaways

  • Even the strongest current models fail to consistently solve a modest set of FEM problems.
  • Performance varies dramatically between models and even between attempts for the same model, highlighting stochastic generation behavior.
  • Errors are often physical rather than syntactic—e.g., wrong boundary conditions, mis‑specified material properties, or unstable mesh parameters.

Practical Implications

  • Tooling for Engineers – Companies building AI‑assisted simulation pipelines can use FEM‑Bench to gauge whether a model is ready for production or needs additional fine‑tuning.
  • Curriculum‑Level Automation – Academic labs could deploy LLMs to generate starter code for student assignments, but the benchmark warns that human verification remains essential.
  • Model‑Driven Design – Integrating LLMs into CAD‑to‑simulation workflows (auto‑generating FEM scripts from geometry) becomes feasible only after passing structured tests like those in FEM‑Bench.
  • Benchmark‑Driven Development – LLM vendors now have a concrete target domain (computational mechanics) to optimize for, potentially spurring specialized fine‑tuning datasets and architecture tweaks.

Limitations & Future Work

  • Scope – The benchmark covers only introductory FEM tasks; real‑world engineering problems involve nonlinear materials, multi‑physics coupling, and large‑scale parallel solvers, which are not yet represented.
  • Language Bias – Current prompts focus on Python/FEniCS and MATLAB; other popular FEM frameworks (e.g., Abaqus, ANSYS) are omitted.
  • Evaluation Granularity – Success is binary (pass/fail) per unit test; richer diagnostics (e.g., error magnitude distribution) could better inform model weaknesses.
  • Human‑in‑the‑Loop – The study does not explore how a developer might iteratively correct LLM output, a realistic usage pattern.

Future releases of FEM‑Bench aim to add higher‑complexity scenarios (nonlinear elasticity, fluid‑structure interaction), support additional programming environments, and incorporate interactive debugging metrics to reflect real development cycles.

Authors

  • Saeed Mohammadzadeh
  • Erfan Hamdi
  • Joel Shor
  • Emma Lejeune

Paper Information

  • arXiv ID: 2512.20732v1
  • Categories: cs.LG, cs.AI, cs.SE
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »