[Paper] FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Published: 1 month ago (December 23, 2025 at 02:40 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20732v1

Overview

The paper introduces FEM‑Bench, a new benchmark that tests how well large language models (LLMs) can write correct code for finite element method (FEM) simulations—a cornerstone of computational mechanics. By framing scientific reasoning as a coding problem with strict physical and numerical constraints, the authors provide a concrete way to measure progress toward AI systems that can model the real world.

Key Contributions

A dedicated scientific‑reasoning benchmark built around FEM tasks drawn from a first‑year graduate computational mechanics curriculum.
33 well‑defined problems covering geometry creation, material modeling, boundary‑condition specification, mesh generation, and post‑processing.
Standardized evaluation protocol: each model gets five independent attempts per task; success is measured both at the function‑level (does the code run) and at the unit‑test level (does the output meet physics‑based tolerances).
Comprehensive baseline results for several state‑of‑the‑art LLMs (Gemini 3 Pro, GPT‑5, Claude 3, Llama 2‑70B, etc.), revealing large performance gaps.
Open‑source benchmark suite (datasets, reference solutions, and evaluation scripts) to enable reproducible research and community extensions.

Methodology

Task Design – The authors curated 33 FEM problems that are “introductory but non‑trivial.” Each problem specifies a physical scenario (e.g., a cantilever beam under load), the required material model, and the desired output (displacement field, stress distribution, etc.).
Prompt Construction – For every task, a natural‑language prompt describes the physics, the numerical method, and the target programming language (Python with FEniCS or MATLAB).
Model Interaction – Selected LLMs generate code snippets in response to the prompts. The process is repeated five times per model to capture variability.
Automated Verification – Generated code is executed in a sandbox. Two tiers of success are recorded:
- Function Success – The script runs without errors and produces any output.
- Joint Success (Unit Tests) – The output is compared against a reference solution using tolerance‑based assertions (e.g., max displacement error < 1 %).
Metrics – Success rates are aggregated across tasks and attempts, yielding per‑model scores such as “30/33 tasks solved at least once” or an “Average Joint Success Rate of 73.8 %.”

Results & Findings

Model (best attempt)	Function‑Level Success	Joint Success (Avg. %)
Gemini 3 Pro (function writing)	30 / 33 tasks solved at least once; 26 / 33 solved in all 5 attempts	–
GPT‑5 (unit‑test writing)	–	73.8 % average joint success
Claude 3	18 / 33 (≥1 success)	45 %
Llama 2‑70B	12 / 33 (≥1 success)	31 %

Key takeaways

Even the strongest current models fail to consistently solve a modest set of FEM problems.
Performance varies dramatically between models and even between attempts for the same model, highlighting stochastic generation behavior.
Errors are often physical rather than syntactic—e.g., wrong boundary conditions, mis‑specified material properties, or unstable mesh parameters.

Practical Implications

Tooling for Engineers – Companies building AI‑assisted simulation pipelines can use FEM‑Bench to gauge whether a model is ready for production or needs additional fine‑tuning.
Curriculum‑Level Automation – Academic labs could deploy LLMs to generate starter code for student assignments, but the benchmark warns that human verification remains essential.
Model‑Driven Design – Integrating LLMs into CAD‑to‑simulation workflows (auto‑generating FEM scripts from geometry) becomes feasible only after passing structured tests like those in FEM‑Bench.
Benchmark‑Driven Development – LLM vendors now have a concrete target domain (computational mechanics) to optimize for, potentially spurring specialized fine‑tuning datasets and architecture tweaks.

Limitations & Future Work

Scope – The benchmark covers only introductory FEM tasks; real‑world engineering problems involve nonlinear materials, multi‑physics coupling, and large‑scale parallel solvers, which are not yet represented.
Language Bias – Current prompts focus on Python/FEniCS and MATLAB; other popular FEM frameworks (e.g., Abaqus, ANSYS) are omitted.
Evaluation Granularity – Success is binary (pass/fail) per unit test; richer diagnostics (e.g., error magnitude distribution) could better inform model weaknesses.
Human‑in‑the‑Loop – The study does not explore how a developer might iteratively correct LLM output, a realistic usage pattern.

Future releases of FEM‑Bench aim to add higher‑complexity scenarios (nonlinear elasticity, fluid‑structure interaction), support additional programming environments, and incorporate interactive debugging metrics to reflect real development cycles.

Authors

Saeed Mohammadzadeh
Erfan Hamdi
Joel Shor
Emma Lejeune

Paper Information

arXiv ID: 2512.20732v1
Categories: cs.LG, cs.AI, cs.SE
Published: December 23, 2025
PDF: Download PDF

[Paper] FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting