[Paper] DSL or Code? Evaluating the Quality of LLM-Generated Algebraic Specifications: A Case Study in Optimization at Kinaxis
Source: arXiv - 2601.00469v1
Overview
The paper investigates whether large language models (LLMs) can reliably generate domain‑specific language (DSL) specifications for mathematical optimization, or whether they perform better when asked to produce general‑purpose code (e.g., Python). Using AMPL—a widely‑used DSL for optimization—the authors build a system called EXEOS that translates natural‑language problem statements into executable models, then refines them using solver feedback. Their experiments on public benchmarks and real‑world supply‑chain scenarios from Kinaxis show that DSL‑generated models can be as accurate, and sometimes more accurate, than Python equivalents.
Key Contributions
- EXEOS framework: an end‑to‑end pipeline that (1) prompts LLMs to produce AMPL or Python models from NL descriptions, (2) runs the generated model through an optimizer, and (3) iteratively repairs errors based on solver diagnostics.
- Empirical comparison of LLM‑generated AMPL versus Python specifications across two LLM families (GPT‑based and LLaMA‑based), focusing on executability and solution correctness.
- Ablation study demonstrating the impact of (a) prompt engineering (including DSL‑specific scaffolding), (b) solver‑feedback loops, and (c) model‑post‑processing on overall quality.
- Industrial validation using Kinaxis supply‑chain optimization cases, providing a realistic testbed beyond academic datasets.
- Open‑source artifacts (prompt templates, evaluation scripts, and a curated benchmark) to enable reproducibility and further research.
Methodology
- Dataset preparation – The authors collected 1,200 optimization problems: 800 from a public benchmark (MIPLIB‑style) and 400 from Kinaxis’s internal supply‑chain use cases. Each problem includes a concise natural‑language description, the optimal solution, and a reference implementation in both AMPL and Python.
- Prompt design – Two families of prompts were crafted:
- DSL‑centric: explicitly ask the LLM to output an AMPL model, provide a small “template” of AMPL syntax, and list common constructs (sets, parameters, variables, constraints).
- Code‑centric: ask for a Python implementation using the
PuLPlibrary.
- Generation – For each NL description, the selected LLM (GPT‑4‑Turbo or LLaMA‑2‑70B) is invoked three times to produce candidate specifications.
- Solver feedback loop – The generated model is fed to the appropriate solver (CPLEX for AMPL, CBC for Python). If the solver reports syntax errors, undefined symbols, or infeasibility, EXEOS extracts the error message, augments the original prompt with a concise “repair instruction,” and re‑queries the LLM. This loop runs up to three iterations.
- Evaluation metrics –
- Executability: does the model run without errors?
- Correctness: does the solution match the known optimal objective within a 1 % tolerance?
- Development effort: measured as the total number of LLM calls (including repairs).
- Ablation – The authors systematically disable (a) the DSL‑specific template, (b) the repair loop, and (c) post‑generation formatting to quantify each component’s contribution.
Results & Findings
| Metric | AMPL (DSL) | Python (code) |
|---|---|---|
| Executable on first try | 68 % | 74 % |
| Executable after repairs | 92 % | 95 % |
| Correct objective (≤1 % error) | 85 % | 81 % |
| Avg. LLM calls per problem | 1.7 | 1.5 |
| Best‑case (GPT‑4‑Turbo) | 94 % correct | 90 % correct |
- DSL competitiveness – Once the repair loop is applied, AMPL models not only become executable at a rate comparable to Python but also achieve higher correctness on the optimization objective.
- Impact of prompts – Adding a concise AMPL template boosts first‑try executability by ~12 % and reduces the number of repair iterations.
- Solver feedback is crucial – Removing the repair loop drops correctness by ~15 % for both languages, confirming that iterative error‑driven refinement is a key advantage of EXEOS.
- Industrial cases – In Kinaxis’s supply‑chain problems, AMPL models generated with EXEOS matched the in‑house solutions in 88 % of cases, while Python fell slightly behind at 82 %.
- LLM family effect – GPT‑4‑Turbo consistently outperformed LLaMA‑2, but the relative gap between DSL and code remained similar across families.
Practical Implications
- Rapid prototyping of optimization models – Engineers can describe a scheduling or logistics problem in plain English and obtain a ready‑to‑run AMPL model within minutes, cutting down the manual modeling effort that traditionally dominates MDE projects.
- Lower barrier to DSL adoption – The results suggest that the perceived “training‑data bias” toward mainstream languages is not a show‑stopper; with proper prompting and feedback loops, DSLs like AMPL become viable targets for LLM‑assisted development.
- Integration into CI pipelines – EXEOS’s repair loop can be automated as part of a continuous‑integration workflow: a new NL requirement triggers model generation, the solver validates it, and any failures automatically generate a ticket with the corrected specification.
- Cost‑effective model maintenance – When business rules evolve, updating the NL description and re‑running EXEOS can regenerate the model, reducing the need for deep DSL expertise on every change.
- Tooling opportunities – IDE plugins that embed EXEOS‑style prompts could let developers toggle between “Python” and “AMPL” views of the same optimization problem, facilitating cross‑language verification and education.
Limitations & Future Work
- Domain scope – The study focuses on linear/mixed‑integer programming; extending to non‑linear, stochastic, or combinatorial DSLs (e.g., MiniZinc) may reveal different accuracy patterns.
- LLM size & cost – High‑performing models like GPT‑4‑Turbo are expensive to query at scale; future work could explore fine‑tuned smaller models specialized on DSL corpora.
- Error‑type granularity – The current repair loop treats all solver errors uniformly; more nuanced parsing (e.g., distinguishing infeasibility vs. unboundedness) could yield smarter prompts.
- Human‑in‑the‑loop evaluation – While the paper measures automated correctness, a user study with domain engineers would clarify how much post‑generation editing is still required in practice.
- Benchmark diversity – Adding more real‑world case studies from other industries (energy, finance) would test the generality of the findings.
Overall, the paper demonstrates that with thoughtful prompting and iterative solver feedback, LLMs can generate high‑quality DSL specifications for optimization—opening a practical pathway for model‑driven engineering in industry.
Authors
- Negin Ayoughi
- David Dewar
- Shiva Nejati
- Mehrdad Sabetzadeh
Paper Information
- arXiv ID: 2601.00469v1
- Categories: cs.SE
- Published: January 1, 2026
- PDF: Download PDF