[Paper] A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

Published: (March 5, 2026 at 10:23 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.05278v1

Overview

The paper introduces a systematic framework for measuring how well large language models (LLMs) can generate code in constraint‑domain specific languages (DSLs) such as OCL and Alloy. By comparing the LLMs’ output against both syntactic well‑formedness and semantic correctness, the authors reveal why LLMs that excel at mainstream languages like Python often stumble on niche DSLs, and they propose practical tricks (e.g., code repair, multiple attempts) to boost performance.

Key Contributions

  • A generic evaluation framework that assesses LLM‑generated DSL code on two axes: well‑formedness (does the code compile?) and correctness (does it satisfy the intended specification?).
  • Empirical benchmark covering three LLM families (GPT‑4, Claude, and open‑source models) on two constraint DSLs (OCL, Alloy) and a baseline general‑purpose language (Python).
  • Insightful analysis of factors that affect DSL generation quality: context‑window size, prompt design, multi‑shot generation, and post‑hoc code repair.
  • Actionable recommendations for developers who want to integrate LLM‑driven code synthesis for DSL‑heavy workflows (e.g., model‑driven engineering, verification pipelines).

Methodology

  1. Task definition – The authors start from natural‑language specifications (e.g., “ensure every User has a unique email”) and ask the LLM to produce the corresponding constraint in the target DSL.
  2. Prompt templates – Several prompt styles are tried (plain description, few‑shot examples, chain‑of‑thought).
  3. Generation strategies – For each task they collect:
    • a single generation,
    • n‑shot generation (multiple candidates), and
    • a repair step where a second LLM is prompted to fix syntax or logical errors.
  4. Evaluation metrics
    • Well‑formedness: does the code parse/compile with the DSL toolchain?
    • Correctness: does the generated constraint hold on a set of test models (automatically generated or hand‑crafted)?
  5. Baseline comparison – The same pipeline is run on Python to quantify the gap between a popular language and the DSLs.

All steps are automated, making the framework reusable for any new DSL or LLM.

Results & Findings

LanguageBest‑case well‑formednessBest‑case correctness
Python~96 %~92 %
OCL~71 %~58 %
Alloy~68 %~55 %
  • Context window matters: Open‑source models with ≤4 k token windows frequently truncate the domain model, causing malformed constraints.
  • Multiple attempts help: Generating 5 candidates per prompt lifts correctness by ~12 % for OCL and Alloy.
  • Repair prompts are powerful: A second LLM fixing syntax errors raises well‑formedness by ~8 % across DSLs.
  • Prompt template choice has minor impact compared to the above two levers.

Overall, LLMs are still noticeably weaker on constraint DSLs than on Python, but targeted strategies can narrow the gap.

Practical Implications

  • Model‑Driven Engineering (MDE) pipelines can now incorporate LLM‑assisted constraint writing, but they should adopt a generate‑then‑repair workflow and keep a small pool of candidate outputs.
  • Verification and testing tools (e.g., Alloy Analyzer) can be hooked to an LLM service to auto‑suggest invariants, provided the system runs a post‑generation validation step.
  • Open‑source LLMs may need custom extensions (e.g., larger context windows or chunked model representations) to be viable for DSL‑heavy domains.
  • Prompt engineering budgets can be re‑allocated: rather than spending time fine‑tuning templates, invest in building a lightweight “repair” micro‑service that runs a second LLM pass.
  • Team workflows: Developers can treat LLM output as draft constraint code, run the DSL compiler automatically, and let the LLM iterate until it passes. This reduces manual boilerplate while keeping the safety net of formal verification.

Limitations & Future Work

  • The study focuses on only two constraint DSLs; results may differ for other DSL families (e.g., hardware description languages).
  • Test specifications are relatively small; scaling to large, real‑world models could expose additional context‑window bottlenecks.
  • The repair step relies on the same LLM family; exploring heterogeneous models (e.g., a smaller model for generation, a larger one for repair) is left for future research.
  • The framework currently evaluates correctness against a fixed test suite; integrating symbolic execution or model checking to assess deeper semantic properties is an open direction.

Authors

  • David Delgado
  • Lola Burgueño
  • Robert Clarisó

Paper Information

  • arXiv ID: 2603.05278v1
  • Categories: cs.SE
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »