[Paper] A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

Published: 16 hours ago (March 5, 2026 at 10:23 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.05278v1

Overview

The paper introduces a systematic framework for measuring how well large language models (LLMs) can generate code in constraint‑domain specific languages (DSLs) such as OCL and Alloy. By comparing the LLMs’ output against both syntactic well‑formedness and semantic correctness, the authors reveal why LLMs that excel at mainstream languages like Python often stumble on niche DSLs, and they propose practical tricks (e.g., code repair, multiple attempts) to boost performance.

Key Contributions

A generic evaluation framework that assesses LLM‑generated DSL code on two axes: well‑formedness (does the code compile?) and correctness (does it satisfy the intended specification?).
Empirical benchmark covering three LLM families (GPT‑4, Claude, and open‑source models) on two constraint DSLs (OCL, Alloy) and a baseline general‑purpose language (Python).
Insightful analysis of factors that affect DSL generation quality: context‑window size, prompt design, multi‑shot generation, and post‑hoc code repair.
Actionable recommendations for developers who want to integrate LLM‑driven code synthesis for DSL‑heavy workflows (e.g., model‑driven engineering, verification pipelines).

Methodology

Task definition – The authors start from natural‑language specifications (e.g., “ensure every User has a unique email”) and ask the LLM to produce the corresponding constraint in the target DSL.
Prompt templates – Several prompt styles are tried (plain description, few‑shot examples, chain‑of‑thought).
Generation strategies – For each task they collect:
- a single generation,
- n‑shot generation (multiple candidates), and
- a repair step where a second LLM is prompted to fix syntax or logical errors.
Evaluation metrics –
- Well‑formedness: does the code parse/compile with the DSL toolchain?
- Correctness: does the generated constraint hold on a set of test models (automatically generated or hand‑crafted)?
Baseline comparison – The same pipeline is run on Python to quantify the gap between a popular language and the DSLs.

All steps are automated, making the framework reusable for any new DSL or LLM.

Results & Findings

Language	Best‑case well‑formedness	Best‑case correctness
Python	~96 %	~92 %
OCL	~71 %	~58 %
Alloy	~68 %	~55 %

Context window matters: Open‑source models with ≤4 k token windows frequently truncate the domain model, causing malformed constraints.
Multiple attempts help: Generating 5 candidates per prompt lifts correctness by ~12 % for OCL and Alloy.
Repair prompts are powerful: A second LLM fixing syntax errors raises well‑formedness by ~8 % across DSLs.
Prompt template choice has minor impact compared to the above two levers.

Overall, LLMs are still noticeably weaker on constraint DSLs than on Python, but targeted strategies can narrow the gap.

Practical Implications

Model‑Driven Engineering (MDE) pipelines can now incorporate LLM‑assisted constraint writing, but they should adopt a generate‑then‑repair workflow and keep a small pool of candidate outputs.
Verification and testing tools (e.g., Alloy Analyzer) can be hooked to an LLM service to auto‑suggest invariants, provided the system runs a post‑generation validation step.
Open‑source LLMs may need custom extensions (e.g., larger context windows or chunked model representations) to be viable for DSL‑heavy domains.
Prompt engineering budgets can be re‑allocated: rather than spending time fine‑tuning templates, invest in building a lightweight “repair” micro‑service that runs a second LLM pass.
Team workflows: Developers can treat LLM output as draft constraint code, run the DSL compiler automatically, and let the LLM iterate until it passes. This reduces manual boilerplate while keeping the safety net of formal verification.

Limitations & Future Work

The study focuses on only two constraint DSLs; results may differ for other DSL families (e.g., hardware description languages).
Test specifications are relatively small; scaling to large, real‑world models could expose additional context‑window bottlenecks.
The repair step relies on the same LLM family; exploring heterogeneous models (e.g., a smaller model for generation, a larger one for repair) is left for future research.
The framework currently evaluates correctness against a fixed test suite; integrating symbolic execution or model checking to assess deeper semantic properties is an open direction.

Authors

David Delgado
Lola Burgueño
Robert Clarisó

Paper Information

arXiv ID: 2603.05278v1
Categories: cs.SE
Published: March 5, 2026
PDF: Download PDF

[Paper] A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A Benchmarking Framework for Model Datasets

[Paper] Why Do You Contribute to Stack Overflow? Understanding Cross-Cultural Motivations and Usage Patterns before the Age of LLMs

[Paper] Auto-Generating Personas from User Reviews in VR App Stores

[Paper] Public Sector Open Source Program Offices - Archetypes for how to Grow (Common) Institutional Capabilities