[Paper] Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Source: arXiv - 2604.24678v1
Overview
This paper investigates how large language models (LLMs) can be used to generate and modify code written in a domain‑specific language (DSL) that lives across many files and directories. The authors conduct a real‑world case study at BMW, adapting two open‑source code‑oriented LLMs to produce repository‑scale DSL artifacts from a single natural‑language instruction, and they evaluate the approach with both automated metrics and a developer survey.
Key Contributions
- End‑to‑end pipeline for turning an industrial DSL (built with Xtext) into a trainable, multi‑file generation task.
- Path‑preserving JSON representation of the DSL folder hierarchy, enabling a single LLM response to cover an entire repository and capture cross‑file dependencies.
- Empirical comparison of three prompting strategies—baseline, one‑shot in‑context, and parameter‑efficient fine‑tuning (QLoRA)—across two 7‑B code LLMs (Qwen2.5‑Coder, DeepSeek‑Coder).
- New evaluation metrics that go beyond BLEU/ROUGE, measuring exact edit correctness, structural fidelity of the generated file tree, and downstream code‑generator success.
- Industrial validation through a developer survey and an execution check that runs the generated DSL through the existing Java/TypeScript code generator.
Methodology
- Dataset Construction – The team extracted historical DSL changes from BMW’s repository, pairing each NL change request (e.g., “add a new vehicle type with X properties”) with the before/after DSL files.
- Multi‑File Task Encoding – Each change is serialized into a JSON object where keys are file paths and values are file contents. This keeps the directory layout explicit while staying within the LLM’s context window.
- Model Adaptation –
- Baseline prompting: a simple “Generate the DSL files for …” prompt.
- One‑shot in‑context: the prompt is prefixed with a single example of NL → JSON mapping.
- QLoRA fine‑tuning: a lightweight, parameter‑efficient fine‑tune (≈0.5 % of model weights) on the collected dataset.
- Evaluation –
- Similarity metrics (BLEU, CodeBLEU) for surface similarity.
- Task‑specific metrics: exact‑match of edits, structural fidelity (does the generated tree match the expected file layout?), and downstream compilation success.
- Human validation: 12 BMW developers rated the usefulness of generated artifacts; a subset was fed to the existing Xtext‑based code generator to confirm that the downstream Java/TS code still builds.
Results & Findings
| Model / Config | Exact‑Match Accuracy | Edit Similarity (CodeBLEU) | Structural Fidelity |
|---|---|---|---|
| Qwen2.5‑Coder – Baseline | 38 % | 0.42 | 0.78 |
| Qwen2.5‑Coder – One‑shot | 45 % | 0.48 | 0.84 |
| Qwen2.5‑Coder – QLoRA | 71 % | 0.71 | 1.00 |
| DeepSeek‑Coder – Baseline | 34 % | 0.39 | 0.73 |
| DeepSeek‑Coder – One‑shot | 41 % | 0.45 | 0.80 |
| DeepSeek‑Coder – QLoRA | 68 % | 0.68 | 1.00 |
Key takeaways
- Fine‑tuning (QLoRA) yields the biggest jump—over 30 % absolute gain in exact‑match and perfect structural fidelity.
- One‑shot in‑context learning is a low‑cost win, consistently improving over the plain baseline.
- The generated DSL files successfully passed the downstream code generator in >90 % of cases after fine‑tuning, confirming functional correctness.
- Developers rated fine‑tuned outputs as “ready to commit” in 78 % of surveyed instances, versus 42 % for baseline.
Practical Implications
- Accelerated DSL evolution – Teams can issue a single natural‑language change request and obtain a full, repository‑wide DSL update, cutting down manual editing cycles.
- Reduced onboarding friction – New engineers can prototype DSL changes without deep knowledge of the DSL’s file layout, relying on the model to preserve the correct hierarchy.
- Continuous integration friendliness – Because the JSON representation guarantees structural fidelity, generated changes can be automatically validated by existing build pipelines (e.g., running the Xtext code generator).
- Cost‑effective customization – Parameter‑efficient fine‑tuning (QLoRA) requires only a few hundred examples and modest GPU resources, making it feasible for many enterprises that already have a DSL in production.
- Template for other DSLs – The pipeline (dataset extraction → path‑preserving JSON → fine‑tune) can be replicated for any Xtext‑based or similar DSL, opening the door to LLM‑assisted development in automotive, finance, telecom, etc.
Limitations & Future Work
- Scale of context window – The approach works for DSL projects that fit within the 8 K‑token window of the 7‑B models; larger repositories would need chunking or hierarchical prompting.
- Dataset bias – The training data consists of historical changes from a single BMW project, which may limit generalization to DSLs with very different semantics or naming conventions.
- Model size ceiling – Only 7‑B models were evaluated; larger models could further improve quality but at higher inference cost.
- Human evaluation scope – The developer survey involved a relatively small group (12 participants); broader user studies would strengthen claims about usability.
- Future directions suggested by the authors include: exploring retrieval‑augmented generation for ultra‑large repositories, extending the method to bidirectional DSL ↔ code synchronization, and integrating automated test generation to close the loop between DSL changes and downstream system behavior.
Authors
- Sivajeet Chand
- Kevin Nguyen
- Peter Kuntz
- Alexander Pretschner
Paper Information
- arXiv ID: 2604.24678v1
- Categories: cs.SE, cs.AI
- Published: April 27, 2026
- PDF: Download PDF