[Paper] Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

Published: 1 day ago (April 27, 2026 at 12:38 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.24678v1

Overview

This paper investigates how large language models (LLMs) can be used to generate and modify code written in a domain‑specific language (DSL) that lives across many files and directories. The authors conduct a real‑world case study at BMW, adapting two open‑source code‑oriented LLMs to produce repository‑scale DSL artifacts from a single natural‑language instruction, and they evaluate the approach with both automated metrics and a developer survey.

Key Contributions

End‑to‑end pipeline for turning an industrial DSL (built with Xtext) into a trainable, multi‑file generation task.
Path‑preserving JSON representation of the DSL folder hierarchy, enabling a single LLM response to cover an entire repository and capture cross‑file dependencies.
Empirical comparison of three prompting strategies—baseline, one‑shot in‑context, and parameter‑efficient fine‑tuning (QLoRA)—across two 7‑B code LLMs (Qwen2.5‑Coder, DeepSeek‑Coder).
New evaluation metrics that go beyond BLEU/ROUGE, measuring exact edit correctness, structural fidelity of the generated file tree, and downstream code‑generator success.
Industrial validation through a developer survey and an execution check that runs the generated DSL through the existing Java/TypeScript code generator.

Methodology

Dataset Construction – The team extracted historical DSL changes from BMW’s repository, pairing each NL change request (e.g., “add a new vehicle type with X properties”) with the before/after DSL files.
Multi‑File Task Encoding – Each change is serialized into a JSON object where keys are file paths and values are file contents. This keeps the directory layout explicit while staying within the LLM’s context window.
Model Adaptation –
- Baseline prompting: a simple “Generate the DSL files for …” prompt.
- One‑shot in‑context: the prompt is prefixed with a single example of NL → JSON mapping.
- QLoRA fine‑tuning: a lightweight, parameter‑efficient fine‑tune (≈0.5 % of model weights) on the collected dataset.
Evaluation –
- Similarity metrics (BLEU, CodeBLEU) for surface similarity.
- Task‑specific metrics: exact‑match of edits, structural fidelity (does the generated tree match the expected file layout?), and downstream compilation success.
- Human validation: 12 BMW developers rated the usefulness of generated artifacts; a subset was fed to the existing Xtext‑based code generator to confirm that the downstream Java/TS code still builds.

Results & Findings

Model / Config	Exact‑Match Accuracy	Edit Similarity (CodeBLEU)	Structural Fidelity
Qwen2.5‑Coder – Baseline	38 %	0.42	0.78
Qwen2.5‑Coder – One‑shot	45 %	0.48	0.84
Qwen2.5‑Coder – QLoRA	71 %	0.71	1.00
DeepSeek‑Coder – Baseline	34 %	0.39	0.73
DeepSeek‑Coder – One‑shot	41 %	0.45	0.80
DeepSeek‑Coder – QLoRA	68 %	0.68	1.00

Key takeaways

Fine‑tuning (QLoRA) yields the biggest jump—over 30 % absolute gain in exact‑match and perfect structural fidelity.
One‑shot in‑context learning is a low‑cost win, consistently improving over the plain baseline.
The generated DSL files successfully passed the downstream code generator in >90 % of cases after fine‑tuning, confirming functional correctness.
Developers rated fine‑tuned outputs as “ready to commit” in 78 % of surveyed instances, versus 42 % for baseline.

Practical Implications

Accelerated DSL evolution – Teams can issue a single natural‑language change request and obtain a full, repository‑wide DSL update, cutting down manual editing cycles.
Reduced onboarding friction – New engineers can prototype DSL changes without deep knowledge of the DSL’s file layout, relying on the model to preserve the correct hierarchy.
Continuous integration friendliness – Because the JSON representation guarantees structural fidelity, generated changes can be automatically validated by existing build pipelines (e.g., running the Xtext code generator).
Cost‑effective customization – Parameter‑efficient fine‑tuning (QLoRA) requires only a few hundred examples and modest GPU resources, making it feasible for many enterprises that already have a DSL in production.
Template for other DSLs – The pipeline (dataset extraction → path‑preserving JSON → fine‑tune) can be replicated for any Xtext‑based or similar DSL, opening the door to LLM‑assisted development in automotive, finance, telecom, etc.

Limitations & Future Work

Scale of context window – The approach works for DSL projects that fit within the 8 K‑token window of the 7‑B models; larger repositories would need chunking or hierarchical prompting.
Dataset bias – The training data consists of historical changes from a single BMW project, which may limit generalization to DSLs with very different semantics or naming conventions.
Model size ceiling – Only 7‑B models were evaluated; larger models could further improve quality but at higher inference cost.
Human evaluation scope – The developer survey involved a relatively small group (12 participants); broader user studies would strengthen claims about usability.
Future directions suggested by the authors include: exploring retrieval‑augmented generation for ultra‑large repositories, extending the method to bidirectional DSL ↔ code synchronization, and integrating automated test generation to close the loop between DSL changes and downstream system behavior.

Authors

Sivajeet Chand
Kevin Nguyen
Peter Kuntz
Alexander Pretschner

Paper Information

arXiv ID: 2604.24678v1
Categories: cs.SE, cs.AI
Published: April 27, 2026
PDF: Download PDF

[Paper] Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

[Paper] Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

[Paper] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models