[Paper] Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

Published: (April 27, 2026 at 12:38 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.24678v1

Overview

This paper investigates how large language models (LLMs) can be used to generate and modify code written in a domain‑specific language (DSL) that lives across many files and directories. The authors conduct a real‑world case study at BMW, adapting two open‑source code‑oriented LLMs to produce repository‑scale DSL artifacts from a single natural‑language instruction, and they evaluate the approach with both automated metrics and a developer survey.

Key Contributions

  • End‑to‑end pipeline for turning an industrial DSL (built with Xtext) into a trainable, multi‑file generation task.
  • Path‑preserving JSON representation of the DSL folder hierarchy, enabling a single LLM response to cover an entire repository and capture cross‑file dependencies.
  • Empirical comparison of three prompting strategies—baseline, one‑shot in‑context, and parameter‑efficient fine‑tuning (QLoRA)—across two 7‑B code LLMs (Qwen2.5‑Coder, DeepSeek‑Coder).
  • New evaluation metrics that go beyond BLEU/ROUGE, measuring exact edit correctness, structural fidelity of the generated file tree, and downstream code‑generator success.
  • Industrial validation through a developer survey and an execution check that runs the generated DSL through the existing Java/TypeScript code generator.

Methodology

  1. Dataset Construction – The team extracted historical DSL changes from BMW’s repository, pairing each NL change request (e.g., “add a new vehicle type with X properties”) with the before/after DSL files.
  2. Multi‑File Task Encoding – Each change is serialized into a JSON object where keys are file paths and values are file contents. This keeps the directory layout explicit while staying within the LLM’s context window.
  3. Model Adaptation
    • Baseline prompting: a simple “Generate the DSL files for …” prompt.
    • One‑shot in‑context: the prompt is prefixed with a single example of NL → JSON mapping.
    • QLoRA fine‑tuning: a lightweight, parameter‑efficient fine‑tune (≈0.5 % of model weights) on the collected dataset.
  4. Evaluation
    • Similarity metrics (BLEU, CodeBLEU) for surface similarity.
    • Task‑specific metrics: exact‑match of edits, structural fidelity (does the generated tree match the expected file layout?), and downstream compilation success.
    • Human validation: 12 BMW developers rated the usefulness of generated artifacts; a subset was fed to the existing Xtext‑based code generator to confirm that the downstream Java/TS code still builds.

Results & Findings

Model / ConfigExact‑Match AccuracyEdit Similarity (CodeBLEU)Structural Fidelity
Qwen2.5‑Coder – Baseline38 %0.420.78
Qwen2.5‑Coder – One‑shot45 %0.480.84
Qwen2.5‑Coder – QLoRA71 %0.711.00
DeepSeek‑Coder – Baseline34 %0.390.73
DeepSeek‑Coder – One‑shot41 %0.450.80
DeepSeek‑Coder – QLoRA68 %0.681.00

Key takeaways

  • Fine‑tuning (QLoRA) yields the biggest jump—over 30 % absolute gain in exact‑match and perfect structural fidelity.
  • One‑shot in‑context learning is a low‑cost win, consistently improving over the plain baseline.
  • The generated DSL files successfully passed the downstream code generator in >90 % of cases after fine‑tuning, confirming functional correctness.
  • Developers rated fine‑tuned outputs as “ready to commit” in 78 % of surveyed instances, versus 42 % for baseline.

Practical Implications

  • Accelerated DSL evolution – Teams can issue a single natural‑language change request and obtain a full, repository‑wide DSL update, cutting down manual editing cycles.
  • Reduced onboarding friction – New engineers can prototype DSL changes without deep knowledge of the DSL’s file layout, relying on the model to preserve the correct hierarchy.
  • Continuous integration friendliness – Because the JSON representation guarantees structural fidelity, generated changes can be automatically validated by existing build pipelines (e.g., running the Xtext code generator).
  • Cost‑effective customization – Parameter‑efficient fine‑tuning (QLoRA) requires only a few hundred examples and modest GPU resources, making it feasible for many enterprises that already have a DSL in production.
  • Template for other DSLs – The pipeline (dataset extraction → path‑preserving JSON → fine‑tune) can be replicated for any Xtext‑based or similar DSL, opening the door to LLM‑assisted development in automotive, finance, telecom, etc.

Limitations & Future Work

  • Scale of context window – The approach works for DSL projects that fit within the 8 K‑token window of the 7‑B models; larger repositories would need chunking or hierarchical prompting.
  • Dataset bias – The training data consists of historical changes from a single BMW project, which may limit generalization to DSLs with very different semantics or naming conventions.
  • Model size ceiling – Only 7‑B models were evaluated; larger models could further improve quality but at higher inference cost.
  • Human evaluation scope – The developer survey involved a relatively small group (12 participants); broader user studies would strengthen claims about usability.
  • Future directions suggested by the authors include: exploring retrieval‑augmented generation for ultra‑large repositories, extending the method to bidirectional DSL ↔ code synchronization, and integrating automated test generation to close the loop between DSL changes and downstream system behavior.

Authors

  • Sivajeet Chand
  • Kevin Nguyen
  • Peter Kuntz
  • Alexander Pretschner

Paper Information

  • arXiv ID: 2604.24678v1
  • Categories: cs.SE, cs.AI
  • Published: April 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...