[Paper] A Generalizable Framework for Building Executable Domain-Specific LLMs under Data Scarcity: Demonstration on Semiconductor TCAD Simulation

Published: (January 15, 2026 at 02:13 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.10128v1

Overview

The paper introduces a schema‑first alignment framework that lets developers create compact, domain‑specific large language models (LLMs) capable of generating executable code even when only a handful of real examples exist. The authors validate the approach by building TcadGPT, an LLM that writes correct TCAD (Technology Computer‑Aided Design) simulation scripts, and they show the same recipe works for a finite‑element solver (Elmer).

Key Contributions

  • Synthetic QA generation from expert docs – a pipeline that automatically turns manuals, standards, and reference guides into 1.5 M question‑answer pairs, giving the model a solid “knowledge base” without manual labeling.
  • Code‑centric IR → DPO workflow – converts verified tool decks into an intermediate representation (IR), diversifies it while preserving semantics, and creates preference pairs for Direct Preference Optimization (DPO) to directly reward syntactic validity and tool‑compilability.
  • Controlled RAG evaluation – demonstrates that Retrieval‑Augmented Generation helps generic LLMs but can slightly hurt already domain‑aligned models, highlighting the importance of proper alignment.
  • Empirical validation on two domains – TCAD (semiconductor device simulation) and Elmer (open‑source FEM solver), achieving large gains over state‑of‑the‑art general models (e.g., GPT‑4o).
  • Open‑source release – all datasets, benchmarks, and code (including the P1, P2, and IR‑→ DPO modules) are publicly available for reproducibility.

Methodology

  1. Schema‑First Data Synthesis

    • Extract structured knowledge (tables, parameter definitions, command syntax) from vendor manuals and research papers.
    • Use prompt engineering to turn each schema entry into a QA pair (e.g., “What is the default doping concentration for a p‑type region?” → answer).
    • This yields a massive, low‑cost synthetic corpus that teaches the model the vocabulary and concepts of the domain.
  2. Intermediate Representation (IR) & Diversification

    • Take a verified TCAD deck (a script that successfully runs in the simulation tool) and parse it into a language‑agnostic IR that captures the logical flow (mesh creation → material assignment → biasing).
    • Apply equivalence‑preserving transformations (e.g., reordering independent statements, renaming variables) to generate many semantically identical but syntactically diverse variants.
  3. Direct Preference Optimization (DPO)

    • For each original‑IR script, pair it with a less‑optimal variant (e.g., missing a required flag).
    • Train the LLM with DPO so it learns to prefer the higher‑quality, executable version when given the same natural‑language instruction.
  4. Retrieval‑Augmented Generation (RAG) Study

    • Compare three setups: (a) vanilla LLM, (b) LLM + RAG, (c) domain‑aligned LLM + RAG.
    • Measure semantic correctness and syntax‑pass rates on a held‑out test suite.

Results & Findings

ModelSemantic AccuracySyntax‑Pass (Executable)
GPT‑4o (baseline)68.2 %55.1 %
TcadGPT (synthetic QA only)78.4 %71.3 %
TcadGPT (full IR → DPO)85.6 %80.0 %
Elmer‑GPT (same pipeline)82.1 %76.5 %
  • Synthetic QA alone already lifts performance dramatically, confirming that domain knowledge can be injected without hand‑curated data.
  • IR‑driven DPO adds another ~7 % boost in both semantic and syntactic metrics, showing that directly optimizing for executability is more effective than generic instruction‑following loss functions.
  • RAG improves the baseline GPT‑4o (+4 % semantic) but decreases TcadGPT’s performance by ~1 % when the model is already tightly aligned, suggesting diminishing returns for retrieval once the model internalizes the schema.

Practical Implications

  • Rapid prototyping of domain‑specific assistants – Engineers can spin up a “code‑writing” LLM for any tool that has a well‑documented command set (e.g., CAD, CFD, circuit simulators) using only manuals and a few verified scripts.
  • Reduced reliance on costly annotation – The synthetic QA pipeline eliminates the need for large human‑curated datasets, cutting onboarding time from months to weeks.
  • Higher reliability in production pipelines – Because the model is explicitly trained to output compilable scripts, it can be integrated into CI/CD for simulation jobs, automatically generating or tweaking decks based on high‑level design intents.
  • Portability across domains – The same framework worked for an open‑source FEM solver, indicating that any engineering stack with a deterministic execution engine can benefit.
  • Open‑source ecosystem boost – With the released IR schema and DPO code, the community can contribute domain adapters, expanding the library of executable LLMs.

Limitations & Future Work

  • Dependence on a stable IR – The approach assumes the target tool can be parsed into a lossless intermediate representation; tools with highly dynamic or undocumented syntax may need custom parsers.
  • Synthetic data bias – While large, the QA set mirrors the style of the source manuals; edge‑case behaviors not covered in documentation may still be missed.
  • Scalability of verification – Generating preference pairs requires running the tool to confirm executability, which can be expensive for large‑scale simulations.
  • Future directions suggested by the authors include:
    1. Automating IR extraction for black‑box tools.
    2. Incorporating reinforcement learning from real simulation outcomes (e.g., convergence metrics).
    3. Exploring multi‑modal inputs (figures, schematics) to enrich the knowledge base.

Authors

  • Di Wang
  • Zhenhua Wu
  • Yu Liu
  • Kai Chang
  • Shaohua Wu

Paper Information

  • arXiv ID: 2601.10128v1
  • Categories: cs.CE, cond-mat.mtrl-sci, cs.SE
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »