[Paper] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

Published: (December 23, 2025 at 09:22 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.20387v1

Overview

The paper introduces Vision‑Language Simulation Models (VLSMs) – a new class of AI systems that can turn a rough layout sketch and a natural‑language description into executable FlexScript code for industrial simulations. By bridging visual perception, language understanding, and code generation, the authors lay the groundwork for “generative digital twins” that can be created on‑the‑fly from informal inputs.

Key Contributions

  • Unified multimodal model that jointly processes sketches and textual prompts to output runnable simulation scripts.
  • Large‑scale dataset of 120 k+ prompt‑sketch‑code triples, the first publicly released resource for training generative digital twins.
  • Three task‑specific metrics – Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR) – to evaluate geometry, parameter fidelity, and actual simulator execution.
  • Extensive ablation study across vision backbones (e.g., ViT, ConvNeXt), connector architectures, and code‑pretrained language models (e.g., CodeBERT, StarCoder).
  • Near‑perfect structural accuracy (SVR ≈ 99.8 %) and high execution robustness (ESR > 92 %) on held‑out test sets.

Methodology

  1. Data collection – Engineers manually paired free‑hand layout sketches (CAD‑like line drawings) with concise English prompts and the corresponding FlexScript code that drives a standard industrial simulator.
  2. Model architecture
    • Vision encoder extracts a spatial embedding from the sketch.
    • Language encoder processes the natural‑language prompt.
    • A cross‑modal connector (either a simple concatenation + transformer or a cross‑attention module) fuses the two embeddings.
    • The fused representation feeds a code‑generation decoder (initialized from a code‑pretrained LLM) that emits FlexScript token by token.
  3. Training – The system is trained end‑to‑end with a mixed loss:
    • (i) token‑level cross‑entropy for code generation,
    • (ii) a structural consistency loss that penalizes mismatched geometry, and
    • (iii) a reinforcement‑style reward for successful execution in the simulator.
  4. Evaluation – The three bespoke metrics assess:
    • (i) whether the generated script respects the sketch’s topology (SVR),
    • (ii) whether numeric parameters (e.g., dimensions, speeds) match the prompt (PMR), and
    • (iii) whether the script runs without errors in the FlexScript interpreter (ESR).

Results & Findings

Model variantSVRPMRESR
ViT + Cross‑Attention + StarCoder99.8 %96.4 %93.2 %
ConvNeXt + Concat + CodeBERT98.9 %94.1 %89.7 %
Baseline (vision‑only)85.3 %71.2 %62.5 %
  • Adding the language prompt consistently boosts parameter fidelity (PMR) and execution success (ESR).
  • Cross‑attention connectors outperform simple concatenation, especially for complex spatial relationships.
  • The model generalizes to unseen industrial domains (e.g., conveyor‑belt layouts) with only a modest drop in ESR (~4 %).

Practical Implications

  • Rapid prototyping – Engineers can sketch a new production line on a tablet, describe it in a few sentences, and instantly obtain a runnable simulation, cutting weeks of manual scripting.
  • Design‑to‑simulation pipelines – CAD tools can embed VLSM APIs to auto‑generate test scenarios, enabling continuous verification as designs evolve.
  • Training simulators for RL agents – Synthetic digital twins can be mass‑produced to feed reinforcement‑learning pipelines for robotics or autonomous material handling.
  • Cross‑disciplinary collaboration – Non‑programmers (e.g., process engineers) can contribute directly to simulation models without learning FlexScript syntax.
  • Open‑source ecosystem – The released dataset and evaluation suite give the community a benchmark for future multimodal code‑generation research.

Limitations & Future Work

  • Domain specificity – The current dataset focuses on FlexScript and a limited set of industrial equipment; transferring to other simulators (e.g., ROS‑based) will require additional fine‑tuning.
  • Sketch quality sensitivity – Extremely noisy or ambiguous drawings still cause structural errors; robustness to hand‑drawn variations needs improvement.
  • Scalability of execution testing – ESR relies on running the generated script in a sandbox; scaling this to millions of samples is computationally expensive.
  • Future directions include extending VLSMs to 3‑D voxel or point‑cloud inputs, incorporating feedback loops where the simulator’s output refines the generated code, and exploring few‑shot adaptation to new simulation languages.

Authors

  • YuChe Hsu
  • AnJui Wang
  • TsaiChing Ni
  • YuanFu Yang

Paper Information

  • arXiv ID: 2512.20387v1
  • Categories: cs.AI, cs.CL, cs.CV
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »