[Paper] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

Published: 1 month ago (December 23, 2025 at 09:22 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.20387v1

Overview

The paper introduces Vision‑Language Simulation Models (VLSMs) – a new class of AI systems that can turn a rough layout sketch and a natural‑language description into executable FlexScript code for industrial simulations. By bridging visual perception, language understanding, and code generation, the authors lay the groundwork for “generative digital twins” that can be created on‑the‑fly from informal inputs.

Key Contributions

Unified multimodal model that jointly processes sketches and textual prompts to output runnable simulation scripts.
Large‑scale dataset of 120 k+ prompt‑sketch‑code triples, the first publicly released resource for training generative digital twins.
Three task‑specific metrics – Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR) – to evaluate geometry, parameter fidelity, and actual simulator execution.
Extensive ablation study across vision backbones (e.g., ViT, ConvNeXt), connector architectures, and code‑pretrained language models (e.g., CodeBERT, StarCoder).
Near‑perfect structural accuracy (SVR ≈ 99.8 %) and high execution robustness (ESR > 92 %) on held‑out test sets.

Methodology

Data collection – Engineers manually paired free‑hand layout sketches (CAD‑like line drawings) with concise English prompts and the corresponding FlexScript code that drives a standard industrial simulator.
Model architecture
- Vision encoder extracts a spatial embedding from the sketch.
- Language encoder processes the natural‑language prompt.
- A cross‑modal connector (either a simple concatenation + transformer or a cross‑attention module) fuses the two embeddings.
- The fused representation feeds a code‑generation decoder (initialized from a code‑pretrained LLM) that emits FlexScript token by token.
Training – The system is trained end‑to‑end with a mixed loss:
- (i) token‑level cross‑entropy for code generation,
- (ii) a structural consistency loss that penalizes mismatched geometry, and
- (iii) a reinforcement‑style reward for successful execution in the simulator.
Evaluation – The three bespoke metrics assess:
- (i) whether the generated script respects the sketch’s topology (SVR),
- (ii) whether numeric parameters (e.g., dimensions, speeds) match the prompt (PMR), and
- (iii) whether the script runs without errors in the FlexScript interpreter (ESR).

Results & Findings

Model variant	SVR	PMR	ESR
ViT + Cross‑Attention + StarCoder	99.8 %	96.4 %	93.2 %
ConvNeXt + Concat + CodeBERT	98.9 %	94.1 %	89.7 %
Baseline (vision‑only)	85.3 %	71.2 %	62.5 %

Adding the language prompt consistently boosts parameter fidelity (PMR) and execution success (ESR).
Cross‑attention connectors outperform simple concatenation, especially for complex spatial relationships.
The model generalizes to unseen industrial domains (e.g., conveyor‑belt layouts) with only a modest drop in ESR (~4 %).

Practical Implications

Rapid prototyping – Engineers can sketch a new production line on a tablet, describe it in a few sentences, and instantly obtain a runnable simulation, cutting weeks of manual scripting.
Design‑to‑simulation pipelines – CAD tools can embed VLSM APIs to auto‑generate test scenarios, enabling continuous verification as designs evolve.
Training simulators for RL agents – Synthetic digital twins can be mass‑produced to feed reinforcement‑learning pipelines for robotics or autonomous material handling.
Cross‑disciplinary collaboration – Non‑programmers (e.g., process engineers) can contribute directly to simulation models without learning FlexScript syntax.
Open‑source ecosystem – The released dataset and evaluation suite give the community a benchmark for future multimodal code‑generation research.

Limitations & Future Work

Domain specificity – The current dataset focuses on FlexScript and a limited set of industrial equipment; transferring to other simulators (e.g., ROS‑based) will require additional fine‑tuning.
Sketch quality sensitivity – Extremely noisy or ambiguous drawings still cause structural errors; robustness to hand‑drawn variations needs improvement.
Scalability of execution testing – ESR relies on running the generated script in a sandbox; scaling this to millions of samples is computationally expensive.
Future directions include extending VLSMs to 3‑D voxel or point‑cloud inputs, incorporating feedback loops where the simulator’s output refines the generated code, and exploring few‑shot adaptation to new simulation languages.

Authors

YuChe Hsu
AnJui Wang
TsaiChing Ni
YuanFu Yang

Paper Information

arXiv ID: 2512.20387v1
Categories: cs.AI, cs.CL, cs.CV
Published: December 23, 2025
PDF: Download PDF

[Paper] Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law