[Paper] Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Published: 2 days ago (April 21, 2026 at 12:49 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.19667v1

Overview

The Chat2Workflow paper investigates whether today’s large language models (LLMs) can turn plain‑language requests into executable visual workflows—the drag‑and‑drop pipelines that power many low‑code automation platforms (e.g., Dify, Coze). By releasing a benchmark built from real‑world business processes and a lightweight “agentic” framework that iteratively fixes execution errors, the authors expose the current gap between LLM capabilities and the reliability needed for production‑grade automation.

Key Contributions

Chat2Workflow benchmark – > 5 k curated workflow instances drawn from actual enterprise use‑cases, each paired with natural‑language specifications and a target visual representation that can be directly deployed.
Agentic execution loop – a simple yet effective framework that lets an LLM self‑debug generated workflows by detecting runtime failures, requesting clarifications, and re‑generating corrected steps.
Comprehensive evaluation – systematic testing of several state‑of‑the‑art LLMs (GPT‑4, Claude, LLaMA‑2, etc.) on intent capture, structural correctness, and end‑to‑end executability.
Open‑source toolkit – code, data loaders, and conversion scripts for turning LLM output into platform‑specific JSON/YAML, enabling reproducibility and community extensions.

Methodology

Data collection – The authors mined publicly available workflow templates from Dify, Coze, and similar services, then annotated each with a concise natural‑language description of the business goal (e.g., “collect user feedback and store it in a Google Sheet”).
Prompt design – For each benchmark instance, a single prompt is fed to an LLM asking it to produce a visual workflow in the target platform’s DSL (a JSON‑like structure).
Agentic loop – After the first generation, the workflow is executed in a sandbox. If any node fails (missing parameters, type mismatches, etc.), the system extracts the error, feeds it back to the LLM together with the original request, and asks for a corrected version. This cycle repeats up to three times.
Metrics –
- Intent Accuracy – does the workflow’s overall purpose match the NL description?
- Structural Validity – are all required fields present and correctly typed?
- Executable Rate – does the workflow run to completion without runtime errors?
- Resolve Rate – proportion of initially failing workflows that become executable after the agentic loop.

Results & Findings

Model	Intent Accuracy	Structural Validity	Executable Rate (baseline)	Resolve Rate (agentic)
GPT‑4	87 %	81 %	62 %	+5.34 % (≈ 67 %)
Claude‑2	83 %	78 %	58 %	+4.9 %
LLaMA‑2‑70B	71 %	66 %	45 %	+3.2 %
Open‑source baseline (BERT‑flow)	58 %	52 %	31 %	+1.1 %

High‑level intent is often captured, but the models frequently miss low‑level details (parameter names, data‑type conversions) that cause runtime failures.
The agentic loop consistently improves executability, yet even the best‑performing setup still leaves ~30 % of workflows non‑runnable out‑of‑the‑box.
Complex or evolving requirements (e.g., “add conditional branching based on user role”) dramatically reduce success rates, highlighting the brittleness of current LLM reasoning over visual pipeline semantics.

Practical Implications

Rapid prototyping – Developers can now generate a first‑draft workflow from a user story in seconds, cutting down the manual “drag‑and‑drop” time that typically dominates low‑code projects.
Automated ticket triage – Customer‑support bots could translate a textual request (“reset my password and notify the admin”) into a ready‑to‑run workflow, reducing hand‑off to engineers.
Continuous integration for automation – The agentic framework can be embedded in CI pipelines to auto‑fix broken workflows after schema changes, keeping production automations stable.
Platform‑agnostic tooling – Because the benchmark includes adapters for multiple visual‑workflow engines, the same LLM prompt can target Dify, Coze, or any future DSL with minimal re‑training.

Limitations & Future Work

Domain coverage – The benchmark focuses on business‑process automation; domains like data‑science pipelines or IoT orchestration are not represented.
Error feedback granularity – The current sandbox only returns generic error messages; richer diagnostics could enable more precise self‑correction.
Scalability of the agentic loop – Multiple iterations increase latency, which may be unacceptable for real‑time user interactions.
Model alignment – Fine‑tuning LLMs on workflow‑specific code (e.g., DSL syntax trees) could bridge the gap between intent understanding and syntactic correctness.

Chat2Workflow shines a light on the promise—and the current limits—of using LLMs to automate visual workflow creation. As models get better at reasoning over structured pipelines and as richer feedback loops emerge, developers can look forward to a future where “write a sentence, get a production‑ready automation” becomes the norm.

Authors

Yi Zhong
Buqiang Xu
Yijun Wang
Zifei Shan
Shuofei Qiao
Guozhou Zheng
Ningyu Zhang

Paper Information

arXiv ID: 2604.19667v1
Categories: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Published: April 21, 2026
PDF: Download PDF

[Paper] Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation