[Paper] World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems
Source: arXiv - 2601.22130v1
Overview
The paper introduces World of Workflows (WoW), a realistic ServiceNow‑based sandbox that mimics the hidden, inter‑dependent processes found in large enterprises. By coupling this environment with a 234‑task benchmark (WoW‑bench), the authors expose a critical blind spot in today’s frontier large language models (LLMs): the inability to anticipate and respect the cascading side‑effects of actions inside opaque enterprise systems.
Key Contributions
- WoW Environment: A fully‑featured ServiceNow instance containing >4,000 business rules and 55 active, hidden workflows that drive state changes across multiple databases.
- WoW‑bench Benchmark: 234 carefully crafted tasks that require agents to (a) complete constrained user requests and (b) model the underlying system dynamics to avoid silent violations.
- Empirical Diagnosis: Systematic evaluation of several state‑of‑the‑art LLM agents, revealing a pervasive “dynamics blindness” – agents repeatedly miss invisible, cascading effects.
- Design Insight: Argues for a new paradigm where enterprise agents must learn and simulate hidden system dynamics rather than rely solely on surface‑level observations.
- Open‑Source Release: Full code, environment setup scripts, and evaluation pipelines are made publicly available on GitHub.
Methodology
- Environment Construction – The authors built a ServiceNow tenant populated with realistic business objects (incidents, change requests, CMDB entries) and wired them together with thousands of declarative business rules and workflow automations that are not directly observable to an external agent.
- Task Generation – Each benchmark task mimics a typical employee request (e.g., “reset a user’s VPN access”) but is deliberately designed so that the correct answer depends on hidden workflow outcomes (e.g., a downstream approval process that may reject the request).
- Agent Interface – LLM agents interact with WoW through a limited API (search, read, write) that mirrors the restricted UI a real chatbot would have. No internal state dump is provided.
- Evaluation Metrics
- Task Success Rate – Did the agent achieve the visible goal?
- Constraint Violation Rate – Did the agent trigger any hidden rule violations (detected post‑hoc by the environment)?
- Dynamics Prediction Accuracy – Ability to predict the next hidden state transition given an action.
- Model Baselines – The study tests several leading LLMs (GPT‑4, Claude‑2, Llama‑2‑70B) both zero‑shot and with few‑shot prompting, as well as a simple rule‑based baseline.
Results & Findings
| Model | Task Success | Constraint Violations | Dynamics Prediction |
|---|---|---|---|
| GPT‑4 (zero‑shot) | 58 % | 42 % | 31 % |
| GPT‑4 (few‑shot) | 63 % | 38 % | 35 % |
| Claude‑2 | 55 % | 45 % | 28 % |
| Llama‑2‑70B | 48 % | 51 % | 22 % |
| Rule‑based baseline | 34 % | 62 % | 15 % |
- Dynamics Blindness: Even the strongest LLMs missed hidden side‑effects in roughly 40 % of attempts, leading to silent policy breaches that would be costly in a real enterprise.
- Grounded Simulation Helps: Adding a lightweight “world‑model” module that predicts hidden state transitions boosted dynamics prediction by ~10 % and reduced violations by ~5 % points.
- Few‑shot Prompting Provides Marginal Gains: Providing examples of workflow reasoning improves success modestly but does not solve the core observability gap.
Practical Implications
- Enterprise Chatbots Need Internal Simulators: Deploying LLM‑powered assistants on platforms like ServiceNow, Salesforce, or SAP should include a component that learns the platform’s business rules and can run “what‑if” simulations before committing changes.
- Safety‑First Deployment Pipelines: Organizations must instrument hidden‑state monitors (audit logs, rule‑engine hooks) to catch silent violations that LLM agents might cause.
- Developer Tooling: The WoW repo can serve as a sandbox for testing custom prompting strategies, fine‑tuning on workflow logs, or integrating reinforcement‑learning‑from‑human‑feedback (RLHF) loops that reward correct dynamics prediction.
- Cost Savings: By catching cascading errors early, companies can avoid downstream ticket floods, compliance breaches, and costly rollbacks that typically arise from “good‑enough” automation.
Limitations & Future Work
- Scope of Workflows: While 55 workflows are substantial, real enterprises often run hundreds; scaling the benchmark to larger rule sets remains an open challenge.
- Static Business Rules: The current environment assumes deterministic rule execution; future versions should incorporate probabilistic outcomes and time‑based triggers.
- Human‑in‑the‑Loop Evaluation: The study focuses on fully autonomous agents; assessing how LLM assistants collaborate with human operators would broaden applicability.
- Learning the Dynamics: The paper highlights the need for world‑model learning but does not provide a concrete training pipeline; subsequent work could explore self‑supervised dynamics prediction from audit logs.
Authors
- Lakshya Gupta
- Litao Li
- Yizhe Liu
- Sriram Ganapathi Subramanian
- Kaheer Suleman
- Zichen Zhang
- Haoye Lu
- Sumit Pasupalak
Paper Information
- arXiv ID: 2601.22130v1
- Categories: cs.AI, cs.SE
- Published: January 29, 2026
- PDF: Download PDF