[Paper] World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

Published: 1 week ago (January 29, 2026 at 01:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.22130v1

Overview

The paper introduces World of Workflows (WoW), a realistic ServiceNow‑based sandbox that mimics the hidden, inter‑dependent processes found in large enterprises. By coupling this environment with a 234‑task benchmark (WoW‑bench), the authors expose a critical blind spot in today’s frontier large language models (LLMs): the inability to anticipate and respect the cascading side‑effects of actions inside opaque enterprise systems.

Key Contributions

WoW Environment: A fully‑featured ServiceNow instance containing >4,000 business rules and 55 active, hidden workflows that drive state changes across multiple databases.
WoW‑bench Benchmark: 234 carefully crafted tasks that require agents to (a) complete constrained user requests and (b) model the underlying system dynamics to avoid silent violations.
Empirical Diagnosis: Systematic evaluation of several state‑of‑the‑art LLM agents, revealing a pervasive “dynamics blindness” – agents repeatedly miss invisible, cascading effects.
Design Insight: Argues for a new paradigm where enterprise agents must learn and simulate hidden system dynamics rather than rely solely on surface‑level observations.
Open‑Source Release: Full code, environment setup scripts, and evaluation pipelines are made publicly available on GitHub.

Methodology

Environment Construction – The authors built a ServiceNow tenant populated with realistic business objects (incidents, change requests, CMDB entries) and wired them together with thousands of declarative business rules and workflow automations that are not directly observable to an external agent.
Task Generation – Each benchmark task mimics a typical employee request (e.g., “reset a user’s VPN access”) but is deliberately designed so that the correct answer depends on hidden workflow outcomes (e.g., a downstream approval process that may reject the request).
Agent Interface – LLM agents interact with WoW through a limited API (search, read, write) that mirrors the restricted UI a real chatbot would have. No internal state dump is provided.
Evaluation Metrics
- Task Success Rate – Did the agent achieve the visible goal?
- Constraint Violation Rate – Did the agent trigger any hidden rule violations (detected post‑hoc by the environment)?
- Dynamics Prediction Accuracy – Ability to predict the next hidden state transition given an action.
Model Baselines – The study tests several leading LLMs (GPT‑4, Claude‑2, Llama‑2‑70B) both zero‑shot and with few‑shot prompting, as well as a simple rule‑based baseline.

Results & Findings

Model	Task Success	Constraint Violations	Dynamics Prediction
GPT‑4 (zero‑shot)	58 %	42 %	31 %
GPT‑4 (few‑shot)	63 %	38 %	35 %
Claude‑2	55 %	45 %	28 %
Llama‑2‑70B	48 %	51 %	22 %
Rule‑based baseline	34 %	62 %	15 %

Dynamics Blindness: Even the strongest LLMs missed hidden side‑effects in roughly 40 % of attempts, leading to silent policy breaches that would be costly in a real enterprise.
Grounded Simulation Helps: Adding a lightweight “world‑model” module that predicts hidden state transitions boosted dynamics prediction by ~10 % and reduced violations by ~5 % points.
Few‑shot Prompting Provides Marginal Gains: Providing examples of workflow reasoning improves success modestly but does not solve the core observability gap.

Practical Implications

Enterprise Chatbots Need Internal Simulators: Deploying LLM‑powered assistants on platforms like ServiceNow, Salesforce, or SAP should include a component that learns the platform’s business rules and can run “what‑if” simulations before committing changes.
Safety‑First Deployment Pipelines: Organizations must instrument hidden‑state monitors (audit logs, rule‑engine hooks) to catch silent violations that LLM agents might cause.
Developer Tooling: The WoW repo can serve as a sandbox for testing custom prompting strategies, fine‑tuning on workflow logs, or integrating reinforcement‑learning‑from‑human‑feedback (RLHF) loops that reward correct dynamics prediction.
Cost Savings: By catching cascading errors early, companies can avoid downstream ticket floods, compliance breaches, and costly rollbacks that typically arise from “good‑enough” automation.

Limitations & Future Work

Scope of Workflows: While 55 workflows are substantial, real enterprises often run hundreds; scaling the benchmark to larger rule sets remains an open challenge.
Static Business Rules: The current environment assumes deterministic rule execution; future versions should incorporate probabilistic outcomes and time‑based triggers.
Human‑in‑the‑Loop Evaluation: The study focuses on fully autonomous agents; assessing how LLM assistants collaborate with human operators would broaden applicability.
Learning the Dynamics: The paper highlights the need for world‑model learning but does not provide a concrete training pipeline; subsequent work could explore self‑supervised dynamics prediction from audit logs.

Authors

Lakshya Gupta
Litao Li
Yizhe Liu
Sriram Ganapathi Subramanian
Kaheer Suleman
Zichen Zhang
Haoye Lu
Sumit Pasupalak

Paper Information

arXiv ID: 2601.22130v1
Categories: cs.AI, cs.SE
Published: January 29, 2026
PDF: Download PDF

[Paper] World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound