[Paper] Learning to Compose for Cross-domain Agentic Workflow Generation

Published: 3 days ago (February 11, 2026 at 01:27 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.11114v1

Overview

The paper presents a new way to automatically create agentic workflows—structured sequences of operators (or code snippets) that let large language models (LLMs) reason, verify, and repair their outputs. By teaching an LLM to decompose tasks into reusable capabilities, recompose them on the fly, and decide which pieces actually mattered, the authors achieve reliable, single‑pass workflow generation that works across very different domains, cutting the typical 20‑plus refinement iterations down to one.

Key Contributions

Compact capability library: Learns a small set of reusable workflow primitives that span multiple domains.
Sparse composition engine: Maps any new task to a lightweight, sparse combination of these primitives, enabling one‑shot workflow synthesis.
Counterfactual attribution: Introduces a causal‑style analysis to pinpoint which capabilities contributed to a successful workflow, improving interpretability and robustness.
Cross‑domain performance: Demonstrates that a single LLM can generate high‑quality workflows for seen, shifted, and completely unseen domains without domain‑specific fine‑tuning.
Efficiency gains: Achieves comparable or better results than state‑of‑the‑art iterative refinement methods while reducing latency and compute cost by an order of magnitude.

Methodology

Decompose – The authors first train an open‑source LLM to identify a basis set of workflow capabilities (e.g., “search the web”, “run a Python script”, “validate JSON”). This is done by clustering operator graphs from many domains and extracting the most common, reusable patterns.
Recompose – Given a new user request, the model predicts a sparse vector over the learned basis, essentially selecting a handful of capabilities that together can solve the task. The selected capabilities are then stitched together into a concrete workflow graph in a single forward pass.
Decide – After execution, the system performs a counterfactual contribution analysis: it perturbs each capability’s presence and measures the impact on success, attributing credit (or blame) to individual primitives. This feedback loop refines the capability library without full retraining.

All steps are implemented on top of a publicly available LLM (e.g., LLaMA‑2) and rely on standard fine‑tuning and prompting techniques, making the pipeline reproducible.

Results & Findings

Evaluation Setting	Metric (higher is better)	Baseline (20‑step refinement)	Proposed 1‑pass method
In‑domain	Success rate (%)	78.3	84.7
Cross‑domain	Success rate (%)	62.1	71.5
Unseen‑domain	Success rate (%)	48.9	58.2
Latency	Avg. seconds per workflow	12.4 (20 iterations)	1.1 (single pass)
Compute cost	GPU‑hours per 1k tasks	3.6	0.4

The single‑pass generator not only outperforms the iterative baselines on success rates across all domains but also slashes generation time by ~10× and reduces GPU consumption dramatically. The counterfactual attribution analysis reveals that a small subset (≈15 %) of the learned capabilities accounts for >80 % of successful outcomes, confirming the sparsity assumption.

Practical Implications

Faster AI‑assisted tooling: IDE plugins, data‑pipeline builders, or low‑code platforms can generate end‑to‑end automation scripts on the fly, without waiting for multi‑step refinement loops.
Cost‑effective cloud services: SaaS providers can embed the model in their backend and bill per request rather than per iteration, lowering operational expenses.
Robust cross‑domain assistants: Customer‑support bots, scientific analysis pipelines, or DevOps agents can adapt to new problem spaces (e.g., a new API or data format) without retraining on domain‑specific data.
Explainable automation: The counterfactual attribution gives developers a clear view of why a generated workflow succeeded, aiding debugging and compliance audits.

Limitations & Future Work

Capability granularity: The current basis set may miss highly specialized operators needed for niche industries, requiring manual extension.
Counterfactual overhead: While lightweight, the attribution step adds a small runtime cost that could become noticeable at massive scale.
Evaluation scope: Benchmarks focus on synthetic and benchmark tasks; real‑world deployment in safety‑critical domains (e.g., medical or finance) still needs thorough validation.
Future directions: The authors suggest expanding the capability library via continual learning, integrating richer execution feedback (e.g., logs, error traces), and exploring hierarchical composition for even more complex multi‑step processes.

Authors

Jialiang Wang
Shengxiang Xu
Hanmo Liu
Jiachuan Wang
Yuyu Luo
Shimin Di
Min-Ling Zhang
Lei Chen

Paper Information

arXiv ID: 2602.11114v1
Categories: cs.MA, cs.AI, cs.LG, cs.SE
Published: February 11, 2026
PDF: Download PDF

[Paper] Learning to Compose for Cross-domain Agentic Workflow Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

[Paper] Agentic Test-Time Scaling for WebAgents