[Paper] Learning to Compose for Cross-domain Agentic Workflow Generation
Source: arXiv - 2602.11114v1
Overview
The paper presents a new way to automatically create agentic workflows—structured sequences of operators (or code snippets) that let large language models (LLMs) reason, verify, and repair their outputs. By teaching an LLM to decompose tasks into reusable capabilities, recompose them on the fly, and decide which pieces actually mattered, the authors achieve reliable, single‑pass workflow generation that works across very different domains, cutting the typical 20‑plus refinement iterations down to one.
Key Contributions
- Compact capability library: Learns a small set of reusable workflow primitives that span multiple domains.
- Sparse composition engine: Maps any new task to a lightweight, sparse combination of these primitives, enabling one‑shot workflow synthesis.
- Counterfactual attribution: Introduces a causal‑style analysis to pinpoint which capabilities contributed to a successful workflow, improving interpretability and robustness.
- Cross‑domain performance: Demonstrates that a single LLM can generate high‑quality workflows for seen, shifted, and completely unseen domains without domain‑specific fine‑tuning.
- Efficiency gains: Achieves comparable or better results than state‑of‑the‑art iterative refinement methods while reducing latency and compute cost by an order of magnitude.
Methodology
- Decompose – The authors first train an open‑source LLM to identify a basis set of workflow capabilities (e.g., “search the web”, “run a Python script”, “validate JSON”). This is done by clustering operator graphs from many domains and extracting the most common, reusable patterns.
- Recompose – Given a new user request, the model predicts a sparse vector over the learned basis, essentially selecting a handful of capabilities that together can solve the task. The selected capabilities are then stitched together into a concrete workflow graph in a single forward pass.
- Decide – After execution, the system performs a counterfactual contribution analysis: it perturbs each capability’s presence and measures the impact on success, attributing credit (or blame) to individual primitives. This feedback loop refines the capability library without full retraining.
All steps are implemented on top of a publicly available LLM (e.g., LLaMA‑2) and rely on standard fine‑tuning and prompting techniques, making the pipeline reproducible.
Results & Findings
| Evaluation Setting | Metric (higher is better) | Baseline (20‑step refinement) | Proposed 1‑pass method |
|---|---|---|---|
| In‑domain | Success rate (%) | 78.3 | 84.7 |
| Cross‑domain | Success rate (%) | 62.1 | 71.5 |
| Unseen‑domain | Success rate (%) | 48.9 | 58.2 |
| Latency | Avg. seconds per workflow | 12.4 (20 iterations) | 1.1 (single pass) |
| Compute cost | GPU‑hours per 1k tasks | 3.6 | 0.4 |
The single‑pass generator not only outperforms the iterative baselines on success rates across all domains but also slashes generation time by ~10× and reduces GPU consumption dramatically. The counterfactual attribution analysis reveals that a small subset (≈15 %) of the learned capabilities accounts for >80 % of successful outcomes, confirming the sparsity assumption.
Practical Implications
- Faster AI‑assisted tooling: IDE plugins, data‑pipeline builders, or low‑code platforms can generate end‑to‑end automation scripts on the fly, without waiting for multi‑step refinement loops.
- Cost‑effective cloud services: SaaS providers can embed the model in their backend and bill per request rather than per iteration, lowering operational expenses.
- Robust cross‑domain assistants: Customer‑support bots, scientific analysis pipelines, or DevOps agents can adapt to new problem spaces (e.g., a new API or data format) without retraining on domain‑specific data.
- Explainable automation: The counterfactual attribution gives developers a clear view of why a generated workflow succeeded, aiding debugging and compliance audits.
Limitations & Future Work
- Capability granularity: The current basis set may miss highly specialized operators needed for niche industries, requiring manual extension.
- Counterfactual overhead: While lightweight, the attribution step adds a small runtime cost that could become noticeable at massive scale.
- Evaluation scope: Benchmarks focus on synthetic and benchmark tasks; real‑world deployment in safety‑critical domains (e.g., medical or finance) still needs thorough validation.
- Future directions: The authors suggest expanding the capability library via continual learning, integrating richer execution feedback (e.g., logs, error traces), and exploring hierarchical composition for even more complex multi‑step processes.
Authors
- Jialiang Wang
- Shengxiang Xu
- Hanmo Liu
- Jiachuan Wang
- Yuyu Luo
- Shimin Di
- Min-Ling Zhang
- Lei Chen
Paper Information
- arXiv ID: 2602.11114v1
- Categories: cs.MA, cs.AI, cs.LG, cs.SE
- Published: February 11, 2026
- PDF: Download PDF