[Paper] From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

Published: (May 5, 2026 at 01:08 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.03986v1

Overview

The paper presents an end‑to‑end framework that automatically builds multi‑agent workflows from a high‑level user intent. By replacing the traditionally manual steps of planning, agent selection, and execution‑graph construction with a set of coordinated software modules, the authors demonstrate a more scalable way to spin up task‑specific AI applications.

Key Contributions

  • LLM‑driven planner that translates natural‑language intents into a structured sequence of tasks.
  • Two‑stage agent recommender (fast vector retriever + LLM re‑ranker) that selects the most suitable agents from local and global registries.
  • Dynamic call‑graph generator that assembles the selected agents into an executable workflow.
  • Critique agent that reviews the whole plan and can trigger revisions to improve recall and robustness.
  • Comprehensive empirical evaluation of embedder/re‑ranker choices, description enrichment, and the impact of the critique step, showing state‑of‑the‑art recall performance.

Methodology

  1. Intent → Task Decomposition

    • An LLM (e.g., GPT‑4) receives a user’s natural‑language goal and outputs an ordered list of atomic tasks.
  2. Agent Retrieval

    • Stage 1: A dense‑vector retriever (e.g., FAISS + sentence‑transformer embeddings) quickly pulls a shortlist of candidate agents whose metadata matches each task.
    • Stage 2: A smaller LLM re‑ranks the shortlist using richer contextual cues (task description, agent capabilities, past performance).
  3. Workflow Assembly

    • The system builds a dynamic call graph that wires the chosen agents according to task dependencies, forming an executable DAG (directed acyclic graph).
  4. Critique Loop

    • A dedicated critique agent inspects the full plan + selected agents, checks for gaps or mismatches, and can request re‑planning or alternative agents.
  5. Execution

    • The orchestrator invokes each agent in topological order, passing intermediate results downstream until the overall intent is satisfied.

All components are modular, allowing developers to plug in their own LLMs, embedding models, or custom agents.

Results & Findings

AspectMetricOutcome
Recall of correct agents% of tasks matched with an appropriate agent~15 % higher than prior baselines (e.g., single‑stage retrieval).
ScalabilityTime to retrieve agents for 100‑task workflowLinear growth; the fast retriever keeps latency low (< 200 ms per task).
Critique impactRecall after critique vs. before+4–6 % absolute gain, confirming the value of a holistic review step.
RobustnessSuccess rate under noisy intent phrasingMaintained > 90 % task completion, whereas baselines dropped below 70 %.

The experiments also showed that enriching agent descriptions (adding example inputs/outputs) significantly improves the re‑ranker’s ability to pick the right tool.

Practical Implications

  • Rapid prototyping of AI‑powered services: Developers can describe a new workflow in plain English and obtain a ready‑to‑run multi‑agent pipeline without hand‑crafting glue code.
  • Marketplace integration: SaaS platforms that host a catalog of specialized agents (e.g., data cleaning, translation, code generation) can use the recommender to auto‑match client requests to the best‑fit services.
  • Enterprise automation: Business process automation teams can replace brittle RPA scripts with adaptive agent chains that self‑select the most capable tool for each step.
  • Extensibility: Because the framework is modular, teams can swap in domain‑specific LLMs or embedder models to tailor performance for niche verticals (finance, healthcare, etc.).

In short, the approach lowers the barrier to building sophisticated, composable AI systems, turning “intent → execution” into a repeatable engineering pattern.

Limitations & Future Work

  • Dependency on high‑quality agent metadata: The recommender’s success hinges on well‑structured, descriptive registries; sparse or noisy descriptions degrade performance.
  • LLM cost and latency: Using large LLMs for planning and re‑ranking can be expensive for very large workflows; future work could explore distilled models or caching strategies.
  • Evaluation scope: Benchmarks focus on recall and synthetic intents; real‑world deployments with complex error handling and security constraints remain to be tested.
  • Dynamic adaptation: The current system assumes a static agent pool; extending it to discover or train new agents on‑fly is an open research direction.

Overall, the paper lays a solid foundation for automated multi‑agent composition while highlighting practical challenges that the community can address next.

Authors

  • Kishan Athrey
  • Ramin Pishehvar
  • Brian Riordan
  • Mahesh Viswanathan

Paper Information

  • arXiv ID: 2605.03986v1
  • Categories: cs.AI
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...