[Paper] Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning

Published: 1 week ago (January 12, 2026 at 12:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.07782v1

Overview

Large‑language‑model (LLM) agents are increasingly being equipped with huge, ever‑changing libraries of external tools (APIs, scripts, data sources). Picking the right tool(s) from such a library is a retrieval problem, but the usual “single‑shot” dense retrievers—where a single embedding of the user request is matched against static tool embeddings—often miss the mark on complex, multi‑step tasks. The paper Beyond Single‑Shot: Multi‑step Tool Retrieval via Query Planning introduces TOOLQP, a lightweight framework that turns tool retrieval into an iterative “query‑planning” process, dramatically improving accuracy and robustness.

Key Contributions

Iterative Query Planning: Replaces one‑shot matching with a multi‑step decomposition of the user instruction into sub‑tasks, each generating a focused retrieval query.
Synthetic Trajectory Pre‑training + RLVR: Trains the planner on automatically generated query trajectories and then fine‑tunes it with Reinforcement Learning using Verifiable Rewards that directly measure whether the retrieved tools enable successful execution.
Retriever‑Agnostic Design: TOOLQP works with a variety of underlying dense retrievers (e.g., FAISS, ScaNN, ColBERT) and consistently lifts their performance.
Zero‑Shot Generalization: Demonstrates strong out‑of‑distribution results on unseen tool sets and novel user intents without any task‑specific fine‑tuning.
Downstream Agent Gains: Shows that agents equipped with TOOLQP retrieve the right tools more often, leading to higher success rates in end‑to‑end task execution (e.g., code generation, data pipeline orchestration).

Methodology

Problem Framing

Input: a natural‑language user request (e.g., “Generate a weekly sales report and email it to the team”).
Goal: retrieve a set of tools (e.g., a database query API, a CSV exporter, an email sender) that together satisfy the request.

Query Planner Architecture

A small LLM (or a fine‑tuned encoder‑decoder) receives the full request and produces a plan: a sequence of sub‑goals (e.g., “fetch sales data”, “format as CSV”, “send email”).
For each sub‑goal, the planner emits a targeted query (a short textual phrase) that is fed to the underlying dense retriever.

Training Pipeline

Synthetic Trajectory Generation: The authors automatically construct many (request → plan → query → tool) examples by sampling from a knowledge base of tool descriptions and composing random multi‑step tasks.
Supervised Pre‑training: The planner learns to mimic these synthetic trajectories.
Reinforcement Learning with Verifiable Rewards (RLVR):
- Reward = 1 if the retrieved tool set enables a verifier (a sandboxed executor) to complete the original request; 0 otherwise.
- Policy gradients update the planner to favor query sequences that lead to successful verification.

Inference

The planner iteratively proposes queries until a stopping criterion (e.g., “no new sub‑goals” or “max steps reached”) is met, then aggregates all retrieved tools for the downstream agent.

Results & Findings

Metric	Single‑Shot Baseline	TOOLQP (w/ FAISS)	TOOLQP (w/ ColBERT)
Top‑1 Retrieval Accuracy	42.7 %	68.9 %	71.3 %
Zero‑Shot Task Success (end‑to‑end)	35.4 %	59.2 %	61.0 %
Average Number of Queries per Request	1	3.2	3.0
RLVR Training Convergence (steps)	–	~12 k	~10 k

State‑of‑the‑art: TOOLQP outperforms the strongest single‑shot dense retrievers by >20 % absolute accuracy.
Robustness: Performance gains hold across different retriever back‑ends, confirming the planner’s retriever‑agnostic nature.
Generalization: On a held‑out “future‑tool” split (tools added after training), TOOLQP retains >60 % success versus <40 % for baselines.
Agentic Impact: In a simulated code‑assistant scenario, overall task completion rose from 48 % to 73 % when the agent used TOOLQP for tool lookup.

Practical Implications

Plug‑and‑Play Retrieval Layer: Developers can wrap any existing dense retriever with TOOLQP’s planner and immediately see higher tool‑matching rates without re‑indexing.
Dynamic Tool Ecosystems: SaaS platforms that frequently add or deprecate APIs (e.g., cloud automation, low‑code platforms) can keep LLM agents functional with minimal retraining.
Reduced Prompt Engineering: Instead of hand‑crafting elaborate prompts to coax the LLM into “thinking” about tool composition, the planner handles decomposition automatically.
Improved Safety & Explainability: The step‑wise plan is human‑readable, making it easier to audit why a particular tool was chosen—a boon for compliance‑heavy domains.
Cost Efficiency: Fewer failed tool calls mean less wasted compute and API usage, which translates to lower operational costs for large‑scale LLM‑driven services.

Limitations & Future Work

Synthetic Bias: The training data are synthetically generated; real‑world user requests may exhibit linguistic patterns not captured, potentially limiting performance on highly domain‑specific language.
Planner Overhead: The iterative query loop adds latency (≈2–3 extra retrieval calls per request). Optimizations such as early‑stop heuristics or caching are needed for latency‑critical applications.
Tool Description Quality: The approach assumes reasonably detailed tool documentation; sparse or noisy descriptions can degrade retrieval quality.

Future Directions

Incorporating few‑shot human demonstrations to enrich the planner’s understanding of niche domains.
Extending RLVR to reward efficiency (e.g., minimizing number of queries) alongside correctness.
Exploring multimodal tool descriptors (code snippets, schema diagrams) to further close the semantic gap.

Bottom line: TOOLQP reframes tool retrieval from a static “match‑once” problem into a dynamic planning exercise, delivering a practical boost for any LLM‑powered system that must navigate large, evolving tool libraries. For developers building AI assistants, automation bots, or any agent that needs to call external services, integrating TOOLQP could be a game‑changer in both reliability and developer experience.

Authors

Wei Fang
James Glass

Paper Information

arXiv ID: 2601.07782v1
Categories: cs.CL, cs.AI, cs.IR
Published: January 12, 2026
PDF: Download PDF