[Paper] Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Published: (February 16, 2026 at 12:36 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.14955v1

Overview

The paper introduces a domain‑specific benchmark for evaluating how well large language models (LLMs) can generate tool‑aware execution plans in contact‑center environments. By breaking down a business‑intelligence question into a sequence of concrete steps that call structured tools (e.g., Text‑to‑SQL on Snowflake) and unstructured tools (e.g., Retrieval‑Augmented Generation over call transcripts), the authors expose the current strengths and blind spots of LLMs when they act as autonomous agents.

Key Contributions

  • Reference‑based evaluation framework
    • Two evaluation modes: a metric‑wise scorer covering seven dimensions (tool‑prompt alignment, query adherence, step executability, etc.) and a one‑shot human‑like match evaluator.
  • Iterative data‑curation pipeline
    • An evaluator → optimizer loop that automatically refines raw LLM‑generated plans into high‑quality plan lineages (ordered revisions), dramatically cutting manual annotation effort.
  • Large‑scale empirical study
    • Benchmarked 14 LLMs (Claude, GPT‑4, Llama, Mistral, etc.) across model sizes and families, testing both with and without plan lineage prompts.
    • Discovered systematic failure modes: compound queries and plans longer than four steps (most real‑world queries require 5‑15 steps).
    • Best overall metric score: 84.8 % (Claude‑3‑7‑Sonnet).
    • Best one‑shot “A+” match rate: 49.75 % (o3‑mini).

Methodology

  1. Task definition – The target use case is a contact‑center analyst asking a data‑driven question (e.g., “What was the churn rate for customers who called about billing in the last quarter?”). The answer must be assembled by orchestrating:
    • Structured tools: Text‑to‑SQL generation that runs against a Snowflake data warehouse.
    • Unstructured tools: RAG over call‑transcript embeddings.
  2. Plan representation – A plan is a list of steps, each annotated with:
    • The tool to invoke,
    • The prompt (or query) fed to that tool,
    • Optional depends_on links to enable parallel execution.
  3. Reference‑based evaluation – For each generated plan, the framework compares against a gold‑standard reference across seven dimensions:
    • Tool‑prompt alignment (does the prompt match the tool’s expected input?)
    • Query adherence (does the plan stay on topic?)
    • Step completeness, order correctness, parallelism validity, executability, and overall coherence.
  4. Plan lineage creation – Starting from a raw LLM plan, the evaluator flags deficiencies; an optimizer (a second LLM or rule‑based system) rewrites the plan. This loop repeats until the plan reaches a predefined quality threshold, yielding a lineage of revisions.
  5. Experimental setup – Each of the 14 LLMs is prompted in two ways:
    • Zero‑shot (no lineage context).
    • Lineage‑aware (the model sees the previous revision(s) and is asked to improve).
      Performance is aggregated per model and per metric, and the “A+” tier (Extremely Good / Very Good) is reported for the one‑shot evaluator.

Results & Findings

MetricBest ScoreModel
Overall metric‑wise score84.8 %Claude‑3‑7‑Sonnet
One‑shot “A+” match rate49.75 %o3‑mini
Average step executability (≥ 4 steps)~30 %Across all models
  • Compound queries are hard – When a question requires joining multiple data sources or mixing structured and unstructured evidence, LLMs often drop a step or misuse a tool.
  • Plan length matters – Accuracy drops sharply after the fourth step; most models struggle to keep track of dependencies beyond that point.
  • Lineage helps selectively – Providing prior revisions improves the top‑performing models (Claude, GPT‑4) by ~5–7 % on executability, but the gain is marginal for smaller models.
  • Tool‑prompt alignment is the weakest dimension – Even the best models frequently generate prompts that do not match the expected schema of the downstream tool, leading to runtime failures.

Practical Implications

  1. Designing AI‑augmented contact‑center agents – Developers should limit plan depth (prefer 3–4 steps) or decompose long queries into multiple, smaller sub‑queries that can be solved independently.
  2. Prompt engineering for tool use – Explicitly include tool schemas and example prompts in the system prompt; this mitigates mis‑alignment observed in the study.
  3. Leveraging plan lineage – When building a production pipeline, incorporate an iterative refinement loop (evaluator → optimizer) to automatically polish LLM‑generated plans before execution, especially for high‑value queries.
  4. Model selection – For mission‑critical analytics, opting for larger, instruction‑tuned models (Claude‑3, GPT‑4) yields better overall plan quality, but the cost‑benefit trade‑off must be weighed against the modest one‑shot success rates.
  5. Tool‑wrapper APIs – Expose a standardized “plan execution” API that validates each step’s prompt against the tool’s contract before dispatch, catching alignment errors early and providing feedback to the LLM for the next refinement pass.

Limitations & Future Work

  • Domain confinement – The benchmark focuses on contact‑center data‑analysis; results may not transfer directly to other domains (e.g., code generation, medical QA).
  • Evaluation reliance on reference plans – While the metric‑wise scorer is comprehensive, it still depends on human‑crafted gold references, which could embed bias.
  • Scalability of lineage generation – The optimizer loop reduces manual effort but still incurs extra compute; future work could explore self‑critiquing LLMs that internalize the evaluator’s feedback.
  • Tool diversity – Only Text‑to‑SQL and RAG were examined; extending to more heterogeneous tools (e.g., dashboards, external APIs) would test the generality of the framework.
  • Real‑time constraints – The study does not measure latency; integrating plan validation and refinement into a low‑latency production system remains an open challenge.

Bottom line: The paper provides a concrete, reproducible methodology for measuring how well LLMs can act as autonomous planners in a tool‑rich environment. For developers building AI‑driven contact‑center assistants, the findings underscore the importance of short, well‑structured plans, explicit tool schemas, and iterative refinement to bridge the gap between LLM reasoning and reliable tool execution.

Authors

  • Varun Nathan
  • Shreyas Guha
  • Ayush Kumar

Paper Information

  • arXiv ID: 2602.14955v1
  • Categories: cs.CL, cs.SE
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »