[Paper] Beyond IVR: Benchmarking Customer Support LLM Agents for Business-Adherence
Source: arXiv - 2601.00596v1
Overview
Customer‑support chatbots are moving beyond the stiff, script‑driven world of Interactive Voice Response (IVR). This paper introduces JourneyBench, a new benchmark that tests whether large‑language‑model (LLM) agents can follow real‑world business policies, handle multi‑step workflows, and stay robust when users or systems behave unpredictably. The authors show that a modest redesign of the prompting strategy can dramatically improve policy compliance—even letting a smaller model beat a larger one.
Key Contributions
- JourneyBench benchmark: a graph‑based framework that generates realistic, multi‑step support scenarios across three business domains.
- User Journey Coverage Score (UJCS): a novel metric that quantifies how well an agent follows prescribed policies and completes all required sub‑tasks.
- Two agent architectures:
- Static‑Prompt Agent (SPA) – a single, fixed prompt that relies on the LLM’s internal knowledge.
- Dynamic‑Prompt Agent (DPA) – a prompt that is updated on‑the‑fly to reflect the current policy state and task dependencies.
- Comprehensive evaluation: 703 simulated conversations, comparing GPT‑4o, GPT‑4o‑mini, Claude‑3, and Llama‑2‑70B under both SPA and DPA setups.
- Empirical insight: DPA consistently outperforms SPA, and the smaller GPT‑4o‑mini with DPA surpasses the larger GPT‑4o with SPA, highlighting the power of structured orchestration over raw model size.
Methodology
- Scenario Generation – Business processes (e.g., order returns, account upgrades, troubleshooting) are encoded as directed graphs where nodes represent atomic actions (verify identity, check inventory, issue refund) and edges encode policy‑driven dependencies. Random walks through these graphs produce diverse conversation “journeys.”
- Agent Design
- SPA: The LLM receives a single, static system prompt describing the overall task and a list of policy rules. It must keep track of progress internally.
- DPA: After each turn, a lightweight controller updates a policy state (which nodes are completed, which are pending) and injects this state into the next prompt. This explicit context acts like a checklist for the LLM.
- Evaluation – For each conversation, the ground‑truth graph is known. The UJCS measures the proportion of required nodes that the agent correctly executes in the right order, penalizing missed or out‑of‑order steps. Human annotators also verify a subset for quality control.
The whole pipeline is open‑source, making it easy for developers to plug in their own LLMs or policy graphs.
Results & Findings
| Model (Prompt) | UJCS (avg.) | # of fully compliant journeys |
|---|---|---|
| GPT‑4o (SPA) | 0.62 | 31 % |
| GPT‑4o (DPA) | 0.78 | 45 % |
| GPT‑4o‑mini (SPA) | 0.55 | 27 % |
| GPT‑4o‑mini (DPA) | 0.81 | 52 % |
| Claude‑3 (SPA) | 0.60 | 30 % |
| Claude‑3 (DPA) | 0.74 | 42 % |
| Llama‑2‑70B (SPA) | 0.48 | 22 % |
| Llama‑2‑70B (DPA) | 0.69 | 38 % |
- Dynamic prompting yields a 15‑25 % boost in policy adherence across all models.
- The smaller GPT‑4o‑mini with DPA outperforms the larger GPT‑4o with SPA, suggesting that a well‑structured orchestration layer can compensate for raw model capacity.
- Errors are dominated by state‑drift (forgetting which step was completed) in SPA, while DPA’s failures are mostly due to ambiguous user utterances that the policy graph does not cover.
Practical Implications
- Design‑first approach: When building AI‑driven support bots, invest in a lightweight policy engine that tracks task progress and feeds that state back into the LLM prompt. This is cheaper and more reliable than fine‑tuning massive models.
- Compliance & Auditing: The UJCS metric gives product teams a quantifiable way to certify that bots obey regulatory or internal SOPs—critical for finance, healthcare, and telecom.
- Rapid prototyping: JourneyBench’s graph generator can model new support flows (e.g., SaaS onboarding, warranty claims) without writing thousands of hand‑crafted test cases.
- Cost savings: Using a smaller model like GPT‑4o‑mini with DPA reduces inference latency and API spend while maintaining higher compliance than a larger model used naïvely.
- Integration hooks: The controller that updates prompts can be implemented as a microservice that consumes existing CRM tickets, policy rule engines, or knowledge‑base APIs, making the solution plug‑and‑play for existing stacks.
Limitations & Future Work
- Synthetic conversations: While the graph‑based generator creates realistic paths, it may miss the nuance of real customer language, sarcasm, or multi‑intent utterances.
- Domain coverage: The benchmark currently spans three domains; expanding to more regulated sectors (e.g., banking) will test the metric’s robustness.
- Scalability of the controller: The DPA’s prompt‑update loop adds latency; future work could explore tighter integration (e.g., tool‑calling APIs) or caching strategies.
- Human‑in‑the‑loop evaluation: The study relies heavily on automated scoring; deeper user studies would clarify how policy adherence translates to perceived satisfaction.
Overall, the paper makes a strong case that structured orchestration beats raw model size for policy‑driven customer support, and JourneyBench provides a practical yardstick for the next generation of AI agents.
Authors
- Sumanth Balaji
- Piyush Mishra
- Aashraya Sachdeva
- Suraj Agrawal
Paper Information
- arXiv ID: 2601.00596v1
- Categories: cs.CL
- Published: January 2, 2026
- PDF: Download PDF