[Paper] $τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Published: 1 day ago (March 4, 2026 at 01:34 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.04370v1

Overview

The paper introduces τ‑Knowledge, a new benchmark that pushes conversational agents to retrieve and reason over large, unstructured knowledge bases in realistic, multi‑turn interactions. By extending the earlier τ‑Bench framework with a fintech‑focused “τ‑Banking” domain, the authors expose the challenges of coordinating natural‑language knowledge with tool use—something current evaluations largely ignore.

Key Contributions

τ‑Knowledge benchmark: a fully agentic evaluation suite that couples unstructured document retrieval with tool‑mediated actions and policy‑compliant state changes.
τ‑Banking domain: a simulated fintech customer‑support environment containing ~700 interlinked knowledge documents and realistic account‑update tools.
Comprehensive baselines: experiments with state‑of‑the‑art LLMs (including chain‑of‑thought reasoning, retrieval‑augmented generation, and tool‑use pipelines) across both embedding‑based and terminal‑based search.
Empirical findings: even the strongest models achieve only ~25 % pass rate, and performance degrades sharply when the same query is repeated across multiple trials.
Open‑source release: the benchmark code, data, and evaluation scripts are publicly available to foster reproducible research on knowledge‑aware agents.

Methodology

Environment design – The authors built a sandbox that mimics a banking support workflow: a user asks a question, the agent must locate the relevant policy or FAQ document, extract the needed information, and then invoke a simulated tool (e.g., “update‑balance”) to modify the user’s account state.
Knowledge representation – All documents are raw text; there is no pre‑structured schema. Retrieval is performed either via dense vector embeddings (FAISS) or via a terminal‑style keyword search that mimics a human operator’s “find‑in‑page” behavior.
Agent pipelines – Several pipelines were tested:
- Pure LLM – the model attempts to answer directly without external retrieval.
- RAG‑style – the model first queries the retriever, then generates a response conditioned on the retrieved snippets.
- Tool‑augmented – after reasoning, the model calls a predefined API to enact account changes, and the system verifies compliance with banking policies.
Evaluation metric – A task is considered passed only if the final account state matches the ground‑truth outcome and the intermediate reasoning steps can be verified against the retrieved documents (the τ‑knowledge “pass” metric).

Results & Findings

Model / Retrieval	Pass@1 (single attempt)	Pass@5 (5 attempts)
Pure LLM (GPT‑4)	12 %	18 %
RAG (dense embeddings)	24 %	31 %
RAG (terminal search)	26 %	33 %
Chain‑of‑thought + tool use (GPT‑4)	25.5 %	38 %

Retrieval matters: dense embeddings slightly underperform terminal search because the banking docs are heavily cross‑referenced, making exact keyword matches more reliable.
Reasoning budget: Giving the model more “thinking steps” improves pass rates modestly, but the ceiling remains low.
Reliability drop: When the same query is repeated across multiple simulated sessions, success rates fall by ~10 % per repetition, indicating brittleness in memory and context handling.
Policy compliance: Even when the final answer is correct, many agents violate internal policy constraints (e.g., exposing private fields), which the benchmark flags as failures.

Practical Implications

Fintech & Customer Support – Deploying LLM‑powered bots that must obey strict regulatory policies (KYC, AML, data privacy) will need more robust retrieval pipelines and explicit policy verification layers.
Tool Integration – Simply adding a “call‑API” step is not enough; developers must design guardrails that cross‑check tool outputs against retrieved policy text.
Evaluation Standards – τ‑Knowledge offers a template for building domain‑specific testbeds (e.g., healthcare, legal) where unstructured knowledge and tool use intersect, encouraging more realistic QA before production roll‑out.
Model Selection – For applications where correctness outweighs fluency, dense‑embedding retrievers may be insufficient; hybrid approaches that combine keyword search with learned reranking could yield better reliability.
Developer Tooling – The benchmark’s open‑source SDK can be integrated into CI pipelines to automatically flag regressions in knowledge‑aware behavior as models evolve.

Limitations & Future Work

Synthetic domain – τ‑Banking, while realistic, is still a simulated environment; real‑world banking systems may have additional constraints (latency, multi‑modal data).
Document scale – The benchmark uses ~700 documents; scaling to millions of pages could expose new bottlenecks in retrieval latency and index management.
Policy formalization – Current verification relies on textual matching; future work could explore formal policy languages (e.g., SPDX, OPA) to enable automated compliance checks.
User modeling – The benchmark assumes a cooperative user; handling ambiguous or adversarial queries remains an open challenge.

τ‑Knowledge marks a step toward evaluating truly “agentic” LLMs that must blend unstructured knowledge with actionable tools—an essential capability for the next generation of enterprise AI assistants.

Authors

Quan Shi
Alexandra Zytek
Pedram Razavi
Karthik Narasimhan
Victor Barres

Paper Information

arXiv ID: 2603.04370v1
Categories: cs.AI, cs.CL, cs.IR
Published: March 4, 2026
PDF: Download PDF

[Paper] $τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

[Paper] World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

[Paper] SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

[Paper] CONCUR: Benchmarking LLMs for Concurrent Code Generation