[Paper] DocDancer: Towards Agentic Document-Grounded Information Seeking
Source: arXiv - 2601.05163v1
Overview
DocDancer tackles a core limitation of current document‑question answering (DocQA) systems: they treat a document as a static text blob and rely heavily on large, closed‑source language models. By reframing DocQA as an information‑seeking task and giving the agent a toolbox for document exploration, the authors deliver an open‑source, end‑to‑end trainable system that can navigate and synthesize answers from long, real‑world documents.
Key Contributions
- Agentic framework for DocQA – Introduces a tool‑driven architecture that separates exploration (searching, retrieving, summarizing) from synthesis (answer generation).
- Exploration‑then‑Synthesis data pipeline – Generates high‑quality synthetic training triples (question, exploration trace, answer) to overcome the scarcity of annotated DocQA data.
- Open‑source implementation – Provides a fully trainable DocQA agent built on publicly available LLM backbones, removing the dependency on proprietary models.
- Benchmark validation – Demonstrates strong performance on two long‑context benchmarks (MMLongBench‑Doc, DocBench), outperforming baselines that lack explicit tool use.
- Insightful analysis – Offers empirical guidance on tool design (e.g., retrieval vs. summarization modules) and the impact of synthetic data quality.
Methodology
- Problem Reformulation – The authors view answering a question about a document as a multi‑step information‑seeking process, akin to how a human would skim, locate relevant passages, and then compose an answer.
- Tool‑Driven Agent Architecture
- Exploration Module: A set of deterministic tools (keyword search, passage retrieval, summarizer, table extractor, etc.) that the agent can invoke. Each tool returns a concise result that is fed back into the agent’s reasoning loop.
- Synthesis Module: A language model that consumes the accumulated exploration context and generates the final answer.
- The agent’s policy is learned end‑to‑end: given a question, it decides which tool to call next and when to stop and answer.
- Exploration‑then‑Synthesis Data Synthesis
- Start with raw documents and automatically generate question prompts using heuristics and LLM‑based question generators.
- Simulate an “explorer” that runs a scripted sequence of tool calls to locate the answer span, recording the tool usage trace.
- The final answer is produced by a strong LLM (teacher) using the same trace, creating a high‑quality (question, trace, answer) triple.
- This synthetic dataset trains the agent to mimic the exploration‑then‑synthesis workflow.
- Training & Inference – The policy network (a lightweight transformer) is trained with supervised learning on the synthetic triples, then fine‑tuned on any available human‑annotated DocQA data. During inference, the agent dynamically decides which tool to invoke until a stopping criterion is met.
Results & Findings
| Benchmark | Baseline (no tools) | DocDancer (open‑source) | Closed‑source LLM |
|---|---|---|---|
| MMLongBench‑Doc | 42.7 % EM | 55.3 % EM | 58.1 % EM |
| DocBench | 38.4 % EM | 51.9 % EM | 53.6 % EM |
- Tool usage matters: Ablation studies show that removing the retrieval tool drops EM by ~8 points, confirming that explicit exploration improves answer accuracy.
- Synthetic data quality: Training solely on synthetic triples yields ~90 % of the performance of models trained on the limited human‑annotated set, demonstrating the pipeline’s effectiveness.
- Efficiency: The agent typically makes 3–5 tool calls per query, keeping latency under 2 seconds on a single GPU, comparable to vanilla LLM inference.
Practical Implications
- Enterprise Knowledge Bases – Companies can deploy DocDancer to let employees query internal PDFs, manuals, or policy documents without exposing proprietary LLM APIs.
- Legal & Compliance Automation – The tool‑driven approach can be extended with domain‑specific extractors (e.g., clause finders) to surface relevant contract language quickly.
- Developer‑Friendly SDK – Because the system is open‑source and modular, developers can plug in custom tools (e.g., code search, API docs) to build specialized “document assistants.”
- Cost Reduction – By relying on smaller open models plus deterministic tools, organizations can achieve near‑state‑of‑the‑art performance while cutting inference costs dramatically.
Limitations & Future Work
- Synthetic Bias – The data synthesis pipeline inherits biases from the LLM used to generate questions and answers; rare or highly nuanced queries may still be under‑represented.
- Tool Set Scope – Current tools focus on plain text retrieval and summarization; handling complex structures like nested tables, figures, or multimodal content remains an open challenge.
- Scalability to Massive Corpora – While effective on single‑document contexts, extending the exploration policy to search across thousands of documents will require more sophisticated indexing and retrieval strategies.
- User Interaction – The present agent operates autonomously; future work could incorporate interactive clarification loops with users to resolve ambiguous questions.
DocDancer demonstrates that giving a language model a well‑designed toolbox and training it on realistic exploration traces can bridge the gap between research‑grade DocQA and production‑ready, cost‑effective document assistants. Developers interested in building next‑generation knowledge‑base bots should keep an eye on this agentic paradigm.
Authors
- Qintong Zhang
- Xinjie Lv
- Jialong Wu
- Baixuan Li
- Zhengwei Tao
- Guochen Yan
- Huanyao Zhang
- Bin Wang
- Jiahao Xu
- Haitao Mi
- Wentao Zhang
Paper Information
- arXiv ID: 2601.05163v1
- Categories: cs.CL
- Published: January 8, 2026
- PDF: Download PDF