[Paper] DocDancer: Towards Agentic Document-Grounded Information Seeking

Published: 1 month ago (January 8, 2026 at 12:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05163v1

Overview

DocDancer tackles a core limitation of current document‑question answering (DocQA) systems: they treat a document as a static text blob and rely heavily on large, closed‑source language models. By reframing DocQA as an information‑seeking task and giving the agent a toolbox for document exploration, the authors deliver an open‑source, end‑to‑end trainable system that can navigate and synthesize answers from long, real‑world documents.

Key Contributions

Agentic framework for DocQA – Introduces a tool‑driven architecture that separates exploration (searching, retrieving, summarizing) from synthesis (answer generation).
Exploration‑then‑Synthesis data pipeline – Generates high‑quality synthetic training triples (question, exploration trace, answer) to overcome the scarcity of annotated DocQA data.
Open‑source implementation – Provides a fully trainable DocQA agent built on publicly available LLM backbones, removing the dependency on proprietary models.
Benchmark validation – Demonstrates strong performance on two long‑context benchmarks (MMLongBench‑Doc, DocBench), outperforming baselines that lack explicit tool use.
Insightful analysis – Offers empirical guidance on tool design (e.g., retrieval vs. summarization modules) and the impact of synthetic data quality.

Methodology

Problem Reformulation – The authors view answering a question about a document as a multi‑step information‑seeking process, akin to how a human would skim, locate relevant passages, and then compose an answer.
Tool‑Driven Agent Architecture
- Exploration Module: A set of deterministic tools (keyword search, passage retrieval, summarizer, table extractor, etc.) that the agent can invoke. Each tool returns a concise result that is fed back into the agent’s reasoning loop.
- Synthesis Module: A language model that consumes the accumulated exploration context and generates the final answer.
- The agent’s policy is learned end‑to‑end: given a question, it decides which tool to call next and when to stop and answer.
Exploration‑then‑Synthesis Data Synthesis
- Start with raw documents and automatically generate question prompts using heuristics and LLM‑based question generators.
- Simulate an “explorer” that runs a scripted sequence of tool calls to locate the answer span, recording the tool usage trace.
- The final answer is produced by a strong LLM (teacher) using the same trace, creating a high‑quality (question, trace, answer) triple.
- This synthetic dataset trains the agent to mimic the exploration‑then‑synthesis workflow.
Training & Inference – The policy network (a lightweight transformer) is trained with supervised learning on the synthetic triples, then fine‑tuned on any available human‑annotated DocQA data. During inference, the agent dynamically decides which tool to invoke until a stopping criterion is met.

Results & Findings

Benchmark	Baseline (no tools)	DocDancer (open‑source)	Closed‑source LLM
MMLongBench‑Doc	42.7 % EM	55.3 % EM	58.1 % EM
DocBench	38.4 % EM	51.9 % EM	53.6 % EM

Tool usage matters: Ablation studies show that removing the retrieval tool drops EM by ~8 points, confirming that explicit exploration improves answer accuracy.
Synthetic data quality: Training solely on synthetic triples yields ~90 % of the performance of models trained on the limited human‑annotated set, demonstrating the pipeline’s effectiveness.
Efficiency: The agent typically makes 3–5 tool calls per query, keeping latency under 2 seconds on a single GPU, comparable to vanilla LLM inference.

Practical Implications

Enterprise Knowledge Bases – Companies can deploy DocDancer to let employees query internal PDFs, manuals, or policy documents without exposing proprietary LLM APIs.
Legal & Compliance Automation – The tool‑driven approach can be extended with domain‑specific extractors (e.g., clause finders) to surface relevant contract language quickly.
Developer‑Friendly SDK – Because the system is open‑source and modular, developers can plug in custom tools (e.g., code search, API docs) to build specialized “document assistants.”
Cost Reduction – By relying on smaller open models plus deterministic tools, organizations can achieve near‑state‑of‑the‑art performance while cutting inference costs dramatically.

Limitations & Future Work

Synthetic Bias – The data synthesis pipeline inherits biases from the LLM used to generate questions and answers; rare or highly nuanced queries may still be under‑represented.
Tool Set Scope – Current tools focus on plain text retrieval and summarization; handling complex structures like nested tables, figures, or multimodal content remains an open challenge.
Scalability to Massive Corpora – While effective on single‑document contexts, extending the exploration policy to search across thousands of documents will require more sophisticated indexing and retrieval strategies.
User Interaction – The present agent operates autonomously; future work could incorporate interactive clarification loops with users to resolve ambiguous questions.

DocDancer demonstrates that giving a language model a well‑designed toolbox and training it on realistic exploration traces can bridge the gap between research‑grade DocQA and production‑ready, cost‑effective document assistants. Developers interested in building next‑generation knowledge‑base bots should keep an eye on this agentic paradigm.

Authors

Qintong Zhang
Xinjie Lv
Jialong Wu
Baixuan Li
Zhengwei Tao
Guochen Yan
Huanyao Zhang
Bin Wang
Jiahao Xu
Haitao Mi
Wentao Zhang

Paper Information

arXiv ID: 2601.05163v1
Categories: cs.CL
Published: January 8, 2026
PDF: Download PDF

[Paper] DocDancer: Towards Agentic Document-Grounded Information Seeking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

[Paper] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning