[Paper] OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Published: 14 hours ago (March 9, 2026 at 01:34 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.08655v1

Overview

The paper presents OfficeQA Pro, a new benchmark that pushes AI agents to perform grounded reasoning over a massive, real‑world document collection: nearly a century of U.S. Treasury Bulletins (≈ 89 k pages, 26 M numeric values). Unlike typical QA tests that rely on a single passage, OfficeQA Pro forces models to retrieve, parse, and analytically combine information from both free‑text and tabular sources—tasks that are common in enterprise settings such as financial analysis, compliance, and internal knowledge bases.

Key Contributions

Enterprise‑scale corpus: Curated a publicly available, heterogeneous dataset (text + tables) spanning 100 years of Treasury data.
Grounded multi‑document QA: Designed 133 questions that require precise extraction, cross‑document retrieval, and numeric reasoning.
Comprehensive evaluation: Benchmarked leading LLMs (Claude Opus 4.6, GPT‑5.4, Gemini 3.1 Pro) under three conditions—parametric only, web‑augmented, and with direct corpus access.
Structured representation boost: Demonstrated a 16.1 % relative performance gain when agents consume a parsed, structured view of documents generated by Databricks’ ai_parse_document.
Ablation studies: Analyzed the impact of model size, table encoding, retrieval strategies, and test‑time scaling on accuracy.

Methodology

Corpus preparation – The authors scraped the full archive of Treasury Bulletins, OCR‑processed scanned pages, and extracted tables into a searchable index.
Question design – Each of the 133 queries was crafted to require at least two distinct documents and a mix of textual and numeric reasoning (e.g., “What was the average interest rate for 10‑year bonds in the fiscal years 1975‑1979?”).
Agent configurations – Three experimental setups were used:
- Parametric only: The model answers from its internal knowledge.
- Web‑augmented: The model can browse the open web but not the private corpus.
- Corpus‑provided: The full document set is supplied, either as raw PDFs or as a structured JSON produced by ai_parse_document.
Retrieval pipeline – Standard dense‑vector retrieval (FAISS) was combined with a lightweight re‑ranking step that prioritizes documents containing the required numeric fields.
Evaluation – Accuracy is measured by exact match against a gold answer; partial credit is given for numerically close results (within 1 % tolerance).

Results & Findings

Setup	Avg. Accuracy
Parametric only	< 5 %
Web‑augmented	< 12 %
Corpus raw (no parsing)	≈ 34 %
Corpus + `ai_parse_document` (structured)	≈ 40 % (16 % relative gain)

Even the strongest LLMs struggle to exceed 40 % when the full corpus is available, indicating that retrieval + reasoning remains a bottleneck.
Structured representations (tables turned into key‑value pairs, hierarchical headings) consistently help across all models, confirming that raw PDFs are too noisy for current agents.
Scaling model size from 7 B to 70 B parameters yields diminishing returns; retrieval quality and document parsing matter more.
Table‑specific encodings (e.g., row‑column positional embeddings) improve performance on questions that hinge on numeric aggregation.

Practical Implications

Enterprise search & analytics – Companies that need AI to answer finance‑ or compliance‑related queries can’t rely on LLMs alone; they must integrate robust document parsing pipelines.
Tooling focus – Investing in high‑quality OCR, table extraction, and structured indexing (e.g., Databricks’ ai_parse_document) can yield immediate gains without changing the underlying model.
Hybrid architectures – The benchmark suggests a “retrieval‑first, parse‑then‑reason” stack: dense retrieval → structured parsing → LLM reasoning. This pattern aligns with emerging enterprise AI platforms.
Risk management – Low accuracy on grounded tasks highlights the danger of deploying LLMs for critical financial decisions without rigorous validation.

Limitations & Future Work

Domain specificity – The corpus is limited to Treasury Bulletins; results may not transfer directly to other domains (legal, medical, etc.).
Question set size – Only 133 questions; a larger, more diverse set would better capture edge cases.
Retrieval baseline – The study uses a single dense‑vector retriever; exploring hybrid (BM25 + dense) or graph‑based retrieval could further improve performance.
Human‑in‑the‑loop – Future work could assess how modest human assistance (e.g., confirming retrieved docs) changes outcomes, moving toward practical enterprise workflows.

Authors

Krista Opsahl-Ong
Arnav Singhvi
Jasmine Collins
Ivan Zhou
Cindy Wang
Ashutosh Baheti
Owen Oertell
Jacob Portes
Sam Havens
Erich Elsen
Michael Bendersky
Matei Zaharia
Xing Chen

Paper Information

arXiv ID: 2603.08655v1
Categories: cs.AI, cs.CL, cs.IR
Published: March 9, 2026
PDF: Download PDF

[Paper] OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Critical Training

[Paper] How Far Can Unsupervised RLVR Scale LLM Training?

[Paper] Drift-to-Action Controllers: Budgeted Interventions with Online Risk Certificates

[Paper] LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing