[Paper] OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

Published: (March 9, 2026 at 01:34 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08655v1

Overview

The paper presents OfficeQA Pro, a new benchmark that pushes AI agents to perform grounded reasoning over a massive, real‑world document collection: nearly a century of U.S. Treasury Bulletins (≈ 89 k pages, 26 M numeric values). Unlike typical QA tests that rely on a single passage, OfficeQA Pro forces models to retrieve, parse, and analytically combine information from both free‑text and tabular sources—tasks that are common in enterprise settings such as financial analysis, compliance, and internal knowledge bases.

Key Contributions

  • Enterprise‑scale corpus: Curated a publicly available, heterogeneous dataset (text + tables) spanning 100 years of Treasury data.
  • Grounded multi‑document QA: Designed 133 questions that require precise extraction, cross‑document retrieval, and numeric reasoning.
  • Comprehensive evaluation: Benchmarked leading LLMs (Claude Opus 4.6, GPT‑5.4, Gemini 3.1 Pro) under three conditions—parametric only, web‑augmented, and with direct corpus access.
  • Structured representation boost: Demonstrated a 16.1 % relative performance gain when agents consume a parsed, structured view of documents generated by Databricks’ ai_parse_document.
  • Ablation studies: Analyzed the impact of model size, table encoding, retrieval strategies, and test‑time scaling on accuracy.

Methodology

  1. Corpus preparation – The authors scraped the full archive of Treasury Bulletins, OCR‑processed scanned pages, and extracted tables into a searchable index.
  2. Question design – Each of the 133 queries was crafted to require at least two distinct documents and a mix of textual and numeric reasoning (e.g., “What was the average interest rate for 10‑year bonds in the fiscal years 1975‑1979?”).
  3. Agent configurations – Three experimental setups were used:
    • Parametric only: The model answers from its internal knowledge.
    • Web‑augmented: The model can browse the open web but not the private corpus.
    • Corpus‑provided: The full document set is supplied, either as raw PDFs or as a structured JSON produced by ai_parse_document.
  4. Retrieval pipeline – Standard dense‑vector retrieval (FAISS) was combined with a lightweight re‑ranking step that prioritizes documents containing the required numeric fields.
  5. Evaluation – Accuracy is measured by exact match against a gold answer; partial credit is given for numerically close results (within 1 % tolerance).

Results & Findings

SetupAvg. Accuracy
Parametric only< 5 %
Web‑augmented< 12 %
Corpus raw (no parsing)≈ 34 %
Corpus + ai_parse_document (structured)≈ 40 % (16 % relative gain)
  • Even the strongest LLMs struggle to exceed 40 % when the full corpus is available, indicating that retrieval + reasoning remains a bottleneck.
  • Structured representations (tables turned into key‑value pairs, hierarchical headings) consistently help across all models, confirming that raw PDFs are too noisy for current agents.
  • Scaling model size from 7 B to 70 B parameters yields diminishing returns; retrieval quality and document parsing matter more.
  • Table‑specific encodings (e.g., row‑column positional embeddings) improve performance on questions that hinge on numeric aggregation.

Practical Implications

  • Enterprise search & analytics – Companies that need AI to answer finance‑ or compliance‑related queries can’t rely on LLMs alone; they must integrate robust document parsing pipelines.
  • Tooling focus – Investing in high‑quality OCR, table extraction, and structured indexing (e.g., Databricks’ ai_parse_document) can yield immediate gains without changing the underlying model.
  • Hybrid architectures – The benchmark suggests a “retrieval‑first, parse‑then‑reason” stack: dense retrieval → structured parsing → LLM reasoning. This pattern aligns with emerging enterprise AI platforms.
  • Risk management – Low accuracy on grounded tasks highlights the danger of deploying LLMs for critical financial decisions without rigorous validation.

Limitations & Future Work

  • Domain specificity – The corpus is limited to Treasury Bulletins; results may not transfer directly to other domains (legal, medical, etc.).
  • Question set size – Only 133 questions; a larger, more diverse set would better capture edge cases.
  • Retrieval baseline – The study uses a single dense‑vector retriever; exploring hybrid (BM25 + dense) or graph‑based retrieval could further improve performance.
  • Human‑in‑the‑loop – Future work could assess how modest human assistance (e.g., confirming retrieved docs) changes outcomes, moving toward practical enterprise workflows.

Authors

  • Krista Opsahl-Ong
  • Arnav Singhvi
  • Jasmine Collins
  • Ivan Zhou
  • Cindy Wang
  • Ashutosh Baheti
  • Owen Oertell
  • Jacob Portes
  • Sam Havens
  • Erich Elsen
  • Michael Bendersky
  • Matei Zaharia
  • Xing Chen

Paper Information

  • arXiv ID: 2603.08655v1
  • Categories: cs.AI, cs.CL, cs.IR
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why...