[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Published: 3 days ago (February 26, 2026 at 12:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.23286v1

Overview

The paper introduces SPARTA, a new framework for automatically building large‑scale benchmarks that test a model’s ability to answer questions requiring multi‑hop reasoning across both tables and free‑form text. By generating thousands of high‑quality QA pairs with complex operations (aggregation, grouping, nested queries), SPARTA exposes serious gaps in current cross‑modal QA systems that perform well on existing, shallow benchmarks.

Key Contributions

Automated benchmark generation: End‑to‑end pipeline that creates Table‑Text QA datasets with minimal human validation (≈¼ the annotation effort of HybridQA).
Fact‑grounded reference database: Enriches each source table with “grounding tables” derived from atomic facts automatically extracted from accompanying passages.
Controlled multi‑hop query synthesis: Generates nested SQL‑style queries whose depth matches a target hop count, enabling systematic testing of deep reasoning.
Provenance‑based refinement: Rewrites syntactically valid queries that would return empty results, guaranteeing executability.
Realistic‑structure enforcement: Restricts generation to post‑order traversals of the query graph, ensuring the resulting natural‑language questions sound fluent and human‑like.
Comprehensive benchmark: Includes thousands of QA pairs covering aggregations, grouping, and deep multi‑hop reasoning across text and tables.
Empirical gap analysis: Shows that state‑of‑the‑art models lose >30 F1 points on SPARTA compared to HybridQA/OTT‑QA, highlighting fundamental weaknesses.

Methodology

Fact Extraction – From each passage, the system extracts atomic facts (subject‑predicate‑object triples) using off‑the‑shelf OpenIE tools.
Grounding Table Construction – These facts are organized into auxiliary tables that “ground” the unstructured text, linking it to the original structured table.
Query Generation – A grammar‑driven generator creates SQL‑like queries with a configurable number of hops. Queries are built as directed acyclic graphs; a post‑order traversal ensures realistic nesting.
Provenance‑Based Refinement – If a generated query would return an empty set, the system rewrites predicates using provenance information (i.e., which tables contributed to the result) until a non‑empty answer is guaranteed.
Natural‑Language Verbalization – The final query graph is linearized into a fluent question using template‑based surface realization, followed by lightweight human validation for fluency.
Dataset Assembly – Each QA pair consists of the original table, the associated passage, the generated question, and the correct answer (derived from the executed query).

Results & Findings

Benchmark Scale: SPARTA contains ≈10K QA pairs, an order of magnitude larger than prior hybrid QA datasets.
Model Performance Drop: Top models (e.g., TAPAS‑based, Table‑Text Fusion) that achieve 70 F1 on HybridQA fall to ≈38 F1 on SPARTA; similarly, OTT‑QA models drop from 50 F1 to ≈18 F1.
Error Analysis: Failures concentrate on (a) correctly aligning textual facts with table rows, (b) executing aggregations/group‑by across modalities, and (c) maintaining logical consistency over >2 hops.
Human Validation: Only ~5 % of generated questions required manual correction, confirming the pipeline’s high fidelity.

Practical Implications

Better Model Diagnostics: Developers can use SPARTA to pinpoint exactly where their cross‑modal reasoning pipelines break (e.g., aggregation handling, multi‑hop linking).
Training Data Augmentation: The generation pipeline can be adapted to synthesize domain‑specific QA pairs (finance, healthcare) where tables and reports co‑exist, reducing the need for costly annotation.
Benchmark for New Architectures: SPARTA encourages the design of models that natively integrate relational reasoning (SQL‑style operators) with language understanding, such as neural symbolic hybrids or graph‑augmented transformers.
Real‑World Use Cases: Applications like business intelligence dashboards, data‑driven chatbots, and automated report generation will benefit from systems validated against SPARTA’s deep reasoning scenarios.

Limitations & Future Work

Synthetic Bias: Although provenance refinement ensures executability, the generated queries may still reflect patterns of the underlying grammar rather than the full diversity of human queries.
Domain Coverage: The current pipeline focuses on generic Wikipedia‑style tables and passages; extending to highly specialized domains may require custom fact‑extraction rules.
Human Validation Scope: Only a small sample was manually reviewed; scaling validation could further improve naturalness.
Future Directions: The authors plan to incorporate adversarial query generation, richer linguistic paraphrasing, and to open‑source tools for domain‑specific benchmark creation.

Authors

Sungho Park
Jueun Kim
Wook‑Shin Han

Paper Information

arXiv ID: 2602.23286v1
Categories: cs.CL, cs.AI, cs.DB, cs.IR
Published: February 26, 2026
PDF: Download PDF

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

[Paper] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models