[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Source: arXiv - 2602.23286v1
Overview
The paper introduces SPARTA, a new framework for automatically building large‑scale benchmarks that test a model’s ability to answer questions requiring multi‑hop reasoning across both tables and free‑form text. By generating thousands of high‑quality QA pairs with complex operations (aggregation, grouping, nested queries), SPARTA exposes serious gaps in current cross‑modal QA systems that perform well on existing, shallow benchmarks.
Key Contributions
- Automated benchmark generation: End‑to‑end pipeline that creates Table‑Text QA datasets with minimal human validation (≈¼ the annotation effort of HybridQA).
- Fact‑grounded reference database: Enriches each source table with “grounding tables” derived from atomic facts automatically extracted from accompanying passages.
- Controlled multi‑hop query synthesis: Generates nested SQL‑style queries whose depth matches a target hop count, enabling systematic testing of deep reasoning.
- Provenance‑based refinement: Rewrites syntactically valid queries that would return empty results, guaranteeing executability.
- Realistic‑structure enforcement: Restricts generation to post‑order traversals of the query graph, ensuring the resulting natural‑language questions sound fluent and human‑like.
- Comprehensive benchmark: Includes thousands of QA pairs covering aggregations, grouping, and deep multi‑hop reasoning across text and tables.
- Empirical gap analysis: Shows that state‑of‑the‑art models lose >30 F1 points on SPARTA compared to HybridQA/OTT‑QA, highlighting fundamental weaknesses.
Methodology
- Fact Extraction – From each passage, the system extracts atomic facts (subject‑predicate‑object triples) using off‑the‑shelf OpenIE tools.
- Grounding Table Construction – These facts are organized into auxiliary tables that “ground” the unstructured text, linking it to the original structured table.
- Query Generation – A grammar‑driven generator creates SQL‑like queries with a configurable number of hops. Queries are built as directed acyclic graphs; a post‑order traversal ensures realistic nesting.
- Provenance‑Based Refinement – If a generated query would return an empty set, the system rewrites predicates using provenance information (i.e., which tables contributed to the result) until a non‑empty answer is guaranteed.
- Natural‑Language Verbalization – The final query graph is linearized into a fluent question using template‑based surface realization, followed by lightweight human validation for fluency.
- Dataset Assembly – Each QA pair consists of the original table, the associated passage, the generated question, and the correct answer (derived from the executed query).
Results & Findings
- Benchmark Scale: SPARTA contains ≈10K QA pairs, an order of magnitude larger than prior hybrid QA datasets.
- Model Performance Drop: Top models (e.g., TAPAS‑based, Table‑Text Fusion) that achieve 70 F1 on HybridQA fall to ≈38 F1 on SPARTA; similarly, OTT‑QA models drop from 50 F1 to ≈18 F1.
- Error Analysis: Failures concentrate on (a) correctly aligning textual facts with table rows, (b) executing aggregations/group‑by across modalities, and (c) maintaining logical consistency over >2 hops.
- Human Validation: Only ~5 % of generated questions required manual correction, confirming the pipeline’s high fidelity.
Practical Implications
- Better Model Diagnostics: Developers can use SPARTA to pinpoint exactly where their cross‑modal reasoning pipelines break (e.g., aggregation handling, multi‑hop linking).
- Training Data Augmentation: The generation pipeline can be adapted to synthesize domain‑specific QA pairs (finance, healthcare) where tables and reports co‑exist, reducing the need for costly annotation.
- Benchmark for New Architectures: SPARTA encourages the design of models that natively integrate relational reasoning (SQL‑style operators) with language understanding, such as neural symbolic hybrids or graph‑augmented transformers.
- Real‑World Use Cases: Applications like business intelligence dashboards, data‑driven chatbots, and automated report generation will benefit from systems validated against SPARTA’s deep reasoning scenarios.
Limitations & Future Work
- Synthetic Bias: Although provenance refinement ensures executability, the generated queries may still reflect patterns of the underlying grammar rather than the full diversity of human queries.
- Domain Coverage: The current pipeline focuses on generic Wikipedia‑style tables and passages; extending to highly specialized domains may require custom fact‑extraction rules.
- Human Validation Scope: Only a small sample was manually reviewed; scaling validation could further improve naturalness.
- Future Directions: The authors plan to incorporate adversarial query generation, richer linguistic paraphrasing, and to open‑source tools for domain‑specific benchmark creation.
Authors
- Sungho Park
- Jueun Kim
- Wook‑Shin Han
Paper Information
- arXiv ID: 2602.23286v1
- Categories: cs.CL, cs.AI, cs.DB, cs.IR
- Published: February 26, 2026
- PDF: Download PDF