[Paper] Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

Published: (May 29, 2026 at 01:10 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.31550v1

Overview

The paper introduces Semantic Triplet Restoration (STR), a new way to represent tables for question‑answering systems that use large language models (LLMs). Instead of feeding the model a raw HTML or Markdown rendering of a table, STR rewrites every cell as a concise “triplet” – item path, feature path, value – making the table’s hierarchical structure explicit and dramatically cutting down on token usage.

Key Contributions

  • Semantic Triplet Representation: A compact, fact‑like encoding of each table cell that captures row entities, hierarchical column attributes, and cell values.
  • TripletQL Router: A lightweight, query‑aware component that selects the most relevant subset of triplets (or a suitable rendering) for a given question, reducing unnecessary context.
  • Empirical Validation: Demonstrated that STR matches or outperforms traditional HTML/Markdown pipelines on four bilingual (Chinese & English) table‑QA benchmarks.
  • Efficiency Gains: Shows larger relative improvements for smaller LLMs and for tables with many rows/columns, highlighting benefits under tight inference budgets.
  • Open‑source Release: Code, data, and reproducible scripts are publicly available, encouraging community adoption and further research.

Methodology

  1. Triplet Construction

    • Item Path: Traverses the row hierarchy (e.g., Country → State).
    • Feature Path: Traverses the column hierarchy (e.g., Year → Revenue).
    • Value: The actual cell content (numeric, textual, etc.).
      The authors parse the original table (HTML/Markdown) to extract these hierarchical paths and generate a flat list of triplets.
  2. TripletQL (Query‑aware Router)

    • Takes the user question, encodes it with a small transformer, and scores each triplet for relevance.
    • Returns either a filtered set of top‑k triplets or a fallback rendering (e.g., the original HTML) if the question is too broad.
  3. Integration with LLMs

    • The selected triplets are concatenated with the question and fed to a downstream LLM (e.g., LLaMA‑2, GPT‑3.5).
    • Because each triplet is a short, self‑contained fact, the model can reason directly over the semantic structure without learning implicit layout cues.

Results & Findings

BenchmarkBaseline (HTML)STR + TripletQLToken Reduction
WikiTableQuestions (EN)71.2 % EM72.5 %~38 %
TabFact (CN)84.1 %84.6 %~35 %
HybridTableQA (EN)68.9 %69.3 %~40 %
MultiLingualTableQA (CN/EN)73.4 %74.0 %~37 %
  • Accuracy: STR consistently matches or slightly improves exact‑match scores over HTML‑based pipelines.
  • Token Efficiency: Average input length drops by roughly one‑third, freeing up context windows for longer reasoning chains.
  • Model Size Sensitivity: Gains are most pronounced for 7B‑parameter models, where the relative improvement can exceed 2 % absolute EM.
  • Scalability: For tables exceeding 200 cells, STR’s token savings become critical, preventing context‑window overflow.

Practical Implications

  • Cost‑Effective Deployment: Companies can run table‑QA services on cheaper, smaller LLMs without sacrificing accuracy, lowering inference costs on cloud platforms.
  • Faster Response Times: Fewer tokens mean reduced latency, which is valuable for real‑time analytics dashboards or conversational assistants that need to answer data‑driven queries instantly.
  • Simplified Prompt Engineering: Developers no longer need to craft elaborate prompts to teach the model about row/column spans; the triplet format is self‑describing.
  • Better Interoperability: The triplet schema can be generated from any tabular source (CSV, Excel, SQL results), making it easy to plug into existing data pipelines.
  • Enhanced Explainability: Since each fact is explicit, debugging wrong answers becomes a matter of inspecting the selected triplets rather than deciphering hidden layout cues.

Limitations & Future Work

  • Complex Cell Content: The current approach treats cell values as atomic strings; handling rich content (lists, images, nested tables) requires extensions.
  • Header Ambiguity: In extremely deep hierarchies, the feature path can become long, potentially re‑inflating token counts; smarter path compression is an open problem.
  • Cross‑Table Reasoning: STR focuses on single‑table QA; integrating multiple tables or external knowledge bases remains unexplored.
  • Language Coverage: Experiments are limited to Chinese and English; adapting the parser to languages with different script directions or table conventions may need additional work.

Overall, Semantic Triplet Restoration offers a pragmatic, token‑efficient bridge between tabular data and LLMs, opening the door for more scalable and developer‑friendly table‑question answering systems.

Authors

  • Yibin Zhao
  • Fangxin Shang
  • Dingrui Yang
  • Yuqi Wang

Paper Information

  • arXiv ID: 2605.31550v1
  • Categories: cs.CL
  • Published: May 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »