[Paper] Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

Published: 1 week ago (May 29, 2026 at 01:10 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.31550v1

Overview

The paper introduces Semantic Triplet Restoration (STR), a new way to represent tables for question‑answering systems that use large language models (LLMs). Instead of feeding the model a raw HTML or Markdown rendering of a table, STR rewrites every cell as a concise “triplet” – item path, feature path, value – making the table’s hierarchical structure explicit and dramatically cutting down on token usage.

Key Contributions

Semantic Triplet Representation: A compact, fact‑like encoding of each table cell that captures row entities, hierarchical column attributes, and cell values.
TripletQL Router: A lightweight, query‑aware component that selects the most relevant subset of triplets (or a suitable rendering) for a given question, reducing unnecessary context.
Empirical Validation: Demonstrated that STR matches or outperforms traditional HTML/Markdown pipelines on four bilingual (Chinese & English) table‑QA benchmarks.
Efficiency Gains: Shows larger relative improvements for smaller LLMs and for tables with many rows/columns, highlighting benefits under tight inference budgets.
Open‑source Release: Code, data, and reproducible scripts are publicly available, encouraging community adoption and further research.

Methodology

Triplet Construction
- Item Path: Traverses the row hierarchy (e.g., Country → State).
- Feature Path: Traverses the column hierarchy (e.g., Year → Revenue).
- Value: The actual cell content (numeric, textual, etc.).
  The authors parse the original table (HTML/Markdown) to extract these hierarchical paths and generate a flat list of triplets.
TripletQL (Query‑aware Router)
- Takes the user question, encodes it with a small transformer, and scores each triplet for relevance.
- Returns either a filtered set of top‑k triplets or a fallback rendering (e.g., the original HTML) if the question is too broad.
Integration with LLMs
- The selected triplets are concatenated with the question and fed to a downstream LLM (e.g., LLaMA‑2, GPT‑3.5).
- Because each triplet is a short, self‑contained fact, the model can reason directly over the semantic structure without learning implicit layout cues.

Results & Findings

Benchmark	Baseline (HTML)	STR + TripletQL	Token Reduction
WikiTableQuestions (EN)	71.2 % EM	72.5 %	~38 %
TabFact (CN)	84.1 %	84.6 %	~35 %
HybridTableQA (EN)	68.9 %	69.3 %	~40 %
MultiLingualTableQA (CN/EN)	73.4 %	74.0 %	~37 %

Accuracy: STR consistently matches or slightly improves exact‑match scores over HTML‑based pipelines.
Token Efficiency: Average input length drops by roughly one‑third, freeing up context windows for longer reasoning chains.
Model Size Sensitivity: Gains are most pronounced for 7B‑parameter models, where the relative improvement can exceed 2 % absolute EM.
Scalability: For tables exceeding 200 cells, STR’s token savings become critical, preventing context‑window overflow.

Practical Implications

Cost‑Effective Deployment: Companies can run table‑QA services on cheaper, smaller LLMs without sacrificing accuracy, lowering inference costs on cloud platforms.
Faster Response Times: Fewer tokens mean reduced latency, which is valuable for real‑time analytics dashboards or conversational assistants that need to answer data‑driven queries instantly.
Simplified Prompt Engineering: Developers no longer need to craft elaborate prompts to teach the model about row/column spans; the triplet format is self‑describing.
Better Interoperability: The triplet schema can be generated from any tabular source (CSV, Excel, SQL results), making it easy to plug into existing data pipelines.
Enhanced Explainability: Since each fact is explicit, debugging wrong answers becomes a matter of inspecting the selected triplets rather than deciphering hidden layout cues.

Limitations & Future Work

Complex Cell Content: The current approach treats cell values as atomic strings; handling rich content (lists, images, nested tables) requires extensions.
Header Ambiguity: In extremely deep hierarchies, the feature path can become long, potentially re‑inflating token counts; smarter path compression is an open problem.
Cross‑Table Reasoning: STR focuses on single‑table QA; integrating multiple tables or external knowledge bases remains unexplored.
Language Coverage: Experiments are limited to Chinese and English; adapting the parser to languages with different script directions or table conventions may need additional work.

Overall, Semantic Triplet Restoration offers a pragmatic, token‑efficient bridge between tabular data and LLMs, opening the door for more scalable and developer‑friendly table‑question answering systems.

Authors

Yibin Zhao
Fangxin Shang
Dingrui Yang
Yuqi Wang

Paper Information

arXiv ID: 2605.31550v1
Categories: cs.CL
Published: May 29, 2026
PDF: Download PDF

[Paper] Semantic Triplet Restoration: A Novel Protocol for Hierarchical Table Understanding in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection