[Paper] ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution

Published: (February 17, 2026 at 01:01 PM EST)
6 min read
Source: arXiv

Source: arXiv - 2602.15769v1

Overview

The paper “ViTaB‑A: Evaluating Multimodal Large Language Models on Visual Table Attribution” investigates a critical but under‑explored capability of multimodal LLMs (mLLMs): the ability to point to the exact rows and columns in a table that justify a given answer. While many models can answer questions over tables encoded as Markdown, JSON, or images, developers often need to know where the answer came from—especially in domains like finance, healthcare, or compliance where traceability is non‑negotiable. The authors benchmark several state‑of‑the‑art mLLMs and reveal a stark gap between raw QA performance and fine‑grained attribution reliability.

Key Contributions

  • Formal definition of structured‑data attribution (row/column citation) for tables across three formats: Markdown, JSON, and rendered images.
  • ViTaB‑A benchmark suite comprising a diverse set of table‑question pairs with ground‑truth row/column references.
  • Comprehensive evaluation of multiple mLLM families (e.g., GPT‑4‑V, LLaVA, Gemini‑Pro‑Vision) using a variety of prompting strategies (zero‑shot, few‑shot, chain‑of‑thought).
  • Empirical finding that attribution accuracy is dramatically lower than QA accuracy—often near random for JSON inputs.
  • Analysis of failure modes, showing models are better at citing rows than columns and perform best on visual (image) tables versus textual (Markdown/JSON) ones.
  • Open‑source release of the benchmark data, evaluation scripts, and detailed attribution metrics to encourage reproducibility.

Methodology

Dataset Construction

  • Collected 1,200 real‑world tables from public repositories (e.g., Wikipedia, open government data).
  • For each table, generated 3–5 natural‑language questions and manually annotated the exact supporting rows and columns.
  • Rendered each table in three modalities: plain Markdown text, JSON key‑value structure, and a raster image (PNG).

Model Selection & Prompt Design

  • Tested 7 publicly available mLLMs spanning vision‑language and text‑only families.
  • Designed three prompting templates:
    1. Direct QA: “Answer the question.”
    2. Citation‑aware: “Answer and list the row/column IDs that support the answer.”
    3. Chain‑of‑Thought: “Explain step‑by‑step, then cite the evidence.”

Evaluation Metrics

  • QA Accuracy – exact match of the answer string.
  • Row Attribution Recall/Precision – proportion of correctly cited rows.
  • Column Attribution Recall/Precision – same for columns.
  • Combined Attribution F1 – harmonic mean of row and column scores.

Statistical Analysis

  • Used bootstrapped confidence intervals (1,000 samples) to assess significance of differences across models and formats.

All steps were scripted in Python, leveraging the OpenAI, Hugging Face, and Google Gemini APIs for reproducibility.

Results & Findings

Model (Family)QA AccuracyRow Attribution F1Column Attribution F1Overall Attribution F1
GPT‑4‑V (Vision)68 %45 %31 %38 %
LLaVA‑13B55 %28 %19 %23 %
Gemini‑Pro‑Vision62 %41 %27 %34 %
Others (average)58 %22 %15 %18 %
  • QA vs. Attribution Gap: While QA accuracy hovers around 55‑70 %, attribution F1 scores drop to 15‑38 %, indicating models often “guess” the answer without grounding it.
  • Format Dependence: Attribution on JSON tables is near random (≈10 % F1), whereas image tables achieve the highest scores (≈38 % F1).
  • Row vs. Column: Models consistently cite rows more reliably (≈10 % higher F1) than columns, suggesting they treat tables more like lists than matrices.
  • Prompt Impact: Chain‑of‑thought prompts improve attribution modestly (≈5 % absolute gain) but still lag far behind QA performance.
  • Model Family Differences: Vision‑augmented models (GPT‑4‑V, Gemini‑Pro‑Vision) outperform purely text‑based mLLMs on visual tables, but none achieve robust citation.

Takeaway: Current mLLMs can answer table‑based questions, but they cannot be trusted to provide transparent, traceable evidence—especially when the source data is in a structured textual format.

Practical Implications

ScenarioWhy Attribution MattersImpact of Findings
Financial reporting dashboardsAuditors need to see which rows/columns justify a KPImLLM‑generated insights would require a human verification layer; blind reliance is risky.
Healthcare data analysisClinical decisions must be traceable to patient recordsCurrent models could misattribute findings, leading to compliance violations.
Business intelligence (BI) toolsUsers expect “drill‑down” capability from AI‑assisted queriesDevelopers should expose raw query results alongside model answers, or fallback to rule‑based extraction.
Regulatory compliance (e.g., GDPR, SOX)Evidence of data provenance is mandatoryThe low attribution scores mean mLLMs cannot yet satisfy audit trails.
Developer tooling (e.g., Copilot for data notebooks)Inline code suggestions need to reference source cellsIntegrating a verification step (e.g., prompting the model to re‑run a simple SELECT) can improve trust.

Actionable advice for engineers

  1. Never expose raw model answers as final decisions—always pair them with a deterministic extraction routine (SQL/JSONPath) that can be independently verified.
  2. Prefer visual table inputs (e.g., screenshots) if you must rely on model attribution, but still treat the output as a hint rather than proof.
  3. Leverage chain‑of‑thought prompting to coax the model into reasoning steps, then parse the intermediate citations for sanity checks.
  4. Implement fallback mechanisms: if the model’s attribution confidence (e.g., token‑level log‑probability) falls below a threshold, revert to a rule‑based extractor.
  5. Monitor attribution metrics in production (e.g., track row/column recall) to catch drift as models are updated.

Limitations & Future Work

  • Scope of Table Complexity: The benchmark focuses on moderately sized tables (≤30 rows, ≤10 columns). Larger, hierarchical tables may exacerbate attribution failures.
  • Prompt Engineering Depth: Only three prompt templates were explored; more sophisticated prompting (e.g., self‑critique loops) could improve citation.
  • Model Access: Some evaluated models are proprietary black boxes, limiting insight into why they miss citations.
  • Ground‑Truth Ambiguity: Certain questions admit multiple valid supporting cells; the current annotation treats a single “gold” citation, potentially penalizing correct alternative attributions.

Future research directions suggested by the authors include:

  • Extending ViTaB‑A to nested JSON and pivot tables.
  • Designing training objectives that explicitly reward citation (e.g., multi‑task fine‑tuning with a “cite‑cells” loss).
  • Exploring retrieval‑augmented pipelines where a deterministic extractor supplies candidate cells that the LLM then validates.
  • Investigating explainability tools (e.g., attention visualizations) to diagnose why models overlook column cues.

Authors

  • Yahia Alqurnawi
  • Preetom Biswas
  • Anmol Rao
  • Tejas Anvekar
  • Chitta Baral
  • Vivek Gupta

Paper Information

  • arXiv ID: 2602.15769v1
  • Categories: cs.CL
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »