[Paper] From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding
Source: arXiv - 2601.08741v1
Overview
The paper introduces FRTR, a retrieval‑augmented multimodal framework that lets large language models (LLMs) reason over massive, real‑world Excel workbooks. By breaking spreadsheets into fine‑grained embeddings and fusing textual, numeric, and visual cues, FRTR dramatically improves accuracy while keeping token usage low—making spreadsheet AI practical for enterprise developers.
Key Contributions
- FRTR‑Bench: the first large‑scale benchmark for multimodal spreadsheet reasoning (30 enterprise workbooks, ~4 M cells, 50+ embedded images).
- Granular embedding pipeline: rows, columns, and logical blocks are encoded separately, enabling efficient retrieval of only the relevant pieces.
- Hybrid lexical‑dense retrieval with Reciprocal Rank Fusion (RRF): combines keyword matching and dense vector similarity for robust selection of spreadsheet fragments.
- Multimodal integration: visual embeddings (charts, receipts) are merged with numeric/textual embeddings, allowing the model to answer questions that span both data types.
- Empirical gains: 74 % accuracy on FRTR‑Bench with Claude Sonnet 4.5 (vs. 24 % prior SOTA) and 87 % accuracy on SpreadsheetLLM with GPT‑5 while cutting token consumption by ~50 %.
Methodology
- Chunking the workbook – Each sheet is parsed into three kinds of chunks:
- Row chunks (containing the full row vector)
- Column chunks (full column vector)
- Block chunks (user‑defined logical regions, e.g., a table or a pivot)
- Embedding generation –
- Textual/numeric data → dense embeddings via a pre‑trained LLM encoder.
- Images (charts, receipts) → visual embeddings using a CLIP‑style vision encoder.
- Hybrid Retrieval –
- Lexical search (BM25) finds exact matches on column headers, formulas, etc.
- Dense search finds semantically related chunks.
- Results are merged with Reciprocal Rank Fusion, which balances precision (lexical) and recall (dense).
- Prompt construction – The retrieved chunks are stitched into a concise context window and fed to the target LLM together with the user query.
- Answer generation – The LLM produces a natural‑language answer, optionally accompanied by a formula or a reference to a visual element.
Results & Findings
| Benchmark | Model (with FRTR) | Accuracy | Token Savings |
|---|---|---|---|
| FRTR‑Bench (30 workbooks) | Claude Sonnet 4.5 | 74 % | – |
| SpreadsheetLLM | GPT‑5 | 87 % | ≈ 50 % fewer tokens vs. full‑context compression |
| Prior SOTA (same tasks) | Various | 24 % | – |
What this means: FRTR’s retrieval step isolates only the rows/columns/visuals needed for a query, so the LLM can focus its reasoning power without being overwhelmed by millions of irrelevant cells. The multimodal fusion also lets the system answer questions like “What is the trend shown in the sales chart for Q3?”—something pure‑text approaches can’t handle.
Practical Implications
- Enterprise automation: Developers can embed FRTR into internal bots that answer finance, supply‑chain, or HR spreadsheet queries on‑the‑fly, reducing manual data‑digging.
- Cost‑effective LLM usage: Halving token consumption translates directly into lower API bills, making large‑scale spreadsheet assistants viable for SaaS products.
- Extensible to other office formats: The same retrieval‑augmented, multimodal pipeline could be adapted for Word documents, PowerPoint decks, or even PDF reports that mix tables and graphics.
- Improved UX for low‑code platforms: No‑code tools can expose “Ask your workbook” features that feel natural to end‑users while staying performant under the hood.
Limitations & Future Work
- Retrieval latency: Although chunking reduces token load, the hybrid search (BM25 + dense + RRF) adds a preprocessing step that can be noticeable for very large workbooks; indexing optimizations are needed.
- Domain‑specific visual cues: The current vision encoder handles generic charts but may struggle with highly customized or low‑resolution images (e.g., scanned receipts). Fine‑tuning on domain‑specific visual data is a next step.
- Explainability: FRTR returns answers but does not yet provide a transparent trace of which rows/columns contributed most to the reasoning—a useful feature for audit‑heavy industries.
- Benchmark diversity: FRTR‑Bench focuses on enterprise Excel files; expanding to Google Sheets, LibreOffice, and cross‑file workflows would broaden applicability.
Bottom line: FRTR shows that a smart retrieval front‑end, combined with multimodal embeddings, can unlock reliable, cost‑effective spreadsheet reasoning for developers building next‑generation AI assistants.
Authors
- Anmol Gulati
- Sahil Sen
- Waqar Sarguroh
- Kevin Paul
Paper Information
- arXiv ID: 2601.08741v1
- Categories: cs.CL
- Published: January 13, 2026
- PDF: Download PDF