[Paper] Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents

Published: 1 week ago (January 12, 2026 at 12:39 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.07754v1

Overview

The paper introduces a hybrid framework that couples a large language model (LLM) with a knowledge‑graph (KG) representation of financial documents to boost numerical reasoning performance. By first extracting a structured schema from the text and then letting the LLM “reason” on top of that structure, the authors achieve a noticeable jump in answer accuracy on the FinQA benchmark.

Key Contributions

Schema‑first extraction: A lightweight pipeline that automatically builds a domain‑specific KG (entities, relationships, and numeric attributes) directly from raw financial reports.
LLM‑KG integration: A method for feeding the KG into Llama 3.1 8B Instruct, allowing the model to query structured facts before performing calculations.
Empirical gains: Demonstrates a ~12 % relative improvement in execution accuracy on FinQA compared with the same LLM used without KG augmentation.
Open‑source reproducibility: All code, KG construction scripts, and evaluation scripts are released, enabling other researchers and engineers to replicate the results.

Methodology

Document Parsing & KG Construction
- The raw PDF/HTML financial report is tokenized and passed through a rule‑based extractor that identifies key entities (e.g., “Revenue”, “Operating Income”), numeric values, and relational cues (e.g., “increased by”, “as a percentage of”).
- These elements are assembled into a directed graph where nodes hold numeric literals and edges encode semantic relations (e.g., has‑value, derived‑from).
Prompt Engineering for LLM
- The KG is serialized into a concise, human‑readable “facts block” that is prepended to the original question prompt.
- The LLM receives two inputs: the facts block (structured context) and the natural‑language query.
Numerical Reasoning Loop
- The model first extracts the relevant numeric nodes from the KG, performs the required arithmetic (addition, subtraction, percentage calculations, etc.), and then generates a natural‑language answer with an optional step‑by‑step explanation.
Evaluation
- Experiments are run on the FinQA dataset, which contains real‑world financial Q&A pairs with ground‑truth execution traces.
- Metrics: Execution Accuracy (whether the final numeric answer matches the gold answer) and Explanation Accuracy (how well the generated reasoning steps align with the reference).

Results & Findings

Model	Execution Accuracy (FinQA)	Relative Gain vs. Vanilla LLM
Llama 3.1 8B Instruct (baseline)	68.4 %	–
Llama 3.1 8B Instruct + KG (proposed)	76.7 %	≈ 12 %

The KG‑augmented system consistently outperforms the vanilla LLM across all question types (arithmetic, comparison, aggregation).
Explanation quality also improves, with the model more often citing the correct KG nodes when justifying its answer.
Ablation studies show that removing the KG or feeding it in an unstructured format drops performance back to baseline levels, confirming the importance of the structured “facts block”.

Practical Implications

Financial QA bots: Developers can embed the KG extraction pipeline into existing chat‑oriented assistants (e.g., Slack bots, customer‑service portals) to deliver more reliable numeric answers from annual reports, earnings calls, or SEC filings.
RegTech & compliance: Automated audit tools can leverage the framework to verify numeric claims (e.g., “Revenue grew 15 % YoY”) against the structured data extracted from filings, reducing manual review effort.
Data‑driven dashboards: By exposing the KG as a queryable API (e.g., GraphQL), downstream analytics platforms can perform ad‑hoc calculations without re‑training the LLM for each new metric.
Cost‑effective scaling: The approach works with an 8‑billion‑parameter open‑source LLM, meaning enterprises can avoid the expense of proprietary, larger models while still achieving state‑of‑the‑art performance.

Limitations & Future Work

Domain specificity: The rule‑based KG extractor is tuned for typical financial report language; it may need adaptation for other domains (e.g., insurance, real‑estate).
Scalability of KG size: Very large reports generate dense graphs that can exceed prompt length limits; future work could explore hierarchical summarization or retrieval‑augmented generation.
Error propagation: Mistakes in the initial entity/number extraction directly affect downstream reasoning; integrating a confidence‑scoring mechanism could mitigate this.
Broader LLM integration: The study focuses on Llama 3.1 8B; evaluating the framework with newer instruction‑tuned or multimodal models could uncover additional gains.

Authors

Aryan Mishra
Akash Anil

Paper Information

arXiv ID: 2601.07754v1
Categories: cs.CL
Published: January 12, 2026
PDF: Download PDF

[Paper] Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents