[Paper] Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents

Published: (January 12, 2026 at 12:39 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.07754v1

Overview

The paper introduces a hybrid framework that couples a large language model (LLM) with a knowledge‑graph (KG) representation of financial documents to boost numerical reasoning performance. By first extracting a structured schema from the text and then letting the LLM “reason” on top of that structure, the authors achieve a noticeable jump in answer accuracy on the FinQA benchmark.

Key Contributions

  • Schema‑first extraction: A lightweight pipeline that automatically builds a domain‑specific KG (entities, relationships, and numeric attributes) directly from raw financial reports.
  • LLM‑KG integration: A method for feeding the KG into Llama 3.1 8B Instruct, allowing the model to query structured facts before performing calculations.
  • Empirical gains: Demonstrates a ~12 % relative improvement in execution accuracy on FinQA compared with the same LLM used without KG augmentation.
  • Open‑source reproducibility: All code, KG construction scripts, and evaluation scripts are released, enabling other researchers and engineers to replicate the results.

Methodology

  1. Document Parsing & KG Construction

    • The raw PDF/HTML financial report is tokenized and passed through a rule‑based extractor that identifies key entities (e.g., “Revenue”, “Operating Income”), numeric values, and relational cues (e.g., “increased by”, “as a percentage of”).
    • These elements are assembled into a directed graph where nodes hold numeric literals and edges encode semantic relations (e.g., has‑value, derived‑from).
  2. Prompt Engineering for LLM

    • The KG is serialized into a concise, human‑readable “facts block” that is prepended to the original question prompt.
    • The LLM receives two inputs: the facts block (structured context) and the natural‑language query.
  3. Numerical Reasoning Loop

    • The model first extracts the relevant numeric nodes from the KG, performs the required arithmetic (addition, subtraction, percentage calculations, etc.), and then generates a natural‑language answer with an optional step‑by‑step explanation.
  4. Evaluation

    • Experiments are run on the FinQA dataset, which contains real‑world financial Q&A pairs with ground‑truth execution traces.
    • Metrics: Execution Accuracy (whether the final numeric answer matches the gold answer) and Explanation Accuracy (how well the generated reasoning steps align with the reference).

Results & Findings

ModelExecution Accuracy (FinQA)Relative Gain vs. Vanilla LLM
Llama 3.1 8B Instruct (baseline)68.4 %
Llama 3.1 8B Instruct + KG (proposed)76.7 %≈ 12 %
  • The KG‑augmented system consistently outperforms the vanilla LLM across all question types (arithmetic, comparison, aggregation).
  • Explanation quality also improves, with the model more often citing the correct KG nodes when justifying its answer.
  • Ablation studies show that removing the KG or feeding it in an unstructured format drops performance back to baseline levels, confirming the importance of the structured “facts block”.

Practical Implications

  • Financial QA bots: Developers can embed the KG extraction pipeline into existing chat‑oriented assistants (e.g., Slack bots, customer‑service portals) to deliver more reliable numeric answers from annual reports, earnings calls, or SEC filings.
  • RegTech & compliance: Automated audit tools can leverage the framework to verify numeric claims (e.g., “Revenue grew 15 % YoY”) against the structured data extracted from filings, reducing manual review effort.
  • Data‑driven dashboards: By exposing the KG as a queryable API (e.g., GraphQL), downstream analytics platforms can perform ad‑hoc calculations without re‑training the LLM for each new metric.
  • Cost‑effective scaling: The approach works with an 8‑billion‑parameter open‑source LLM, meaning enterprises can avoid the expense of proprietary, larger models while still achieving state‑of‑the‑art performance.

Limitations & Future Work

  • Domain specificity: The rule‑based KG extractor is tuned for typical financial report language; it may need adaptation for other domains (e.g., insurance, real‑estate).
  • Scalability of KG size: Very large reports generate dense graphs that can exceed prompt length limits; future work could explore hierarchical summarization or retrieval‑augmented generation.
  • Error propagation: Mistakes in the initial entity/number extraction directly affect downstream reasoning; integrating a confidence‑scoring mechanism could mitigate this.
  • Broader LLM integration: The study focuses on Llama 3.1 8B; evaluating the framework with newer instruction‑tuned or multimodal models could uncover additional gains.

Authors

  • Aryan Mishra
  • Akash Anil

Paper Information

  • arXiv ID: 2601.07754v1
  • Categories: cs.CL
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »