[Paper] Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents
Source: arXiv - 2601.07754v1
Overview
The paper introduces a hybrid framework that couples a large language model (LLM) with a knowledge‑graph (KG) representation of financial documents to boost numerical reasoning performance. By first extracting a structured schema from the text and then letting the LLM “reason” on top of that structure, the authors achieve a noticeable jump in answer accuracy on the FinQA benchmark.
Key Contributions
- Schema‑first extraction: A lightweight pipeline that automatically builds a domain‑specific KG (entities, relationships, and numeric attributes) directly from raw financial reports.
- LLM‑KG integration: A method for feeding the KG into Llama 3.1 8B Instruct, allowing the model to query structured facts before performing calculations.
- Empirical gains: Demonstrates a ~12 % relative improvement in execution accuracy on FinQA compared with the same LLM used without KG augmentation.
- Open‑source reproducibility: All code, KG construction scripts, and evaluation scripts are released, enabling other researchers and engineers to replicate the results.
Methodology
-
Document Parsing & KG Construction
- The raw PDF/HTML financial report is tokenized and passed through a rule‑based extractor that identifies key entities (e.g., “Revenue”, “Operating Income”), numeric values, and relational cues (e.g., “increased by”, “as a percentage of”).
- These elements are assembled into a directed graph where nodes hold numeric literals and edges encode semantic relations (e.g., has‑value, derived‑from).
-
Prompt Engineering for LLM
- The KG is serialized into a concise, human‑readable “facts block” that is prepended to the original question prompt.
- The LLM receives two inputs: the facts block (structured context) and the natural‑language query.
-
Numerical Reasoning Loop
- The model first extracts the relevant numeric nodes from the KG, performs the required arithmetic (addition, subtraction, percentage calculations, etc.), and then generates a natural‑language answer with an optional step‑by‑step explanation.
-
Evaluation
- Experiments are run on the FinQA dataset, which contains real‑world financial Q&A pairs with ground‑truth execution traces.
- Metrics: Execution Accuracy (whether the final numeric answer matches the gold answer) and Explanation Accuracy (how well the generated reasoning steps align with the reference).
Results & Findings
| Model | Execution Accuracy (FinQA) | Relative Gain vs. Vanilla LLM |
|---|---|---|
| Llama 3.1 8B Instruct (baseline) | 68.4 % | – |
| Llama 3.1 8B Instruct + KG (proposed) | 76.7 % | ≈ 12 % |
- The KG‑augmented system consistently outperforms the vanilla LLM across all question types (arithmetic, comparison, aggregation).
- Explanation quality also improves, with the model more often citing the correct KG nodes when justifying its answer.
- Ablation studies show that removing the KG or feeding it in an unstructured format drops performance back to baseline levels, confirming the importance of the structured “facts block”.
Practical Implications
- Financial QA bots: Developers can embed the KG extraction pipeline into existing chat‑oriented assistants (e.g., Slack bots, customer‑service portals) to deliver more reliable numeric answers from annual reports, earnings calls, or SEC filings.
- RegTech & compliance: Automated audit tools can leverage the framework to verify numeric claims (e.g., “Revenue grew 15 % YoY”) against the structured data extracted from filings, reducing manual review effort.
- Data‑driven dashboards: By exposing the KG as a queryable API (e.g., GraphQL), downstream analytics platforms can perform ad‑hoc calculations without re‑training the LLM for each new metric.
- Cost‑effective scaling: The approach works with an 8‑billion‑parameter open‑source LLM, meaning enterprises can avoid the expense of proprietary, larger models while still achieving state‑of‑the‑art performance.
Limitations & Future Work
- Domain specificity: The rule‑based KG extractor is tuned for typical financial report language; it may need adaptation for other domains (e.g., insurance, real‑estate).
- Scalability of KG size: Very large reports generate dense graphs that can exceed prompt length limits; future work could explore hierarchical summarization or retrieval‑augmented generation.
- Error propagation: Mistakes in the initial entity/number extraction directly affect downstream reasoning; integrating a confidence‑scoring mechanism could mitigate this.
- Broader LLM integration: The study focuses on Llama 3.1 8B; evaluating the framework with newer instruction‑tuned or multimodal models could uncover additional gains.
Authors
- Aryan Mishra
- Akash Anil
Paper Information
- arXiv ID: 2601.07754v1
- Categories: cs.CL
- Published: January 12, 2026
- PDF: Download PDF