[Paper] Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Published: (March 9, 2026 at 01:58 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08704v1

Overview

The paper introduces AI Financial Intelligence Benchmark (AFIB), a systematic way to measure how well large language models (LLMs) can handle real‑world financial analysis tasks. By testing five popular LLM‑based AI assistants—including the newly‑released SuperInvesting—on a curated set of 95+ equity‑research questions, the authors expose the strengths and blind spots of each system and show why “financial intelligence” is a multi‑dimensional problem.

Key Contributions

  • AFIB benchmark: a five‑dimensional evaluation suite (factual accuracy, analytical completeness, data recency, model consistency, and failure patterns) tailored to finance‑focused use cases.
  • Comprehensive dataset: >95 structured questions derived from actual equity research workflows, covering earnings analysis, valuation, macro‑economic impact, and more.
  • Cross‑model comparison: systematic head‑to‑head testing of GPT, Gemini, Perplexity, Claude, and the newly introduced SuperInvesting AI.
  • Empirical insights: quantifies trade‑offs between live‑retrieval capabilities (e.g., Perplexity) and deep analytical reasoning (e.g., SuperInvesting).
  • Open‑source artifacts: benchmark code, prompts, and scoring scripts released for reproducibility and community extensions.

Methodology

  1. Task Design – The authors distilled common equity‑research activities into 95+ question templates (e.g., “Compute the DCF valuation for Company X using FY‑2024 earnings”). Each template includes required inputs, expected output format, and reference answers.
  2. Dimension Scoring
    • Factual Accuracy: binary correctness of numerical facts (price, EPS, etc.) scored on a 0‑10 scale.
    • Analytical Completeness: rubric‑based points (max 70) for covering all sub‑steps (data gathering, assumptions, calculations, interpretation).
    • Data Recency: checks whether the model used up‑to‑date market data (e.g., latest quarterly results).
    • Model Consistency: runs the same prompt three times; measures variance in answers.
    • Failure Patterns: categorizes hallucinations, omissions, or mis‑interpretations.
  3. Evaluation Pipeline – Each LLM is queried via its public API using identical prompts. Responses are automatically parsed, then manually verified by finance experts to assign rubric scores.
  4. Aggregation – Scores are normalized and combined into an overall AFIB index for each model.

Results & Findings

ModelFactual Accuracy (/10)Completeness (/70)Recency (✓/✗)Consistency (σ)Hallucination Rate
SuperInvesting8.9656.65✓ (84 % up‑to‑date)Low variance2 %
GPT7.4248.12✓ (71 %)Moderate7 %
Gemini7.1545.80✓ (68 %)Moderate8 %
Claude6.8842.33✗ (55 %)Higher variance10 %
Perplexity (retrieval‑augmented)7.9038.40✓ (96 %)Moderate9 %
  • SuperInvesting tops the aggregate score, excelling at both factual correctness and analytical depth while keeping hallucinations minimal.
  • Perplexity shines on data recency thanks to live web retrieval, but its answers often miss the nuanced synthesis required for a full investment thesis.
  • All models exhibit some inconsistency across repeated runs, highlighting stochastic output as a reliability concern for high‑stakes finance work.

Practical Implications

  • Tool Selection: For developers building AI‑assisted research platforms, the benchmark suggests pairing a retrieval layer (for fresh market data) with a reasoning‑focused model like SuperInvesting to get the best of both worlds.
  • Prompt Engineering: The completeness rubric reveals that explicit multi‑step prompts (e.g., “first gather earnings, then compute multiples, finally provide a recommendation”) dramatically improve output quality across all models.
  • Risk Management: The low hallucination rate of SuperInvesting means fewer regulatory red‑flags when automating report generation, a key consideration for fintech compliance teams.
  • API Design: Consistency metrics indicate that offering deterministic “temperature=0” endpoints or result‑caching can mitigate variance for downstream pipelines.
  • Product Roadmaps: Companies can use AFIB as a diagnostic checklist to prioritize improvements—e.g., adding a live‑price feed to a strong reasoning model or enhancing the reasoning module of a retrieval‑centric system.

Limitations & Future Work

  • Domain Scope: The benchmark focuses on equity research; other finance domains (fixed income, derivatives, ESG) remain untested.
  • Static Dataset: While the authors refreshed the question set annually, rapid market regime shifts could outdate the evaluation quickly.
  • Human Scoring Overhead: Completeness and failure‑pattern annotations still require expert review, limiting large‑scale automated benchmarking.
  • Model Access: Results depend on the specific API versions and temperature settings used; future work should explore version‑agnostic evaluation and open‑source LLM baselines.

Bottom line: AFIB provides a practical, reproducible yardstick for measuring “financial IQ” in LLMs, and its early results already give developers concrete guidance on which AI engines are ready for production‑grade investment analysis.*

Authors

  • Akshay Gulati
  • Kanha Singhania
  • Tushar Banga
  • Parth Arora
  • Anshul Verma
  • Vaibhav Kumar Singh
  • Agyapal Digra
  • Jayant Singh Bisht
  • Danish Sharma
  • Varun Singla
  • Shubh Garg

Paper Information

  • arXiv ID: 2603.08704v1
  • Categories: cs.AI
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Scale Space Diffusion

Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a simil...

[Paper] Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why...