[Paper] Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines
Source: arXiv - 2603.08704v1
Overview
The paper introduces AI Financial Intelligence Benchmark (AFIB), a systematic way to measure how well large language models (LLMs) can handle real‑world financial analysis tasks. By testing five popular LLM‑based AI assistants—including the newly‑released SuperInvesting—on a curated set of 95+ equity‑research questions, the authors expose the strengths and blind spots of each system and show why “financial intelligence” is a multi‑dimensional problem.
Key Contributions
- AFIB benchmark: a five‑dimensional evaluation suite (factual accuracy, analytical completeness, data recency, model consistency, and failure patterns) tailored to finance‑focused use cases.
- Comprehensive dataset: >95 structured questions derived from actual equity research workflows, covering earnings analysis, valuation, macro‑economic impact, and more.
- Cross‑model comparison: systematic head‑to‑head testing of GPT, Gemini, Perplexity, Claude, and the newly introduced SuperInvesting AI.
- Empirical insights: quantifies trade‑offs between live‑retrieval capabilities (e.g., Perplexity) and deep analytical reasoning (e.g., SuperInvesting).
- Open‑source artifacts: benchmark code, prompts, and scoring scripts released for reproducibility and community extensions.
Methodology
- Task Design – The authors distilled common equity‑research activities into 95+ question templates (e.g., “Compute the DCF valuation for Company X using FY‑2024 earnings”). Each template includes required inputs, expected output format, and reference answers.
- Dimension Scoring
- Factual Accuracy: binary correctness of numerical facts (price, EPS, etc.) scored on a 0‑10 scale.
- Analytical Completeness: rubric‑based points (max 70) for covering all sub‑steps (data gathering, assumptions, calculations, interpretation).
- Data Recency: checks whether the model used up‑to‑date market data (e.g., latest quarterly results).
- Model Consistency: runs the same prompt three times; measures variance in answers.
- Failure Patterns: categorizes hallucinations, omissions, or mis‑interpretations.
- Evaluation Pipeline – Each LLM is queried via its public API using identical prompts. Responses are automatically parsed, then manually verified by finance experts to assign rubric scores.
- Aggregation – Scores are normalized and combined into an overall AFIB index for each model.
Results & Findings
| Model | Factual Accuracy (/10) | Completeness (/70) | Recency (✓/✗) | Consistency (σ) | Hallucination Rate |
|---|---|---|---|---|---|
| SuperInvesting | 8.96 | 56.65 | ✓ (84 % up‑to‑date) | Low variance | 2 % |
| GPT | 7.42 | 48.12 | ✓ (71 %) | Moderate | 7 % |
| Gemini | 7.15 | 45.80 | ✓ (68 %) | Moderate | 8 % |
| Claude | 6.88 | 42.33 | ✗ (55 %) | Higher variance | 10 % |
| Perplexity (retrieval‑augmented) | 7.90 | 38.40 | ✓ (96 %) | Moderate | 9 % |
- SuperInvesting tops the aggregate score, excelling at both factual correctness and analytical depth while keeping hallucinations minimal.
- Perplexity shines on data recency thanks to live web retrieval, but its answers often miss the nuanced synthesis required for a full investment thesis.
- All models exhibit some inconsistency across repeated runs, highlighting stochastic output as a reliability concern for high‑stakes finance work.
Practical Implications
- Tool Selection: For developers building AI‑assisted research platforms, the benchmark suggests pairing a retrieval layer (for fresh market data) with a reasoning‑focused model like SuperInvesting to get the best of both worlds.
- Prompt Engineering: The completeness rubric reveals that explicit multi‑step prompts (e.g., “first gather earnings, then compute multiples, finally provide a recommendation”) dramatically improve output quality across all models.
- Risk Management: The low hallucination rate of SuperInvesting means fewer regulatory red‑flags when automating report generation, a key consideration for fintech compliance teams.
- API Design: Consistency metrics indicate that offering deterministic “temperature=0” endpoints or result‑caching can mitigate variance for downstream pipelines.
- Product Roadmaps: Companies can use AFIB as a diagnostic checklist to prioritize improvements—e.g., adding a live‑price feed to a strong reasoning model or enhancing the reasoning module of a retrieval‑centric system.
Limitations & Future Work
- Domain Scope: The benchmark focuses on equity research; other finance domains (fixed income, derivatives, ESG) remain untested.
- Static Dataset: While the authors refreshed the question set annually, rapid market regime shifts could outdate the evaluation quickly.
- Human Scoring Overhead: Completeness and failure‑pattern annotations still require expert review, limiting large‑scale automated benchmarking.
- Model Access: Results depend on the specific API versions and temperature settings used; future work should explore version‑agnostic evaluation and open‑source LLM baselines.
Bottom line: AFIB provides a practical, reproducible yardstick for measuring “financial IQ” in LLMs, and its early results already give developers concrete guidance on which AI engines are ready for production‑grade investment analysis.*
Authors
- Akshay Gulati
- Kanha Singhania
- Tushar Banga
- Parth Arora
- Anshul Verma
- Vaibhav Kumar Singh
- Agyapal Digra
- Jayant Singh Bisht
- Danish Sharma
- Varun Singla
- Shubh Garg
Paper Information
- arXiv ID: 2603.08704v1
- Categories: cs.AI
- Published: March 9, 2026
- PDF: Download PDF