[Paper] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

Published: (January 7, 2026 at 01:18 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.04160v1

Overview

A new benchmark called RFC Bench (Reference‑Free Counterfactual) has been released to test how well large language models (LLMs) can spot false or misleading statements in financial news—without being handed a “ground‑truth” reference. By focusing on paragraph‑level content, the benchmark mirrors the real‑world challenge where the truth of a claim often depends on subtle, dispersed cues across a story.

Key Contributions

  • RFC Bench dataset: ~10k paragraph‑level news excerpts covering real financial topics, each paired with a perturbed (misinformation) version.
  • Two evaluation modes:
    1. Reference‑free detection – model decides if a single paragraph is false, mimicking a real‑time analyst’s workflow.
    2. Comparative diagnosis – model receives the original and the perturbed paragraph together and must flag the misinformation, showing how context improves performance.
  • Comprehensive baseline suite: Tested state‑of‑the‑art LLMs (GPT‑4, Claude, LLaMA‑2, etc.) and classic classifiers, exposing a consistent performance gap between the two modes.
  • Error taxonomy: Identified “unstable predictions” (outputs flip with minor wording changes) and “invalid outputs” (nonsensical or overly generic answers) as dominant failure modes in the reference‑free setting.
  • Open‑source release: Data, evaluation scripts, and a leaderboard to encourage community contributions.

Methodology

  1. Data collection – Curators harvested financial news from reputable outlets (e.g., Bloomberg, Reuters). Professional editors rewrote each paragraph to inject realistic misinformation (e.g., altered earnings figures, swapped company names).
  2. Annotation – Human annotators labeled each pair as original vs. perturbed and provided rationales, ensuring the misinformation was subtle yet factually incorrect.
  3. Task design:
    • Reference‑free: Model receives only the potentially false paragraph and must output a binary label (misinformation / trustworthy) plus a confidence score.
    • Comparative: Model receives both the original and the perturbed paragraph and must indicate which one is false.
  4. Evaluation metrics – Accuracy, F1, and a “stability score” (measuring prediction consistency under paraphrasing).
  5. Baselines – Prompt‑based LLMs (zero‑shot, few‑shot) and fine‑tuned classifiers (BERT, RoBERTa) were benchmarked across both modes.

Results & Findings

ModelReference‑free Acc.Comparative Acc.Stability ↓
GPT‑4 (zero‑shot)68.2 %92.5 %0.71
Claude‑2 (few‑shot)64.7 %89.1 %0.68
LLaMA‑2‑13B (fine‑tuned)59.3 %84.3 %0.62
RoBERTa‑base (fine‑tuned)55.1 %78.9 %0.58
  • Comparative context dramatically boosts performance (≈ +20‑30 % accuracy).
  • In the reference‑free setting, even the strongest LLMs hover around 65‑70 % accuracy, far from reliable for high‑stakes finance.
  • Stability scores reveal that small paraphrases can flip a model’s decision, highlighting fragile belief states.
  • Invalid outputs (e.g., “I’m not sure”) appear in ~12 % of reference‑free predictions, a concerning rate for automated monitoring pipelines.

Practical Implications

  • Real‑time news monitoring: Companies building AI‑driven compliance or risk‑alert systems should not rely on a single LLM pass; pairing with a comparative check (e.g., maintaining a short “baseline” version of recent headlines) can dramatically improve detection rates.
  • Model‑as‑a‑service: Vendors offering LLM APIs for financial analysis need to expose confidence and stability metrics, allowing downstream systems to flag low‑trust predictions for human review.
  • Prompt engineering: Adding retrieval‑augmented prompts (e.g., “compare this paragraph to the last 5 minutes of market data”) may emulate the comparative advantage without storing explicit originals.
  • Regulatory tech (RegTech): The benchmark surfaces a concrete weakness that regulators can reference when evaluating AI‑based misinformation safeguards for trading firms and asset managers.

Limitations & Future Work

  • Domain scope: RFC Bench focuses on English‑language news from major outlets; emerging markets, non‑English sources, and social‑media posts remain untested.
  • Perturbation realism: While crafted by experts, the synthetic misinformation may still be less crafty than adversarial attacks deployed by malicious actors.
  • Model size bias: Only a handful of large commercial LLMs were evaluated; smaller open‑source models may behave differently under fine‑tuning.
  • Future directions proposed by the authors include: expanding the dataset to multi‑paragraph and multi‑modal (tables, charts) contexts, integrating retrieval‑augmented generation to provide “soft references,” and exploring continual‑learning setups where models update their belief states as new market data arrives.

Authors

  • Yuechen Jiang
  • Zhiwei Liu
  • Yupeng Cao
  • Yueru He
  • Ziyang Xu
  • Chen Xu
  • Zhiyang Deng
  • Prayag Tiwari
  • Xi Chen
  • Alejandro Lopez-Lira
  • Jimin Huang
  • Junichi Tsujii
  • Sophia Ananiadou

Paper Information

  • arXiv ID: 2601.04160v1
  • Categories: cs.CL, cs.CE, q-fin.CP
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »